Projects in Awesome Lists tagged with large-dataset
A curated list of projects in awesome lists tagged with large-dataset .
https://github.com/opendrivelab/driveagi
[CVPR 2024 Highlight] GenAD: Generalized Predictive Model for Autonomous Driving & Foundation Models in Autonomous System
autonomous-driving embodied-ai foundation-model general-artificial-intelligence large-dataset policy-learning video-dataset video-generation world-models
Last synced: 15 May 2025
https://github.com/DiskFrame/disk.frame
Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
data data-science large-dataset manipulation-data medium-data r
Last synced: 14 Mar 2025
https://github.com/xiaodaigh/disk.frame
Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
data data-science large-dataset manipulation-data medium-data r
Last synced: 14 Mar 2025
https://github.com/fair-acc/chart-fx
A scientific charting library focused on performance optimised real-time data visualisation at 25 Hz update rates for data sets with a few 10 thousand up to 5 million data points.
chart-fx charting-libraries data-visualisation hacktoberfest java javafx large-dataset scientific-visualization
Last synced: 04 Apr 2025
https://github.com/GSI-CS-CO/chart-fx
A scientific charting library focused on performance optimised real-time data visualisation at 25 Hz update rates for data sets with a few 10 thousand up to 5 million data points.
chart-fx charting-libraries data-visualisation hacktoberfest java javafx large-dataset scientific-visualization
Last synced: 21 Dec 2024
https://github.com/zzw922cn/tensorflow-input-pipeline
TensorFlow Input Pipeline Examples based on multi-thread and FIFOQueue
fifo-queue input-pipeline large-dataset mini-batch multi-threading small-dataset tensorflow tfrecords
Last synced: 26 Apr 2025
https://github.com/privefl/bigreadr
R package to read large text files based on splitting + data.table::fread
large-dataset r-package read-csv
Last synced: 22 Nov 2024
https://github.com/kyegomez/EXA-1
An EXA-Scale repository of Multi-Modality AI resources from papers and models, to foundational libraries!
artificial-intelligence dataset gpt4 jax kosmos large-dataset large-language-models multimodal multimodal-data multimodality pytorch pytorch-implementation triton
Last synced: 28 Mar 2025
https://github.com/matteodelabre/saxophone
Fast and lightweight event-driven streaming XML parser in pure JavaScript
javascript large-dataset parser sax xml
Last synced: 16 Mar 2025
https://github.com/maxhalford/tuna
:fish: A streaming ETL for fish
etl feature-extraction go golang large-dataset machine-learning online-algorithms stream stream-processing
Last synced: 07 May 2025
https://github.com/guypeer8/csv-streamer
đź’§A stream based csv aggregator for limiting RAM usage while processing large data sets.
csv gzip large-dataset nodejs sqlite stream zlib
Last synced: 21 Nov 2024
https://github.com/gjcampbell/ooffice
Some components for internal, line of business angular apps
angular angular2 large-dataset performance tree virtualized virtualizer
Last synced: 17 Feb 2025
https://github.com/imdeepmind/amazonreview-languagegenerationdataset
Processed Amazon Review Dataset for Language Generation (Character Level)
deep-learning language-datasets language-generation-dataset language-model large-dataset machine-learning nlp python
Last synced: 11 Feb 2025
https://github.com/davidssmith/rawarray.jl
Raw array (RA) file format for simple, robust, and user-friendly N-dimensional array storage
bytes complex-numbers data-science file-format julia large-dataset large-files ra-format rawarray scientific-computing storage
Last synced: 07 May 2025
https://github.com/vjgpt/home-credit-default-risk
Objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan.
banking credit-risk gradient-boosting large-dataset lightgbm loan
Last synced: 10 Apr 2025
https://github.com/bugthesystem/cerebro
Finding The Median In Large Sets Of Numbers Split Across N Servers using zeromq and nodejs (experimental)
average distributed experimental large-dataset median nodejs zeromq
Last synced: 19 Feb 2025
https://github.com/m-wells/alignedbinaryformat.jl
Memory-mapping made easy.
file-format input-output julia julia-language julia-package julialang large-dataset load memory-mapping save serialization
Last synced: 10 Apr 2025
https://github.com/Lizhecheng02/Kaggle-LLM-Detect_AI_Generated_Text
Detect whether the text is AI-generated by training a new tokenizer and combining it with tree classification models or by training language models on a large dataset of human & AI-generated texts.
ai-generated bpe classification ensemble large-dataset llm tokenizer wordpiece
Last synced: 06 Jan 2025
https://github.com/emahtab/mysql-test-dataset
Repository for MySQL test data set
Last synced: 02 Apr 2025
https://github.com/shreckye/jgrapht-memory-efficient-bipartite-graph
A memory-efficient matching algorithm (Kuhn–Munkres and Hopcroft–Karp) implementation based on JGraphT in Java
bipartite-graphs hopcroft-karp jgrapht kotlin kuhn-munkres large-dataset memory-efficient
Last synced: 04 Apr 2025
https://github.com/avijit-jana/classifying_cybersecurity_incidents
This project focuses on building a machine learning classification model to enhance the efficiency of Security Operation Centers (SOCs). Using the comprehensive GUIDE dataset, the model predicts the triage grade of cybersecurity incidents (True Positive, Benign Positive, or False Positive).
exploratory-data-analysis large-dataset machine-learning pandas python3 visualization
Last synced: 25 Feb 2025
https://github.com/rajkumargara/bike_rental_data_analysis
Chicago bike rental data analysis for business insights using R programming
data-analysis data-visualization data-wrangling large-dataset machine-learning-algorithms
Last synced: 03 Mar 2025
https://github.com/pngo1997/data-mining-sql
Twitter tweets Data Mining practice.
data-mining database large-dataset python sql
Last synced: 28 Feb 2025
https://github.com/mehrantsi/common-crawl-analyzer
Tools to extract and analyze domains and URLs from Common Crawl data files.
common-crawl large-dataset stemmer term-analysis term-frequency-inverse-document
Last synced: 16 May 2025
https://github.com/dimitrivavoulisportfolio/aws-serverless-nlp-sentiment-4m-product-reviews
This is a production ready DistilBERT Sentiment Analysis model for product reviews designed to work as a low cost market research tool with the nuiance of an actual market researcher.
aws distilbert distilbert-model large-dataset market-research nlp nlp-machine-learning product-reviews sentiment-analysis serverless
Last synced: 04 Apr 2025