Projects in Awesome Lists tagged with big-data
A curated list of projects in awesome lists tagged with big-data .
https://github.com/clickhouse/clickhouse
ClickHouse® is a real-time analytics database management system
ai analytics big-data clickhouse cloud-native cpp database dbms distributed embedded hacktoberfest lakehouse mpp olap rust self-hosted sql
Last synced: 04 May 2026
https://github.com/ClickHouse/ClickHouse
ClickHouse® is a real-time analytics DBMS
ai analytics big-data clickhouse cpp dbms distributed-database hacktoberfest mpp olap rust sql
Last synced: 14 Mar 2025
https://github.com/donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano
Last synced: 12 May 2025
https://github.com/amark/gun
An open source cybersecurity protocol for syncing decentralized graph data.
artificial-intelligence big-data blockchain crdt crypto cryptography dapp database decentralized dweb encryption end-to-end graph machine-learning metaverse offline-first p2p protocol realtime web3
Last synced: 12 May 2025
https://github.com/andkret/cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 25 Jan 2026
https://github.com/andkret/Cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 14 Mar 2025
https://github.com/apache/predictionio
PredictionIO, a machine learning server for developers and ML engineers.
Last synced: 05 Oct 2025
https://github.com/yahoo/cmak
CMAK is a tool for managing Apache Kafka clusters
big-data cluster-management kafka scala
Last synced: 13 May 2025
https://github.com/yahoo/CMAK
CMAK is a tool for managing Apache Kafka clusters
big-data cluster-management kafka scala
Last synced: 03 Apr 2025
https://github.com/vesoft-inc/nebula
A distributed, fast open-source graph database featuring horizontal scalability and high availability
big-data cpp database distributed distributed-systems graph graph-database graphdb hacktoberfest nebula nebula-graph nebulagraph raft scalability
Last synced: 13 May 2025
https://github.com/trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino
Last synced: 02 Apr 2026
https://github.com/provectus/kafka-ui
Open-Source Web UI for Apache Kafka Management
apache-kafka big-data cluster-management event-streaming hacktoberfest kafka kafka-brokers kafka-client kafka-cluster kafka-connect kafka-manager kafka-producer kafka-streams kafka-ui opensource streaming-data streams web-ui
Last synced: 12 May 2025
https://github.com/starrocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Last synced: 16 Feb 2026
https://github.com/cython/cython
The most widely used Python to C compiler
big-data c cpp cpython cpython-extensions cython performance python
Last synced: 04 Jan 2026
https://github.com/quickwit-oss/quickwit
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
big-data cloud-native cloud-storage distributed-tracing log-management logs open-source rust search-engine tantivy
Last synced: 29 Jan 2026
https://github.com/StarRocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized
Last synced: 14 Mar 2025
https://github.com/catboost/catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
big-data catboost categorical-features coreml cuda data-mining data-science decision-trees gbdt gbm gpu gpu-computing gradient-boosting kaggle machine-learning python r tutorial
Last synced: 12 May 2025
https://github.com/delta-io/delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
acid analytics big-data delta-lake spark
Last synced: 12 May 2025
https://github.com/apache/datafusion
Apache DataFusion SQL Query Engine
arrow big-data dataframe datafusion olap python query-engine rust sql
Last synced: 12 Dec 2025
https://github.com/h2oai/h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark
Last synced: 24 Dec 2025
https://github.com/paradedb/paradedb
ParadeDB is a modern Elasticsearch alternative built on Postgres. Built for real-time, update-heavy workloads.
aggregations analytics big-data bm25 database elasticsearch full-text-search htap hybrid-search mpp object-storage olap postgresql real-time-analytics similarity-search sparse-vector sql
Last synced: 23 Apr 2026
https://github.com/RisingWaveLabs/risingwave
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
analytics big-data cloud-native data-engineering database distributed-database etl flink kafka ksqldb materialized-view postgres postgresql real-time real-time-analytics rust serverless spark-streaming sql stream-processing
Last synced: 29 Mar 2025
https://github.com/vespa-engine/vespa
AI + Data, online. https://vespa.ai
ai big-data java machine-learning rag search search-engine server serving-recommendation tensor vector vector-database vector-search vespa
Last synced: 01 Apr 2026
https://github.com/feast-dev/feast
The Open Source Feature Store for AI/ML
big-data data-engineering data-quality data-science feature-store features machine-learning ml mlops python
Last synced: 04 May 2026
https://github.com/arkime/arkime
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
big-data c javascript network-monitoring nsm packet-capture pcap security
Last synced: 06 Apr 2026
https://github.com/apache/couchdb
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
big-data cloud content couchdb database erlang http javascript network-client network-server
Last synced: 12 May 2025
https://github.com/apache/zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
big-data database flink java javascript nosql scala spark zeppelin
Last synced: 12 May 2025
https://github.com/hazelcast/hazelcast
Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
big-data caching data-in-motion data-insights distributed distributed-computing distributed-systems hacktoberfest hazelcast in-memory java low-latency real-time scalability stream-processing
Last synced: 09 Sep 2025
https://github.com/pachyderm/pachyderm
Data-Centric Pipelines and Data Versioning
analytics big-data containers data-analysis data-science distributed-systems docker go kubernetes pachyderm
Last synced: 16 Dec 2025
https://github.com/apache/iotdb
Apache IoTDB
big-data database iot java nosql timeseries tsdb
Last synced: 11 Jan 2026
https://github.com/microsoft/synapseml
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 13 May 2025
https://microsoft.github.io/SynapseML/
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 29 Apr 2025
https://github.com/microsoft/SynapseML
Simple and Distributed Machine Learning
ai apache-spark azure big-data cognitive-services data-science databricks deep-learning http lightgbm machine-learning microsoft ml model-deployment onnx opencv pyspark scala spark synapse
Last synced: 14 Mar 2025
https://github.com/apache/ignite
Apache Ignite
big-data cache cloud data-management-platform database distributed-sql-database hadoop ignite in-memory-computing in-memory-database iot network-client network-server osgi sql
Last synced: 14 May 2025
https://github.com/apache/calcite
Apache Calcite
big-data calcite geospatial hadoop java sql
Last synced: 17 Dec 2025
https://github.com/tschellenbach/stream-framework
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
activity-feed activity-stream big-data cassandra feed news news-feed redis
Last synced: 14 May 2025
https://github.com/tschellenbach/Stream-Framework
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
activity-feed activity-stream big-data cassandra feed news news-feed redis
Last synced: 14 Mar 2025
https://github.com/tangbc/vue-virtual-scroll-list
⚡️A vue component support big amount data list with high render performance and efficient.
big-data infinite-scroll virtual-list
Last synced: 11 May 2025
https://github.com/crate/crate
CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
analytics big-data cratedb database dbms distributed distributed-database distributed-sql-database elasticsearch industrial-iot iot iot-analytics iot-database lucene olap postgresql sql time-series tsdb vector-database
Last synced: 16 Jan 2026
https://github.com/alibaba/fastjson2
🚄 FASTJSON2 is a Java JSON library with excellent performance.
android big-data deserialization fastjson fastjson2 graal graalvm-native-image high-performance java java-json json json-deserialization json-parser json-path json-serialization json-serializer jsonb serialization
Last synced: 12 May 2025
https://github.com/rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
big-data dataset deep-learning download-images image image-dataset multimodal
Last synced: 13 May 2025
https://github.com/moataz-elmesmary/data-science-roadmap
Data Science Roadmap from A to Z
big-data chatgpt cheatsheet cv-template data-analysis data-engineering data-science data-visualization deep-learning interview-questions linear-algebra llms machine-learning mathematics neural-network nlp probability python sql statistics
Last synced: 14 May 2025
https://github.com/Moataz-Elmesmary/Data-Science-Roadmap
Data Science Roadmap from A to Z
big-data chatgpt cheatsheet cv-template data-analysis data-engineering data-science data-visualization deep-learning interview-questions linear-algebra llms machine-learning mathematics neural-network nlp probability python sql statistics
Last synced: 25 Mar 2025
https://github.com/alibaba/graphscope
🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
analytics big-data data-science graph graph-analytics graph-computation graph-computing graph-data graph-neural-networks gremlin
Last synced: 11 May 2025
https://github.com/alibaba/GraphScope
🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
analytics big-data data-science graph graph-analytics graph-computation graph-computing graph-data graph-neural-networks gremlin
Last synced: 08 Apr 2025
https://github.com/databricks/koalas
Koalas: pandas API on Apache Spark
big-data data-science dataframe mlflow pandas pydata spark
Last synced: 13 May 2025
https://github.com/sqlparser-rs/sqlparser-rs
Extensible SQL Lexer and Parser for Rust
Last synced: 08 Jul 2025
https://github.com/apache/datafusion-sqlparser-rs
Extensible SQL Lexer and Parser for Rust
Last synced: 13 May 2025
https://github.com/TuiQiao/CBoard
An easy to use, self-service open BI reporting and BI dashboard platform.
big-data business-intelligence cboard dashboard data-visualization metabase olap superset
Last synced: 26 Mar 2025
https://github.com/eventual-inc/daft
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
big-data data-engineering data-science dataframe distributed-computing machine-learning python rust
Last synced: 14 May 2026
https://github.com/apache/paimon
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
big-data data-ingestion flink paimon real-time-analytics spark streaming-datalake table-store
Last synced: 17 Mar 2026
https://github.com/apache/incubator-hugegraph
A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)
big-data database graph graph-database graphdb gremlin
Last synced: 11 Jan 2026
https://github.com/lakesoul-io/lakesoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox
Last synced: 14 May 2025
https://github.com/Eventual-Inc/Daft
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
big-data data-engineering data-science dataframe distributed-computing machine-learning python rust
Last synced: 09 Apr 2025
https://github.com/featurebasedb/featurebase
A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
big-data bitmap database go index pilosa sql
Last synced: 15 Dec 2025
https://github.com/FeatureBaseDB/featurebase
A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
big-data bitmap database go index pilosa sql
Last synced: 28 Mar 2025
https://github.com/pilosa/pilosa
A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
big-data bitmap database go index pilosa sql
Last synced: 11 Jul 2025
https://github.com/jostmey/NakedTensor
Bare bone examples of machine learning in TensorFlow
big-data distributed-computing linear-regression simple tensorflow tensorflow-examples tensorflow-exercises tensorflow-tutorials
Last synced: 13 May 2025
https://github.com/jostmey/nakedtensor
Bare bone examples of machine learning in TensorFlow
big-data distributed-computing linear-regression simple tensorflow tensorflow-examples tensorflow-exercises tensorflow-tutorials
Last synced: 15 May 2025
https://github.com/lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox
Last synced: 27 Mar 2025
https://github.com/man-group/arcticdb
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading
Last synced: 04 May 2026
https://github.com/apache/ambari
Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.
ambari big-data java javascript python
Last synced: 14 May 2025
https://github.com/quarylabs/quary
Open-source BI for engineers
analytics big-data business-intelligence data-modeling elt
Last synced: 24 Jan 2026
https://github.com/ytsaurus/ytsaurus
YTsaurus is a scalable and fault-tolerant open-source big data platform.
big-data clickhouse distributed-database lakehouse olap-database spark sql ytsaurus
Last synced: 02 Apr 2026
https://github.com/qihoo360/poseidon
A search engine which can hold 100 trillion lines of log data.
big-data golang map-reduce poseidon search-engine
Last synced: 08 Apr 2025
https://github.com/Qihoo360/poseidon
A search engine which can hold 100 trillion lines of log data.
big-data golang map-reduce poseidon search-engine
Last synced: 11 Apr 2025
https://github.com/apache/bookkeeper
Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
apache big-data bookkeeper distributed-log distributed-systems wal
Last synced: 13 May 2025
https://github.com/gchq/gaffer
A large-scale entity and relation database supporting aggregation of properties
accumulo aggregation big-data graph graph-database hadoop hbase parquet spark
Last synced: 12 May 2025
https://github.com/gchq/Gaffer
A large-scale entity and relation database supporting aggregation of properties
accumulo aggregation big-data graph graph-database hadoop hbase parquet spark
Last synced: 04 May 2025
https://netflix.github.io/genie/
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 16 Nov 2025
https://github.com/fluid-cloudnative/fluid
Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
ai-framework alluxio big-data data-abstraction distributed-cache kubernetes
Last synced: 13 May 2025
https://github.com/apache/datafusion-ballista
Apache DataFusion Ballista Distributed Query Engine
arrow big-data dataframe distributed olap python query-engine rust sql
Last synced: 12 Dec 2025
https://github.com/netflix/genie
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 13 May 2025
https://github.com/Netflix/genie
Distributed Big Data Orchestration Service
big-data bigdata cloud configuration configuration-management distributed-systems java microservice microservices netflix-oss netflixoss orchestration spring-boot
Last synced: 04 Apr 2025
https://github.com/bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Last synced: 15 May 2025
https://github.com/jadianes/spark-py-notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark
Last synced: 15 May 2025
https://github.com/kantord/just-dashboard
:bar_chart: :clipboard: Dashboards using YAML or JSON files
big-data business-intelligence chart csv d3 d3js dashboard data data-driven data-engineering data-science data-visualization gist github-gist json just-dashboard yaml
Last synced: 15 May 2025
https://github.com/matanolabs/matano
Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
alerting apache-iceberg aws aws-security big-data cloud cloud-native cloud-security cybersecurity detection-engineering dfir log-analytics log-management rust secops security security-tools serverless siem threat-hunting
Last synced: 14 May 2025
https://github.com/apache/auron
The Auron accelerator for distributed computing framework (e.g., Spark) leverages native vectorized execution to accelerate query processing
big-data datafusion rust-lang spark
Last synced: 28 Aug 2025
https://github.com/man-group/ArcticDB
ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading
Last synced: 12 Mar 2025
https://github.com/kwai/blaze
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
big-data datafusion rust-lang spark
Last synced: 14 May 2025
https://github.com/yahoo/mysql_perf_analyzer
MySQL performance monitoring and analysis.
big-data java mysql performance-analysis
Last synced: 16 May 2025
https://github.com/apache/carbondata
High performance data store solution
apache big-data carbondata data-format hadoop java scala spark
Last synced: 13 May 2025
https://github.com/dremio/dremio-oss
Dremio - the missing link in modern data
analytics big-data data-analytics ui
Last synced: 14 May 2025
https://github.com/mtth/avsc
Avro for JavaScript :zap:
avro big-data binary-format encoding javascript schema-evolution serialization typescript
Last synced: 12 May 2025
https://github.com/lakehq/sail
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.
arrow artificial-intelligence big-data data data-engineering datafusion distributed-computing machine-learning pyspark python rust spark sql
Last synced: 14 Apr 2026
https://github.com/uxlfoundation/scikit-learn-intelex
Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
ai-inference ai-machine-learning ai-training analytics big-data data-analysis gpu machine-learning machine-learning-algorithms oneapi python scikit-learn swrepo
Last synced: 14 Feb 2026
https://github.com/mahmoudparsian/pyspark-tutorial
PySpark-Tutorial provides basic algorithms using PySpark
big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd
Last synced: 14 May 2025
https://github.com/apachecn/spark-doc-zh
Apache Spark 官方文档中文版
big-data documentation java spark
Last synced: 07 Apr 2025
https://github.com/yahoo/egads
A Java package to automatically detect anomalies in large scale time-series data
anomaly-detection-models big-data java time-series
Last synced: 15 May 2025
https://github.com/DeepWisdom/AutoDL
Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL challenge@NeurIPS.
ai artificial-intelligence autodl autodl-challenge automated-machine-learning automl big-data data-science deeplearning feature-engineering full-automl lightgbm machine-learning model-selection multi-label nas python pytorch resnet tensorflow
Last synced: 12 May 2025