An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with big-data

A curated list of projects in awesome lists tagged with big-data .

https://github.com/apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

big-data java jdbc python r scala spark sql

Last synced: 09 Sep 2025

https://github.com/donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

aws big-data caffe data-science deep-learning hadoop kaggle keras machine-learning mapreduce matplotlib numpy pandas python scikit-learn scipy spark tensorflow theano

Last synced: 12 May 2025

https://github.com/apache/flink

Apache Flink

big-data flink java python scala sql

Last synced: 08 Apr 2026

https://github.com/prestodb/presto

The official home of the Presto distributed SQL query engine for big data

big-data data hadoop hive java lakehouse presto query sql

Last synced: 14 Apr 2026

https://github.com/apache/predictionio

PredictionIO, a machine learning server for developers and ML engineers.

big-data predictionio scala

Last synced: 05 Oct 2025

https://github.com/yahoo/cmak

CMAK is a tool for managing Apache Kafka clusters

big-data cluster-management kafka scala

Last synced: 13 May 2025

https://github.com/yahoo/CMAK

CMAK is a tool for managing Apache Kafka clusters

big-data cluster-management kafka scala

Last synced: 03 Apr 2025

https://github.com/vesoft-inc/nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability

big-data cpp database distributed distributed-systems graph graph-database graphdb hacktoberfest nebula nebula-graph nebulagraph raft scalability

Last synced: 13 May 2025

https://github.com/trinodb/trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

analytics big-data data-science database databases datalake delta-lake distributed-database distributed-systems hadoop hive iceberg java jdbc presto prestodb query-engine sql trino

Last synced: 02 Apr 2026

https://github.com/starrocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 16 Feb 2026

https://github.com/cython/cython

The most widely used Python to C compiler

big-data c cpp cpython cpython-extensions cython performance python

Last synced: 04 Jan 2026

https://github.com/quickwit-oss/quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

big-data cloud-native cloud-storage distributed-tracing log-management logs open-source rust search-engine tantivy

Last synced: 29 Jan 2026

https://github.com/StarRocks/starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

analytics big-data cloudnative database datalake delta-lake distributed-database hudi iceberg join lakehouse lakehouse-platform mpp olap real-time-analytics real-time-updates realtime-database sql star-schema vectorized

Last synced: 14 Mar 2025

https://github.com/catboost/catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

big-data catboost categorical-features coreml cuda data-mining data-science decision-trees gbdt gbm gpu gpu-computing gradient-boosting kaggle machine-learning python r tutorial

Last synced: 12 May 2025

https://github.com/apache/beam

Apache Beam is a unified programming model for Batch and Streaming data processing.

batch beam big-data golang java python sql streaming

Last synced: 12 May 2025

https://github.com/delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

acid analytics big-data delta-lake spark

Last synced: 12 May 2025

https://github.com/apache/datafusion

Apache DataFusion SQL Query Engine

arrow big-data dataframe datafusion olap python query-engine rust sql

Last synced: 12 Dec 2025

https://github.com/h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark

Last synced: 24 Dec 2025

https://github.com/paradedb/paradedb

ParadeDB is a modern Elasticsearch alternative built on Postgres. Built for real-time, update-heavy workloads.

aggregations analytics big-data bm25 database elasticsearch full-text-search htap hybrid-search mpp object-storage olap postgresql real-time-analytics similarity-search sparse-vector sql

Last synced: 23 Apr 2026

https://github.com/RisingWaveLabs/risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.

analytics big-data cloud-native data-engineering database distributed-database etl flink kafka ksqldb materialized-view postgres postgresql real-time real-time-analytics rust serverless spark-streaming sql stream-processing

Last synced: 29 Mar 2025

https://github.com/arkime/arkime

Arkime is an open source, large scale, full packet capturing, indexing, and database system.

big-data c javascript network-monitoring nsm packet-capture pcap security

Last synced: 06 Apr 2026

https://github.com/apache/couchdb

Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability

big-data cloud content couchdb database erlang http javascript network-client network-server

Last synced: 12 May 2025

https://github.com/apache/zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

big-data database flink java javascript nosql scala spark zeppelin

Last synced: 12 May 2025

https://github.com/hazelcast/hazelcast

Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

big-data caching data-in-motion data-insights distributed distributed-computing distributed-systems hacktoberfest hazelcast in-memory java low-latency real-time scalability stream-processing

Last synced: 09 Sep 2025

https://github.com/tschellenbach/stream-framework

Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

activity-feed activity-stream big-data cassandra feed news news-feed redis

Last synced: 14 May 2025

https://github.com/tschellenbach/Stream-Framework

Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

activity-feed activity-stream big-data cassandra feed news news-feed redis

Last synced: 14 Mar 2025

https://github.com/tangbc/vue-virtual-scroll-list

⚡️A vue component support big amount data list with high render performance and efficient.

big-data infinite-scroll virtual-list

Last synced: 11 May 2025

https://github.com/crate/crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.

analytics big-data cratedb database dbms distributed distributed-database distributed-sql-database elasticsearch industrial-iot iot iot-analytics iot-database lucene olap postgresql sql time-series tsdb vector-database

Last synced: 16 Jan 2026

https://github.com/rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

big-data dataset deep-learning download-images image image-dataset multimodal

Last synced: 13 May 2025

https://github.com/alibaba/graphscope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

analytics big-data data-science graph graph-analytics graph-computation graph-computing graph-data graph-neural-networks gremlin

Last synced: 11 May 2025

https://github.com/alibaba/GraphScope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

analytics big-data data-science graph graph-analytics graph-computation graph-computing graph-data graph-neural-networks gremlin

Last synced: 08 Apr 2025

https://github.com/databricks/koalas

Koalas: pandas API on Apache Spark

big-data data-science dataframe mlflow pandas pydata spark

Last synced: 13 May 2025

https://github.com/sqlparser-rs/sqlparser-rs

Extensible SQL Lexer and Parser for Rust

big-data rust sql

Last synced: 08 Jul 2025

https://github.com/apache/datafusion-sqlparser-rs

Extensible SQL Lexer and Parser for Rust

big-data rust sql

Last synced: 13 May 2025

https://github.com/TuiQiao/CBoard

An easy to use, self-service open BI reporting and BI dashboard platform.

big-data business-intelligence cboard dashboard data-visualization metabase olap superset

Last synced: 26 Mar 2025

https://github.com/eventual-inc/daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

big-data data-engineering data-science dataframe distributed-computing machine-learning python rust

Last synced: 14 May 2026

https://github.com/apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

big-data data-ingestion flink paimon real-time-analytics spark streaming-datalake table-store

Last synced: 17 Mar 2026

https://github.com/apache/incubator-hugegraph

A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)

big-data database graph graph-database graphdb gremlin

Last synced: 11 Jan 2026

https://github.com/lakesoul-io/lakesoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 14 May 2025

https://github.com/Eventual-Inc/Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust

big-data data-engineering data-science dataframe distributed-computing machine-learning python rust

Last synced: 09 Apr 2025

https://github.com/featurebasedb/featurebase

A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase

big-data bitmap database go index pilosa sql

Last synced: 15 Dec 2025

https://github.com/FeatureBaseDB/featurebase

A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase

big-data bitmap database go index pilosa sql

Last synced: 28 Mar 2025

https://github.com/pilosa/pilosa

A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase

big-data bitmap database go index pilosa sql

Last synced: 11 Jul 2025

https://github.com/lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

arrow big-data datafusion datalake flink huggingface lakehouse lakesoul postgresql python pytorch rust spark sql streaming vectorized velox

Last synced: 27 Mar 2025

https://github.com/man-group/arcticdb

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading

Last synced: 04 May 2026

https://github.com/apache/ambari

Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.

ambari big-data java javascript python

Last synced: 14 May 2025

https://github.com/quarylabs/quary

Open-source BI for engineers

analytics big-data business-intelligence data-modeling elt

Last synced: 24 Jan 2026

https://github.com/ytsaurus/ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

big-data clickhouse distributed-database lakehouse olap-database spark sql ytsaurus

Last synced: 02 Apr 2026

https://github.com/qihoo360/poseidon

A search engine which can hold 100 trillion lines of log data.

big-data golang map-reduce poseidon search-engine

Last synced: 08 Apr 2025

https://github.com/apache/drill

Apache Drill is a distributed MPP query layer for self describing data

big-data drill hadoop hive java jdbc parquet sql

Last synced: 13 May 2025

https://github.com/Qihoo360/poseidon

A search engine which can hold 100 trillion lines of log data.

big-data golang map-reduce poseidon search-engine

Last synced: 11 Apr 2025

https://github.com/apache/bookkeeper

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads

apache big-data bookkeeper distributed-log distributed-systems wal

Last synced: 13 May 2025

https://github.com/apache/kudu

Mirror of Apache Kudu

big-data cplusplus kudu

Last synced: 14 May 2025

https://github.com/gchq/gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 12 May 2025

https://github.com/gchq/Gaffer

A large-scale entity and relation database supporting aggregation of properties

accumulo aggregation big-data graph graph-database hadoop hbase parquet spark

Last synced: 04 May 2025

https://github.com/fluid-cloudnative/fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

ai-framework alluxio big-data data-abstraction distributed-cache kubernetes

Last synced: 13 May 2025

https://github.com/apache/datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

arrow big-data dataframe distributed olap python query-engine rust sql

Last synced: 12 Dec 2025

https://github.com/bytedance/bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time

Last synced: 15 May 2025

https://github.com/jadianes/spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

big-data bigdata data-analysis data-science ipython ipython-notebook machine-learning mllib notebook pyspark python spark

Last synced: 15 May 2025

https://github.com/matanolabs/matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS

alerting apache-iceberg aws aws-security big-data cloud cloud-native cloud-security cybersecurity detection-engineering dfir log-analytics log-management rust secops security security-tools serverless siem threat-hunting

Last synced: 14 May 2025

https://github.com/apache/auron

The Auron accelerator for distributed computing framework (e.g., Spark) leverages native vectorized execution to accelerate query processing

big-data datafusion rust-lang spark

Last synced: 28 Aug 2025

https://github.com/man-group/ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

big-data data data-analysis data-science database dataframe pandas quantitative-analysis quantitative-finance quantitative-trading

Last synced: 12 Mar 2025

https://github.com/kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

big-data datafusion rust-lang spark

Last synced: 14 May 2025

https://github.com/yahoo/mysql_perf_analyzer

MySQL performance monitoring and analysis.

big-data java mysql performance-analysis

Last synced: 16 May 2025

https://github.com/apache/carbondata

High performance data store solution

apache big-data carbondata data-format hadoop java scala spark

Last synced: 13 May 2025

https://github.com/dremio/dremio-oss

Dremio - the missing link in modern data

analytics big-data data-analytics ui

Last synced: 14 May 2025

https://github.com/lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive AI workloads.

arrow artificial-intelligence big-data data data-engineering datafusion distributed-computing machine-learning pyspark python rust spark sql

Last synced: 14 Apr 2026

https://github.com/mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

big-data big-data-analytics data-algorithms pyspark spark spark-dataframes spark-rdd

Last synced: 14 May 2025

https://github.com/apachecn/spark-doc-zh

Apache Spark 官方文档中文版

big-data documentation java spark

Last synced: 07 Apr 2025

https://github.com/yahoo/egads

A Java package to automatically detect anomalies in large scale time-series data

anomaly-detection-models big-data java time-series

Last synced: 15 May 2025