awesome-bigdata
A curated list of awesome big data frameworks, ressources and other awesomeness.
https://github.com/oxnr/awesome-bigdata
Last synced: 6 days ago
JSON representation
-
Machine Learning
- Concurrent Pattern - machine learning library for Cascading.
- convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
- Decider - Flexible and Extensible Machine Learning in Ruby.
- Etsy Conjecture - scalable Machine Learning in Scalding.
- Karate Club - An unsupervised machine learning library for graph structured data. Python
- Lambdo - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
- Little Ball of Fur - A subsampling library for graph structured data. Python
- MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
- ML Workspace - All-in-one web-based IDE specialized for machine learning and data science.
- PyTorch Geometric Temporal - a temporal extension library for PyTorch Geometric .
- scikit-learn - scikit-learn: machine learning in Python.
- Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
- TensorFlow - Library from Google for machine learning using data flow graphs.
- Velox - System for serving machine learning predictions.
- BidMach - CPU and GPU-accelerated Machine Learning Library.
- H2O - statistical, machine learning and math runtime with Hadoop. R and Python.
- WEKA - suite of machine learning software.
- Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
- Keras - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
- nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
- Sibyl - System for Large Scale Machine Learning at Google.
- MonkeyLearn - Text mining made easy. Extract and classify data from text.
- Aim - open-source AI metadata tracker for experiments and training runs.
- isolation-forest - distributed Spark and Scala implementation of isolation forest for unsupervised outlier detection.
- Neptune - experiment tracking and model registry for research and production machine learning teams.
-
Memcached forks and evolutions
- Twemproxy - A fast, light-weight proxy for memcached and redis.
- Twitter Fatcache - key/value cache for flash storage.
- Twitter Twemcache - fork of Memcache.
-
MySQL forks and evolutions
- Amazon RDS - MySQL databases in Amazon's cloud.
- Drizzle - evolution of MySQL 6.0.
- MariaDB - enhanced, drop-in replacement for MySQL.
- MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
- TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
- Drizzle - evolution of MySQL 6.0.
- Percona Server - enhanced, drop-in replacement for MySQL.
- ProxySQL - High Performance Proxy for MySQL.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
- Percona Server - enhanced, drop-in replacement for MySQL.
-
NewSQL Databases
- CitusDB - scales out PostgreSQL through sharding and replication.
- FoundationDB - distributed database, inspired by F1.
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- InfiniSQL - infinity scalable RDBMS.
- Map-D - GPU in-memory database, big data analysis and visualization platform.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
- SAP HANA - is an in-memory, column-oriented, relational database management system.
- Sky - database used for flexible, high performance analysis of behavioral data.
- yugabyteDB - open source, high-performance, distributed SQL database compatible with PostgreSQL.
- ActorDB - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
- BayesDB - statistic oriented SQL database.
- Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
- Comdb2 - a clustered RDBMS built on optimistic concurrency control techniques.
- Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
- InfiniSQL - infinity scalable RDBMS.
- KarelDB - a relational database backed by Apache Kafka.
- SenseiDB - distributed, realtime, semi-structured database.
- TiDB - TiDB is a distributed SQL database. Inspired by the design of Google F1.
- SymmetricDS - open source software for both file and database synchronization.
- Actian Ingres - commercially supported, open-source SQL relational database management system.
- NuoDB - SQL/ACID compliant distributed database.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
- Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
-
PostgreSQL forks and evolutions
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
- RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
- Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
- Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
- HadoopDB - hybrid of MapReduce and DBMS.
- IBM Netezza - high-performance data warehouse appliances.
- TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries
-
RDBMS
- MySQL
- PostgreSQL
- Teradata - high-performance MPP data warehouse platform.
- Oracle Database - object-relational database management system.
-
Scheduling
- Linkedin Azkaban - batch workflow job scheduler.
- Apache Airflow - a platform to programmatically author, schedule and monitor workflows.
- Cronicle - Distributed, easy to install, NodeJS based, task scheduler
- Dagster - a data orchestrator for machine learning, analytics, and ETL.
- Schedoscope - Scala DSL for agile scheduling of Hadoop jobs.
- Sparrow - scheduling platform.
- Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
-
Search engine and framework
- ElasticSearch - Search and analytics engine based on Apache Lucene.
- Enigma.io
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
- LinkedIn Galene - search architecture at LinkedIn.
- Sphinx Search Server - fulltext search engine.
- Elassandra - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
- LinkedIn Zoie - is a realtime search/indexing system written in Java.
- MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
- Sphinx Search Server - fulltext search engine.
- Facebook Faiss - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
- Annoy - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
- Weaviate - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
- LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
- Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
-
Security
-
Service Programming
- Google Chubby - a lock service for loosely-coupled distributed systems.
- OpenMPI - message passing framework.
- Hydrosphere Mist - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
- Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
- Twitter Elephant Bird - libraries for working with LZOP-compressed data.
- Twitter Finagle - asynchronous network stack for the JVM.
- Serf - decentralized solution for service discovery and orchestration.
- Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Mara - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
-
SQL-like processing
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- Apache HCatalog - table and storage management layer for Hadoop.
- Aster Database - SQL-like analytic processing for MapReduce.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
- Facebook PrestoDB - distributed SQL query engine.
- Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
- Materialize - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
- Concurrent Lingual - SQL-like query language for Cascading.
- Iceberg - an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.
- Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
- Pivotal HDB - SQL-like data warehouse system for Hadoop.
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- PipelineDB - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- Apache Phoenix - SQL skin over HBase.
- Apache Doris - real-time analytical database for high-concurrency SQL analytics, search, and warehousing.
- DuckDB - in-process analytical SQL database for local analytics over files, data lakes, and data frames.
- rawquery - managed lakehouse query service using DuckDB over Apache Iceberg tables on object storage.
- RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
- StarRocks - high-performance MPP SQL engine for real-time analytics and lakehouse queries.
- Trino - distributed SQL query engine for querying large datasets across heterogeneous data sources.
-
System Deployment
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Google Borg - job scheduling and monitoring system.
- Google Omega - job scheduling and monitoring system.
- Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Marathon - Mesos framework for long-running services.
- Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
- Linkis - Linkis helps easily connect to various back-end computation/storage engines.
- Terraform - infrastructure as code tool for provisioning and managing cloud and on-premises infrastructure.
-
Time-Series Databases
- Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
- InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
- QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
- M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
- Prometheus - a time series database and service monitoring system.
- Rhombus - series object store for Cassandra that handles all the complexity of building wide row indexes.
- Chronix - a time series storage built to store time series highly compressed and for fast access times.
- Cube - uses MongoDB to store time series data.
- Heroic - is a scalable time series database based on Cassandra and Elasticsearch.
- Kairosdb - similar to OpenTSDB but allows for Cassandra.
- Newts - a time series database based on Apache Cassandra.
- TDengine - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
- Beringei - Facebook's in-memory time-series database.
- Akumuli - series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- Dalmatiner DB
- Blueflood
- Timely
- VictoriaMetrics - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
- Druid
- IronDB - scalable, general-purpose time series database.
- Thanos - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
- M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
-
Vector Databases
- Chroma - open-source embedding database for AI applications.
- Infinity - AI-native database for hybrid vector, sparse vector, tensor, full-text, and structured search.
- LanceDB - open-source embedded vector database built on the Lance columnar format.
- Milvus - open-source vector database for scalable similarity search.
- Qdrant - vector database and similarity search engine with REST, gRPC, and client SDKs.
- Weaviate - open-source vector database for semantic search with structured filtering.
-
Videos
-
2001 - 2010
- Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
- Machine Learning, Data Science and Deep Learning with Python - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
- Data warehouse schema design - dimensional modeling and star schema - Introduction to schema design for data warehouse using the star schema method.
- Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
-
Programming Languages
Categories
Interesting Papers
47
Distributed Programming
46
Data Visualization
45
Machine Learning
39
Books
37
Data Ingestion
34
Applications
27
Graph Data Model
26
NewSQL Databases
25
Key-value Data Model
25
SQL-like processing
24
Time-Series Databases
22
Business Intelligence
20
Search engine and framework
15
Distributed Filesystem
14
MySQL forks and evolutions
11
System Deployment
11
Internet of things and sensor data
10
Columnar Databases
10
Benchmarking
9
Service Programming
9
Key Map Data Model
8
PostgreSQL forks and evolutions
7
Scheduling
7
Frameworks
7
Vector Databases
6
Data Quality and Observability
4
Embedded Databases
4
Lakehouse Table Formats
4
RDBMS
4
Document Data Model
4
Videos
4
Interesting Readings
4
Memcached forks and evolutions
3
Security
2
Distributed Index
1
Sub Categories
Keywords
database
20
machine-learning
15
python
12
deep-learning
11
data-science
10
analytics
8
data-visualization
8
golang
7
graph
7
sql
7
go
7
kafka
6
visualization
6
postgresql
5
network-embedding
5
tensorflow
5
awesome
5
awesome-list
5
pytorch
5
kubernetes
5
spark
5
java
5
business-intelligence
4
data-analysis
4
mysql
4
nosql
4
distributed
4
geospatial
4
bigquery
4
graph-database
4
distributed-database
4
stream-processing
4
random-forest
4
node-embedding
4
cloud-native
4
mlops
4
network-science
4
graph-embedding
4
classifier
4
in-memory
3
networkx
3
deepwalk
3
snowflake
3
event-streaming
3
etl
3
data-pipeline
3
data-integration
3
node2vec
3
reinforcement-learning
3
gradient-boosting
3