awesome-bigdata

A curated list of awesome big data frameworks, ressources and other awesomeness
https://github.com/Anyz01/awesome-bigdata

Last synced: about 14 hours ago
JSON representation

Machine Learning
- WEKA - suite of machine learning software.
- Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
- Cloudera Oryx - real-time large-scale machine learning.
- DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
- Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
- etcML - text classification with machine learning.
- GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MonkeyLearn - Text mining made easy. Extract and classify data from text.
- ND4J - A matrix library for the JVM. Numpy for Java.
- RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
- Sibyl - System for Large Scale Machine Learning at Google.
- Theano - A Python-focused machine learning library supported by the University of Montreal.
- Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
- WEKA - suite of machine learning software.
- Mahout - An Apache-backed machine learning library for Hadoop.
- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
- Mahout - An Apache-backed machine learning library for Hadoop.
- SAMOA - distributed streaming machine learning framework.
RDBMS
- MySQL
- PostgreSQL
- Teradata - high-performance MPP data warehouse platform.
- Oracle Database - object-relational database management system.
Distributed Programming
- Apache APEX - a unified, enterprise platform for big data stream and batch processing.
- Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
- Apache Gearpump - real-time big data streaming engine based on Akka.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- Google MapReduce - map reduce framework.
- Google MillWheel - fault tolerant stream processing framework.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Nokia Disco - MapReduce framework developed by Nokia.
- Pinterest Pinlater - asynchronous job execution system.
- Rackerlabs Blueflood - multi-tenant distributed metric processing system
- Stratosphere - general purpose cluster computing framework.
- Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
- Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
- Facebook Peregrine - Map Reduce framework.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
Distributed Filesystem
- BeeGFS - formerly FhGFS, parallel distributed file system.
- Google Colossus - distributed filesystem (GFS2).
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
- Quantcast File System QFS - open-source distributed file system.
- Tahoe-LAFS - decentralized cloud storage system.
SQL-like processing
- Datasalt Splout SQL - full SQL query engine for big datasets.
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- Apache HCatalog - table and storage management layer for Hadoop.
- Aster Database - SQL-like analytic processing for MapReduce.
- Facebook PrestoDB - distributed SQL query engine.
- Google BigQuery - framework for interactive analysis, implementation of Dremel.
- Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
- Tajo - distributed data warehouse system on Hadoop.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
Document Data Model
- MongoDB - Document-oriented database system.
- RavenDB - A transactional, open-source Document Database.
- RethinkDB - document database that supports queries like table joins and group by.
Key Map Data Model
- Distinguishing two major types of Column Stores
- Google BigTable - column-oriented distributed datastore.
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Hypertable - column-oriented distributed datastore, inspired by BigTable.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
Key-value Data Model
- Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
- LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- Redis - in memory key value datastore.
- Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
Graph Data Model
- MapGraph - Massively Parallel Graph processing on GPUs.
- Neo4j - graph database written entirely in Java.
- Titan - distributed graph database, built over Cassandra.
- AgensGraph - a new generation multi-model graph database for the modern complex data environment.
- NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
Columnar Databases
- C-Store - column oriented DBMS.
- MonetDB - column store database.
- Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
- EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
NewSQL Databases
- BayesDB - statistic oriented SQL database.
- CitusDB - scales out PostgreSQL through sharding and replication.
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- InfiniSQL - infinity scalable RDBMS.
- Map-D - GPU in-memory database, big data analysis and visualization platform.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
- SAP HANA - is an in-memory, column-oriented, relational database management system.
- Sky - database used for flexible, high performance analysis of behavioral data.
- FoundationDB - distributed database, inspired by F1.
- Sky - database used for flexible, high performance analysis of behavioral data.
Time-Series Databases
- Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
- InfluxDB - distributed time series database.
- Prometheus - a time series database and service monitoring system.
- Rhombus - series object store for Cassandra that handles all the complexity of building wide row indexes.
Data Ingestion
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
- Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
- Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- Kestrel - distributed message queue system.
- StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
- Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
Service Programming
- Google Chubby - a lock service for loosely-coupled distributed systems.
- OpenMPI - message passing framework.
- Apache Curator - Java libaries for Apache ZooKeeper.
Scheduling
- Linkedin Azkaban - batch workflow job scheduler.
Benchmarking
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark - real-world big data workload benchmark.
- PUMA Benchmarking - benchmark suite for MapReduce applications.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
- Deeplearning4j Benchmarks
- Intel HiBench - a Hadoop benchmark suite.
System Deployment
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Google Borg - job scheduling and monitoring system.
- Google Omega - job scheduling and monitoring system.
- Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
- Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
- Apache Whirr - set of libraries for running cloud services.
- Facebook Prism - multi datacenters replication system.
- Hortonworks HOYA - application that can deploy HBase cluster on YARN.
- Apache YARN - Cluster manager.
Applications
- Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
- Apache Tika - content analysis toolkit.
- HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
- Hunk - Splunk analytics for Hadoop.
- Imhotep - Large scale analytics platform by indeed.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
- Sense - Cloud Platform for Data Science and Big Data Analytics.
- Splunk - analyzer for machine-generated data.
- Sumo Logic - cloud based analyzer for machine-generated data.
- Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
- MADlib - data-processing library of an RDBMS to analyze data.
- Kylin - open source Distributed Analytics Engine from eBay.
- PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
Search engine and framework
- Enigma.io
- Facebook Unicorn - social graph search platform.
- Google Percolator - continuous indexing system.
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
- LinkedIn Galene - search architecture at LinkedIn.
- Sphinx Search Server - fulltext search engine.
- Apache Lucene - Search engine library.
- HBase Coprocessor - implementation of Percolator, part of HBase.
- Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
- ElasticSearch - Search and analytics engine based on Apache Lucene.
MySQL forks and evolutions
- Amazon RDS - MySQL databases in Amazon's cloud.
- Drizzle - evolution of MySQL 6.0.
- MariaDB - enhanced, drop-in replacement for MySQL.
- MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
- TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
PostgreSQL forks and evolutions
- RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
- Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
- Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
Business Intelligence
- BIME Analytics - business intelligence platform in the cloud.
- GoodData - platform for data products and embedded analytics.
- Jaspersoft - powerful business intelligence suite.
- Jedox Palo - customisable Business Intelligence platform.
- Jethrodata - Interactive Big Data Analytics.
- Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
- SparklineData SNAP - modern B.I platform powered by Apache Spark.
- Tableau - business intelligence platform.
- Chartio - lean business intelligence platform to visualize and explore your data.
- SpagoBI - open source business intelligence platform.
Data Visualization
- chartd - responsive, retina-compatible charts with just an img tag.
- D3 - javaScript library for manipulating documents.
- FnordMetric - write SQL queries that return SVG charts rather than tables
- Grafana - graphite dashboard frontend, editor and graph composer.
- Graphite - scalable Realtime Graphing.
- Highcharts - simple and flexible charting API.
- Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
- Zing Charts - JavaScript charting library for big data.
- ReCharts - A composable charting library built on React components
- Kibana - visualize logs and time-stamped data
- Lumify - open source big data analysis and visualization platform
Internet of things and sensor data
- Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
- TempoIQ - Cloud-based sensor analytics.
- Pubnub - Data stream network
- IFTTT - If this then that
- Evrything - Making products smart
- NetLytics - Analytics platform to process network data on Spark.
Interesting Readings
- Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
- Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
- Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
Interesting Papers
- 2015 - 2016
  - 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
- 2013 - 2014
  - 2014 - **Stanford** - Mining of Massive Datasets.
  - 2013 - **AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
  - 2013 - **AMPLab** - MLbase: A Distributed Machine-learning System.
  - 2013 - **AMPLab** - Shark: SQL and Rich Analytics at Scale.
  - 2013 - **AMPLab** - GraphX: A Resilient Distributed Graph System on Spark.
  - 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  - 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
  - 2013 - **Google** - F1: A Distributed SQL Database That Scales.
  - 2013 - **Facebook** - Scaling Memcache at Facebook.
  - 2013 - **Facebook** - Scuba: Diving into Data at Facebook.
  - 2013 - **Facebook** - Unicorn: A System for Searching the Social Graph.
  - 2013 - **Google** - Online, Asynchronous Schema Change in F1.
  - 2013 - **Google** - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
- 2011 - 2012
  - 2012 - **Twitter** - The Unified Logging Infrastructure
  - 2012 - **AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data.
  - 2012 - **AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark.
  - 2012 - **AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
  - 2012 - **Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
  - 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  - 2012 - **Google** - Processing a trillion cells per mouse click.
  - 2012 - **Google** - Spanner: Google’s Globally-Distributed Database.
  - 2011 - **AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
  - 2011 - **AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
  - 2011 - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
  - 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
- 2001 - 2010
  - 2010 - **Facebook** - Finding a needle in Haystack: Facebook’s photo storage.
  - 2010 - **AMPLab** - Spark: Cluster Computing with Working Sets.
  - 2010 - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets.
  - 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
  - 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  - 2008 - **AMPLab** - Chukwa: A large-scale monitoring system.
  - 2006 - **Google** - The Chubby lock service for loosely-coupled distributed systems.
  - 2004 - **Google** - MapReduce: Simplied Data Processing on Large Clusters.
  - 2003 - **Google** - The Google File System.
  - 2010 - **Google** - Pregel: A System for Large-Scale Graph Processing.
  - 2007 - **Amazon** - Dynamo: Amazon’s Highly Available Key-value Store.
  - 2006 - **Google** - Bigtable: A Distributed Storage System for Structured Data.
  - 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base of Percolator and Caffeine.
Videos
- 2001 - 2010
  - Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
Books
- 2001 - 2010
  - Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
  - Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
  - Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
  - Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.
  - Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
  - Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
  - Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
  - Spark in Action - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
  - Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
  - Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
  - Distributed Systems for fun and profit
- Data Visualization
Embedded Databases
- Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
Security
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
- Apache Sentry - security module for data stored in Hadoop.
- BDA - The vulnerability detector for Hadoop and Spark
- Apache Ranger - Central security admin & fine-grained authorization for Hadoop
- Apache Eagle - real time monitoring solution
Memcached forks and evolutions
- Facebook McDipper - key/value cache for flash storage.
Frameworks
- IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)

Programming Languages

Ruby 1 Java 1

awesome-bigdata

Machine Learning

RDBMS

Distributed Programming

Distributed Filesystem

SQL-like processing

Document Data Model

Key Map Data Model

Key-value Data Model

Graph Data Model

Columnar Databases

NewSQL Databases

Time-Series Databases

Data Ingestion

Service Programming

Scheduling

Benchmarking

System Deployment

Applications

Search engine and framework

MySQL forks and evolutions

PostgreSQL forks and evolutions

Business Intelligence

Data Visualization

Internet of things and sensor data

Interesting Readings

Interesting Papers

2015 - 2016

2013 - 2014

2011 - 2012

2001 - 2010

Videos

2001 - 2010

Books

2001 - 2010

Data Visualization

Embedded Databases

Security

Memcached forks and evolutions

Frameworks