data-engineering-collection

A collection of awesome software, libraries, Learning Tutorials, documents, books, resources and interesting stuff about Big Data Science & Engineering
https://github.com/exajobs/data-engineering-collection

Last synced: 4 days ago
JSON representation

`Distributed Filesystem `
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
- Quantcast File System QFS - open-source distributed file system.
- Tahoe-LAFS - decentralized cloud storage system.
- Ambry - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
- Seaweed-FS - simple and highly scalable distributed file system.
- Baidu File System - distributed filesystem.
- Disco DDFS - distributed filesystem.
- Alluxio - reliable file sharing at memory speed across cluster frameworks.
- Lustre file system - high-performance distributed filesystem.
- Red Hat GlusterFS - scale-out network-attached storage file system.
`Distributed Index `
- Pilosa
`Distributed Programming `
- Apache APEX - a unified, enterprise platform for big data stream and batch processing.
- Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
- Apache Gearpump - real-time big data streaming engine based on Akka.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- Google MapReduce - map reduce framework.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Nokia Disco - MapReduce framework developed by Nokia.
- Pinterest Pinlater - asynchronous job execution system.
- Rackerlabs Blueflood - multi-tenant distributed metric processing system
- Stratosphere - general purpose cluster computing framework.
- Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
- Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
- Facebook Peregrine - Map Reduce framework.
- AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour - MapReduce library for Clojure.
- Datasalt Pangool - alternative MapReduce paradigm.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Netflix PigPen - map-reduce for Clojure which compiles to Apache Pig.
- Nokia Disco - MapReduce framework developed by Nokia.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
- Ray - A fast and simple framework for building and running distributed applications.
- Rackerlabs Blueflood - multi-tenant distributed metric processing system
- Skale - High performance distributed data processing in NodeJS.
- Stratosphere - general purpose cluster computing framework.
- streamsx.topology - Libraries to enable building IBM Streams application in Java, Python or Scala.
- Tuktu - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
- Twitter Heron - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
- Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
- Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
- Onyx - Distributed computation for the cloud.
- JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Cascalog - data processing and querying library.
- Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
`Document Data Model `
- MongoDB - Document-oriented database system.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
- Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB
Embedded Databases
- Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
- HanoiDB - Erlang LSM BTree Storage.
- LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
- BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
`Frameworks `
- IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
- Bistro - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via *functions* and processes data via *column operations* as opposed to having only set operations in conventional approaches like MapReduce or SQL.
- Tigon - High Throughput Real-time Stream Processing Framework.
- Polyaxon - A platform for reproducible and scalable machine learning and deep learning.
- Smooks - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
- Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
Graph Data Model
- Google Pregel - graph processing framework.
- GraphX - resilient Distributed Graph System on Spark.
- MapGraph - Massively Parallel Graph processing on GPUs.
- NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
- Titan - distributed graph database, built over Cassandra.
- AgensGraph - a new generation multi-model graph database for the modern complex data environment.
- AgensGraph - a new generation multi-model graph database for the modern complex data environment.
- DGraph - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
- EliasDB - a lightweight graph based database that does not require any third-party libraries.
- GCHQ Gaffer - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
- Google Cayley - open-source graph database.
- Gremlin - graph traversal Language.
- Infovore - RDF-centric Map/Reduce framework.
- Microsoft Graph Engine - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
- Phoebus - framework for large scale graph processing.
- Titan - distributed graph database, built over Cassandra.
- Twitter FlockDB - distributed graph database.
- Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
- OrientDB - document and graph database.
- GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
- Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- GraphX - resilient Distributed Graph System on Spark.
Interesting Papers
- 2001 - 2010
  - 2010 - **Facebook** - Finding a needle in Haystack: Facebook’s photo storage.
  - 2010 - **AMPLab** - Spark: Cluster Computing with Working Sets.
  - 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base of Percolator and Caffeine.
  - 2010 - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets.
  - 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
  - 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  - 2006 - **Google** - The Chubby lock service for loosely-coupled distributed systems.
  - 2004 - **Google** - MapReduce: Simplied Data Processing on Large Clusters.
  - 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base of Percolator and Caffeine.
  - 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
  - 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  - 2008 - **AMPLab** - Chukwa: A large-scale monitoring system.
  - 2007 - **Amazon** - Dynamo: Amazon’s Highly Available Key-value Store.
- 2011 - 2012
  - 2012 - **Twitter** - The Unified Logging Infrastructure
  - 2012 - **AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data.
  - 2012 - **AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark.
  - 2012 - **AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
  - 2012 - **Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
  - 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  - 2012 - **Google** - Processing a trillion cells per mouse click.
  - 2012 - **Google** - Spanner: Google’s Globally-Distributed Database.
  - 2011 - **AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
  - 2011 - **AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
  - 2011 - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
  - 2012 - **Twitter** - The Unified Logging Infrastructure
  - 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  - 2012 - **Google** - Processing a trillion cells per mouse click.
  - 2012 - **Microsoft** - Paxos Made Parallel.
- 2013 - 2014
  - 2014 - **Stanford** - Mining of Massive Datasets.
  - 2013 - **AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
  - 2013 - **AMPLab** - MLbase: A Distributed Machine-learning System.
  - 2013 - **AMPLab** - Shark: SQL and Rich Analytics at Scale.
  - 2013 - **AMPLab** - GraphX: A Resilient Distributed Graph System on Spark.
  - 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  - 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
  - 2013 - **Google** - F1: A Distributed SQL Database That Scales.
  - 2013 - **Facebook** - Scaling Memcache at Facebook.
  - 2014 - **Stanford** - Mining of Massive Datasets.
  - 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  - 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
  - 2013 - **Google** - F1: A Distributed SQL Database That Scales.
  - 2013 - **Facebook** - Scuba: Diving into Data at Facebook.
  - 2013 - **Google** - Online, Asynchronous Schema Change in F1.
- 2015 - 2016
  - 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
  - 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
Interesting Readings
- Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
- Monitoring Cassandra performance - Guide to monitoring Cassandra, including native methods for metrics collection.
- Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
- Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
Internet of things and sensor data
- Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
- TempoIQ - Cloud-based sensor analytics.
- Pubnub - Data stream network
- IFTTT - If this then that
- Evrything - Making products smart
- Ably - Pub/sub messaging platform for IoT
- Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub
- ThingWorx - Rapid development and connection of intelligent systems
- NetLytics - Analytics platform to process network data on Spark.
- NetLytics - Analytics platform to process network data on Spark.
`Key Map Data Model `
- Distinguishing two major types of Column Stores
- Google BigTable - column-oriented distributed datastore.
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
- Distinguishing two major types of Column Stores
- Baidu Tera - an Internet-scale database, inspired by BigTable.
- ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
- Hypertable - column-oriented distributed datastore, inspired by BigTable.
Key-value Data Model
- Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
- Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
- LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- Bolt - an embedded key-value database for Go.
- BTDB - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
- BuntDB - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
- Edis - is a protocol-compatible Server replacement for Redis.
- ElephantDB - Distributed database specialized in exporting data from Hadoop.
- GhostDB - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
- Graviton - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
- HyperDex - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
- Linkedin Voldemort - distributed key/value storage system.
- Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
- Riak - a decentralized datastore.
- Storehaus - library to work with asynchronous key value stores, by Twitter.
- SummitDB - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
- Tarantool - an efficient NoSQL database and a Lua application server.
- TiKV - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
- Tile38 - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
- TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
- Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
- EventStore - distributed time series database.
- GridDB - suitable for sensor data stored in a timeseries.
Machine Learning
- Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
- DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
- Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
- etcML - text classification with machine learning.
- GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MonkeyLearn - Text mining made easy. Extract and classify data from text.
- ND4J - A matrix library for the JVM. Numpy for Java.
- RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
- Sibyl - System for Large Scale Machine Learning at Google.
- Theano - A Python-focused machine learning library supported by the University of Montreal.
- Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
- WEKA - suite of machine learning software.
- brain - Neural networks in JavaScript.

Programming Languages

Java 30 Python 19 Go 18 JavaScript 15 Scala 11 C++ 9 C 6 Erlang 5 Ruby 3 Rust 3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

data-engineering-collection

`Distributed Filesystem `

`Distributed Index `

`Distributed Programming `

`Document Data Model `

Embedded Databases

`Frameworks `

Graph Data Model

Interesting Papers

2001 - 2010

2011 - 2012

2013 - 2014

2015 - 2016

Interesting Readings

Internet of things and sensor data

`Key Map Data Model `

Key-value Data Model

Machine Learning