Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-bigdata

A curated list of awesome big data frameworks, ressources and other awesomeness.
https://github.com/oxnr/awesome-bigdata

Last synced: 5 days ago
JSON representation

  • Applications

    • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
    • Apache Tika - content analysis toolkit.
    • Hunk - Splunk analytics for Hadoop.
    • Imhotep - Large scale analytics platform by indeed.
    • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
    • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
    • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
    • Splunk - analyzer for machine-generated data.
    • Sumo Logic - cloud based analyzer for machine-generated data.
  • RDBMS

  • SQL-like processing

    • Datasalt Splout SQL - full SQL query engine for big datasets.
    • Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
    • Apache HCatalog - table and storage management layer for Hadoop.
    • Aster Database - SQL-like analytic processing for MapReduce.
    • Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
    • Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
    • Facebook PrestoDB - distributed SQL query engine.
    • Google BigQuery - framework for interactive analysis, implementation of Dremel.
    • Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
    • Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
    • Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
    • Materialize - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
  • Frameworks

    • Bistro - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via *functions* and processes data via *column operations* as opposed to having only set operations in conventional approaches like MapReduce or SQL.
  • Distributed Programming

    • Apache APEX - a unified, enterprise platform for big data stream and batch processing.
    • Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
    • Apache Gearpump - real-time big data streaming engine based on Akka.
    • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
    • Apache Pig - high level language to express data analysis programs for Hadoop.
    • Apache Spark Streaming - framework for stream processing, part of Spark.
    • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
    • Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
    • Cascalog - data processing and querying library.
    • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
    • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
    • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
    • Google MapReduce - map reduce framework.
    • Google MillWheel - fault tolerant stream processing framework.
    • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
    • Nokia Disco - MapReduce framework developed by Nokia.
    • Pinterest Pinlater - asynchronous job execution system.
    • Rackerlabs Blueflood - multi-tenant distributed metric processing system
    • Stratosphere - general purpose cluster computing framework.
    • Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
    • Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
    • Twitter TSAR - TimeSeries AggregatoR by Twitter.
    • Facebook Peregrine - Map Reduce framework.
  • Distributed Filesystem

  • Interesting Papers

    • 2001 - 2010

      • 2003 - **Google** - The Google File System.
      • 2010 - **Google** - Pregel: A System for Large-Scale Graph Processing.
      • 2010 - **Facebook** - Finding a needle in Haystack: Facebook’s photo storage.
      • 2010 - **AMPLab** - Spark: Cluster Computing with Working Sets.
      • 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
      • 2010 - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets.
      • 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
      • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
      • 2006 - **Google** - The Chubby lock service for loosely-coupled distributed systems.
      • 2004 - **Google** - MapReduce: Simplied Data Processing on Large Clusters.
    • 2015 - 2016

      • 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
    • 2013 - 2014

      • 2014 - **Stanford** - Mining of Massive Datasets.
      • 2013 - **AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
      • 2013 - **AMPLab** - MLbase: A Distributed Machine-learning System.
      • 2013 - **AMPLab** - Shark: SQL and Rich Analytics at Scale.
      • 2013 - **AMPLab** - GraphX: A Resilient Distributed Graph System on Spark.
      • 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
      • 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
      • 2013 - **Google** - F1: A Distributed SQL Database That Scales.
      • 2013 - **Facebook** - Scaling Memcache at Facebook.
    • 2011 - 2012

      • 2012 - **Twitter** - The Unified Logging Infrastructure
      • 2012 - **AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data.
      • 2012 - **AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark.
      • 2012 - **AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
      • 2012 - **Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
      • 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
      • 2012 - **Google** - Processing a trillion cells per mouse click.
      • 2012 - **Google** - Spanner: Google’s Globally-Distributed Database.
      • 2011 - **AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
      • 2011 - **AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
      • 2011 - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
      • 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  • Document Data Model

    • MongoDB - Document-oriented database system.
    • RavenDB - A transactional, open-source Document Database.
    • RethinkDB - document database that supports queries like table joins and group by.
  • Key Map Data Model

  • Key-value Data Model

    • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
    • Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
    • Redis - in memory key value datastore.
    • LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
    • Linkedin Voldemort - distributed key/value storage system.
  • Graph Data Model

    • AgensGraph - a new generation multi-model graph database for the modern complex data environment.
    • MapGraph - Massively Parallel Graph processing on GPUs.
    • Neo4j - graph database written entirely in Java.
    • Titan - distributed graph database, built over Cassandra.
    • NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
  • Columnar Databases

    • MonetDB - column store database.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
    • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
  • Embedded Databases

    • Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
    • LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  • NewSQL Databases

    • InfiniSQL - infinity scalable RDBMS.
    • CitusDB - scales out PostgreSQL through sharding and replication.
    • FoundationDB - distributed database, inspired by F1.
    • Google F1 - distributed SQL database built on Spanner.
    • Google Spanner - globally distributed semi-relational database.
    • SAP HANA - is an in-memory, column-oriented, relational database management system.
    • Map-D - GPU in-memory database, big data analysis and visualization platform.
    • Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
    • Sky - database used for flexible, high performance analysis of behavioral data.
    • yugabyteDB - open source, high-performance, distributed SQL database compatible with PostgreSQL.
  • Time-Series Databases

    • Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
    • InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
    • QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
    • M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
    • Prometheus - a time series database and service monitoring system.
    • Rhombus - series object store for Cassandra that handles all the complexity of building wide row indexes.
  • Data Ingestion

    • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
    • Amazon Kinesis - real-time processing of streaming data at massive scale.
    • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
    • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • Kestrel - distributed message queue system.
    • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
    • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
  • Service Programming

    • Google Chubby - a lock service for loosely-coupled distributed systems.
    • OpenMPI - message passing framework.
    • Serf - decentralized solution for service discovery and orchestration.
  • Machine Learning

    • Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
    • DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
    • Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
    • etcML - text classification with machine learning.
    • GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
    • MLbase - distributed machine learning libraries for the BDAS stack.
    • MonkeyLearn - Text mining made easy. Extract and classify data from text.
    • ND4J - A matrix library for the JVM. Numpy for Java.
    • RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
    • Sibyl - System for Large Scale Machine Learning at Google.
    • Theano - A Python-focused machine learning library supported by the University of Montreal.
    • Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
    • WEKA - suite of machine learning software.
  • Scheduling

  • Benchmarking

  • System Deployment

    • Apache YARN - Cluster manager.
    • Brooklyn - library that simplifies application deployment and management.
    • Buildoop - Similar to Apache BigTop based on Groovy language.
    • Google Borg - job scheduling and monitoring system.
    • Google Omega - job scheduling and monitoring system.
    • Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
  • MySQL forks and evolutions

    • Drizzle - evolution of MySQL 6.0.
    • Amazon RDS - MySQL databases in Amazon's cloud.
    • MariaDB - enhanced, drop-in replacement for MySQL.
    • MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
    • TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
    • WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
  • PostgreSQL forks and evolutions

    • Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
    • RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
    • Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
    • Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
  • Business Intelligence

    • BIME Analytics - business intelligence platform in the cloud.
    • GoodData - platform for data products and embedded analytics.
    • Jaspersoft - powerful business intelligence suite.
    • Jedox Palo - customisable Business Intelligence platform.
    • Knowage - open source business intelligence platform. (former [SpagoBi](http://www.spagobi.org/))
    • Jethrodata - Interactive Big Data Analytics.
    • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
    • SparklineData SNAP - modern B.I platform powered by Apache Spark.
    • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
    • Saiku Analytics - Open source analytics platform.
    • Tableau - business intelligence platform.
  • Data Visualization

    • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • D3 - javaScript library for manipulating documents.
    • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
    • FnordMetric - write SQL queries that return SVG charts rather than tables
    • Grafana - graphite dashboard frontend, editor and graph composer.
    • Graphite - scalable Realtime Graphing.
    • Highcharts - simple and flexible charting API.
    • Lumify - open source big data analysis and visualization platform
    • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
    • Zing Charts - JavaScript charting library for big data.
  • Internet of things and sensor data

    • Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
    • TempoIQ - Cloud-based sensor analytics.
    • Pubnub - Data stream network
    • IFTTT - If this then that
    • Evrything - Making products smart
    • Ably - Pub/sub messaging platform for IoT
  • Interesting Readings

  • Books

  • Videos