An open API service indexing awesome lists of open source software.

fucking-awesome-bigdata

A curated list of awesome big data frameworks, resources and other awesomeness. With repository stars⭐ and forks🍴
https://github.com/correia-jpv/fucking-awesome-bigdata

Last synced: 4 days ago
JSON representation

  • NewSQL Databases

    • NuoDB - SQL/ACID compliant distributed database.
    • Actian Ingres - commercially supported, open-source SQL relational database management system.
    • ActorDB - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
    • BayesDB - statistic oriented SQL database.
    • Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
    • Comdb2 - a clustered RDBMS built on optimistic concurrency control techniques.
    • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
    • InfiniSQL - infinity scalable RDBMS.
    • KarelDB - a relational database backed by Apache Kafka.
    • NuoDB - SQL/ACID compliant distributed database.
    • Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
    • SenseiDB - distributed, realtime, semi-structured database.
    • Sky - database used for flexible, high performance analysis of behavioral data.
    • TiDB - TiDB is a distributed SQL database. Inspired by the design of Google F1.
    • yugabyteDB - open source, high-performance, distributed SQL database compatible with PostgreSQL.
    • SenseiDB - distributed, realtime, semi-structured database.
    • SymmetricDS - open source software for both file and database synchronization.
    • H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  • Data Ingestion

    • LinkedIn White Elephant - log aggregator and dashboard.
    • Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
    • Facebook Scribe - streamed log data aggregator.
    • Gazette - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
    • Heka - open source stream processing software system.
    • HIHO - framework for connecting disparate data sources with Hadoop.
    • Kestrel - distributed message queue system.
    • LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
    • Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
    • Pinterest Secor - is a service implementing Kafka log persistance.
    • Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
    • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
    • RudderStack - an open source customer data infrastructure (segment, mParticle alternative) written in go.
    • Zilla - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.
  • Machine Learning

    • Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
    • nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
    • brain - Neural networks in JavaScript.
    • Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
    • Concurrent Pattern - machine learning library for Cascading.
    • convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
    • DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
    • Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
    • Decider - Flexible and Extensible Machine Learning in Ruby.
    • etcML - text classification with machine learning.
    • Etsy Conjecture - scalable Machine Learning in Scalding.
    • Karate Club - An unsupervised machine learning library for graph structured data. Python
    • Lambdo - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
    • Little Ball of Fur - A subsampling library for graph structured data. Python
    • MLbase - distributed machine learning libraries for the BDAS stack.
    • MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
    • ML Workspace - All-in-one web-based IDE specialized for machine learning and data science.
    • ND4J - A matrix library for the JVM. Numpy for Java.
    • PyTorch Geometric Temporal - a temporal extension library for PyTorch Geometric .
    • RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
    • scikit-learn - scikit-learn: machine learning in Python.
    • Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
    • TensorFlow - Library from Google for machine learning using data flow graphs.
    • Theano - A Python-focused machine learning library supported by the University of Montreal.
    • Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
    • Velox - System for serving machine learning predictions.
    • BidMach - CPU and GPU-accelerated Machine Learning Library.
    • WEKA - suite of machine learning software.
    • Keras - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
    • H2O - statistical, machine learning and math runtime with Hadoop. R and Python.
  • Benchmarking

  • Frameworks

    • Bistro - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via *functions* and processes data via *column operations* as opposed to having only set operations in conventional approaches like MapReduce or SQL.
    • Tigon - High Throughput Real-time Stream Processing Framework.
    • Polyaxon - A platform for reproducible and scalable machine learning and deep learning.
    • Smooks - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
  • Data Visualization

    • Echarts - Baidus enterprise charts.
    • Airpal - Web UI for PrestoDB.
    • Arbor - graph visualization library using web workers and jQuery.
    • Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
    • Bloomery - Web UI for Impala.
    • CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • Chartist.js - another open source HTML5 Charts visualization.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Cubism - JavaScript library for time series visualization.
    • DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
    • D3.compose - Compose complex, data-driven visualizations from reusable charts and components.
    • Dash - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
    • Envisionjs - dynamic HTML5 visualization.
    • Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
    • Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
    • Graphite - scalable Realtime Graphing.
    • Lumify - open source big data analysis and visualization platform
    • Matplotlib - plotting with Python.
    • Peity - Progressive SVG bar, line and pie charts.
    • Plotly.js
    • Redash - open-source platform to query and visualize data.
    • Sigma.js - JavaScript library dedicated to graph drawing.
    • Vega - a visualization grammar.
    • Zeppelin - a notebook-style collaborative data analysis.
    • DataSphere Studio - one-stop data application development management portal.
    • Superset - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Recline - simple but powerful library for building data applications in pure Javascript and HTML.
  • Distributed Filesystem

    • Disco DDFS - distributed filesystem.
    • Alluxio - reliable file sharing at memory speed across cluster frameworks.
    • Ambry - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
    • Google GFS - distributed filesystem.
    • Seaweed-FS - simple and highly scalable distributed file system.
    • Baidu File System - distributed filesystem.
    • Lustre file system - high-performance distributed filesystem.
  • Key Map Data Model

  • System Deployment

    • Linkis - Linkis helps easily connect to various back-end computation/storage engines.
    • Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
    • Brooklyn - library that simplifies application deployment and management.
    • Buildoop - Similar to Apache BigTop based on Groovy language.
    • Marathon - Mesos framework for long-running services.
    • Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
  • Internet of things and sensor data

    • NetLytics - Analytics platform to process network data on Spark.
    • Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
    • NetLytics - Analytics platform to process network data on Spark.
  • Interesting Papers

    • 2013 - 2014

      • 2013 - **Facebook** - Scuba: Diving into Data at Facebook.
      • 2014 - **Stanford** - Mining of Massive Datasets.
      • 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
      • 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
      • 2013 - **Google** - F1: A Distributed SQL Database That Scales.
    • 2001 - 2010

      • 2007 - **Amazon** - Dynamo: Amazon’s Highly Available Key-value Store.
      • 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and notifications base of Percolator and Caffeine.
      • 2010 - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets.
      • 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
      • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
      • 2006 - **Google** - The Chubby lock service for loosely-coupled distributed systems.
      • 2004 - **Google** - MapReduce: Simplied Data Processing on Large Clusters.
    • 2015 - 2016

      • 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
    • 2011 - 2012

      • 2012 - **Twitter** - The Unified Logging Infrastructure
      • 2012 - **Google** - Processing a trillion cells per mouse click.
      • 2012 - **Google** - Spanner: Google’s Globally-Distributed Database.
      • 2011 - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
      • 2012 - **Microsoft** - Paxos Made Parallel.
  • RDBMS

    • Teradata - high-performance MPP data warehouse platform.
  • Business Intelligence

    • Microsoft - business intelligence software and platform.
    • Blazer - business intelligence made simple.
    • Lightdash - The open source Looker alternative built on dbt
    • Metabase - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
    • Pentaho - business intelligence platform.
    • SparklineData SNAP - modern B.I platform powered by Apache Spark.
  • Distributed Programming

    • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
    • AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
    • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
    • Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
    • Concurrent Cascading - framework for data management/analytics on Hadoop.
    • Damballa Parkour - MapReduce library for Clojure.
    • Datasalt Pangool - alternative MapReduce paradigm.
    • Facebook Peregrine - Map Reduce framework.
    • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
    • Netflix PigPen - map-reduce for Clojure which compiles to Apache Pig.
    • Nokia Disco - MapReduce framework developed by Nokia.
    • Pydoop - Python MapReduce and HDFS API for Hadoop.
    • Ray - A fast and simple framework for building and running distributed applications.
    • Rackerlabs Blueflood - multi-tenant distributed metric processing system
    • Skale - High performance distributed data processing in NodeJS.
    • Stratosphere - general purpose cluster computing framework.
    • streamsx.topology - Libraries to enable building IBM Streams application in Java, Python or Scala.
    • Tuktu - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
    • Twitter Heron - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
    • Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
    • Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
    • Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
    • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
    • Onyx - Distributed computation for the cloud.
    • Pydoop - Python MapReduce and HDFS API for Hadoop.
  • Graph Data Model

    • OrientDB - document and graph database.
    • AgensGraph - a new generation multi-model graph database for the modern complex data environment.
    • DGraph - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
    • EliasDB - a lightweight graph based database that does not require any third-party libraries.
    • GCHQ Gaffer - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
    • Google Cayley - open-source graph database.
    • Google Pregel - graph processing framework.
    • Gremlin - graph traversal Language.
    • Infovore - RDF-centric Map/Reduce framework.
    • Microsoft Graph Engine - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
    • Phoebus - framework for large scale graph processing.
    • Titan - distributed graph database, built over Cassandra.
    • Twitter FlockDB - distributed graph database.
  • Distributed Index

  • Document Data Model

    • jumboDB - document oriented datastore over Hadoop.
  • Key-value Data Model

    • Bolt - an embedded key-value database for Go.
    • BTDB - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
    • BuntDB - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
    • Edis - is a protocol-compatible Server replacement for Redis.
    • ElephantDB - Distributed database specialized in exporting data from Hadoop.
    • GhostDB - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
    • Graviton - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
    • HyperDex - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
    • LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
    • Linkedin Voldemort - distributed key/value storage system.
    • Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
    • Riak - a decentralized datastore.
    • Storehaus - library to work with asynchronous key value stores, by Twitter.
    • SummitDB - an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.
    • Tarantool - an efficient NoSQL database and a Lua application server.
    • TiKV - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
    • Tile38 - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
    • TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
  • Columnar Databases

    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • IndexR - an open-source columnar storage format for fast & realtime analytic with big data.
    • LocustDB - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
  • Time-Series Databases

    • Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
    • Chronix - a time series storage built to store time series highly compressed and for fast access times.
    • Cube - uses MongoDB to store time series data.
    • Kairosdb - similar to OpenTSDB but allows for Cassandra.
    • M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
    • Beringei - Facebook's in-memory time-series database.
    • Akumuli - series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
    • Rhombus - series object store for Cassandra that handles all the complexity of building wide row indexes.
    • Dalmatiner DB
    • Blueflood
    • Timely
    • VictoriaMetrics - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
    • Chronix - a time series storage built to store time series highly compressed and for fast access times.
    • TDengine - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
    • Druid
  • SQL-like processing

    • Aster Database - SQL-like analytic processing for MapReduce.
    • Concurrent Lingual - SQL-like query language for Cascading.
    • Materialize - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
    • RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
    • Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
  • Service Programming

    • Hydrosphere Mist - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
    • Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
    • Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
    • Twitter Elephant Bird - libraries for working with LZOP-compressed data.
    • Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
    • Mara - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
  • Scheduling

    • Apache Airflow - a platform to programmatically author, schedule and monitor workflows.
    • Chronos - distributed and fault-tolerant scheduler.
    • Cronicle - Distributed, easy to install, NodeJS based, task scheduler
    • Dagster - a data orchestrator for machine learning, analytics, and ETL.
    • Schedoscope - Scala DSL for agile scheduling of Hadoop jobs.
    • Sparrow - scheduling platform.
  • Security

    • BDA - The vulnerability detector for Hadoop and Spark
    • BDA - The vulnerability detector for Hadoop and Spark
  • Applications

    • 411 - an web application for alert management resulting from scheduled searches into Elasticsearch.
    • Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
    • Argus - Time series monitoring and alerting platform.
    • AthenaX - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
    • Atlas - a backend for managing dimensional time series data.
    • Eclipse BIRT - Eclipse-based reporting system.
    • ElastAert - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
    • Eventhub - open source event analytics platform.
    • Hermes - asynchronous message broker built on top of Kafka.
    • Imhotep - Large scale analytics platform by indeed.
    • Kapacitor - an open source framework for processing, monitoring, and alerting on time series data.
    • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
    • SnappyData - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
    • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
    • SparkR - R frontend for Spark.
    • Substation - Substation is a cloud native data pipeline and transformation toolkit written in Go.
    • Rakam - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
    • SparkR - R frontend for Spark.
    • Elassandra - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
    • Lily HBase Indexer - quickly and easily search for any content stored in HBase.
    • LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
    • LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
    • LinkedIn Zoie - is a realtime search/indexing system written in Java.
    • MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
    • Sphinx Search Server - fulltext search engine.
    • Facebook Faiss - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
    • Annoy - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
    • Weaviate - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
    • LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
    • LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
  • MySQL forks and evolutions

    • Drizzle - evolution of MySQL 6.0.
    • ProxySQL - High Performance Proxy for MySQL.
    • WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
  • PostgreSQL forks and evolutions

    • HadoopDB - hybrid of MapReduce and DBMS.
    • IBM Netezza - high-performance data warehouse appliances.
    • Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
    • RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
  • Memcached forks and evolutions

  • Embedded Databases

    • HanoiDB - Erlang LSM BTree Storage.
    • LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
  • Books

  • Source