An open API service indexing awesome lists of open source software.

awesome-bigdata

Just big data
https://github.com/bbauska/awesome-bigdata

Last synced: about 2 hours ago
JSON representation

  • Distributed Programming

    • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
    • Onyx - Distributed computation for the cloud.
    • Apache APEX - a unified, enterprise platform for big data stream and batch processing.
    • Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
    • Apache Gearpump - real-time big data streaming engine based on Akka.
    • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
    • Apache Pig - high level language to express data analysis programs for Hadoop.
    • Apache Spark Streaming - framework for stream processing, part of Spark.
    • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
    • Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
    • Cascalog - data processing and querying library.
    • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
    • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
    • Google MapReduce - map reduce framework.
    • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
    • Nokia Disco - MapReduce framework developed by Nokia.
    • Pinterest Pinlater - asynchronous job execution system.
    • Rackerlabs Blueflood - multi-tenant distributed metric processing system
    • Stratosphere - general purpose cluster computing framework.
    • Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
    • Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
    • Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
    • Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
    • Apache Gora - framework for in-memory data model and persistence.
    • Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
    • Google MillWheel - fault tolerant stream processing framework.
    • Twitter TSAR - TimeSeries AggregatoR by Twitter.
    • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
    • Facebook Peregrine - Map Reduce framework.
    • JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
    • AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
    • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
    • Concurrent Cascading - framework for data management/analytics on Hadoop.
    • Damballa Parkour - MapReduce library for Clojure.
    • Datasalt Pangool - alternative MapReduce paradigm.
    • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
    • Netflix PigPen - map-reduce for Clojure which compiles to Apache Pig.
    • Nokia Disco - MapReduce framework developed by Nokia.
    • Pydoop - Python MapReduce and HDFS API for Hadoop.
    • Ray - A fast and simple framework for building and running distributed applications.
    • Rackerlabs Blueflood - multi-tenant distributed metric processing system
    • Skale - High performance distributed data processing in NodeJS.
    • Stratosphere - general purpose cluster computing framework.
    • streamsx.topology - Libraries to enable building IBM Streams application in Java, Python or Scala.
    • Tuktu - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
    • Twitter Heron - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
    • Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
    • Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
  • Distributed Filesystem

  • Key Map Data Model

  • Key-value Data Model

    • EventStore - distributed time series database.
    • Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
    • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
    • Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
    • LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
    • Linkedin Voldemort - distributed key/value storage system.
    • Redis - in memory key value datastore.
    • Bolt - an embedded key-value database for Go.
    • BTDB - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
    • BuntDB - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
    • Edis - is a protocol-compatible Server replacement for Redis.
    • ElephantDB - Distributed database specialized in exporting data from Hadoop.
    • GhostDB - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
    • Graviton - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
    • HyperDex - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
    • Linkedin Voldemort - distributed key/value storage system.
    • Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
    • Riak - a decentralized datastore.
    • Storehaus - library to work with asynchronous key value stores, by Twitter.
    • SummitDB - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
    • Tarantool - an efficient NoSQL database and a Lua application server.
    • TiKV - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
    • Tile38 - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
    • TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
  • NewSQL Databases

    • Actian Ingres - commercially supported, open-source SQL relational database management system.
    • CitusDB - scales out PostgreSQL through sharding and replication.
    • FoundationDB - distributed database, inspired by F1.
    • Google F1 - distributed SQL database built on Spanner.
    • Google Spanner - globally distributed semi-relational database.
    • InfiniSQL - infinity scalable RDBMS.
    • Map-D - GPU in-memory database, big data analysis and visualization platform.
    • Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
    • SAP HANA - is an in-memory, column-oriented, relational database management system.
    • Sky - database used for flexible, high performance analysis of behavioral data.
    • Sky - database used for flexible, high performance analysis of behavioral data.
    • Amazon RedShift - data warehouse service, based on PostgreSQL.
    • Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
    • Google F1 - distributed SQL database built on Spanner.
    • HandlerSocket - NoSQL plugin for MySQL/MariaDB.
    • Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
    • VoltDB - claims to be fastest in-memory database.
    • SymmetricDS - open source software for both file and database synchronization.
    • ActorDB - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
    • BayesDB - statistic oriented SQL database.
    • Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
    • Comdb2 - a clustered RDBMS built on optimistic concurrency control techniques.
    • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
    • InfiniSQL - infinity scalable RDBMS.
    • KarelDB - a relational database backed by Apache Kafka.
    • SenseiDB - distributed, realtime, semi-structured database.
    • TiDB - TiDB is a distributed SQL database. Inspired by the design of Google F1.
    • yugabyteDB - open source, high-performance, distributed SQL database compatible with PostgreSQL.
    • H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
    • NuoDB - SQL/ACID compliant distributed database.
  • Time-Series Databases

    • IronDB - scalable, general-purpose time series database.
    • TDengine - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
    • Druid
    • Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
    • InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
    • QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
    • M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
    • Prometheus - a time series database and service monitoring system.
    • Rhombus - series object store for Cassandra that handles all the complexity of building wide row indexes.
    • QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
    • OpenTSDB - distributed time series database on top of HBase.
    • Chronix - a time series storage built to store time series highly compressed and for fast access times.
    • Cube - uses MongoDB to store time series data.
    • Heroic - is a scalable time series database based on Cassandra and Elasticsearch.
    • Kairosdb - similar to OpenTSDB but allows for Cassandra.
    • Newts - a time series database based on Apache Cassandra.
    • Beringei - Facebook's in-memory time-series database.
    • Akumuli - series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
    • Dalmatiner DB
    • Blueflood
    • Timely
    • VictoriaMetrics - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
  • Data Ingestion

    • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
    • Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
    • Amazon Kinesis - real-time processing of streaming data at massive scale.
    • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
    • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • Kestrel - distributed message queue system.
    • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
    • Apache Flume - service to manage large amount of log data.
    • Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
    • Linkedin Gobblin - linkedin's universal data ingestion framework.
    • Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
    • Facebook Scribe - streamed log data aggregator.
    • Gazette - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
    • Heka - open source stream processing software system.
    • HIHO - framework for connecting disparate data sources with Hadoop.
    • Logstash - a tool for managing events and logs.
    • Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
    • Pinterest Secor - is a service implementing Kafka log persistance.
    • Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
    • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
    • RudderStack - an open source customer data infrastructure (segment, mParticle alternative) written in go.
    • LinkedIn White Elephant - log aggregator and dashboard.
  • MySQL forks and evolutions

    • Google Cloud SQL - MySQL databases in Google's cloud.
    • Amazon RDS - MySQL databases in Amazon's cloud.
    • Drizzle - evolution of MySQL 6.0.
    • MariaDB - enhanced, drop-in replacement for MySQL.
    • MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
    • TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
    • WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
    • Drizzle - evolution of MySQL 6.0.
    • Percona Server - enhanced, drop-in replacement for MySQL.
    • ProxySQL - High Performance Proxy for MySQL.
    • WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
  • Graph Data Model

    • OrientDB - document and graph database.
    • MapGraph - Massively Parallel Graph processing on GPUs.
    • Neo4j - graph database written entirely in Java.
    • Titan - distributed graph database, built over Cassandra.
    • NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
    • AgensGraph - a new generation multi-model graph database for the modern complex data environment.
    • Apache Giraph - implementation of Pregel, based on Hadoop.
    • JanusGraph - open-source, distributed graph database
    • Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
    • AgensGraph - a new generation multi-model graph database for the modern complex data environment.
    • DGraph - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
    • EliasDB - a lightweight graph based database that does not require any third-party libraries.
    • GCHQ Gaffer - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
    • Google Cayley - open-source graph database.
    • Gremlin - graph traversal Language.
    • Infovore - RDF-centric Map/Reduce framework.
    • Microsoft Graph Engine - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
    • Phoebus - framework for large scale graph processing.
    • Titan - distributed graph database, built over Cassandra.
    • Twitter FlockDB - distributed graph database.
    • GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
    • Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
  • Interesting Papers

    • 2011 - 2012

      • 2012 - **Microsoft** - Paxos Made Parallel.
      • 2012 - **Twitter** - The Unified Logging Infrastructure
      • 2012 - **AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data.
      • 2012 - **AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark.
      • 2012 - **AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
      • 2012 - **Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
      • 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
      • 2012 - **Google** - Processing a trillion cells per mouse click.
      • 2012 - **Google** - Spanner: Google’s Globally-Distributed Database.
      • 2011 - **AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
      • 2011 - **AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
      • 2011 - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
      • 2012 - **Twitter** - The Unified Logging Infrastructure
      • 2012 - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
      • 2012 - **Google** - Processing a trillion cells per mouse click.
    • 2001 - 2010

      • 2003 - **Google** - The Google File System.
      • 2010 - **Google** - Pregel: A System for Large-Scale Graph Processing.
      • 2010 - **Facebook** - Finding a needle in Haystack: Facebook’s photo storage.
      • 2010 - **AMPLab** - Spark: Cluster Computing with Working Sets.
      • 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
      • 2010 - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets.
      • 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
      • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
      • 2006 - **Google** - The Chubby lock service for loosely-coupled distributed systems.
      • 2004 - **Google** - MapReduce: Simplied Data Processing on Large Clusters.
      • 2010 - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
      • 2010 - **Yahoo** - S4: Distributed Stream Computing Platform.
      • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
      • 2008 - **AMPLab** - Chukwa: A large-scale monitoring system.
      • 2007 - **Amazon** - Dynamo: Amazon’s Highly Available Key-value Store.
    • 2015 - 2016

      • 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
      • 2015 - **Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.
    • 2013 - 2014

      • 2014 - **Stanford** - Mining of Massive Datasets.
      • 2013 - **AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
      • 2013 - **AMPLab** - MLbase: A Distributed Machine-learning System.
      • 2013 - **AMPLab** - Shark: SQL and Rich Analytics at Scale.
      • 2013 - **AMPLab** - GraphX: A Resilient Distributed Graph System on Spark.
      • 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
      • 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
      • 2013 - **Google** - F1: A Distributed SQL Database That Scales.
      • 2013 - **Facebook** - Scaling Memcache at Facebook.
      • 2014 - **Stanford** - Mining of Massive Datasets.
      • 2013 - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
      • 2013 - **Metamarkets** - Druid: A Real-time Analytical Data Store.
      • 2013 - **Google** - F1: A Distributed SQL Database That Scales.
      • 2013 - **Facebook** - Scuba: Diving into Data at Facebook.
  • PostgreSQL forks and evolutions

    • Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
    • RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
    • Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
    • Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
    • HadoopDB - hybrid of MapReduce and DBMS.
    • IBM Netezza - high-performance data warehouse appliances.
  • Columnar Databases

    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • MonetDB - column store database.
    • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
    • Columnar Storage - an explanation of what columnar storage is and when you might want it.
    • Parquet - columnar storage format for Hadoop.
    • Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
    • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
    • IndexR - an open-source columnar storage format for fast & realtime analytic with big data.
    • LocustDB - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
    • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
  • Service Programming

    • Serf - decentralized solution for service discovery and orchestration.
    • Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
    • Google Chubby - a lock service for loosely-coupled distributed systems.
    • OpenMPI - message passing framework.
    • Apache Curator - Java libaries for Apache ZooKeeper.
    • Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
    • Apache Thrift - framework to build binary protocols.
    • Hydrosphere Mist - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
    • Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
    • Twitter Elephant Bird - libraries for working with LZOP-compressed data.
    • Twitter Finagle - asynchronous network stack for the JVM.
    • Mara - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
  • Machine Learning

    • H2O - statistical, machine learning and math runtime with Hadoop. R and Python.
    • Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
    • DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
    • Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
    • etcML - text classification with machine learning.
    • GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
    • MLbase - distributed machine learning libraries for the BDAS stack.
    • MonkeyLearn - Text mining made easy. Extract and classify data from text.
    • ND4J - A matrix library for the JVM. Numpy for Java.
    • RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
    • Sibyl - System for Large Scale Machine Learning at Google.
    • Theano - A Python-focused machine learning library supported by the University of Montreal.
    • Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
    • WEKA - suite of machine learning software.
    • Feast - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
    • Mahout - An Apache-backed machine learning library for Hadoop.
    • Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
    • Sibyl - System for Large Scale Machine Learning at Google.
    • WEKA - suite of machine learning software.
    • brain - Neural networks in JavaScript.
    • Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
    • Concurrent Pattern - machine learning library for Cascading.
    • convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
    • Decider - Flexible and Extensible Machine Learning in Ruby.
    • Etsy Conjecture - scalable Machine Learning in Scalding.
    • Karate Club - An unsupervised machine learning library for graph structured data. Python
    • Lambdo - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
    • Little Ball of Fur - A subsampling library for graph structured data. Python
    • MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
    • ML Workspace - All-in-one web-based IDE specialized for machine learning and data science.
    • PyTorch Geometric Temporal - a temporal extension library for PyTorch Geometric .
    • scikit-learn - scikit-learn: machine learning in Python.
    • Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
    • Sibyl - System for Large Scale Machine Learning at Google.
    • TensorFlow - Library from Google for machine learning using data flow graphs.
    • Velox - System for serving machine learning predictions.
    • BidMach - CPU and GPU-accelerated Machine Learning Library.
    • Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
    • Keras - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
    • nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  • Security

    • BDA - The vulnerability detector for Hadoop and Spark
    • Apache Knox Gateway - single point of secure access for Hadoop clusters.
    • Apache Sentry - security module for data stored in Hadoop.
    • BDA - The vulnerability detector for Hadoop and Spark
  • System Deployment

    • Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
    • Buildoop - Similar to Apache BigTop based on Groovy language.
    • Brooklyn - library that simplifies application deployment and management.
    • Google Borg - job scheduling and monitoring system.
    • Google Omega - job scheduling and monitoring system.
    • Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
    • Apache Bigtop - system deployment framework for the Hadoop ecosystem.
    • Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
    • Brooklyn - library that simplifies application deployment and management.
    • Buildoop - Similar to Apache BigTop based on Groovy language.
    • Marathon - Mesos framework for long-running services.
    • Linkis - Linkis helps easily connect to various back-end computation/storage engines.
  • Applications

    • Rakam - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
    • SparkR - R frontend for Spark.
    • Apache Tika - content analysis toolkit.
    • Hunk - Splunk analytics for Hadoop.
    • Imhotep - Large scale analytics platform by indeed.
    • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
    • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
    • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
    • Splunk - analyzer for machine-generated data.
    • Sumo Logic - cloud based analyzer for machine-generated data.
    • Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
    • HASH - open source simulation and visualization platform.
    • MADlib - data-processing library of an RDBMS to analyze data.
    • Kylin - open source Distributed Analytics Engine from eBay.
    • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
    • 411 - an web application for alert management resulting from scheduled searches into Elasticsearch.
    • Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
    • Argus - Time series monitoring and alerting platform.
    • AthenaX - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
    • Atlas - a backend for managing dimensional time series data.
    • Eclipse BIRT - Eclipse-based reporting system.
    • ElastAert - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
    • Eventhub - open source event analytics platform.
    • Hermes - asynchronous message broker built on top of Kafka.
    • Hunk - Splunk analytics for Hadoop.
    • Kapacitor - an open source framework for processing, monitoring, and alerting on time series data.
    • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
    • SnappyData - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
    • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
    • SparkR - R frontend for Spark.
    • LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
    • LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
    • ElasticSearch - Search and analytics engine based on Apache Lucene.
    • Enigma.io
    • Lily HBase Indexer - quickly and easily search for any content stored in HBase.
    • LinkedIn Galene - search architecture at LinkedIn.
    • Sphinx Search Server - fulltext search engine.
    • Apache Solr - Search platform for Apache Lucene.
    • HBase Coprocessor - implementation of Percolator, part of HBase.
    • Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
    • Elassandra - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
    • LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
    • LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
    • LinkedIn Zoie - is a realtime search/indexing system written in Java.
    • MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
    • Sphinx Search Server - fulltext search engine.
    • Facebook Faiss - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
    • Annoy - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
    • Weaviate - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
  • Data Visualization

    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Recline - simple but powerful library for building data applications in pure Javascript and HTML.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • D3 - javaScript library for manipulating documents.
    • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
    • FnordMetric - write SQL queries that return SVG charts rather than tables
    • Grafana - graphite dashboard frontend, editor and graph composer.
    • Graphite - scalable Realtime Graphing.
    • Highcharts - simple and flexible charting API.
    • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
    • Zing Charts - JavaScript charting library for big data.
    • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
    • AnyChart - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.
    • C3 - D3-based reusable chart library
    • Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
    • ReCharts - A composable charting library built on React components
    • Lumify - open source big data analysis and visualization platform
    • Airpal - Web UI for PrestoDB.
    • Arbor - graph visualization library using web workers and jQuery.
    • Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
    • Bloomery - Web UI for Impala.
    • CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • Chartist.js - another open source HTML5 Charts visualization.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Cubism - JavaScript library for time series visualization.
    • DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
    • D3.compose - Compose complex, data-driven visualizations from reusable charts and components.
    • Dash - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
    • Envisionjs - dynamic HTML5 visualization.
    • Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
    • Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
    • Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
    • Google Charts - simple charting API.
    • Kibana - visualize logs and time-stamped data
    • Matplotlib - plotting with Python.
    • Peity - Progressive SVG bar, line and pie charts.
    • Plotly.js
    • Redash - open-source platform to query and visualize data.
    • Sigma.js - JavaScript library dedicated to graph drawing.
    • Vega - a visualization grammar.
    • Zeppelin - a notebook-style collaborative data analysis.
    • DataSphere Studio - one-stop data application development management portal.
    • Echarts - Baidus enterprise charts.
    • Superset - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
  • Business Intelligence

    • Microsoft - business intelligence software and platform.
    • BIME Analytics - business intelligence platform in the cloud.
    • GoodData - platform for data products and embedded analytics.
    • Jaspersoft - powerful business intelligence suite.
    • Jedox Palo - customisable Business Intelligence platform.
    • Jethrodata - Interactive Big Data Analytics.
    • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
    • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
    • Saiku Analytics - Open source analytics platform.
    • Knowage - open source business intelligence platform. (former [SpagoBi](http://www.spagobi.org/))
    • SparklineData SNAP - modern B.I platform powered by Apache Spark.
    • Tableau - business intelligence platform.
    • intermix.io - Performance Monitoring for Amazon Redshift
    • Qlik - business intelligence and analytics platform.
    • Blazer - business intelligence made simple.
    • Metabase - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
    • Numeracy - Fast, clean SQL client and business intelligence.
    • Pentaho - business intelligence platform.
    • Zoomdata - Big Data Analytics.
  • RDBMS

  • Document Data Model

    • MongoDB - Document-oriented database system.
    • RavenDB - A transactional, open-source Document Database.
    • RethinkDB - document database that supports queries like table joins and group by.
    • Actian Versant - commercial object-oriented database management systems .
    • Facebook Apollo - Facebook’s Paxos-like NoSQL database.
    • jumboDB - document oriented datastore over Hadoop.
    • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
    • MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
    • Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB
  • SQL-like processing

    • Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
    • Apache HCatalog - table and storage management layer for Hadoop.
    • Aster Database - SQL-like analytic processing for MapReduce.
    • Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
    • Facebook PrestoDB - distributed SQL query engine.
    • Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
    • Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
    • Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
    • Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
    • Apache Hive - SQL-like data warehouse system for Hadoop.
    • Apache Calcite - framework that allows efficient translation of queries involving heterogeneous and federated data.
    • Datasalt Splout SQL - full SQL query engine for big datasets.
    • Google BigQuery - framework for interactive analysis, implementation of Dremel.
    • RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
    • SparkSQL - Manipulating Structured Data Using Spark.
    • Stinger - interactive query for Hive.
    • Tajo - distributed data warehouse system on Hadoop.
    • Concurrent Lingual - SQL-like query language for Cascading.
    • Materialize - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
    • Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
    • Pivotal HDB - SQL-like data warehouse system for Hadoop.
    • PipelineDB - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
  • Scheduling

    • Linkedin Azkaban - batch workflow job scheduler.
    • Apache Oozie - workflow job scheduler.
    • Chronos - distributed and fault-tolerant scheduler.
    • Apache Airflow - a platform to programmatically author, schedule and monitor workflows.
    • Cronicle - Distributed, easy to install, NodeJS based, task scheduler
    • Dagster - a data orchestrator for machine learning, analytics, and ETL.
    • Schedoscope - Scala DSL for agile scheduling of Hadoop jobs.
    • Sparrow - scheduling platform.
    • Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
  • Benchmarking

  • Internet of things and sensor data

    • Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
    • TempoIQ - Cloud-based sensor analytics.
    • Pubnub - Data stream network
    • IFTTT - If this then that
    • Evrything - Making products smart
    • Ably - Pub/sub messaging platform for IoT
    • Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub
    • ThingWorx - Rapid development and connection of intelligent systems
    • NetLytics - Analytics platform to process network data on Spark.
    • NetLytics - Analytics platform to process network data on Spark.
  • Interesting Readings

  • Videos

  • Books

  • Embedded Databases

    • Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
    • BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
    • RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.
    • HanoiDB - Erlang LSM BTree Storage.
    • LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
  • Memcached forks and evolutions

  • Frameworks

    • IBM Streams - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
    • Bistro - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via *functions* and processes data via *column operations* as opposed to having only set operations in conventional approaches like MapReduce or SQL.
    • Tigon - High Throughput Real-time Stream Processing Framework.
    • Polyaxon - A platform for reproducible and scalable machine learning and deep learning.
    • Smooks - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
  • Distributed Index