An open API service indexing awesome lists of open source software.

awesome-bigdata

Just big data
https://github.com/bbauska/awesome-bigdata

Last synced: 8 days ago
JSON representation

  • Applications

    • Apache Tika - content analysis toolkit.
    • Hunk - Splunk analytics for Hadoop.
    • Imhotep - Large scale analytics platform by indeed.
    • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
    • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
    • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
    • Splunk - analyzer for machine-generated data.
    • Sumo Logic - cloud based analyzer for machine-generated data.
    • 411 - an web application for alert management resulting from scheduled searches into Elasticsearch.
    • Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
    • Argus - Time series monitoring and alerting platform.
    • AthenaX - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
    • Atlas - a backend for managing dimensional time series data.
    • Eclipse BIRT - Eclipse-based reporting system.
    • ElastAert - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
    • Eventhub - open source event analytics platform.
    • Hermes - asynchronous message broker built on top of Kafka.
    • Hunk - Splunk analytics for Hadoop.
    • Kapacitor - an open source framework for processing, monitoring, and alerting on time series data.
    • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
    • SnappyData - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
    • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
    • SparkR - R frontend for Spark.
    • Rakam - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
    • SparkR - R frontend for Spark.
  • Benchmarking

  • Books

  • Business Intelligence

    • BIME Analytics - business intelligence platform in the cloud.
    • GoodData - platform for data products and embedded analytics.
    • Jaspersoft - powerful business intelligence suite.
    • Jedox Palo - customisable Business Intelligence platform.
    • Jethrodata - Interactive Big Data Analytics.
    • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
    • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
    • Saiku Analytics - Open source analytics platform.
    • Knowage - open source business intelligence platform. (former [SpagoBi](http://www.spagobi.org/))
    • SparklineData SNAP - modern B.I platform powered by Apache Spark.
    • Tableau - business intelligence platform.
    • Blazer - business intelligence made simple.
    • Metabase - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
    • Numeracy - Fast, clean SQL client and business intelligence.
    • Pentaho - business intelligence platform.
    • Zoomdata - Big Data Analytics.
    • Microsoft - business intelligence software and platform.
  • Columnar Databases

    • MonetDB - column store database.
    • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
    • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
    • IndexR - an open-source columnar storage format for fast & realtime analytic with big data.
    • LocustDB - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
    • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
  • Data Ingestion

    • Amazon Kinesis - real-time processing of streaming data at massive scale.
    • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
    • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • Kestrel - distributed message queue system.
    • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
    • Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
    • Facebook Scribe - streamed log data aggregator.
    • Gazette - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
    • Heka - open source stream processing software system.
    • HIHO - framework for connecting disparate data sources with Hadoop.
    • Logstash - a tool for managing events and logs.
    • Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
    • Pinterest Secor - is a service implementing Kafka log persistance.
    • Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
    • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
    • RudderStack - an open source customer data infrastructure (segment, mParticle alternative) written in go.
    • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
    • Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
    • LinkedIn White Elephant - log aggregator and dashboard.
    • Linkedin Gobblin - linkedin's universal data ingestion framework.
  • Data Visualization

    • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • D3 - javaScript library for manipulating documents.
    • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
    • FnordMetric - write SQL queries that return SVG charts rather than tables
    • Grafana - graphite dashboard frontend, editor and graph composer.
    • Graphite - scalable Realtime Graphing.
    • Highcharts - simple and flexible charting API.
    • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
    • Zing Charts - JavaScript charting library for big data.
    • Lumify - open source big data analysis and visualization platform
    • Airpal - Web UI for PrestoDB.
    • Arbor - graph visualization library using web workers and jQuery.
    • Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
    • Bloomery - Web UI for Impala.
    • CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • Chartist.js - another open source HTML5 Charts visualization.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Cubism - JavaScript library for time series visualization.
    • DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
    • D3.compose - Compose complex, data-driven visualizations from reusable charts and components.
    • Dash - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
    • Envisionjs - dynamic HTML5 visualization.
    • Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
    • Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
    • Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
    • Google Charts - simple charting API.
    • Kibana - visualize logs and time-stamped data
    • Matplotlib - plotting with Python.
    • Peity - Progressive SVG bar, line and pie charts.
    • Plotly.js
    • Redash - open-source platform to query and visualize data.
    • Sigma.js - JavaScript library dedicated to graph drawing.
    • Vega - a visualization grammar.
    • Zeppelin - a notebook-style collaborative data analysis.
    • DataSphere Studio - one-stop data application development management portal.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Recline - simple but powerful library for building data applications in pure Javascript and HTML.
    • Echarts - Baidus enterprise charts.
    • Superset - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
    • NVD3 - chart components for d3.js.
  • Distributed Filesystem

  • Distributed Index

  • Distributed Programming

    • Apache APEX - a unified, enterprise platform for big data stream and batch processing.
    • Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
    • Apache Gearpump - real-time big data streaming engine based on Akka.
    • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
    • Apache Pig - high level language to express data analysis programs for Hadoop.
    • Apache Spark Streaming - framework for stream processing, part of Spark.
    • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
    • Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
    • Cascalog - data processing and querying library.
    • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
    • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
    • Google MapReduce - map reduce framework.
    • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
    • Nokia Disco - MapReduce framework developed by Nokia.
    • Pinterest Pinlater - asynchronous job execution system.
    • Rackerlabs Blueflood - multi-tenant distributed metric processing system
    • Stratosphere - general purpose cluster computing framework.
    • Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
    • Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
    • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
    • Twitter TSAR - TimeSeries AggregatoR by Twitter.
    • Facebook Peregrine - Map Reduce framework.
    • AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
    • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
    • Concurrent Cascading - framework for data management/analytics on Hadoop.
    • Damballa Parkour - MapReduce library for Clojure.
    • Datasalt Pangool - alternative MapReduce paradigm.
    • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
    • Netflix PigPen - map-reduce for Clojure which compiles to Apache Pig.
    • Nokia Disco - MapReduce framework developed by Nokia.
    • Pydoop - Python MapReduce and HDFS API for Hadoop.
    • Ray - A fast and simple framework for building and running distributed applications.
    • Rackerlabs Blueflood - multi-tenant distributed metric processing system