awesome-bigdata
Just big data
https://github.com/bbauska/awesome-bigdata
Last synced: 8 days ago
JSON representation
-
Applications
- Apache Tika - content analysis toolkit.
- Hunk - Splunk analytics for Hadoop.
- Imhotep - Large scale analytics platform by indeed.
- Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
- Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
- Splunk - analyzer for machine-generated data.
- Sumo Logic - cloud based analyzer for machine-generated data.
- 411 - an web application for alert management resulting from scheduled searches into Elasticsearch.
- Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
- Argus - Time series monitoring and alerting platform.
- AthenaX - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
- Atlas - a backend for managing dimensional time series data.
- Eclipse BIRT - Eclipse-based reporting system.
- ElastAert - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
- Eventhub - open source event analytics platform.
- Hermes - asynchronous message broker built on top of Kafka.
- Hunk - Splunk analytics for Hadoop.
- Kapacitor - an open source framework for processing, monitoring, and alerting on time series data.
- PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
- SnappyData - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
- Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
- SparkR - R frontend for Spark.
- Rakam - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
- SparkR - R frontend for Spark.
-
Benchmarking
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark - real-world big data workload benchmark.
- PUMA Benchmarking - benchmark suite for MapReduce applications.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
- Deeplearning4j Benchmarks
- Intel HiBench - a Hadoop benchmark suite.
-
Books
-
2001 - 2010
- Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
- Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
- Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
- Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.
- Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
- Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
- Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
- Spark in Action - in-action-second-edition) - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
- Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
- Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
- Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
- Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.
- Distributed Systems for fun and profit
- Graph-Powered Machine Learning - Alessandro Negro. Combine graph theory and models to improve machine learning projects
- Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
-
Data Visualization
- The beauty of data visualization
- Designing Data Visualizations with Noah Iliinsky
- Hans Rosling's 200 Countries, 200 Years, 4 Minutes
- Ice Bucket Challenge Data Visualization
- awesome-awesomeness
- awesome-public-datasets
- awesome
- list
- awesome-awesome-awesome
- awesome-analytics
- awesome-graph-classification
- awesome-network-embedding
- awesome-community-detection
- awesome-decision-tree-papers
- awesome-fraud-detection-papers
- awesome-gradient-boosting-papers
- awesome-monte-carlo-tree-search-papers
- awesome-kafka
- Google Bigtable
-
-
Business Intelligence
- BIME Analytics - business intelligence platform in the cloud.
- GoodData - platform for data products and embedded analytics.
- Jaspersoft - powerful business intelligence suite.
- Jedox Palo - customisable Business Intelligence platform.
- Jethrodata - Interactive Big Data Analytics.
- Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
- Saiku Analytics - Open source analytics platform.
- Knowage - open source business intelligence platform. (former [SpagoBi](http://www.spagobi.org/))
- SparklineData SNAP - modern B.I platform powered by Apache Spark.
- Tableau - business intelligence platform.
- Blazer - business intelligence made simple.
- Metabase - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
- Numeracy - Fast, clean SQL client and business intelligence.
- Pentaho - business intelligence platform.
- Zoomdata - Big Data Analytics.
- Microsoft - business intelligence software and platform.
-
Columnar Databases
- MonetDB - column store database.
- Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
- EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
- EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
- Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
- Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
- IndexR - an open-source columnar storage format for fast & realtime analytic with big data.
- LocustDB - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
- ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
-
Data Ingestion
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
- Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
- Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- Kestrel - distributed message queue system.
- Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
- Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
- Facebook Scribe - streamed log data aggregator.
- Gazette - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
- Heka - open source stream processing software system.
- HIHO - framework for connecting disparate data sources with Hadoop.
- Logstash - a tool for managing events and logs.
- Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
- Pinterest Secor - is a service implementing Kafka log persistance.
- Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
- StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
- RudderStack - an open source customer data infrastructure (segment, mParticle alternative) written in go.
- redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
- Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
- LinkedIn White Elephant - log aggregator and dashboard.
- Linkedin Gobblin - linkedin's universal data ingestion framework.
-
Data Visualization
- Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
- chartd - responsive, retina-compatible charts with just an img tag.
- D3 - javaScript library for manipulating documents.
- DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
- FnordMetric - write SQL queries that return SVG charts rather than tables
- Grafana - graphite dashboard frontend, editor and graph composer.
- Graphite - scalable Realtime Graphing.
- Highcharts - simple and flexible charting API.
- Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
- Zing Charts - JavaScript charting library for big data.
- Lumify - open source big data analysis and visualization platform
- Airpal - Web UI for PrestoDB.
- Arbor - graph visualization library using web workers and jQuery.
- Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
- Bloomery - Web UI for Impala.
- CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
- chartd - responsive, retina-compatible charts with just an img tag.
- Chartist.js - another open source HTML5 Charts visualization.
- Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
- Cubism - JavaScript library for time series visualization.
- DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
- D3.compose - Compose complex, data-driven visualizations from reusable charts and components.
- Dash - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
- Envisionjs - dynamic HTML5 visualization.
- Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
- Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
- Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
- Google Charts - simple charting API.
- Kibana - visualize logs and time-stamped data
- Matplotlib - plotting with Python.
- Peity - Progressive SVG bar, line and pie charts.
- Plotly.js
- Redash - open-source platform to query and visualize data.
- Sigma.js - JavaScript library dedicated to graph drawing.
- Vega - a visualization grammar.
- Zeppelin - a notebook-style collaborative data analysis.
- DataSphere Studio - one-stop data application development management portal.
- Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
- Recline - simple but powerful library for building data applications in pure Javascript and HTML.
- Echarts - Baidus enterprise charts.
- Superset - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
- NVD3 - chart components for d3.js.
-
Distributed Filesystem
- BeeGFS - formerly FhGFS, parallel distributed file system.
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
- Quantcast File System QFS - open-source distributed file system.
- Tahoe-LAFS - decentralized cloud storage system.
- Ambry - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
- Seaweed-FS - simple and highly scalable distributed file system.
- Baidu File System - distributed filesystem.
- Disco DDFS - distributed filesystem.
- Alluxio - reliable file sharing at memory speed across cluster frameworks.
- Lustre file system - high-performance distributed filesystem.
-
Distributed Index
-
Distributed Programming
- Apache APEX - a unified, enterprise platform for big data stream and batch processing.
- Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
- Apache Gearpump - real-time big data streaming engine based on Akka.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- Google MapReduce - map reduce framework.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Nokia Disco - MapReduce framework developed by Nokia.
- Pinterest Pinlater - asynchronous job execution system.
- Rackerlabs Blueflood - multi-tenant distributed metric processing system
- Stratosphere - general purpose cluster computing framework.
- Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
- Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
- Facebook Peregrine - Map Reduce framework.
- AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour - MapReduce library for Clojure.
- Datasalt Pangool - alternative MapReduce paradigm.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Netflix PigPen - map-reduce for Clojure which compiles to Apache Pig.
- Nokia Disco - MapReduce framework developed by Nokia.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
- Ray - A fast and simple framework for building and running distributed applications.
- Rackerlabs Blueflood - multi-tenant distributed metric processing system
Programming Languages
Categories
Interesting Papers
47
Distributed Programming
45
Data Visualization
42
Machine Learning
38
Books
34
Key-value Data Model
25
NewSQL Databases
25
Applications
25
Data Ingestion
21
Time-Series Databases
21
Graph Data Model
21
SQL-like processing
17
Search engine and framework
17
Business Intelligence
17
MySQL forks and evolutions
12
Distributed Filesystem
12
System Deployment
11
Internet of things and sensor data
10
Columnar Databases
9
Service Programming
9
Key Map Data Model
8
PostgreSQL forks and evolutions
7
Scheduling
7
Benchmarking
6
Frameworks
5
Interesting Readings
4
RDBMS
4
Videos
4
Embedded Databases
4
Document Data Model
4
Memcached forks and evolutions
3
Security
2
Distributed Index
1
Sub Categories
Keywords
database
18
machine-learning
13
deep-learning
11
python
10
data-science
9
analytics
7
go
6
data-visualization
6
graph
6
sql
6
awesome-list
5
awesome
5
java
5
spark
5
kafka
5
network-embedding
5
golang
5
visualization
5
node-embedding
4
graph-embedding
4
tensorflow
4
geospatial
4
network-science
4
distributed-database
4
random-forest
4
mysql
4
classifier
4
kubernetes
4
pytorch
4
node2vec
3
jupyter
3
workflow
3
data
3
data-analysis
3
etl
3
gradient-boosting
3
graph-database
3
metrics
3
deepwalk
3
nosql
3
in-memory
3
reinforcement-learning
3
distributed
3
postgresql
3
time-series
3
stream-processing
3
big-data
3
hadoop
3
business-intelligence
3
networkx
3