An open API service indexing awesome lists of open source software.

awesome-bigdata

A curated list of awesome big data frameworks, ressources and other awesomeness.
https://github.com/oxnr/awesome-bigdata

Last synced: about 20 hours ago
JSON representation

  • Applications

    • Apache Tika - content analysis toolkit.
    • Hunk - Splunk analytics for Hadoop.
    • Imhotep - Large scale analytics platform by indeed.
    • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
    • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
    • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
    • Splunk - analyzer for machine-generated data.
    • Sumo Logic - cloud based analyzer for machine-generated data.
    • 411 - an web application for alert management resulting from scheduled searches into Elasticsearch.
    • Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
    • Argus - Time series monitoring and alerting platform.
    • AthenaX - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
    • Atlas - a backend for managing dimensional time series data.
    • Comet - Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring.
    • ElastAert - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
    • Eventhub - open source event analytics platform.
    • Hermes - asynchronous message broker built on top of Kafka.
    • Hunk - Splunk analytics for Hadoop.
    • Kapacitor - an open source framework for processing, monitoring, and alerting on time series data.
    • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
    • Opik - Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
    • SnappyData - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
    • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
    • SparkR - R frontend for Spark.
    • Substation - Substation is a cloud native data pipeline and transformation toolkit written in Go.
    • Rakam - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
    • Gigasheet - cloud spreadsheet for exploring and analyzing large datasets.
    • Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
    • MADlib - data-processing library of an RDBMS to analyze data.
    • Kylin - open source Distributed Analytics Engine from eBay.
  • Benchmarking

  • Books

  • Business Intelligence

    • BIME Analytics - business intelligence platform in the cloud.
    • GoodData - platform for data products and embedded analytics.
    • Jaspersoft - powerful business intelligence suite.
    • Jedox Palo - customisable Business Intelligence platform.
    • Jethrodata - Interactive Big Data Analytics.
    • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
    • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
    • Saiku Analytics - Open source analytics platform.
    • Knowage - open source business intelligence platform. (former [SpagoBi](http://www.spagobi.org/))
    • SparklineData SNAP - modern B.I platform powered by Apache Spark.
    • Tableau - business intelligence platform.
    • Blazer - business intelligence made simple.
    • Lightdash - The open source Looker alternative built on dbt
    • Metabase - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
    • Numeracy - Fast, clean SQL client and business intelligence.
    • Pentaho - business intelligence platform.
    • Zoomdata - Big Data Analytics.
    • Microsoft - business intelligence software and platform.
    • Datapallas - BI and data platform with AI exploration, dashboards, and pixel-perfect report generation; formerly ReportBurster.
    • Chartio - lean business intelligence platform to visualize and explore your data.
    • datapine - self-service business intelligence tool in the cloud.
  • Columnar Databases

    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • MonetDB - column store database.
    • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
    • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
    • IndexR - an open-source columnar storage format for fast & realtime analytic with big data.
    • LocustDB - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
    • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
    • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
    • Actian Vector - column-oriented analytic database.
    • Parquet - columnar storage format for Hadoop.
    • SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
  • Data Ingestion

    • Amazon Kinesis - real-time processing of streaming data at massive scale.
    • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
    • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • Kestrel - distributed message queue system.
    • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
    • Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
    • Facebook Scribe - streamed log data aggregator.
    • Gazette - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
    • Heka - open source stream processing software system.
    • HIHO - framework for connecting disparate data sources with Hadoop.
    • Logstash - a tool for managing events and logs.
    • Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
    • Pinterest Secor - is a service implementing Kafka log persistance.
    • Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
    • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
    • RudderStack - an open source customer data infrastructure (segment, mParticle alternative) written in go.
    • Zilla - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.
    • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
    • Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
    • LinkedIn White Elephant - log aggregator and dashboard.
    • Linkedin Gobblin - linkedin's universal data ingestion framework.
    • Airbyte - open-source data movement platform for ELT pipelines and connector-based replication.
    • Apache SeaTunnel - high-performance, distributed data integration platform for batch and streaming synchronization.
    • Bruin - end-to-end data pipeline tool combining ingestion, transformations, and data quality checks.
    • DataRaven - managed cloud object storage transfers for data ingestion workflows.
    • DBConvert Streams - self-hosted CDC replication and database migration tool.
    • Debezium - open-source distributed platform for change data capture.
    • Flink CDC - streaming data integration tool powered by Apache Flink.
    • Graylog - log management platform for collecting, storing, searching, and alerting on machine data.
    • Hevo - managed data pipeline platform for moving data from databases, SaaS apps, cloud storage, SDKs, and streaming services.
    • Hightouch - reverse ETL platform for syncing warehouse data into business applications.
    • ingestr - CLI tool for copying data between sources and destinations.
    • Metricbeat - lightweight shipper for system and service metrics.
    • Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
  • Data Quality and Observability

    • DataKitchen Open Source Data Observability - open-source data observability for monitoring data journeys, data quality, and pipeline events.
    • Great Expectations - open-source framework for validating, documenting, and testing data quality.
    • OpenLineage - open standard and reference implementation for collecting lineage metadata from data pipelines.
    • Soda Core - open-source Python library and CLI for data quality tests.
  • Data Visualization

    • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • D3 - javaScript library for manipulating documents.
    • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
    • FnordMetric - write SQL queries that return SVG charts rather than tables
    • Grafana - graphite dashboard frontend, editor and graph composer.
    • Graphite - scalable Realtime Graphing.
    • Highcharts - simple and flexible charting API.
    • Lumify - open source big data analysis and visualization platform
    • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
    • Zing Charts - JavaScript charting library for big data.
    • Airpal - Web UI for PrestoDB.
    • Arbor - graph visualization library using web workers and jQuery.
    • Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
    • Bloomery - Web UI for Impala.
    • CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • Chartist.js - another open source HTML5 Charts visualization.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Cubism - JavaScript library for time series visualization.
    • DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
    • D3.compose - Compose complex, data-driven visualizations from reusable charts and components.
    • Dash - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
    • Envisionjs - dynamic HTML5 visualization.
    • Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
    • Freeboard - pen source real-time dashboard builder for IOT and other web mashups.
    • Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
    • Google Charts - simple charting API.
    • Kibana - visualize logs and time-stamped data
    • Matplotlib - plotting with Python.
    • Peity - Progressive SVG bar, line and pie charts.
    • Plotly.js
    • Recline - simple but powerful library for building data applications in pure Javascript and HTML.
    • Redash - open-source platform to query and visualize data.
    • Sigma.js - JavaScript library dedicated to graph drawing.
    • Vega - a visualization grammar.
    • Zeppelin - a notebook-style collaborative data analysis.
    • DataSphere Studio - one-stop data application development management portal.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Superset - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
    • Echarts - Baidus enterprise charts.
    • NVD3 - chart components for d3.js.
    • Cytoscape - JavaScript library for visualizing complex networks.
    • Flexmonster Pivot Table & Charts - JavaScript component for pivot tables, charts, and web reporting.
    • WebDataRocks - free web pivot table component for embedding analytics in applications.
    • Bokeh - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
    • Chart.js - open source HTML5 Charts visualizations.
    • D3Plus - A fairly robust set of reusable charts and styles for d3.js.
    • Shiny - a web application framework for R.
  • Distributed Filesystem

    • BeeGFS - formerly FhGFS, parallel distributed file system.
    • Google Megastore - scalable, highly available storage.