An open API service indexing awesome lists of open source software.

data-engineering-collection

A collection of awesome software, libraries, Learning Tutorials, documents, books, resources and interesting stuff about Big Data Science & Engineering
https://github.com/exajobs/data-engineering-collection

Last synced: 3 days ago
JSON representation

  • Applications

    • Apache Tika - content analysis toolkit.
    • Hunk - Splunk analytics for Hadoop.
    • Imhotep - Large scale analytics platform by indeed.
    • Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
    • Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
    • MADlib - data-processing library of an RDBMS to analyze data.
    • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
    • Splunk - analyzer for machine-generated data.
    • Sumo Logic - cloud based analyzer for machine-generated data.
    • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
    • 411 - an web application for alert management resulting from scheduled searches into Elasticsearch.
    • Adobe spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
    • Argus - Time series monitoring and alerting platform.
    • AthenaX - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
    • Atlas - a backend for managing dimensional time series data.
    • ElastAert - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
    • Eventhub - open source event analytics platform.
    • Hermes - asynchronous message broker built on top of Kafka.
    • Hunk - Splunk analytics for Hadoop.
    • Kapacitor - an open source framework for processing, monitoring, and alerting on time series data.
    • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
    • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
    • SparkR - R frontend for Spark.
    • Rakam - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
    • Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
    • Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
    • Eclipse BIRT - Eclipse-based reporting system.
    • HASH - open source simulation and visualization platform.
    • MADlib - data-processing library of an RDBMS to analyze data.
    • Kylin - open source Distributed Analytics Engine from eBay.
    • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
  • Benchmarking

  • Books

  • Business Intelligence

    • BIME Analytics - business intelligence platform in the cloud.
    • Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
    • GoodData - platform for data products and embedded analytics.
    • Jaspersoft - powerful business intelligence suite.
    • Jedox Palo - customisable Business Intelligence platform.
    • Jethrodata - Interactive Big Data Analytics.
    • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
    • Qlik - business intelligence and analytics platform.
    • Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
    • Saiku Analytics - Open source analytics platform.
    • Knowage - open source business intelligence platform. (former [SpagoBi](http://www.spagobi.org/))
    • SparklineData SNAP - modern B.I platform powered by Apache Spark.
    • Tableau - business intelligence platform.
    • Blazer - business intelligence made simple.
    • Metabase - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
    • Numeracy - Fast, clean SQL client and business intelligence.
    • Pentaho - business intelligence platform.
    • Zoomdata - Big Data Analytics.
    • Microsoft - business intelligence software and platform.
    • datapine - self-service business intelligence tool in the cloud.
    • intermix.io - Performance Monitoring for Amazon Redshift
    • Qlik - business intelligence and analytics platform.
  • `Columnar Databases`

    • Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
    • MonetDB - column store database.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
    • Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
    • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
    • IndexR - an open-source columnar storage format for fast & realtime analytic with big data.
    • LocustDB - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
    • ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
    • Columnar Storage - an explanation of what columnar storage is and when you might want it.
    • Actian Vector - column-oriented analytic database.
    • SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
    • Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
  • Databases

  • Data Ingestion

    • Amazon Kinesis - real-time processing of streaming data at massive scale.
    • Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
    • Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • Kestrel - distributed message queue system.
    • Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
    • Apache Pulsar - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
    • Facebook Scribe - streamed log data aggregator.
    • Gazette - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
    • Heka - open source stream processing software system.
    • HIHO - framework for connecting disparate data sources with Hadoop.
    • Logstash - a tool for managing events and logs.
    • Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
    • Pinterest Secor - is a service implementing Kafka log persistance.
    • Skizze - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
    • StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
    • RudderStack - an open source customer data infrastructure (segment, mParticle alternative) written in go.
    • redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
    • Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
    • LinkedIn White Elephant - log aggregator and dashboard.
    • Linkedin Gobblin - linkedin's universal data ingestion framework.
    • Apache Chukwa - data collection system.
    • Apache Flume - service to manage large amount of log data.
    • Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
    • Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
    • Fluentd - tool to collect events and logs.
    • Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
    • LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
  • Data Visualization

    • chartd - responsive, retina-compatible charts with just an img tag.
    • D3 - javaScript library for manipulating documents.
    • DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
    • FnordMetric - write SQL queries that return SVG charts rather than tables
    • Grafana - graphite dashboard frontend, editor and graph composer.
    • Graphite - scalable Realtime Graphing.
    • Highcharts - simple and flexible charting API.
    • Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
    • Zing Charts - JavaScript charting library for big data.
    • Lumify - open source big data analysis and visualization platform
    • Airpal - Web UI for PrestoDB.
    • Arbor - graph visualization library using web workers and jQuery.
    • Banana - visualize logs and time-stamped data stored in Solr. Port of Kibana.
    • Bloomery - Web UI for Impala.
    • CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
    • chartd - responsive, retina-compatible charts with just an img tag.
    • Chartist.js - another open source HTML5 Charts visualization.
    • Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
    • Cubism - JavaScript library for time series visualization.
    • DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
    • D3.compose - Compose complex, data-driven visualizations from reusable charts and components.
    • Dash - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
    • Envisionjs - dynamic HTML5 visualization.