awesome-data-engineering

A curated list of data engineering tools for software developers
https://github.com/igorbarinov/awesome-data-engineering

Last synced: 3 days ago
JSON representation

Batch Processing
- Hadoop MapReduce - A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.
- Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- Spark RDD API Examples - Examples by Zhen He.
- AWS EMR - A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
- Data Mechanics - A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.
- Tez - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
- Mahout - An environment for quickly creating scalable performant machine learning applications.
- Spark MLlib - Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
- Giraph - An iterative graph processing system built for high scalability.
- Spark GraphX - Apache Spark's API for graphs and graph-parallel computation.
- Presto - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
- Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
- Deep Spark - Connecting Apache Spark with different data stores. Deprecated.
- Delight - A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).
- Bistro - A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via _functions_ and processes data via _columns operations_ as opposed to having only set operations in conventional approaches like MapReduce or SQL.
- Hivemall - Scalable machine learning library for Hive/Hadoop.
- PyHive - Python interface to Hive and Presto.
- Substation - A cloud native data pipeline and transformation toolkit written in Go.
- GraphLab Create - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
- GraphLab Create - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
- dna-claude-analysis - Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, etc.) and generating a terminal-style single-page HTML visualization.
- Datatrax - Pure-Go classic machine learning toolkit and data engineering utilities. Eight algorithms with zero external dependencies.
- Zingg - Open source Master Data Management platform using machine learning for entity resolution at scale. Native to Databricks, Microsoft Fabric, Snowflake, AWS, and GCP. Golden records are maintained through a persistent Zingg ID across all systems and sources.
Charts and Dashboards
- Highcharts - A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
- ZingChart - Fast JavaScript charts for any data set.
- D3.js - A JavaScript library for manipulating documents based on data.
- SmoothieCharts - A JavaScript Charting Library for Streaming Data.
- Redash - Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.
- PyQtGraph - A pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.
- PyXley - Python helpers for building dashboards using Flask and React.
- Plotly - Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.
- Metabase - The easy, open source way for everyone in your company to ask questions and learn from data.
- QueryGPT - Natural language database query interface with automatic chart generation, supporting Chinese and English queries.
- Seaborn - A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- Apache Superset - A modern, enterprise-ready business intelligence web application.
- AI for Database - Agentic AI platform to connect any database (PostgreSQL, MySQL, MongoDB, etc.) and query in plain English; includes self-refreshing intelligent dashboards and action workflows triggered by data changes.
- Dekart - Open-source SQL to map platform for BigQuery, Snowflake, and PostGIS.
Community
- Books
  - Snowflake Data Engineering - A practical introduction to data engineering on the Snowflake cloud data platform.
  - Best Data Science Books - This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.
  - Architecting an Apache Iceberg Lakehouse - A guide to designing an Apache Iceberg lakehouse from scratch.
  - Learn AI Data Engineering in a Month of Lunches - A fast, friendly guide to integrating large language models into your data workflows.
- Conferences
  - Data Council - Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.
  - Data Council - The first technical conference that bridges the gap between data scientists, data engineers and data analysts.
  - Data Council - The first technical conference that bridges the gap between data scientists, data engineers and data analysts.
- Forums
  - /r/dataengineering - News, tips, and background on Data Engineering.
  - /r/etl - Subreddit focused on ETL.
  - AI Dev Jobs - Job board focused on AI, ML, and data engineering roles with 7,400+ listings, salary data, and a free REST API.
- Podcasts
  - Data Engineering Podcast - The show about modern data infrastructure.
  - The Data Stack Show - A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
  - Latent Space - Technical deep dives on AI engineering, from model training to deployment.
  - Practical AI - Making AI practical, productive, and accessible to everyone.
  - Software Engineering Daily - Daily interviews about technical software topics, including data infrastructure.
  - The Analytics Engineering Podcast - How analytics engineers build and maintain data pipelines at scale.
  - Chain of Thought - Interviews with AI and data infrastructure leaders on building production systems.
Databases
- MySQL - The world's most popular open source database.
- Percona XtraBackup - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- MariaDB - An enhanced, drop-in replacement for MySQL.
- PostgreSQL - The world's most advanced open source database.
- Amazon RDS - Makes it easy to set up, operate, and scale a relational database in the cloud.
- Redis - An open source, BSD licensed, advanced key-value cache and store.
- Riak - A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
- AWS DynamoDB - A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
- SSDB - A high performance NoSQL database supporting many data structures, an alternative to Redis.
- Cassandra Calculator - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
- HBase - The Hadoop database, a distributed, scalable, big data store.
- AWS Redshift - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
- Elasticsearch - Search & Analyze Data in Real Time.
- Couchbase - The highest performing NoSQL distributed database.
- RethinkDB - The open-source database for the realtime web.
- RavenDB - Fully Transactional NoSQL Document Database.
- Neo4j - The world's leading graph database.
- Titan - A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
- Apache Geode - An open source, distributed, in-memory database for scale-out applications.
- QuestDB - A relational column-oriented database designed for real-time analytics on time series and event data.
- Riak-TS - Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
- Rhombus - A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
- Percona XtraBackup - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- RQLite - Replicated SQLite using the Raft consensus protocol.
- TiDB - A distributed NewSQL database compatible with MySQL protocol.
- mysql_utils - Pinterest MySQL Management Tools.
- HyperDex - A scalable, searchable key-value store. Deprecated.
- IonDB - A key-value store for microcontroller and IoT applications.
- CCM - A script to easily create and destroy an Apache Cassandra cluster on localhost.
- FiloDB - Distributed. Columnar. Versioned. Streaming. SQL.
- MemDB - Distributed Transactional In-Memory Database (based on MongoDB).
- FlockDB - A distributed, fault-tolerant graph database by Twitter. Deprecated.
- Gaffer - A large-scale graph database.
- InfluxDB - Scalable datastore for metrics, events, and real-time analytics.
- OpenTSDB - A scalable, distributed Time Series Database.
- kairosdb - Fast scalable time series database.
- Heroic - A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
- Akumuli - A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- Dalmatiner DB - Fast distributed metrics database.
- Blueflood - A distributed system designed to ingest and process time series data.
- Timely - A time series database application that provides secure access to time series data based on Accumulo and Grafana.
- cayley - An open-source graph database. Google.
- Snappydata - SnappyData: OLTP + OLAP Database built on Apache Spark.
- DuckDB - A fast in-process analytical database that has zero external dependencies, runs on Linux/macOS/Windows, offers a rich SQL dialect, and is free and extensible.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- CCM - A script to easily create and destroy an Apache Cassandra cluster on localhost.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
- OrientDB - 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
- Blueflood - A distributed system designed to ingest and process time series data.
- Tarantool - An in-memory database and application server.
- Snappydata - OLTP + OLAP Database built on Apache Spark.
- Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
- SSDB - A high performance NoSQL database supporting many data structures, an alternative to Redis.
- Percona XtraBackup - A free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- ScyllaDB - NoSQL data store using the seastar framework, compatible with Apache Cassandra.
- Kyoto Tycoon - Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.
- TimescaleDB - Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.
- ClickHouse - Distributed columnar DBMS for OLAP. SQL.
- Kyoto Tycoon - A lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.
- CCM - A script to easily create and destroy an Apache Cassandra cluster on localhost.
- Druid - Column oriented distributed data store ideal for powering interactive applications.
- GreenPlum - The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
- Actionbase - A database for user interactions (likes, views, follows) represented as graphs, with precomputed reads served in real-time.
- Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
- Rivestack - Managed PostgreSQL with pgvector for AI workloads. HNSW indexing, sub-4ms latency, and built-in SQL editor with automatic embedding generation.
- ArcadeDB - Open-source multi-model database with native graph, document, key-value, and vector support. SQL, Cypher, and Gremlin query languages. Apache 2.0 license.
- Omnigraph - Typed graph database where agents branch and merge like Git. S3-native, Rust, traversal + vector + BM25 in one runtime.
- SlothDB - In-process analytical SQL database written in C++20. Reads Parquet, CSV, JSON, Avro, Arrow, SQLite, and Excel directly. Single binary, Python package, and 1.3 MB WASM build for the browser.
- DAtomic - The fully transactional, cloud-ready, distributed database.
- chDB - Embedded ClickHouse — full ClickHouse SQL dialect, ~80 data formats, and 12+ source connectors (S3, Postgres, MongoDB, Kafka, Iceberg) in core. Python, Go, Rust, Node, Bun, Zig, and Ruby bindings.
Data Comparison
- datacompy - A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.
- dvt - Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration.
- koala-diff - A high-performance Python library for comparing large datasets (CSV, Parquet) locally using Rust and Polars. It features zero-copy streaming to prevent OOM errors and generates interactive HTML data quality reports.
- everyrow - AI-powered data operations SDK for Python. Semantic deduplication, fuzzy table merging, and intelligent row ranking using LLM agents.
- FutureSearch SDK - Python SDK that dispatches parallel web-research agents across
Data Ingestion
- Kafka - Publish-subscribe messaging rethought as a distributed commit log.
- AWS Kinesis - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
- RabbitMQ - Robust messaging for applications.
- Nakadi - Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.
- Pravega - Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.
- Apache Pulsar - An open-source distributed pub-sub messaging system.
- Sling - CLI data integration tool specialized in moving data between databases, as well as storage systems.
- Meltano - CLI & code-first ELT.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- BottledWater - Change data capture from PostgreSQL into Kafka. Deprecated.
- kafkat - Simplified command-line administration for Kafka brokers.
- pg-kafka - A PostgreSQL extension to produce messages to Apache Kafka.
- kafka-docker - Kafka in Docker.
- kafka-node - Node.js client for Apache Kafka 0.8.
- Secor - Pinterest's Kafka to S3 distributed consumer.
- Kafka-logger - Kafka-winston logger for Node.js from Uber.
- Heka - Data Acquisition and Processing Made Easy. Deprecated.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Google Sheets ETL - Live import all your Google Sheets to your data warehouse.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Artie - Real-time data ingestion tool leveraging change data capture.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- CsvPath Framework - A delimited data preboarding framework that fills the gap between MFT and the data lake.
- kafka-manager - A tool for managing Apache Kafka.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Pravega - Provides a new storage abstraction - a stream - for continuous and unbounded data.
- Estuary Flow - No/low-code data pipeline platform that handles both batch and real-time data ingestion.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- dlt - A fast&simple pipeline building library for Python data devs, runs in notebooks, cloud functions, airflow, etc.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Kroxylicious - A Kafka Proxy, solving problems like encrypting your Kafka data at rest.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- db2lake - Lightweight Node.js ETL framework for databases → data lakes/warehouses.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- AWS Data Wrangler - Utility belt to handle data on AWS.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Gobblin - Universal data ingestion framework for Hadoop from LinkedIn.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- ingestr - CLI tool to copy data between databases with a single command. Supports 50+ sources including PostgreSQL, MySQL, MongoDB, Salesforce, Shopify to any data warehouse.
- Kreuzberg - Polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Extracts text, tables, and metadata from 62+ document formats for data pipeline ingestion.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- DataRaven - Managed cloud object storage transfers for ingestion workflows.
- Arpe.io - High-speed CLI tools for database export, import, replication and migration with parallel streaming to CSV, Parquet, JSON and cloud storage, supporting PostgreSQL, MySQL, Oracle, SQL Server and 80+ sources.
- Crustdata - A real-time B2B data API for company and people intelligence, providing firmographics, headcount signals, job listings, web traffic, and funding events via REST API and webhooks for data enrichment pipelines.
- crdt-merge - Conflict-free merge for DataFrames, JSON, ML models & distributed agents — powered by CRDTs.
- librdkafka - The Apache Kafka C/C++ library.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- DataSpoc Pipe - Data ingestion engine that connects 400+ Singer taps to Parquet files in cloud buckets (S3, GCS, Azure). Streaming, incremental, with auto-catalog.

Programming Languages

Python 26 Java 18 Go 18 Scala 7 JavaScript 6 C++ 5 Rust 5 C 4 Shell 3 Ruby 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-data-engineering

Batch Processing

Charts and Dashboards

Community

Books

Conferences

Forums

Podcasts

Databases

Data Comparison

Data Ingestion