awesome-data-engineering

A curated list of data engineering tools for software developers
https://github.com/igorbarinov/awesome-data-engineering

Last synced: 8 days ago
JSON representation

Data Ingestion
- librdkafka - The Apache Kafka C/C++ library.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- DataSpoc Pipe - Data ingestion engine that connects 400+ Singer taps to Parquet files in cloud buckets (S3, GCS, Azure). Streaming, incremental, with auto-catalog.
- Enrich.sh - Managed event ingestion service that converts JSON sent to a REST API into Hive-partitioned Parquet on Cloudflare R2, queryable from DuckDB, ClickHouse, BigQuery, Snowflake, and Python.
- drt - OSS Reverse ETL CLI. Sync data from warehouses to business tools via YAML.
- data-genie - High-performance, streaming-first ETL engine for Node.js and TypeScript with constant memory footprint.
- pdfmux - Python PDF-to-Markdown orchestrator. Classifies each page and routes to the optimal backend (PyMuPDF, Docling, RapidOCR, Gemini Flash), emitting Markdown plus a per-page confidence score so ingestion pipelines can quarantine low-trust pages before feeding LLMs or retrieval.
- LinkedIn Jobs Scraper - Crawlee-based actor extracting structured LinkedIn job listings at scale without API keys.
- CARQ - Context-Aware RAG Processing Queue for high availability and adaptive rate-limiting.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Duckle - Local-first, open-source desktop ETL/ELT studio: drag a pipeline onto a canvas (or describe it to a built-in on-device AI assistant) and run it at native speed through DuckDB. 290+ connectors, a scheduler, and an MCP server for driving pipelines from an LLM. No cloud, no servers.
- Rawbbit - Open-source self-hosted analytics pipeline that lands raw events as Parquet in your own object storage. Uses NATS JetStream for durable buffering and BigQuery external tables for querying. Designed for teams that want to own their raw event data.
- Enrich.sh - Managed event ingestion service that converts JSON sent to a REST API into Hive-partitioned Parquet on Cloudflare R2, queryable from DuckDB, ClickHouse, BigQuery, Snowflake, and Python.
- enrich-companies - CLI tool to enrich CSV files with company data (financials, contacts, metadata) from 250M+ company records. Available on [npm](https://www.npmjs.com/package/enrich-companies).
- faucet-stream - Config-driven data-movement platform for Rust with pluggable source and sink connectors, running ETL, CDC, and streaming pipelines declaratively from YAML or embedded as a library.
Data Lake Management
- lakeFS - An open source platform that delivers resilience and manageability to object-storage based data lakes.
- Project Nessie - A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.
- Ilum - A modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.
- Gravitino - An open-source, unified metadata management for data lakes, data warehouses, and external catalogs.
- FlightPath Data - FlightPath is a gateway to a data lake's bronze layer, protecting it from invalid external data file feeds as a trusted publisher.
Datasets
- Data Dumps
  - GitHub Archive - GitHub's public timeline since 2011, updated every hour.
  - Common Crawl - Open source repository of web crawl data.
  - Wikipedia - Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
  - FirstData - The world's most comprehensive authoritative data source knowledge base. 160+ curated sources from governments, international organizations, and research institutions with MCP integration.
  - The Quiet-Broke Index - A 30-metro composite of US household cost burdens (housing, taxes, childcare, healthcare, transport) aggregated from Census ACS, BLS Consumer Expenditure Survey, and HUD Fair Market Rents. Open methodology, free, no email gate.
  - Mindweave Synthetic Business Data - 42-table synthetic SME dataset with double-entry accounting, tax compliance (AU/US/UK), and temporal realism. CSV, SQL, Parquet, SQLite. Ideal for ETL pipeline testing.
  - LatAm Synth - Synthetic financial savings behavior generator for Latin America: users, savings goals, and transactions calibrated on 506K real records (2015–2024). Reproducible by seed, 100% synthetic.
- Realtime
  - Reddit - Real-time data is available including comments, submissions and links posted to reddit.
  - Eventsim - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
  - Twitter Realtime - The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.
  - Helium MCP - Remote MCP server for real-time financial data, 3.2M+ news articles, ML options pricing, and news bias analysis. Free, no API key. [MCP](https://heliumtrades.com/mcp)
  - Sorsa API - Real-time X (Twitter) data API providing tweets, profiles, search, communities and engagement metrics. Up to 50x cheaper than the official X API with 20 req/sec rate limit, JSON output.
  - Eventum - Data generation platform for producing synthetic event streams with complex correlations.
  - Eventum - Data generation platform for producing synthetic event streams with complex correlations.
Docker
- Rancher - RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.
- Kontena - Application Containers for Masses.
- ImageLayers - Visualize Docker images and the layers that compose them.
- Gockerize - Package golang service into minimal Docker containers.
- Flocker - Easily manage Docker containers & their data.
- Weave - Weaving Docker containers into applications.
- Zodiac - A lightweight tool for easy deployment and rollback of dockerized applications.
- cAdvisor - Analyzes resource usage and performance characteristics of running containers.
- Micro S3 persistence - Docker microservice for saving/restoring volume data to S3.
- Rocker-compose - Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.
- Nomad - A cluster manager, designed for both long-lived services and short-lived batch processing workloads.
ELK Elastic Logstash Kibana
- docker-logstash - A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).
- elasticsearch-jdbc - JDBC importer for Elasticsearch.
- ZomboDB - PostgreSQL Extension that allows creating an index backed by Elasticsearch.
File System
- HDFS - A distributed file system designed to run on commodity hardware.
- AWS S3 - Object storage built to retrieve any amount of data from anywhere.
- CEPH - A unified, distributed storage system designed for excellent performance, reliability, and scalability.
- OrangeFS - Orange File System is a branch of the Parallel Virtual File System.
- GlusterFS - Gluster Filesystem.
- XtreemFS - Fault-tolerant distributed file system for all storage needs.
- LizardFS - Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.
- Snakebite - A pure python HDFS client.
- JuiceFS - A high-performance Cloud-Native file system driven by object storage for large-scale data storage.
- SnackFS - A bite-sized, lightweight HDFS compatible file system built over Cassandra.
- smart_open - Utils for streaming large files (S3, HDFS, gzip, bz2).
- SeaweedFS - Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
Monitoring
- Prometheus
  - Prometheus.io - An open-source service monitoring system and time series database.
  - HAProxy Exporter - Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.
  - Signals CLI - Intent signal monitoring CLI. Track LinkedIn engagers, keyword posters, job changers, funding events. JSON output for data pipelines.
Profiling
- Data Profiler
  - Data Profiler - The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.
  - YData Profiling - A general-purpose open-source data profiler for high-level analysis of a dataset.
  - Desbordante - An open-source data profiler specifically focused on discovery and validation of complex patterns in data.
Schema
- Data Profiler
  - SchemaCrawler - Open-source and free relational database schema discovery and comprehension tool. Documents and diagrams relational database schemas from your Java programs, build tools and the command line. Find database design issues with lint, and write scripts against the database. Includes an MCP Server for use by AI agents.
Serialization format
- PigZ - A parallel implementation of gzip for modern multi-processor, multi-core machines.
- Apache ORC - The smallest, fastest columnar storage for Hadoop workloads.
- SequenceFile - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
- Snappy - A fast compressor/decompressor. Used with Parquet.
- ProtoBuf - Protocol Buffers - Google's data interchange format.
- Kryo - A fast and efficient object graph serialization framework for Java.
- AKF - The AI native file format. Trust scores, source provenance, and compliance metadata that embed into 20+ formats (DOCX, PDF, images, code). EXIF for AI.
- PFC-JSONL - Specialized JSONL log compressor with block-level timestamp indexing and DuckDB integration. Achieves ~9% compression ratio (better than gzip) with time-range random access queries.
- ParquetKit - Browser-based viewer, SQL workbench and converter for Parquet files powered by DuckDB-WASM. Fully client-side, no upload.
Stream Processing
- Apache Beam - A unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.
- Spark Streaming - Makes it easy to build scalable fault-tolerant streaming applications.
- Apache Flink - A streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
- Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data.
- Apache Hudi - An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
- Spring Cloud Dataflow - Streaming and tasks execution between Spring Boot apps.
- Bonobo - A data-processing toolkit for python 3.5+.
- PipelineDB - The Streaming SQL Database.
- Robinhood's Faust - Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
- HStreamDB - The streaming database built for IoT data storage and real-time processing.
- Zilla - - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.
- SwimOS - A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.
- CocoIndex - An open source ETL framework to build fresh index for AI.
- VoltDB - VoltDb is an ACID-compliant RDBMS which uses a [shared nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture).
- Pathway - Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.
- Kuiper - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
Testing
- Data Profiler
  - DataKitchen - Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.
  - Grai - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
  - DQOps - An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
  - RunSQL - Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.
  - Spark Playground - Write, run, and test PySpark code on Spark Playground's online compiler. Access real-world sample datasets & solve interview questions to enhance your PySpark skills for data engineering roles.
  - GreatExpectation - Open Source data validation framework to manage data quality. Users can define and document “expectations” rules about how data should look and behave.
  - Grai - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
  - daffy - Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.
  - Snowflake Emulator - A Snowflake-compatible emulator for local development and testing.
  - Provero - A vendor-neutral, declarative data quality engine. Define checks in YAML, run anywhere. Includes 16 built-in check types, SQL batch optimizer, anomaly detection, and data contracts.
  - DataScreenIQ - Real-time data quality firewall for pipelines and APIs. Screens rows in milliseconds for schema drift, null spikes, type mismatches, and data anomalies with PASS / WARN / BLOCK decisions.
  - DataDriven - Interview practice with SQL query execution, Python, and data modeling exercises.
  - Aegis DQ - Open-source agentic data quality framework with LLM-powered diagnosis, root-cause analysis, SQL auto-fix proposals, and 31 rule types — DuckDB, Postgres, BigQuery, Databricks, Athena, Snowflake.
  - Scherlok - Zero-config data quality CLI. Profiles every table on first run, then auto-detects anomalies (volume drops, schema drift, freshness misses, distribution shifts) on subsequent runs. No YAML, no rules to write. Works with Postgres, BigQuery, Snowflake, and dbt.
  - Fixzi - JSON/XML validation and API contract monitoring tool for debugging and testing structured data.
Workflow
- CronQ - An application cron-like system. [Used](https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luigi. Deprecated.
- Cascading - Java based application development platform.
- Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.
- Oozie - A workflow scheduler system to manage Apache Hadoop jobs.
- Kedro - Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
- SuprSend - Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse.
- Dataform - An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.
- Kestra - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
- Luigi - A Python module that helps you build complex pipelines of batch jobs.
- Airflow - A system to programmatically author, schedule, and monitor data pipelines.
- Pinball - DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.
- Dagster - An open-source Python library for building data applications.
- RudderStack - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.
- PACE - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)
- Multiwoven - The open-source reverse ETL, data activation platform for modern data teams.
- Kestra - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
- Kedro - A framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
- Census - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
- Census - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
- Hamilton - A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.
- Bruin - End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.
- Bonnard - Governed, multi-tenant MCP access to your customers' data. Turn your warehouse, dbt, or semantic layer into a secure, per-customer MCP for AI agents.
- Dotflow - A lightweight Python library for building execution pipelines with retry, parallel execution, cron scheduling, and async support.
- OrionBelt Semantic Layer - Open-source semantic sidecar that compiles YAML-defined dimensions, measures, and metrics into optimized SQL across 8 engines (BigQuery, ClickHouse, Databricks, Dremio, DuckDB, MySQL, PostgreSQL, Snowflake). Unified REST, MCP, and Postgres wire protocol; one model powers AI agents, analytics, DQ rules, and KPIs.
- DataFlow - Open-source platform for data preparation, synthetic data generation, and AI/data pipelines. Includes reusable skills for automating workflow steps across data and AI tasks.
- Nika - Intent-as-code workflow engine for AI data pipelines: reviewable YAML DAGs statically checked (schema, permits, cost floor) before execution, with tamper-evident run traces.
- OneQuery - Self-hosted gateway for safe, auditable queries for agents across approved data sources.

Programming Languages

Python 27 Java 18 Go 18 Scala 7 Rust 7 JavaScript 6 C++ 5 C 4 Shell 3 TypeScript 3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-data-engineering

Data Ingestion

Data Lake Management

Datasets

Data Dumps

Realtime

Docker

ELK Elastic Logstash Kibana

File System

Monitoring

Prometheus

Profiling

Data Profiler

Schema

Data Profiler

Serialization format

Stream Processing

Testing

Data Profiler

Workflow