awesome-data-engineering
A curated list of data engineering tools for software developers
https://github.com/igorbarinov/awesome-data-engineering
Last synced: 4 days ago
JSON representation
-
Batch Processing
- AWS EMR - A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
- Spark MLlib - Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
- Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
- Mahout - An environment for quickly creating scalable performant machine learning applications.
- Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- Tez - An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
- Giraph - An iterative graph processing system built for high scalability.
- Spark GraphX - Apache Spark's API for graphs and graph-parallel computation.
- Hadoop MapReduce - A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.
- Data Mechanics - A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.
- Spark RDD API Examples - Examples by Zhen He.
- Presto - A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
- Deep Spark - Connecting Apache Spark with different data stores. Deprecated.
- Bistro - A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via _functions_ and processes data via _columns operations_ as opposed to having only set operations in conventional approaches like MapReduce or SQL.
- PyHive - Python interface to Hive and Presto.
- Substation - A cloud native data pipeline and transformation toolkit written in Go.
- Delight - A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).
- Hivemall - Scalable machine learning library for Hive/Hadoop.
- GraphLab Create - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
- GraphLab Create - A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
- dna-claude-analysis - Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, etc.) and generating a terminal-style single-page HTML visualization.
-
Data Ingestion
- AWS Kinesis - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
- Apache Pulsar - An open-source distributed pub-sub messaging system.
- Kafka - Publish-subscribe messaging rethought as a distributed commit log.
- RabbitMQ - Robust messaging for applications.
- Meltano - CLI & code-first ELT.
- Nakadi - Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.
- Pravega - Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.
- Sling - CLI data integration tool specialized in moving data between databases, as well as storage systems.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Heka - Data Acquisition and Processing Made Easy. Deprecated.
- kafka-node - Node.js client for Apache Kafka 0.8.
- ingestr - CLI tool to copy data between databases with a single command. Supports 50+ sources including PostgreSQL, MySQL, MongoDB, Salesforce, Shopify to any data warehouse.
- Secor - Pinterest's Kafka to S3 distributed consumer.
- kafkat - Simplified command-line administration for Kafka brokers.
- BottledWater - Change data capture from PostgreSQL into Kafka. Deprecated.
- pg-kafka - A PostgreSQL extension to produce messages to Apache Kafka.
- kafka-docker - Kafka in Docker.
- Kafka-logger - Kafka-winston logger for Node.js from Uber.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Google Sheets ETL - Live import all your Google Sheets to your data warehouse.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Artie - Real-time data ingestion tool leveraging change data capture.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Kroxylicious - A Kafka Proxy, solving problems like encrypting your Kafka data at rest.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- CsvPath Framework - A delimited data preboarding framework that fills the gap between MFT and the data lake.
- kafka-manager - A tool for managing Apache Kafka.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Pravega - Provides a new storage abstraction - a stream - for continuous and unbounded data.
- Estuary Flow - No/low-code data pipeline platform that handles both batch and real-time data ingestion.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- dlt - A fast&simple pipeline building library for Python data devs, runs in notebooks, cloud functions, airflow, etc.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- db2lake - Lightweight Node.js ETL framework for databases → data lakes/warehouses.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- AWS Data Wrangler - Utility belt to handle data on AWS.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Kreuzberg - Polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Extracts text, tables, and metadata from 62+ document formats for data pipeline ingestion.
- Gobblin - Universal data ingestion framework for Hadoop from LinkedIn.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- Singer SDK - The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
- crdt-merge - Conflict-free merge for DataFrames, JSON, ML models & distributed agents — powered by CRDTs.
- kafkacat - Generic command line non-JVM Apache Kafka producer and consumer.
- Airbyte - Open-source data integration for modern data teams.
- DataRaven - Managed cloud object storage transfers for ingestion workflows.
- Arpe.io - High-speed CLI tools for database export, import, replication and migration with parallel streaming to CSV, Parquet, JSON and cloud storage, supporting PostgreSQL, MySQL, Oracle, SQL Server and 80+ sources.
- Crustdata - A real-time B2B data API for company and people intelligence, providing firmographics, headcount signals, job listings, web traffic, and funding events via REST API and webhooks for data enrichment pipelines.
-
Databases
- AWS Redshift - A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
- AWS DynamoDB - A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
- Amazon RDS - Makes it easy to set up, operate, and scale a relational database in the cloud.
- PostgreSQL - The world's most advanced open source database.
- Redis - An open source, BSD licensed, advanced key-value cache and store.
- DuckDB - A fast in-process analytical database that has zero external dependencies, runs on Linux/macOS/Windows, offers a rich SQL dialect, and is free and extensible.
- Apache Geode - An open source, distributed, in-memory database for scale-out applications.
- Elasticsearch - Search & Analyze Data in Real Time.
- MySQL - The world's most popular open source database.
- Couchbase - The highest performing NoSQL distributed database.
- RavenDB - Fully Transactional NoSQL Document Database.
- RethinkDB - The open-source database for the realtime web.
- Neo4j - The world's leading graph database.
- QuestDB - A relational column-oriented database designed for real-time analytics on time series and event data.
- Rhombus - A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
- MariaDB - An enhanced, drop-in replacement for MySQL.
- HBase - The Hadoop database, a distributed, scalable, big data store.
- Percona XtraBackup - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- Riak - A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
- SSDB - A high performance NoSQL database supporting many data structures, an alternative to Redis.
- Cassandra Calculator - This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
- Titan - A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
- Riak-TS - Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
- Percona XtraBackup - Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- TiDB - A distributed NewSQL database compatible with MySQL protocol.
- InfluxDB - Scalable datastore for metrics, events, and real-time analytics.
- RQLite - Replicated SQLite using the Raft consensus protocol.
- FiloDB - Distributed. Columnar. Versioned. Streaming. SQL.
- Akumuli - A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- cayley - An open-source graph database. Google.
- HyperDex - A scalable, searchable key-value store. Deprecated.
- OpenTSDB - A scalable, distributed Time Series Database.
- Heroic - A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
- kairosdb - Fast scalable time series database.
- FlockDB - A distributed, fault-tolerant graph database by Twitter. Deprecated.
- mysql_utils - Pinterest MySQL Management Tools.
- Timely - A time series database application that provides secure access to time series data based on Accumulo and Grafana.
- Gaffer - A large-scale graph database.
- IonDB - A key-value store for microcontroller and IoT applications.
- CCM - A script to easily create and destroy an Apache Cassandra cluster on localhost.
- MemDB - Distributed Transactional In-Memory Database (based on MongoDB).
- Dalmatiner DB - Fast distributed metrics database.
- Blueflood - A distributed system designed to ingest and process time series data.
- Snappydata - SnappyData: OLTP + OLAP Database built on Apache Spark.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- CCM - A script to easily create and destroy an Apache Cassandra cluster on localhost.
- Vertica - Distributed, MPP columnar database with extensive analytics SQL.
- Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
- OrientDB - 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
- Blueflood - A distributed system designed to ingest and process time series data.
- Tarantool - An in-memory database and application server.
- Snappydata - OLTP + OLAP Database built on Apache Spark.
- Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
- SSDB - A high performance NoSQL database supporting many data structures, an alternative to Redis.
- Percona XtraBackup - A free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
- ScyllaDB - NoSQL data store using the seastar framework, compatible with Apache Cassandra.
- Kyoto Tycoon - Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.
- TimescaleDB - Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.
- ClickHouse - Distributed columnar DBMS for OLAP. SQL.
- Kyoto Tycoon - A lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.
- Druid - Column oriented distributed data store ideal for powering interactive applications.
- CCM - A script to easily create and destroy an Apache Cassandra cluster on localhost.
- Actionbase - A database for user interactions (likes, views, follows) represented as graphs, with precomputed reads served in real-time.
- GreenPlum - The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
- Percona Server for MongoDB - Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
- ArcadeDB - Open-source multi-model database with native graph, document, key-value, and vector support. SQL, Cypher, and Gremlin query languages. Apache 2.0 license.
- Rivestack - Managed PostgreSQL with pgvector for AI workloads. HNSW indexing, sub-4ms latency, and built-in SQL editor with automatic embedding generation.
- DAtomic - The fully transactional, cloud-ready, distributed database.
-
File System
- AWS S3 - Object storage built to retrieve any amount of data from anywhere.
- GlusterFS - Gluster Filesystem.
- LizardFS - Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.
- XtreemFS - Fault-tolerant distributed file system for all storage needs.
- HDFS - A distributed file system designed to run on commodity hardware.
- CEPH - A unified, distributed storage system designed for excellent performance, reliability, and scalability.
- OrangeFS - Orange File System is a branch of the Parallel Virtual File System.
- JuiceFS - A high-performance Cloud-Native file system driven by object storage for large-scale data storage.
- Snakebite - A pure python HDFS client.
- SnackFS - A bite-sized, lightweight HDFS compatible file system built over Cassandra.
- smart_open - Utils for streaming large files (S3, HDFS, gzip, bz2).
- SeaweedFS - Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
-
Charts and Dashboards
- PyQtGraph - A pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.
- D3.js - A JavaScript library for manipulating documents based on data.
- Highcharts - A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
- Redash - Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.
- ZingChart - Fast JavaScript charts for any data set.
- SmoothieCharts - A JavaScript Charting Library for Streaming Data.
- Metabase - The easy, open source way for everyone in your company to ask questions and learn from data.
- PyXley - Python helpers for building dashboards using Flask and React.
- Plotly - Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.
- QueryGPT - Natural language database query interface with automatic chart generation, supporting Chinese and English queries.
- Seaborn - A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- Apache Superset - A modern, enterprise-ready business intelligence web application.
-
Serialization format
- Apache ORC - The smallest, fastest columnar storage for Hadoop workloads.
- PigZ - A parallel implementation of gzip for modern multi-processor, multi-core machines.
- SequenceFile - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
- ProtoBuf - Protocol Buffers - Google's data interchange format.
- Snappy - A fast compressor/decompressor. Used with Parquet.
- Kryo - A fast and efficient object graph serialization framework for Java.
-
Stream Processing
- Apache Flink - A streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
- Apache Beam - A unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.
- Apache NiFi - An easy to use, powerful, and reliable system to process and distribute data.
- Spark Streaming - Makes it easy to build scalable fault-tolerant streaming applications.
- Bonobo - A data-processing toolkit for python 3.5+.
- Apache Hudi - An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
- Spring Cloud Dataflow - Streaming and tasks execution between Spring Boot apps.
- PipelineDB - The Streaming SQL Database.
- SwimOS - A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.
- Robinhood's Faust - Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
- Zilla - - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.
- HStreamDB - The streaming database built for IoT data storage and real-time processing.
- Pathway - Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.
- CocoIndex - An open source ETL framework to build fresh index for AI.
- VoltDB - VoltDb is an ACID-compliant RDBMS which uses a [shared nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture).
- Kuiper - An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
-
Workflow
- Cascading - Java based application development platform.
- Oozie - A workflow scheduler system to manage Apache Hadoop jobs.
- Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.
- Kestra - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
- CronQ - An application cron-like system. [Used](https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luigi. Deprecated.
- Kedro - Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
- SuprSend - Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse.
- Dataform - An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.
- RudderStack - A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.
- Luigi - A Python module that helps you build complex pipelines of batch jobs.
- Kestra - Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
- Airflow - A system to programmatically author, schedule, and monitor data pipelines.
- Dagster - An open-source Python library for building data applications.
- Pinball - DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.
- Multiwoven - The open-source reverse ETL, data activation platform for modern data teams.
- PACE - An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)
- Bruin - End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.
- Kedro - A framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
- Census - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
- Census - A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
- Hamilton - A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.
- Bonnard - Agent-native semantic layer with governed metrics, React SDK, and multi-warehouse support. Connects AI agents and dashboards to a single source of truth.
-
Community
-
Podcasts
- Data Engineering Podcast - The show about modern data infrastructure.
- Software Engineering Daily - Daily interviews about technical software topics, including data infrastructure.
- The Data Stack Show - A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
- The Analytics Engineering Podcast - How analytics engineers build and maintain data pipelines at scale.
- Latent Space - Technical deep dives on AI engineering, from model training to deployment.
- Practical AI - Making AI practical, productive, and accessible to everyone.
-
Forums
- /r/dataengineering - News, tips, and background on Data Engineering.
- /r/etl - Subreddit focused on ETL.
-
Conferences
- Data Council - Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.
- Data Council - The first technical conference that bridges the gap between data scientists, data engineers and data analysts.
-
Books
- Snowflake Data Engineering - A practical introduction to data engineering on the Snowflake cloud data platform.
- Best Data Science Books - This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.
- Architecting an Apache Iceberg Lakehouse - A guide to designing an Apache Iceberg lakehouse from scratch.
- Learn AI Data Engineering in a Month of Lunches - A fast, friendly guide to integrating large language models into your data workflows.
-
-
Datasets
-
Data Dumps
- GitHub Archive - GitHub's public timeline since 2011, updated every hour.
- Common Crawl - Open source repository of web crawl data.
- Wikipedia - Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
- FirstData - The world's most comprehensive authoritative data source knowledge base. 160+ curated sources from governments, international organizations, and research institutions with MCP integration.
-
Realtime
- Reddit - Real-time data is available including comments, submissions and links posted to reddit.
- Eventsim - Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
- Twitter Realtime - The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.
-
-
Testing
-
Data Profiler
- DataKitchen - Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.
- GreatExpectation - Open Source data validation framework to manage data quality. Users can define and document “expectations” rules about how data should look and behave.
- Grai - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
- DQOps - An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
- RunSQL - Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.
- Spark Playground - Write, run, and test PySpark code on Spark Playground's online compiler. Access real-world sample datasets & solve interview questions to enhance your PySpark skills for data engineering roles.
- Snowflake Emulator - A Snowflake-compatible emulator for local development and testing.
- daffy - Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.
- Grai - A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
- Provero - A vendor-neutral, declarative data quality engine. Define checks in YAML, run anywhere. Includes 16 built-in check types, SQL batch optimizer, anomaly detection, and data contracts.
- DataScreenIQ - Real-time data quality firewall for pipelines and APIs. Screens rows in milliseconds for schema drift, null spikes, type mismatches, and data anomalies with PASS / WARN / BLOCK decisions.
- DataDriven - Interview practice with SQL query execution, Python, and data modeling exercises.
-
-
Docker
- Rancher - RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.
- Kontena - Application Containers for Masses.
- ImageLayers - Visualize Docker images and the layers that compose them.
- cAdvisor - Analyzes resource usage and performance characteristics of running containers.
- Rocker-compose - Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.
- Zodiac - A lightweight tool for easy deployment and rollback of dockerized applications.
- Nomad - A cluster manager, designed for both long-lived services and short-lived batch processing workloads.
- Weave - Weaving Docker containers into applications.
- Flocker - Easily manage Docker containers & their data.
- Gockerize - Package golang service into minimal Docker containers.
- Micro S3 persistence - Docker microservice for saving/restoring volume data to S3.
-
Data Lake Management
- Gravitino - An open-source, unified metadata management for data lakes, data warehouses, and external catalogs.
- lakeFS - An open source platform that delivers resilience and manageability to object-storage based data lakes.
- Project Nessie - A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.
- Ilum - A modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.
- FlightPath Data - FlightPath is a gateway to a data lake's bronze layer, protecting it from invalid external data file feeds as a trusted publisher.
-
Profiling
-
Data Profiler
- Data Profiler - The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.
- Desbordante - An open-source data profiler specifically focused on discovery and validation of complex patterns in data.
- YData Profiling - A general-purpose open-source data profiler for high-level analysis of a dataset.
-
-
Monitoring
-
Prometheus
- Prometheus.io - An open-source service monitoring system and time series database.
- HAProxy Exporter - Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.
-
-
ELK Elastic Logstash Kibana
- ZomboDB - PostgreSQL Extension that allows creating an index backed by Elasticsearch.
- elasticsearch-jdbc - JDBC importer for Elasticsearch.
- docker-logstash - A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).
-
Data Comparison
- dvt - Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration.
- datacompy - A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.
- everyrow - AI-powered data operations SDK for Python. Semantic deduplication, fuzzy table merging, and intelligent row ranking using LLM agents.
- koala-diff - A high-performance Python library for comparing large datasets (CSV, Parquet) locally using Rust and Polars. It features zero-copy streaming to prevent OOM errors and generates interactive HTML data quality reports.
Categories
Sub Categories
Keywords
python
16
go
9
data-engineering
9
database
9
snowflake
8
etl
7
kafka
7
sql
7
java
7
analytics
7
data-science
7
time-series
7
rust
6
metrics
6
spark
6
monitoring
6
distributed-systems
6
bigquery
5
postgresql
5
data
5
real-time
4
mysql
4
stream-processing
4
golang
4
data-integration
4
scale
4
data-processing
4
data-pipelines
4
redshift
3
prometheus
3
iot
3
elasticsearch
3
graph
3
workflow
3
data-warehouse
3
distributed-database
3
machine-learning
3
streaming
3
hdfs
3
data-quality
3
data-analysis
3
orchestration
3
react
3
automation
3
docker
3
parquet
3
data-pipeline
3
ruby
2
scheduler
2
event-streaming
2