{"id":102357,"url":"https://github.com/brandonhimpfen/awesome-data-engineering","name":"awesome-data-engineering","description":"A curated list of tools, frameworks, platforms, architectures, and learning resources for data engineering.","projects_count":75,"last_synced_at":"2026-06-06T10:00:21.443Z","repository":{"id":330794769,"uuid":"1123974508","full_name":"brandonhimpfen/awesome-data-engineering","owner":"brandonhimpfen","description":"A curated list of tools, frameworks, platforms, architectures, and learning resources for data engineering.","archived":false,"fork":false,"pushed_at":"2026-05-11T04:05:04.000Z","size":35,"stargazers_count":9,"open_issues_count":0,"forks_count":5,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-20T23:45:46.411Z","etag":null,"topics":["awesome","awesome-list","awesome-lists","big-data","data","data-engineering"],"latest_commit_sha":null,"homepage":"https://lnktr.net/awesome","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brandonhimpfen.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"brandonhimpfen","ko_fi":"brandonhimpfen","buy_me_a_coffee":"brandonhimpfen","custom":["https://paypal.me/brandonhimpfen","https://www.brandonhimpfen.com/#/portal/support"]}},"created_at":"2025-12-28T03:31:42.000Z","updated_at":"2026-05-11T04:05:08.000Z","dependencies_parsed_at":"2025-12-30T16:07:25.485Z","dependency_job_id":null,"html_url":"https://github.com/brandonhimpfen/awesome-data-engineering","commit_stats":null,"previous_names":["awesomelistsio/awesome-data-engineering","brandonhimpfen/awesome-data-engineering"],"tags_count":2,"template":false,"template_full_name":"brandonhimpfen/awesome-lists-template","purl":"pkg:github/brandonhimpfen/awesome-data-engineering","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brandonhimpfen%2Fawesome-data-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brandonhimpfen%2Fawesome-data-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brandonhimpfen%2Fawesome-data-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brandonhimpfen%2Fawesome-data-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brandonhimpfen","download_url":"https://codeload.github.com/brandonhimpfen/awesome-data-engineering/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brandonhimpfen%2Fawesome-data-engineering/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33977371,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-06T02:00:07.033Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"created_at":"2026-01-02T00:00:34.837Z","updated_at":"2026-06-06T10:00:21.444Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["Infrastructure \u0026 Platforms","NoSQL \u0026 Specialized Datastores","Data Engineering on the Cloud","Storage, Warehousing \u0026 Lakehouses","Data Ingestion \u0026 Integration","Workflow Orchestration","Query Engines \u0026 Analytics","Streaming \u0026 Event Processing","Observability \u0026 Reliability","Learning Resources","Data Transformation \u0026 Modeling","License","Foundations \u0026 Concepts","Data Quality, Governance \u0026 Lineage","Related Awesome Lists"],"sub_categories":["Guides","Courses","Tutorials"],"readme":"# Awesome Data Engineering [![Awesome Lists](https://srv-cdn.himpfen.io/badges/awesome-lists/awesomelists-flat.svg)](https://github.com/awesomelistsio/awesome)\n\n[![DOI](https://zenodo.org/badge/1123974508.svg)](https://doi.org/10.5281/zenodo.19673251)  \n[![GitHub Sponsor](https://srv-cdn.himpfen.io/badges/github/github-flat.svg)](https://github.com/sponsors/brandonhimpfen) \u0026nbsp; \n[![Buy Me a Coffee](https://srv-cdn.himpfen.io/badges/buymeacoffee/buymeacoffee-flat.svg)](https://buymeacoffee.com/brandonhimpfen) \u0026nbsp; \n[![Ko-Fi](https://srv-cdn.himpfen.io/badges/kofi/kofi-flat.svg)](https://ko-fi.com/brandonhimpfen) \u0026nbsp; \n[![PayPal](https://srv-cdn.himpfen.io/badges/paypal/paypal-flat.svg)](https://paypal.me/brandonhimpfen)\n\n\u003e A curated list of tools, frameworks, platforms, architectures, and learning resources for **data engineering**, covering data ingestion, transformation, storage, orchestration, and reliable data infrastructure at scale.\n\n_Support ongoing maintenance and curation via [GitHub Sponsors](https://github.com/sponsors/brandonhimpfen)._\n\n## Contents\n\n- [Foundations \u0026 Concepts](#foundations--concepts)\n- [Data Ingestion \u0026 Integration](#data-ingestion--integration)\n- [Streaming \u0026 Event Processing](#streaming--event-processing)\n- [Data Transformation \u0026 Modeling](#data-transformation--modeling)\n- [Workflow Orchestration](#workflow-orchestration)\n- [Storage, Warehousing \u0026 Lakehouses](#storage-warehousing--lakehouses)\n- [Query Engines \u0026 Analytics](#query-engines--analytics)\n- [NoSQL \u0026 Specialized Datastores](#nosql--specialized-datastores)\n- [Data Quality, Governance \u0026 Lineage](#data-quality-governance--lineage)\n- [Observability \u0026 Reliability](#observability--reliability)\n- [Infrastructure \u0026 Platforms](#infrastructure--platforms)\n- [Data Engineering on the Cloud](#data-engineering-on-the-cloud)\n- [Learning Resources](#learning-resources)\n- [Related Awesome Lists](#related-awesome-lists)\n\n## Foundations \u0026 Concepts\n\n- [Data Engineering Explained](https://www.ibm.com/topics/data-engineering) – Overview of data engineering roles, responsibilities, and workflows.\n- [Modern Data Stack](https://www.getdbt.com/what-is-the-modern-data-stack/) – Overview of modern analytics and data engineering tooling.\n- [Data Lake vs Data Warehouse](https://www.databricks.com/glossary/data-lakehouse) – Comparison of storage architectures for analytics.\n- [CAP Theorem](https://www.ibm.com/topics/cap-theorem) – Fundamental trade-offs in distributed data systems.\n- [Event-Driven Architecture](https://martinfowler.com/articles/201701-event-driven.html) – Architectural style for real-time data systems.\n\n## Data Ingestion \u0026 Integration\n\n- [Apache Kafka Connect](https://kafka.apache.org/documentation/#connect) – Framework for moving data between Kafka and external systems.\n- [Apache NiFi](https://nifi.apache.org/) – Visual data ingestion and flow automation platform.\n- [Airbyte](https://airbyte.com/) – Open-source data integration platform for ELT pipelines.\n- [Fivetran](https://www.fivetran.com/) – Managed data connectors for analytics and warehousing.\n- [Singer](https://www.singer.io/) – Open-source standard for data extraction and loading.\n- [Debezium](https://debezium.io/) – Change data capture (CDC) platform for databases.\n\n## Streaming \u0026 Event Processing\n\n- [Apache Kafka](https://kafka.apache.org/) – Distributed event streaming platform.\n- [Apache Pulsar](https://pulsar.apache.org/) – Cloud-native pub/sub and streaming platform.\n- [Apache Flink](https://flink.apache.org/) – Stream-first processing framework with low latency.\n- [Kafka Streams](https://kafka.apache.org/documentation/streams/) – Stream processing library built on Kafka.\n- [Apache Storm](https://storm.apache.org/) – Real-time computation system for stream processing.\n\n## Data Transformation \u0026 Modeling\n\n- [dbt](https://www.getdbt.com/) – SQL-based transformation and analytics engineering tool.\n- [Apache Spark](https://spark.apache.org/) – Distributed engine for large-scale data processing.\n- [Apache Beam](https://beam.apache.org/) – Unified programming model for batch and streaming pipelines.\n- [Dask](https://www.dask.org/) – Parallel computing library for scalable Python data processing.\n- [SQLMesh](https://sqlmesh.com/) – Versioned, testable SQL transformations.\n\n## Workflow Orchestration\n\n- [Apache Airflow](https://airflow.apache.org/) – Platform for scheduling and monitoring data workflows.\n- [Dagster](https://dagster.io/) – Data orchestration platform with strong observability and testing.\n- [Prefect](https://www.prefect.io/) – Workflow orchestration system for data pipelines.\n- [Luigi](https://github.com/spotify/luigi) – Python package for building complex pipelines.\n- [Argo Workflows](https://argo-workflows.readthedocs.io/) – Kubernetes-native workflow engine.\n\n## Storage, Warehousing \u0026 Lakehouses\n\n- [Amazon S3](https://aws.amazon.com/s3/) – Object storage widely used as a data lake.\n- [Google Cloud Storage](https://cloud.google.com/storage) – Scalable object storage for analytics workloads.\n- [Azure Data Lake Storage](https://azure.microsoft.com/products/storage/data-lake-storage/) – Optimized storage for analytics on Azure.\n- [Snowflake](https://www.snowflake.com/) – Cloud-native data warehouse.\n- [BigQuery](https://cloud.google.com/bigquery) – Serverless analytics data warehouse.\n- [Delta Lake](https://delta.io/) – Open-source storage layer enabling lakehouse architecture.\n- [Apache Iceberg](https://iceberg.apache.org/) – Table format for large-scale analytic datasets.\n- [Apache Hudi](https://hudi.apache.org/) – Incremental data processing and lakehouse framework.\n\n## Query Engines \u0026 Analytics\n\n- [Trino](https://trino.io/) – Distributed SQL query engine for large datasets.\n- [Presto](https://prestodb.io/) – High-performance distributed SQL engine.\n- [Spark SQL](https://spark.apache.org/sql/) – SQL analytics module built on Apache Spark.\n- [DuckDB](https://duckdb.org/) – In-process analytical SQL engine.\n- [ClickHouse](https://clickhouse.com/) – Column-oriented OLAP database.\n\n## NoSQL \u0026 Specialized Datastores\n\n- [Apache Cassandra](https://cassandra.apache.org/) – Distributed wide-column NoSQL database.\n- [MongoDB](https://www.mongodb.com/) – Document-oriented NoSQL database.\n- [Apache HBase](https://hbase.apache.org/) – NoSQL database built on HDFS.\n- [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) – Managed NoSQL key-value store.\n- [Redis](https://redis.io/) – In-memory data store for caching and streaming use cases.\n\n## Data Quality, Governance \u0026 Lineage\n\n- [Great Expectations](https://greatexpectations.io/) – Data quality validation framework.\n- [Apache Atlas](https://atlas.apache.org/) – Metadata management and data governance platform.\n- [OpenLineage](https://openlineage.io/) – Open standard for capturing data lineage.\n- [DataHub](https://datahubproject.io/) – Open-source metadata and data catalog.\n- [Amundsen](https://www.amundsen.io/) – Data discovery and metadata engine.\n\n## Observability \u0026 Reliability\n\n- [Monte Carlo](https://www.montecarlodata.com/) – Data observability platform for pipelines.\n- [Bigeye](https://www.bigeye.com/) – Data quality monitoring and alerting.\n- [Prometheus](https://prometheus.io/) – Metrics and monitoring system.\n- [Grafana](https://grafana.com/) – Visualization platform for observability.\n- [OpenTelemetry](https://opentelemetry.io/) – Observability framework for distributed systems.\n\n## Infrastructure \u0026 Platforms\n\n- [Kubernetes](https://kubernetes.io/) – Container orchestration for data workloads.\n- [Ray](https://www.ray.io/) – Distributed computing framework for scalable data processing.\n- [Terraform](https://www.terraform.io/) – Infrastructure as code for data platforms.\n- [Apache Mesos](https://mesos.apache.org/) – Distributed systems kernel for resource management.\n\n## Data Engineering on the Cloud\n\n- [Databricks](https://www.databricks.com/) – Unified analytics platform built on Apache Spark.\n- [AWS EMR](https://aws.amazon.com/emr/) – Managed big data platform on AWS.\n- [Google Dataproc](https://cloud.google.com/dataproc) – Managed Spark and Hadoop service.\n- [Azure Synapse Analytics](https://azure.microsoft.com/products/synapse-analytics/) – Integrated analytics service.\n- [Snowflake Data Cloud](https://www.snowflake.com/) – Platform for data sharing and analytics.\n\n## Learning Resources\n\n### Tutorials\n- [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) – Free hands-on data engineering course.\n- [Apache Spark Documentation](https://spark.apache.org/docs/latest/) – Official Spark guides and examples.\n- [Kafka Documentation](https://kafka.apache.org/documentation/) – Official Kafka tutorials.\n\n### Guides\n- [Designing Data-Intensive Applications](https://dataintensive.net/) – Foundational book on scalable data systems.\n- [Streaming Systems](https://www.oreilly.com/library/view/streaming-systems/9781491983867/) – Concepts and architectures for stream processing.\n- [Data Engineering Best Practices](https://www.getdbt.com/blog/) – Modern data engineering workflows.\n\n### Courses\n- *Data Engineering Fundamentals* – Core data pipeline concepts.\n- *Streaming Data Engineering* – Real-time data processing architectures.\n- *Cloud Data Engineering* – Building scalable pipelines in the cloud.\n\n## Related Awesome Lists\n\n- [Awesome Big Data](https://github.com/awesomelistsio/awesome-big-data)\n- [Awesome Data Analytics](https://github.com/awesomelistsio/awesome-data-analytics)\n- [Awesome SQL](https://github.com/awesomelistsio/awesome-sql)\n- [Awesome Cloud](https://github.com/awesomelistsio/awesome-cloud)\n- [Awesome MLOps](https://github.com/awesomelistsio/awesome-mlops)\n\n## Contribute\n\nContributions are welcome. Please ensure your submission fully follows the requirements outlined in [`CONTRIBUTING.md`](CONTRIBUTING.md), including formatting, scope alignment, and category placement.\n\nPull requests that do not adhere to the contribution guidelines may be closed.\n\n## License\n\n[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by-sa.svg)](http://creativecommons.org/licenses/by-sa/4.0/)\n","projects_url":"https://awesome.ecosyste.ms/api/v1/lists/brandonhimpfen%2Fawesome-data-engineering/projects"}