Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-lakehouse
A curated list of lakehouse tools for software developers
https://github.com/zriyansh/awesome-lakehouse
Last synced: 2 days ago
JSON representation
-
🔑 Core Sections
-
2. Key Lakehouse Technologies
-
1. Lakehouse Fundamentals
- Apache Iceberg - A high-performance format for huge analytic tables.
- Apache Hudi - Provides atomic upserts and incremental processing on data lakes.
- Databricks Lakehouse Platform Whitepapers
- Academic Papers on Lakehouse Architecture
- The Lakehouse Paradigm - Databricks Blog
- What is a Lakehouse? - SlideShare Presentation
- Databricks Lakehouse Platform Whitepapers
- The Lakehouse Paradigm - Databricks Blog
- What is a Lakehouse? - SlideShare Presentation
- Academic Papers on Lakehouse Architecture
-
3. ETL/ELT Tools for Lakehouses
- Airbyte - An open-source data integration platform.
- Meltano - An open-source data integration tool built on Singer.
- Meltano - An open-source data integration tool built on Singer.
- dbt (data build tool) - Transform data in your warehouse more effectively.
- Matillion - Data integration and transformation tool.
- Apache Kafka - Distributed event streaming platform.
- Spark Streaming - Scalable stream processing.
- Airbyte - An open-source data integration platform.
- dbt (data build tool) - Transform data in your warehouse more effectively.
- Apache Kafka - Distributed event streaming platform.
- Apache Flink - Stream processing framework.
- Spark Streaming - Scalable stream processing.
-
4. Data Orchestration
- Apache Airflow - Platform to programmatically author, schedule, and monitor workflows.
- Dagster - Data orchestrator for machine learning, analytics, and ETL.
- Prefect - Workflow management system.
- Dagster - Data orchestrator for machine learning, analytics, and ETL.
- Apache Airflow - Platform to programmatically author, schedule, and monitor workflows.
-
5. BI and Analytics on Lakehouses
-
6. ML and AI Workflows
- MLflow - Open-source platform for managing the ML lifecycle.
- Databricks Machine Learning - Unified environment for ML.
- Databricks Machine Learning - Unified environment for ML.
- MLflow - Open-source platform for managing the ML lifecycle.
-
7. Monitoring and Observability
- OpenTelemetry - Observability framework for cloud-native software.
- Grafana - Open-source analytics and monitoring platform.
- OpenTelemetry - Observability framework for cloud-native software.
- Grafana - Open-source analytics and monitoring platform.
- Great Expectations - Data validation framework.
- Prometheus - Monitoring system and time series database.
- Great Expectations - Data validation framework.
-
8. Cost Optimization
- DuckDB - An in-process SQL OLAP Database Management System.
- ClickHouse - A fast open-source column-oriented database management system.
- Materialize - Streaming database for real-time applications.
- DuckDB - An in-process SQL OLAP Database Management System.
- ClickHouse - A fast open-source column-oriented database management system.
-
-
📂 Additional Sections
-
2. Tutorials and Learning Resources
- Towards Data Science - Lakehouse Articles
- YouTube - Data Engineering with Lakehouse
- Confluent YouTube Channel
- Simplilearn - Lakehouse Tutorials
- edureka! Data Engineering Tutorials
- Udemy: Lakehouse Architecture
- LinkedIn Learning: Modern Data Warehousing with the Lakehouse
- Delta Lake Documentation
- Apache Iceberg Getting Started
- Hudi Documentation
- Delta Lake Documentation
- Apache Iceberg Getting Started
- Hudi Documentation
- Towards Data Science - Lakehouse Articles
- YouTube - Data Engineering with Lakehouse
- Confluent YouTube Channel
- Simplilearn - Lakehouse Tutorials
- edureka! Data Engineering Tutorials
- Udemy: Lakehouse Architecture
- Coursera: Data Engineering on Google Cloud
- Coursera: Data Engineering on Google Cloud
- LinkedIn Learning: Modern Data Warehousing with the Lakehouse
- Pluralsight: Data Engineering with Lakehouse
- Pluralsight: Data Engineering with Lakehouse
-
3. Open-source Projects
- lakeFS - Git-like version control for data lakes.
- Great Expectations - Data validation framework.
- Delta Sharing - Open protocol for secure data sharing.
- Materialize - Streaming SQL database for real-time analytics.
- Airbyte - Data integration platform.
- DataHub - Metadata platform for the modern data stack.
- lakeFS - Git-like version control for data lakes.
- Great Expectations - Data validation framework.
- Delta Sharing - Open protocol for secure data sharing.
- Materialize - Streaming SQL database for real-time analytics.
- Airbyte - Data integration platform.
- Amundsen - Data discovery and metadata engine.
- DataHub - Metadata platform for the modern data stack.
- Amundsen - Data discovery and metadata engine.
-
-
🌟 Influential Personalities
-
**2. Matei Zaharia**
-
**3. Reynold Xin**
-
**4. Michael Armbrust**
-
**5. Kostas Tzoumas**
-
**1. Wes McKinney**
-
**6. Scott Shenker**
-
**7. Maxime Beauchemin**
-
-
🤝 Contributing
-
Contribution Guidelines
-
-
📢 Acknowledgements
-
Contribution Guidelines
-
Categories
Sub Categories
2. Tutorials and Learning Resources
24
2. Key Lakehouse Technologies
17
3. Open-source Projects
14
3. ETL/ELT Tools for Lakehouses
12
1. Lakehouse Fundamentals
10
7. Monitoring and Observability
7
5. BI and Analytics on Lakehouses
6
**4. Michael Armbrust**
6
**5. Kostas Tzoumas**
6
**1. Wes McKinney**
6
**7. Maxime Beauchemin**
5
**2. Matei Zaharia**
5
4. Data Orchestration
5
**6. Scott Shenker**
5
8. Cost Optimization
5
6. ML and AI Workflows
4
**3. Reynold Xin**
3
Contribution Guidelines
2
Keywords
data-engineering
6
metadata
4
data-quality
4
pipeline
4
data-discovery
4
data-catalog
4
postgresql
4
data-science
2
data-unit-tests
2
datacleaner
2
datacleaning
2
dataquality
2
dataunittest
2
eda
2
exploratory-analysis
2
exploratory-data-analysis
2
exploratorydataanalysis
2
mlops
2
pipeline-debt
2
pipeline-testing
2
pipeline-tests
2
big-data
2
data-sharing
2
delta-lake
2
pandas
2
apache-spark
2
apache-sparksql
2
aws-s3
2
azure-blob-storage
2
azure-storage
2
data-lake
2
data-version-control
2
data-versioning
2
datalake
2
datalakes
2
git-for-data
2
go
2
golang
2
google-cloud-storage
2
hadoop-filesystem
2
lakefs
2
object-storage
2
cleandata
2
data-profilers
2
data-collection
2
data-integration
2
data-pipeline
2
elt
2
etl
2
java
2