An open API service indexing awesome lists of open source software.

https://github.com/zriyansh/awesome-lakehouse

A curated list of lakehouse tools for software developers
https://github.com/zriyansh/awesome-lakehouse

List: awesome-lakehouse

Last synced: 21 days ago
JSON representation

A curated list of lakehouse tools for software developers

Awesome Lists containing this project

README

        

# 🌊 Awesome Lakehouse

![GitHub stars](https://img.shields.io/github/stars/zriyansh/awesome-lakehouse?style=social)
![License](https://img.shields.io/github/license/zriyansh/awesome-lakehouse)

> **Awesome Lakehouse** is a curated list of resources, tools, and technologies that empower data engineers to build scalable, efficient, and real-time data solutions by combining the best of data lakes and data warehouses.

## Table of Contents 📚

- [🌊 Awesome Lakehouse](#-awesome-lakehouse)
- [Table of Contents 📚](#table-of-contents-)
- [🚀 Introduction](#-introduction)
- [🔑 Core Sections](#-core-sections)
- [1. Lakehouse Fundamentals](#1-lakehouse-fundamentals)
- [2. Key Lakehouse Technologies](#2-key-lakehouse-technologies)
- [3. ETL/ELT Tools for Lakehouses](#3-etlel-tools-for-lakehouses)
- [4. Data Orchestration](#4-data-orchestration)
- [5. BI and Analytics on Lakehouses](#5-bi-and-analytics-on-lakehouses)
- [6. ML and AI Workflows](#6-ml-and-ai-workflows)
- [7. Monitoring and Observability](#7-monitoring-and-observability)
- [8. Cost Optimization](#8-cost-optimization)
- [📂 Additional Sections](#-additional-sections)
- [1. Use Cases and Real-world Applications](#1-use-cases-and-real-world-applications)
- [2. Tutorials and Learning Resources](#2-tutorials-and-learning-resources)
- [3. Open-source Projects](#3-open-source-projects)
- [4. Comparison Tables](#4-comparison-tables)
- [🌟 Influential Personalities](#-influential-personalities)
- [🤝 Contributing](#-contributing)
- [📄 License](#-license)
- [📢 Acknowledgements](#-acknowledgements)

---

## 🚀 Introduction

Welcome to **Awesome Lakehouse**, your go-to repository for everything related to the lakehouse architecture in data engineering. The lakehouse paradigm seamlessly integrates the scalability and flexibility of data lakes with the reliability and performance of data warehouses, enabling real-time analytics, scalable storage, and advanced querying capabilities.

Whether you're a data engineer, architect, or enthusiast, this repository provides a comprehensive collection of tools, technologies, and resources to help you build robust data solutions.

⭐️ If you find this repository helpful, please [star this repo](https://github.com/zriyansh/awesome-lakehouse) to show your support!

---

## 🔑 Core Sections

### 1. Lakehouse Fundamentals

#### Introduction to Lakehouse
The lakehouse architecture merges the best features of data lakes and data warehouses, providing a unified platform for storing, managing, and analyzing large-scale data. Key characteristics include:

- **Unified Storage**: Combines structured and unstructured data.
- **Scalability**: Handles petabytes of data with ease.
- **Real-time Analytics**: Supports real-time data processing and querying.
- **Reliability**: Ensures data consistency and integrity.

#### Foundational Resources
- [Delta Lake](https://delta.io/) - An open-source storage layer that brings ACID transactions to data lakes.
- [Apache Iceberg](https://iceberg.apache.org/) - A high-performance format for huge analytic tables.
- [Apache Hudi](https://hudi.apache.org/) - Provides atomic upserts and incremental processing on data lakes.
- [Databricks Lakehouse Platform Whitepapers](https://databricks.com/learn/lakehouse)
- [Academic Papers on Lakehouse Architecture](https://scholar.google.com/scholar?q=lakehouse+architecture)
- [The Lakehouse Paradigm](https://databricks.com/blog/2020/06/22/introducing-the-lakehouse-platform.html) - Databricks Blog
- [What is a Lakehouse?](https://www.slideshare.net/Databricks/what-is-a-lakehouse-databricks-spark-summit-2021) - SlideShare Presentation

### 2. Key Lakehouse Technologies

#### Storage Layers
- [Delta Lake](https://delta.io/)
- [Apache Iceberg](https://iceberg.apache.org/)
- [Apache Hudi](https://hudi.apache.org/)

#### Query Engines
- [Presto](https://prestodb.io/)
- [Apache Spark](https://spark.apache.org/)
- [Trino](https://trino.io/)
- [Dremio](https://dremio.com/)
- [Starburst](https://www.starburst.io/)

#### Metadata Management
- [Hive Metastore](https://cwiki.apache.org/confluence/display/Hive/Hive+Metastore)
- [Amundsen](https://www.amundsen.io/)
- [DataHub](https://datahubproject.io/)
- [Apache Atlas](https://atlas.apache.org/)

### 3. ETL/ELT Tools for Lakehouses

#### Open-source
- [Airbyte](https://airbyte.com/) - An open-source data integration platform.
- [Meltano](https://meltano.com/) - An open-source data integration tool built on Singer.
- [dbt (data build tool)](https://www.getdbt.com/) - Transform data in your warehouse more effectively.

#### Commercial
- [Fivetran](https://fivetran.com/) - Automated data integration.
- [Matillion](https://www.matillion.com/) - Data integration and transformation tool.

#### Stream Processing
- [Apache Kafka](https://kafka.apache.org/) - Distributed event streaming platform.
- [Apache Flink](https://flink.apache.org/) - Stream processing framework.
- [Spark Streaming](https://spark.apache.org/streaming/) - Scalable stream processing.

### 4. Data Orchestration
- [Apache Airflow](https://airflow.apache.org/) - Platform to programmatically author, schedule, and monitor workflows.
- [Dagster](https://dagster.io/) - Data orchestrator for machine learning, analytics, and ETL.
- [Prefect](https://www.prefect.io/) - Workflow management system.

### 5. BI and Analytics on Lakehouses
- [Superset](https://superset.apache.org/) - Modern data exploration and visualization platform.
- [Looker](https://looker.com/) - Business intelligence software.
- [Tableau](https://www.tableau.com/) - Interactive data visualization.
- [Power BI](https://powerbi.microsoft.com/) - Business analytics service by Microsoft.
- [Hex](https://hex.tech/) - Collaborative data workspace.

### 6. ML and AI Workflows
- [MLflow](https://mlflow.org/) - Open-source platform for managing the ML lifecycle.
- [Kubeflow](https://kubeflow.org/) - Machine learning toolkit for Kubernetes.
- [H2O.ai](https://www.h2o.ai/) - Open-source AI platform.
- [Databricks Machine Learning](https://www.databricks.com/product/machine-learning) - Unified environment for ML.

### 7. Monitoring and Observability
- [OpenTelemetry](https://opentelemetry.io/) - Observability framework for cloud-native software.
- [Prometheus](https://prometheus.io/) - Monitoring system and time series database.
- [Grafana](https://grafana.com/) - Open-source analytics and monitoring platform.
- [Great Expectations](https://greatexpectations.io/) - Data validation framework.

### 8. Cost Optimization
- [DuckDB](https://duckdb.org/) - An in-process SQL OLAP Database Management System.
- [ClickHouse](https://clickhouse.com/) - A fast open-source column-oriented database management system.
- [Materialize](https://materialize.com/) - Streaming database for real-time applications.

---

## 📂 Additional Sections

### 1. Use Cases and Real-world Applications

Lakehouse architectures are transforming various industries by providing scalable and efficient data solutions. Here are some examples:

- **E-commerce**: Real-time inventory management and personalized recommendations.
- **Healthcare**: Patient data integration and predictive analytics for treatment outcomes.
- **Finance**: Fraud detection, risk management, and real-time trading analytics.
- **Media & Entertainment**: Content recommendation engines and audience analytics.
- **Manufacturing**: Predictive maintenance and supply chain optimization.
- **Telecommunications**: Network performance monitoring and customer behavior analysis.
- **Energy**: Smart grid data management and predictive maintenance for infrastructure.

### 2. Tutorials and Learning Resources

Enhance your knowledge with these beginner-friendly guides, tutorials, and courses:

#### **Guides & Blogs**
- [Databricks Lakehouse Guide](https://databricks.com/lakehouse)
- [Delta Lake Documentation](https://docs.delta.io/latest/index.html)
- [Apache Iceberg Getting Started](https://iceberg.apache.org/getting-started/)
- [Hudi Documentation](https://hudi.apache.org/docs/overview/)
- [Towards Data Science - Lakehouse Articles](https://towardsdatascience.com/tagged/lakehouse)

#### **Video Tutorials**
- [YouTube - Data Engineering with Lakehouse](https://www.youtube.com/results?search_query=data+engineering+lakehouse)
- [Databricks YouTube Channel](https://www.youtube.com/c/Databricks)
- [Confluent YouTube Channel](https://www.youtube.com/user/confluentinc)
- [Simplilearn - Lakehouse Tutorials](https://www.youtube.com/results?search_query=simplilearn+lakehouse)
- [edureka! Data Engineering Tutorials](https://www.youtube.com/user/edurekaIN/search?query=lakehouse)

#### **Courses**
- [Udemy: Lakehouse Architecture](https://www.udemy.com/topic/lakehouse/)
- [Coursera: Data Engineering on Google Cloud](https://www.coursera.org/specializations/gcp-data-engineering)
- [LinkedIn Learning: Modern Data Warehousing with the Lakehouse](https://www.linkedin.com/learning/topics/data-lakehouse)
- [Pluralsight: Data Engineering with Lakehouse](https://www.pluralsight.com/search?q=lakehouse)

### 3. Open-source Projects

Explore trending repositories and libraries in the lakehouse ecosystem:

- [lakeFS](https://github.com/treeverse/lakeFS) - Git-like version control for data lakes.
- [Great Expectations](https://github.com/great-expectations/great_expectations) - Data validation framework.
- [Delta Sharing](https://github.com/delta-io/delta-sharing) - Open protocol for secure data sharing.
- [Materialize](https://github.com/MaterializeInc/materialize) - Streaming SQL database for real-time analytics.
- [Airbyte](https://github.com/airbytehq/airbyte) - Data integration platform.
- [Amundsen](https://github.com/amundsen-io/amundsen) - Data discovery and metadata engine.
- [DataHub](https://github.com/datahub-project/datahub) - Metadata platform for the modern data stack.

### 4. Comparison Tables

#### Delta Lake vs. Apache Iceberg vs. Apache Hudi

| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|-----------------------------|-------------------------------------------------|------------------------------------------------|-----------------------------------------------|
| **ACID Transactions** | ✅ | ✅ | ✅ |
| **Schema Evolution** | ✅ | ✅ | ✅ |
| **Time Travel** | ✅ | ✅ | ✅ |
| **Upserts/Merge** | ✅ | ✅ | ✅ |
| **Partitioning** | Dynamic and static partitioning | Hidden partitioning | Globally sorted partitioning |
| **Performance Optimization**| Data skipping, Z-order clustering | Partition pruning, bloom filters | Indexing, clustering |
| **Integration** | Strong integration with Databricks and Spark | Broad integration with various query engines | Integration with Spark and Flink |
| **Community & Adoption** | Large community, backed by Databricks | Growing community, part of Apache Foundation | Active community, part of Apache Foundation |
| **Use Cases** | Real-time analytics, data lakes with ACID support| Large-scale data warehousing, data lake management| Streaming data ingestion, incremental processing|

---

## 🌟 Influential Personalities

Stay updated with insights and trends from leading experts in the lakehouse and data engineering space:

### **1. Wes McKinney**
- **Role**: Creator of Pandas, Chief Architect at Anaconda
- **Twitter**: [@wesm](https://twitter.com/wesm)
- **LinkedIn**: [Wes McKinney](https://www.linkedin.com/in/wesm/)
- **Blog**: [wesmckinney.com](https://wesmckinney.com/)

### **2. Matei Zaharia**
- **Role**: Founder of Databricks, Creator of Apache Spark
- **Twitter**: [@mateizaharia](https://twitter.com/mateizaharia)
- **LinkedIn**: [Matei Zaharia](https://www.linkedin.com/in/mateizaharia/)
- **Blog**: [Matei's Publications](https://mateizaharia.org/)

### **3. Reynold Xin**
- **Role**: Co-Founder and CTO at Databricks, Contributor to Apache Spark
- **Twitter**: [@reynoldx](https://twitter.com/reynoldx)
- **LinkedIn**: [Reynold Xin](https://www.linkedin.com/in/reynoldxin/)
- **Blog**: [Reynold's Medium](https://medium.com/@reynold.xin)

### **4. Michael Armbrust**
- **Role**: Co-Founder at Databricks, Contributor to Apache Spark and Delta Lake
- **Twitter**: [@michaelarmbrust](https://twitter.com/michaelarmbrust)
- **LinkedIn**: [Michael Armbrust](https://www.linkedin.com/in/michael-armbrust/)
- **Blog**: [Databricks Blog](https://databricks.com/blog/author/michael-armbrust)

### **5. Kostas Tzoumas**
- **Role**: VP of Data at Confluent, Contributor to Apache Kafka
- **Twitter**: [@kostastzoumas](https://twitter.com/kostastzoumas)
- **LinkedIn**: [Kostas Tzoumas](https://www.linkedin.com/in/kostastzoumas/)
- **Blog**: [Confluent Blog](https://www.confluent.io/blog/author/kostas-tzoumas/)

### **6. Scott Shenker**
- **Role**: Professor at UC Berkeley, Co-Founder of Coraid and other startups
- **Twitter**: [@scottshenker](https://twitter.com/scottshenker)
- **LinkedIn**: [Scott Shenker](https://www.linkedin.com/in/scottshenker/)
- **Publications**: [Google Scholar](https://scholar.google.com/citations?user=F3GafRkAAAAJ&hl=en)

### **7. Maxime Beauchemin**
- **Role**: Creator of Apache Airflow and Apache Superset
- **Twitter**: [@m_beauchemin](https://twitter.com/m_beauchemin)
- **LinkedIn**: [Maxime Beauchemin](https://www.linkedin.com/in/maximebeauchemin/)
- **Blog**: [Maxime's Blog](https://maximebeauchemin.com/)

---

## 🤝 Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.

1. **Fork the Project**
2. **Create your Feature Branch** (`git checkout -b feature/AmazingFeature`)
3. **Commit your Changes** (`git commit -m 'Add some AmazingFeature'`)
4. **Push to the Branch** (`git push origin feature/AmazingFeature`)
5. **Open a Pull Request**

Please ensure your pull request adheres to the [repository guidelines](CONTRIBUTING.md).

### Contribution Guidelines

- **Code of Conduct**: Please adhere to the [Code of Conduct](CODE_OF_CONDUCT.md).
- **Issue Reporting**: Use [Issues](https://github.com/zriyansh/awesome-lakehouse/issues) to report bugs or suggest enhancements.
- **Style Guide**: Follow the [Markdown Style Guide](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) for consistency.

---

## 📄 License

Distributed under the MIT License. See `LICENSE` for more information.

---

## 📢 Acknowledgements

- Inspired by the [Awesome](https://github.com/sindresorhus/awesome) repository guidelines.
- Special thanks to all the contributors and the open-source community.
- Resources and content curated from leading industry experts and authoritative sources.
- [Databricks](https://databricks.com/), [Apache Software Foundation](https://apache.org/), and other foundational organizations in the lakehouse ecosystem.

---

## 📚 Additional Resources

### **Emojis and Styling**
Emojis are used to enhance visual appeal. Adjust as necessary to fit your style preferences.

### **Contribution Files**
- **[CONTRIBUTING.md](CONTRIBUTING.md)**: Detailed guidelines for contributors.
- **[CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)**: Community standards and expectations.

### **Badges**
Replace `yourusername` with your actual GitHub username in the badge URLs to ensure they display correctly.

---

Feel free to customize the content further to better fit your vision for the **Awesome Lakehouse** repository!