Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/leehuwuj/olh
Open source stack lakehouse
https://github.com/leehuwuj/olh
bigdata dataplatform deltalake kubernetes lakehouse spark
Last synced: 9 days ago
JSON representation
Open source stack lakehouse
- Host: GitHub
- URL: https://github.com/leehuwuj/olh
- Owner: leehuwuj
- Created: 2022-11-15T02:09:07.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-02T03:24:15.000Z (10 months ago)
- Last Synced: 2024-12-23T06:06:20.707Z (16 days ago)
- Topics: bigdata, dataplatform, deltalake, kubernetes, lakehouse, spark
- Language: Python
- Homepage:
- Size: 4.57 MB
- Stars: 25
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
**[WIP]**
# Open source stack for lakehouse
This project to POC of a simple lakehouse architecture which aims to:
- Learning: If you are students or beginner who are working with data everyday then this project could helps you to understand the tools that you are working on.
- Cloud deputize testing: Nowaday, cloud services are easily to plug-and-play but there is various of tools and each of them have its own advantage as well as disadvantage that you have to take aware of. Almost of them are built on top open source stack so this project ifself is a cloud at your home!There will be no fixed deployment kind but each service is revolve around cloud-native application (containerized) which use can easily to integrate/test with your current platform.
*Note*:
- The deployment only for testing purpose. This project scope will not cover on security feature of lakehouse (data, table, row,...) access control, resource management.
- If your machine does not have enough resources then just try the docker or single service deployment instead.# Architecture
![high-level-architecutre](resources/images/architecture.png)# Setup:
## Hive metastore:
- [Hive metastore quick setup](https://github.com/leehuwuj/olh/blob/main/hive-metastore)
## Trino
- [Trino quick setup](https://github.com/leehuwuj/olh/blob/main/trino)
## Spark
- [Spark simple setup for Kubernetes](https://github.com/leehuwuj/olh/blob/main/spark)
## Jupyter
- [Jupyter spark docker setup](https://github.com/leehuwuj/olh/blob/main/jupyter)
## Dagster
- [Dagster hackernews example project](https://github.com/leehuwuj/olh/blob/main/dagster)# Practices
## Tweets Champions
- [Tweets Data](https://github.com/leehuwuj/olh/blob/main/resources/data/README.md)
- Examples:
- [Pyspark - Tweets Fact ingestion](https://github.com/leehuwuj/olh/tree/main/resources/practices/tweetschampions)## Dagster example project
Example using Dagster to orchestrate data workflow: [Arrow -> (PyDelta + Trino or PySpark Delta) -> DBT]
- [Dasgter hackernews](https://github.com/leehuwuj/olh/tree/main/dagster)