{"id":26673288,"url":"https://github.com/turakulov/datalakehouse","last_synced_at":"2026-04-15T09:31:33.890Z","repository":{"id":273061400,"uuid":"918595944","full_name":"Turakulov/datalakehouse","owner":"Turakulov","description":"A project to create a Data Lake House","archived":false,"fork":false,"pushed_at":"2025-03-23T21:38:53.000Z","size":157987,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-23T22:32:19.706Z","etag":null,"topics":["airflow","datavault","dbt","docker","dwh","etl","iceberg","kafka","s3","spark","trino"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Turakulov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-18T10:56:12.000Z","updated_at":"2025-03-23T21:38:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"5c318153-dc95-4dc9-baf7-21c7418fed51","html_url":"https://github.com/Turakulov/datalakehouse","commit_stats":null,"previous_names":["turakulov/datalakehouse"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turakulov%2Fdatalakehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turakulov%2Fdatalakehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turakulov%2Fdatalakehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Turakulov%2Fdatalakehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Turakulov","download_url":"https://codeload.github.com/Turakulov/datalakehouse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245568570,"owners_count":20636803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","datavault","dbt","docker","dwh","etl","iceberg","kafka","s3","spark","trino"],"created_at":"2025-03-26T01:19:21.223Z","updated_at":"2026-04-15T09:31:33.871Z","avatar_url":"https://github.com/Turakulov.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🏗️ Data Lakehouse with Modern Technologies  \n\n![Architecture](https://github.com/user-attachments/assets/ef98d59f-56f9-41ad-a450-5818f9237a55)\n\n\n## 📌 Overview  \nThis project demonstrates how to build a **Data Lakehouse** using modern open-source technologies.  \nIt integrates **Kafka**, **Spark Streaming**, **Apache Iceberg**, **Apache Nessie**, **Trino**, and **DBT** to create an efficient data pipeline.  \n\n🔹 **Streaming Layer:** Spark Streaming / Flink processes real-time Kafka events.  \n🔹 **Batch Layer:** Apache Spark and AirByte handle batch ingestion into an **S3 Minio** data lake.  \n🔹 **Metadata Layer:** Apache Nessie is used as the Iceberg catalog, backed by PostgreSQL.  \n🔹 **Transformation Layer:** Trino and DBT transform and store data in **Vertica** for analytics.  \n🔹 **Orchestration:** Apache Airflow schedules and manages all ETL/ELT workflows.  \n🔹 **Synthetic Data:** Generated with `sdv` Python library from **SalesDB_v1** dataset.  \n🔹 **Visualization:** Dashboards built in **Tableau, Superset, or Power BI**.  \n\n## 🚀 Tech Stack  \n| Category          | Technology |\n|------------------|------------|\n| **Streaming**    | Kafka, Spark Streaming, Flink |\n| **Storage**      | S3 Minio (Apache Iceberg) |\n| **Metadata**     | Apache Nessie, PostgreSQL |\n| **ETL/ELT**      | Airflow, DBT, Apache Spark |\n| **Query Engine** | Trino, Vertica |\n| **Orchestration**| Apache Airflow |\n| **Dashboarding** | Tableau, Superset, Power BI |\n\n---\n\n## 🎯 Architecture  \n\n### 1️⃣ **Data Ingestion**  \n- Streaming events flow from **Kafka** into **Spark Streaming / Flink**.  \n- Batch data is ingested via **AirByte** into **S3 Minio (Iceberg format)**.  \n\n### 2️⃣ **Storage \u0026 Metadata Management**  \n- Raw (`raw`) and Operational (`ods`) data are stored in **S3 Minio (Apache Iceberg)**.  \n- Apache Nessie acts as the metadata catalog, tracking schema versions and changes.  \n\n### 3️⃣ **Transformation \u0026 Querying**  \n- **DBT + Trino** handle transformations.  \n- Processed marts (`marts`) are stored in **Vertica** for BI consumption.  \n\n### 4️⃣ **Orchestration \u0026 Visualization**  \n- **Apache Airflow** schedules all ETL/ELT jobs.  \n- Dashboards are created using **Tableau, Superset, or Power BI**.  \n\n---\n\n## 📸 Screenshots  \n🔹 **Kafka UI** (Monitor real-time events)  \n![kafka](https://github.com/user-attachments/assets/2a6224d8-c22a-4a64-8196-7c9ce80ebc0d)\n\n🔹 **S3 Minio Browser** (View stored Iceberg tables)  \n🔹 **Apache Nessie UI** (Track schema changes)  \n🔹 **Airflow DAGs** (Monitor ETL workflows)  \n🔹 **DBT Lineage Graph** (View transformation dependencies)  \n🔹 **BI Dashboards** (Analytics \u0026 insights)  \n\n---\n\n## 🏗️ **Setup \u0026 Deployment**  \nThis project is fully containerized using **Docker Compose**.  \n\n### 🔧 **Prerequisites**  \n- Docker \u0026 Docker Compose  \n- Python (for data generation)  \n\n### 📥 **Clone Repository**  \n```bash\ngit clone https://github.com/Turakulov/datalakehouse.git\ncd datalakehouse\n```\n\n### ▶️ **Start the Environment**\n```bash\ndocker-compose build\n\ndocker-compose up -d\n```\n\n### 🔎 **Verify Services**  \n- **Minio Console:** [http://localhost:9001](http://localhost:9001)  \n- **Trino UI:** [http://localhost:8083](http://localhost:8083)  \n- **Kafka UI:** [http://localhost:9999](http://localhost:9999)  \n- **Airflow UI:** [http://localhost:8090](http://localhost:8090)  \n- **Vertica (SQL Access):** `jdbc:vertica://localhost:5433/db`  \n\n---\n\n## 🛠️ **Project Structure**  \n```bash\n📂 datalakehouse\n├── 📂 airflow/        # Apache Airflow DAGs\n├── 📂 spark/          # Spark Streaming jobs\n├── 📂 kafka/          # Kafka configurations\n├── 📂 trino/          # Trino catalogs\n├── 📂 minio/          # Minio storage setup\n├── 📂 nessie/         # Apache Nessie metadata\n├── 📂 postgres/       # Postgres storage setup for Apache Iceberg metadata\n├── 📂 vertica/        # Vertica storage setup for datamarts and OLAP queries\n├── 📜 docker-compose.yaml  # Docker environment\n└── 📜 README.md       # Project documentation\n```\n\n## 📌 **Contributing**\n🔹 Fork the repo \u0026 create a feature branch.  \n🔹 Submit a pull request with detailed changes.  \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fturakulov%2Fdatalakehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fturakulov%2Fdatalakehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fturakulov%2Fdatalakehouse/lists"}