https://github.com/narius2030/sakila-lakehouse

Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The project was designed according to Delta architecture
https://github.com/narius2030/sakila-lakehouse

apache-kafka delta-lake hive-metastore lakehouse real-time-analytics spark-streaming trino-dbt

Last synced: about 2 months ago
JSON representation

Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The project was designed according to Delta architecture

Host: GitHub
URL: https://github.com/narius2030/sakila-lakehouse
Owner: Narius2030
Created: 2024-12-26T09:42:56.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-03-18T03:23:05.000Z (8 months ago)
Last Synced: 2025-03-18T04:26:48.519Z (8 months ago)
Topics: apache-kafka, delta-lake, hive-metastore, lakehouse, real-time-analytics, spark-streaming, trino-dbt
Language: Python
Homepage:
Size: 3.38 MB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Introduction

Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The lakehouse was designed according to `Delta` architecture

- Extracted events in database by using CDC (Debezium), then published events to Kafka which ensures `scalable and fault-tolerant` message processing
- Processed streaming event by Spark Streaming and writes to Delta Tables in MinIO, combining with Trino query engine to provide `real-time insights` via Superset dashboards
- Transformed periodically event data in Delta Tables into staging and mart tables for `deep analytics and machine learning` using DBT

![image](https://github.com/user-attachments/assets/cc379c24-93b1-4a58-b719-d70221026769)

# Setup platforms
## Apache Spark cluster
Config files in these folders: `spark`, `notebook`, `hive-metastore`

Run this command to create Docker containers of Apache Spark cluster
```bash
docker-compose up -f ./docker-compose.yaml
```

## Apache Kafka
Config files in this folders: `kafka`

Run this command to create Apache Kafka's containers
```bash
docker-compose up -f ./kafka/docker-compose.yaml
```

## Trino and Superset
Config files in this folder: `trino-superset`

In **trino-superset/trino-conf/catalog**, create `delta.properties` with following parameters
```properties
connector.name=delta-lake
hive.metastore.uri=thrift://160.191.244.13:9083
hive.s3.aws-access-key=minio
hive.s3.aws-secret-key=minio123
hive.s3.endpoint=http://160.191.244.13:9000
hive.s3.path-style-access=true
```

Then run this command
```
docker-compose up --build
```

## Apache Airflow
> Note: this version is sequential not parallel

Config files in this folder: `dbt-airlow`

In **dbt-airflow**, run this command to create Airflow container

```
docker-compose up --build
```

# Run project

To start streaming process, run this command on `jupyter notebook's terminal` which is running in spark-notebook container
```
python3 stream_events.py
```

![image](https://github.com/user-attachments/assets/52fa73fa-1906-42c1-8d77-2d3cff33d216)

To run data warehouse's transformations, just need trigger this DAG in Airflow's UI or it will automatically run daily at `23:00 PM`

![image](https://github.com/user-attachments/assets/50f9aff8-4823-4bb9-b83c-3539e22e6462)

# Real-time dashboard for trend analysis

![dashboard](https://github.com/user-attachments/assets/dd5e52ce-eb1b-48da-b70d-f5a3ec184703)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/narius2030/sakila-lakehouse

Awesome Lists containing this project

README