https://github.com/narius2030/sakila-lakehouse
Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The project was designed according to Delta architecture
https://github.com/narius2030/sakila-lakehouse
apache-kafka delta-lake hive-metastore lakehouse real-time-analytics spark-streaming trino-dbt
Last synced: about 2 months ago
JSON representation
Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The project was designed according to Delta architecture
- Host: GitHub
- URL: https://github.com/narius2030/sakila-lakehouse
- Owner: Narius2030
- Created: 2024-12-26T09:42:56.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-03-18T03:23:05.000Z (8 months ago)
- Last Synced: 2025-03-18T04:26:48.519Z (8 months ago)
- Topics: apache-kafka, delta-lake, hive-metastore, lakehouse, real-time-analytics, spark-streaming, trino-dbt
- Language: Python
- Homepage:
- Size: 3.38 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Introduction
Developed a Lakehouse-based data pipeline using Sakila dataset to analyze movie sale and rental trends. The lakehouse was designed according to `Delta` architecture
- Extracted events in database by using CDC (Debezium), then published events to Kafka which ensures `scalable and fault-tolerant` message processing
- Processed streaming event by Spark Streaming and writes to Delta Tables in MinIO, combining with Trino query engine to provide `real-time insights` via Superset dashboards
- Transformed periodically event data in Delta Tables into staging and mart tables for `deep analytics and machine learning` using DBT

# Setup platforms
## Apache Spark cluster
Config files in these folders: `spark`, `notebook`, `hive-metastore`
Run this command to create Docker containers of Apache Spark cluster
```bash
docker-compose up -f ./docker-compose.yaml
```
## Apache Kafka
Config files in this folders: `kafka`
Run this command to create Apache Kafka's containers
```bash
docker-compose up -f ./kafka/docker-compose.yaml
```
## Trino and Superset
Config files in this folder: `trino-superset`
In **trino-superset/trino-conf/catalog**, create `delta.properties` with following parameters
```properties
connector.name=delta-lake
hive.metastore.uri=thrift://160.191.244.13:9083
hive.s3.aws-access-key=minio
hive.s3.aws-secret-key=minio123
hive.s3.endpoint=http://160.191.244.13:9000
hive.s3.path-style-access=true
```
Then run this command
```
docker-compose up --build
```
## Apache Airflow
> Note: this version is sequential not parallel
Config files in this folder: `dbt-airlow`
In **dbt-airflow**, run this command to create Airflow container
```
docker-compose up --build
```
# Run project
To start streaming process, run this command on `jupyter notebook's terminal` which is running in spark-notebook container
```
python3 stream_events.py
```

To run data warehouse's transformations, just need trigger this DAG in Airflow's UI or it will automatically run daily at `23:00 PM`

# Real-time dashboard for trend analysis
