https://github.com/dain55788/end-to-end-streaming-big-data
End-To-End Streaming Big Data Project makes processing big data easy.
https://github.com/dain55788/end-to-end-streaming-big-data
airflow dbt docker kafka minio postgresql spark trino
Last synced: 7 months ago
JSON representation
End-To-End Streaming Big Data Project makes processing big data easy.
- Host: GitHub
- URL: https://github.com/dain55788/end-to-end-streaming-big-data
- Owner: dain55788
- License: apache-2.0
- Created: 2025-01-01T11:27:22.000Z (10 months ago)
- Default Branch: master
- Last Pushed: 2025-03-24T15:44:28.000Z (7 months ago)
- Last Synced: 2025-03-24T16:37:46.576Z (7 months ago)
- Topics: airflow, dbt, docker, kafka, minio, postgresql, spark, trino
- Language: Python
- Homepage:
- Size: 12.7 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# End-To-End-Streaming-Big-Data
## End-To-End Streaming Big Data Project makes big data processing easy with Airflow, Kafka, Spark, MinIO and much more!!## Top Contents:
+ Streaming Big Amount of Data using Kafka and SparkStreaming.
+ Managing Apache Kafka with Confluent Control Center, Apache Zookeeper and Schema Registry.
+ Processing Data Lake using DeltaLake, Object Storage with MinIO.
+ ELT Pipeline:
+ Automated Medallion Architecture Implementation on the dataset Airflow.
+ Data Modeling and Data Warehousing with PostgreSQL and Dbt
+ Distributed query engine Trino with DBeaver for high query performance.
+ Data Visualization Tools with Superset.
+ Project Report.## Dataset:
This project uses Amazon Sales Report data, you can find the data here: https://github.com/AshaoluV/Amazon-Sales-Project/blob/main/Amazon%20Sales.csv## Star Schema Model
## Tools & Technologies
+ Streaming, Batching Data Process: Apache Kafka, Apache Spark.
+ IDE: Pycharm
+ Programming Languages: Python.
+ Data Orchestration Tool: Apache Airflow.
+ Data Lake/ Data Lakehouse: DeltaLake, MinIO.
+ Data Visualization Tool: Superset.
+ Containerization: Docker, Docker Compose.
+ Query Engine: DBeaver, Trino.
+ Data Transformation, Data Modeling and Data Warehousing: dbt, PostgreSQL## Architecture
## Setup
### Pre-requisites:
+ First, you'll have your Pycharm IDE, Docker, Apache Kafka, Apache Spark and Apache Airflow setup in your project.
+ In your terminal, create a python virtual environment to work with, run (if you are using Windows):
1. ```python -m venv venv```
2. ```venv\Scripts\activate```
3. ```python -m pip install -r requirements.txt``` (download all required libraries for the project)
+ Launch Docker: ```docker compose up -d```
+ Run event_streaming python file in Kafka events.
4. Run the command: python spark_streaming/sales_delta_spark_to_minio.py (submiting spark job and stream the data to MinIO)
5. Access the service:
+ Confluent Control Center for Kafka is accessible at `http://localhost:9021`.
+ MinIO is accessible at `http://localhost:9001`.

+ Trino is accessible at `http://localhost:8084`.

### How can I make this better?!
A lot can still be done :)
+ Choose managed Infra
+ Cloud Composer for Airflow, Kafka and Spark using AWS.
+ Kafka Streaming process monitering with Prometheus and Grafana.
+ Include CI/CD Operations.
+ Write data quality tests.
+ Storage Layer Deployment with AWS S3 and Terraform.---
© 2025 Nguyen Dai