https://github.com/msnzmt/spotify-bigdata-streaming
Spotify BigData Streaming is a real-time data streaming and analytics pipeline that processes event data using Kafka, Spark, and Hadoop HDFS. It follows a Star Schema approach, transforming raw data into structured formats with dbt and storing business-ready insights in ClickHouse. Finally, Metabase provides interactive visualizations for analytics
https://github.com/msnzmt/spotify-bigdata-streaming
dbt docker-compose hdfs kafka spark-streaming
Last synced: about 2 months ago
JSON representation
Spotify BigData Streaming is a real-time data streaming and analytics pipeline that processes event data using Kafka, Spark, and Hadoop HDFS. It follows a Star Schema approach, transforming raw data into structured formats with dbt and storing business-ready insights in ClickHouse. Finally, Metabase provides interactive visualizations for analytics
- Host: GitHub
- URL: https://github.com/msnzmt/spotify-bigdata-streaming
- Owner: MsnzmT
- License: mit
- Created: 2025-03-07T18:29:43.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-11T11:51:55.000Z (over 1 year ago)
- Last Synced: 2025-03-11T12:35:44.004Z (over 1 year ago)
- Topics: dbt, docker-compose, hdfs, kafka, spark-streaming
- Language: Python
- Homepage:
- Size: 41 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **Spotify Big Data Streaming**

## 📌 **Overview**
Spotify Big Data Management is a data pipeline project designed to process, transform, and analyze event data using big data tools and technologies. It follows a **star schema** approach, ensuring efficient data storage and retrieval for analytics.
## ⚙️ **Tech Stack**
- **Event Sim** → Generates simulated event data and produces it to Kafka.
- **Kafka** → Acts as a message broker to handle real-time event streaming.
- **Apache Spark** → Consumes Kafka messages, processes them, and stores raw data in HDFS.
- **Hadoop HDFS** → Stores raw and transformed data across different layers (Bronze, Silver, Gold).
- **dbt (Data Build Tool)** → Transforms data in HDFS using the Spark adapter.
- **ClickHouse** → A high-performance columnar database for data warehousing.
- **Metabase** → A business intelligence tool to create visualizations and charts.
## 🔄 **Data Processing Workflow**
### 1️⃣ **Ingestion & Storage (Bronze Layer)**
- The pipeline starts with **Kafka**, receiving raw event data generated by **Event Sim**.
- **Apache Spark** consumes this data from Kafka and stores it in **HDFS (Bronze Layer)** in Parquet format.
### 2️⃣ **Transformation (Silver Layer)**
- Using **dbt**, the raw data is cleaned, transformed, and structured into **Fact** and **Dimension tables** based on a **Star Schema**.
- This processed data is stored in the **HDFS Silver Layer**.
### 3️⃣ **Business-Ready Data (Gold Layer)**
- Further transformations are applied using **dbt** to create **aggregated, business-ready data** in the **HDFS Gold Layer**.
### 4️⃣ **Data Warehousing & Analytics**
- The **Gold Layer** data is loaded into **ClickHouse**, enabling fast analytical queries.
- **Metabase** is connected to **ClickHouse** to build insightful dashboards and visualizations.
## 🚀 **Key Features**
✔️ **Real-time Data Streaming** with Kafka
✔️ **Scalable Data Storage** using HDFS
✔️ **Transformations with dbt** following Star Schema
✔️ **Fast Querying** with ClickHouse
✔️ **Intuitive Data Visualizations** with Metabase
This project enables efficient **end-to-end data management**, from ingestion to analytics, making it a powerful solution for big data processing.