An open API service indexing awesome lists of open source software.

https://github.com/msnzmt/spotify-bigdata-streaming

Spotify BigData Streaming is a real-time data streaming and analytics pipeline that processes event data using Kafka, Spark, and Hadoop HDFS. It follows a Star Schema approach, transforming raw data into structured formats with dbt and storing business-ready insights in ClickHouse. Finally, Metabase provides interactive visualizations for analytics
https://github.com/msnzmt/spotify-bigdata-streaming

dbt docker-compose hdfs kafka spark-streaming

Last synced: about 2 months ago
JSON representation

Spotify BigData Streaming is a real-time data streaming and analytics pipeline that processes event data using Kafka, Spark, and Hadoop HDFS. It follows a Star Schema approach, transforming raw data into structured formats with dbt and storing business-ready insights in ClickHouse. Finally, Metabase provides interactive visualizations for analytics

Awesome Lists containing this project

README

          

# **Spotify Big Data Streaming**

![Data Flow](docs/image.png)

## 📌 **Overview**
Spotify Big Data Management is a data pipeline project designed to process, transform, and analyze event data using big data tools and technologies. It follows a **star schema** approach, ensuring efficient data storage and retrieval for analytics.

## ⚙️ **Tech Stack**
- **Event Sim** → Generates simulated event data and produces it to Kafka.
- **Kafka** → Acts as a message broker to handle real-time event streaming.
- **Apache Spark** → Consumes Kafka messages, processes them, and stores raw data in HDFS.
- **Hadoop HDFS** → Stores raw and transformed data across different layers (Bronze, Silver, Gold).
- **dbt (Data Build Tool)** → Transforms data in HDFS using the Spark adapter.
- **ClickHouse** → A high-performance columnar database for data warehousing.
- **Metabase** → A business intelligence tool to create visualizations and charts.

## 🔄 **Data Processing Workflow**
### 1️⃣ **Ingestion & Storage (Bronze Layer)**
- The pipeline starts with **Kafka**, receiving raw event data generated by **Event Sim**.
- **Apache Spark** consumes this data from Kafka and stores it in **HDFS (Bronze Layer)** in Parquet format.

### 2️⃣ **Transformation (Silver Layer)**
- Using **dbt**, the raw data is cleaned, transformed, and structured into **Fact** and **Dimension tables** based on a **Star Schema**.
- This processed data is stored in the **HDFS Silver Layer**.

### 3️⃣ **Business-Ready Data (Gold Layer)**
- Further transformations are applied using **dbt** to create **aggregated, business-ready data** in the **HDFS Gold Layer**.

### 4️⃣ **Data Warehousing & Analytics**
- The **Gold Layer** data is loaded into **ClickHouse**, enabling fast analytical queries.
- **Metabase** is connected to **ClickHouse** to build insightful dashboards and visualizations.

## 🚀 **Key Features**
✔️ **Real-time Data Streaming** with Kafka
✔️ **Scalable Data Storage** using HDFS
✔️ **Transformations with dbt** following Star Schema
✔️ **Fast Querying** with ClickHouse
✔️ **Intuitive Data Visualizations** with Metabase

This project enables efficient **end-to-end data management**, from ingestion to analytics, making it a powerful solution for big data processing.