{"id":26277985,"url":"https://github.com/msnzmt/spotify-bigdata-streaming","last_synced_at":"2026-05-17T12:33:23.007Z","repository":{"id":281840797,"uuid":"944658184","full_name":"MsnzmT/Spotify-BigData-Streaming","owner":"MsnzmT","description":"Spotify BigData Streaming is a real-time data streaming and analytics pipeline that processes event data using Kafka, Spark, and Hadoop HDFS. It follows a Star Schema approach, transforming raw data into structured formats with dbt and storing business-ready insights in ClickHouse. Finally, Metabase provides interactive visualizations for analytics","archived":false,"fork":false,"pushed_at":"2025-03-11T11:51:55.000Z","size":42,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T12:35:44.004Z","etag":null,"topics":["dbt","docker-compose","hdfs","kafka","spark-streaming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MsnzmT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-07T18:29:43.000Z","updated_at":"2025-03-11T11:51:59.000Z","dependencies_parsed_at":"2025-03-11T12:47:28.912Z","dependency_job_id":null,"html_url":"https://github.com/MsnzmT/Spotify-BigData-Streaming","commit_stats":null,"previous_names":["msnzmt/spotify-bigdata-streaming"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MsnzmT/Spotify-BigData-Streaming","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MsnzmT%2FSpotify-BigData-Streaming","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MsnzmT%2FSpotify-BigData-Streaming/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MsnzmT%2FSpotify-BigData-Streaming/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MsnzmT%2FSpotify-BigData-Streaming/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MsnzmT","download_url":"https://codeload.github.com/MsnzmT/Spotify-BigData-Streaming/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MsnzmT%2FSpotify-BigData-Streaming/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33138371,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T09:28:26.183Z","status":"ssl_error","status_checked_at":"2026-05-17T09:27:52.702Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dbt","docker-compose","hdfs","kafka","spark-streaming"],"created_at":"2025-03-14T12:31:17.977Z","updated_at":"2026-05-17T12:33:22.991Z","avatar_url":"https://github.com/MsnzmT.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Spotify Big Data Streaming**\n\n![Data Flow](docs/image.png)\n\n## 📌 **Overview**\nSpotify Big Data Management is a data pipeline project designed to process, transform, and analyze event data using big data tools and technologies. It follows a **star schema** approach, ensuring efficient data storage and retrieval for analytics.\n\n## ⚙️ **Tech Stack**\n- **Event Sim** → Generates simulated event data and produces it to Kafka.\n- **Kafka** → Acts as a message broker to handle real-time event streaming.\n- **Apache Spark** → Consumes Kafka messages, processes them, and stores raw data in HDFS.\n- **Hadoop HDFS** → Stores raw and transformed data across different layers (Bronze, Silver, Gold).\n- **dbt (Data Build Tool)** → Transforms data in HDFS using the Spark adapter.\n- **ClickHouse** → A high-performance columnar database for data warehousing.\n- **Metabase** → A business intelligence tool to create visualizations and charts.\n\n## 🔄 **Data Processing Workflow**\n### 1️⃣ **Ingestion \u0026 Storage (Bronze Layer)**\n- The pipeline starts with **Kafka**, receiving raw event data generated by **Event Sim**.\n- **Apache Spark** consumes this data from Kafka and stores it in **HDFS (Bronze Layer)** in Parquet format.\n\n### 2️⃣ **Transformation (Silver Layer)**\n- Using **dbt**, the raw data is cleaned, transformed, and structured into **Fact** and **Dimension tables** based on a **Star Schema**.\n- This processed data is stored in the **HDFS Silver Layer**.\n\n### 3️⃣ **Business-Ready Data (Gold Layer)**\n- Further transformations are applied using **dbt** to create **aggregated, business-ready data** in the **HDFS Gold Layer**.\n\n### 4️⃣ **Data Warehousing \u0026 Analytics**\n- The **Gold Layer** data is loaded into **ClickHouse**, enabling fast analytical queries.\n- **Metabase** is connected to **ClickHouse** to build insightful dashboards and visualizations.\n\n## 🚀 **Key Features**\n✔️ **Real-time Data Streaming** with Kafka  \n✔️ **Scalable Data Storage** using HDFS  \n✔️ **Transformations with dbt** following Star Schema  \n✔️ **Fast Querying** with ClickHouse  \n✔️ **Intuitive Data Visualizations** with Metabase  \n\nThis project enables efficient **end-to-end data management**, from ingestion to analytics, making it a powerful solution for big data processing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsnzmt%2Fspotify-bigdata-streaming","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsnzmt%2Fspotify-bigdata-streaming","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsnzmt%2Fspotify-bigdata-streaming/lists"}