https://github.com/topefolorunso/musicaly-project
An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and visualizes clean data.
https://github.com/topefolorunso/musicaly-project
airflow bigquery data-pipeline dbt google-cloud-platform kafka python spark-streaming
Last synced: about 1 month ago
JSON representation
An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and visualizes clean data.
- Host: GitHub
- URL: https://github.com/topefolorunso/musicaly-project
- Owner: topefolorunso
- Created: 2022-05-05T07:12:00.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-01-24T21:32:28.000Z (over 3 years ago)
- Last Synced: 2025-05-16T10:45:27.434Z (about 1 year ago)
- Topics: airflow, bigquery, data-pipeline, dbt, google-cloud-platform, kafka, python, spark-streaming
- Language: Python
- Homepage:
- Size: 97.2 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# musicaly
An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and perfroms analytics on clean data.
## background
Eventsim is a top music streaming company. The management of Eventsim are working on a new feature tailored to the preferences of the users. In order to aid the development of this feature, the developers needed to understand certain things about the streaming habits of users. Hence, they came up with use cases and questions that need to be answered.
1. What is the total number of active users, heir total stream hours and their geographic distribution?
2. What is the general gender composition of users and how do they make up the top artists?
3. What are the top songs and who are the top artists that users listen to?
## data flow
* Eventsim API produces the streaming data which are then consumed by Kafka.
* Stream data are read from Kafka with Spark Streaming.
* Spark Streaming structures the data and writes to data lake (Cloud Storage) as flat file.
* ELT from data lake (Cloud Storage) to data warehouse (BigQuery) using dbt, and orchestrated with Airflow
* Stream Analytics were performed and deployed using Google Data Studio.

## cloud architecture

## data source
[Eventsim](https://github.com/Interana/eventsim) is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from [viirya's fork](https://github.com/viirya/eventsim) of it, as the original project has gone without maintenance for a few years now.
Eventsim uses song data from [Million Songs Dataset](http://millionsongdataset.com) to generate events. I have used a [subset](http://millionsongdataset.com/pages/getting-dataset/#subset) of 10000 songs.
## dashboard
Click [here](https://datastudio.google.com/embed/reporting/1085eb37-b359-4613-90e2-71e54a82ff87/page/vYvuC) to view latest version on Data Studio

## how to setup
:warning: [**Note that GCP resources (which incur cost) are provisioned in this project**](https://cloud.google.com/pricing)
:warning: Also this setup assumes you are using a linux or bash environment
1. clone this repo to the `~/musicaly-project` directory
```bash
git clone https://github.com/topefolorunso/musicaly-project.git ~/musicaly-project && \
cd ~/musicaly-project
```
2. [setup GCP account](gcp/README.md)
3. [provision infrastructure](terraform/README.md)
4. [ssh to and setup vms](vm_setup/README.md)
5. [proceed to run](#how-to-run)
## how to run
1. start up the kafka service and start streaming [here](kafka/README.md)
2. start up the spark streaming service [here](spark_streaming/README.md)
3. start up the airflow service [here](airflow/README.md)
4. connect bigquery to Data Studio for analytics