https://github.com/kellyjadams/spotify-data-analyze
A serverless data pipeline that logs my Spotify listening history to BigQuery using Cloud Run, then visualizes trends with Looker Studio. Built with Python, Flask, Docker, and GCP..
https://github.com/kellyjadams/spotify-data-analyze
data-analysis data-engineering
Last synced: about 1 year ago
JSON representation
A serverless data pipeline that logs my Spotify listening history to BigQuery using Cloud Run, then visualizes trends with Looker Studio. Built with Python, Flask, Docker, and GCP..
- Host: GitHub
- URL: https://github.com/kellyjadams/spotify-data-analyze
- Owner: kellyjadams
- Created: 2024-12-10T20:39:05.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-06T03:58:41.000Z (about 1 year ago)
- Last Synced: 2025-05-06T04:33:58.850Z (about 1 year ago)
- Topics: data-analysis, data-engineering
- Language: Python
- Homepage: https://lookerstudio.google.com/reporting/e2f6d5f3-c3cf-4687-ba01-d3a47a15998c
- Size: 49.8 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spotify Listening Logger
This is a personal data project focusing on **data engineering** and **analytics engineering** skills.
- **Data engineering**: ETL pipelines, orchestration, and cloud infrastructure
- **Analytics engineering**: **BigQuery SQL**, data modeling, building analysis-ready datasets
It automatically logs my Spotify listening history every minute and stores it in **BigQuery** for analysis.
Links:
- [Blog Post](https://www.kellyjadams.com/post/spotify-listening-logger)
## Why I Built This
At the end of the year, I want to compare what's logged in BigQuery with my annual **Spotify Wrapped**.
I wanted to explore how to:
- Build a serverless data pipeline using **Cloud Run**, **Docker**, and **GitHub Actions**
- Manage secrets securely with `.env` files and **GitHub Secrets**
- Schedule and orchestrate updates using **Cloud Scheduler**
- Structure and transform data for analysis in **BigQuery**
- Design pipelines that are scalable and support near real-time insights
This project brings together core **cloud** and **analytics engineering** tools to build something end-to-end—from ingestion to analysis.
## Key Features
- **Data Ingestion & Deployment**
- Polls Spotify’s now-playing endpoint every minute using a serverless Cloud Run app
- Built with a containerized Python app using Flask + Spotipy
- Deployed manually using a shell script; Cloud Scheduler handles orchestration by triggering the Cloud Run endpoint every minute.
- **Data Storage & Modeling**
- Streams listening history into **BigQuery**
- Cleans and deduplicates plays for session-level analysis
- Stores detailed metadata: artist, album, genre, popularity, duration
- **Analysis & Visualization**
- Designed analysis-ready datasets using **BigQuery SQL**
## Technical Skills
- **Data Pipeline Design**: Built a serverless ETL pipeline from Spotify API to BigQuery using Python and Cloud Run
- **Cloud-Native ETL**: Extracted, transformed, and loaded data on a schedule using Cloud Scheduler and containerized Flask app
- **Python**: Wrote ingestion and transformation logic using the `Spotipy` library
- **REST API**: Created a lightweight endpoint to trigger ingestion using Flask
- **BigQuery**: Designed table schema and streamed structured data for analysis
- **Docker + Cloud Run**: Packaged and deployed the app as a scalable container
- **Environment variable management**: Handled secrets securely using `.env` and GitHub Secrets
## Project Structure
```
spotify-data-analyze/
├── analysis/
│ ├── queries/
│ ├── views/
│ │ ├── deduped_plays_pacific.sql
├── cloud/
│ └── playback/
│ ├── main.py
│ ├── Dockerfile
│ ├── requirements.txt
│ └── deploy.sh
├── scripts/
│ ├── create_bigquery_table.py
│ ├── delete_bigquery_table.py
│ └── load_env.py
├── .env
└── .github/workflows/
└── cloud-deploy.yml
```
## Environment & Deployment
This project uses a `.env` file for Spotify and GCP credentials. These variables are injected during Cloud Run deployment.
I deployed the app manually using:
```bash
cd cloud/playback
./deploy.sh
```
I automated ingestion, by setting up a **Cloud Scheduler** job to hit the app endpoint every minute.
The data is stored in a **BigQuery** table with fields like `track`, `artist`, `genre`, and `popularity`. See `create_bigquery_table.py` for schema setup.
## Next Steps
Below are my next steps:
- Automate CI/CD deployment via GitHub Actions
- Finalize my Looker Studio Dashboard that automatically retrieves data from my BigQuery tables
- Analyze stats using BigQuery SQL queries