https://github.com/kellyjadams/spotify-data-analyze

A serverless data pipeline that logs my Spotify listening history to BigQuery using Cloud Run, then visualizes trends with Looker Studio. Built with Python, Flask, Docker, and GCP..
https://github.com/kellyjadams/spotify-data-analyze

data-analysis data-engineering

Last synced: about 1 year ago
JSON representation

A serverless data pipeline that logs my Spotify listening history to BigQuery using Cloud Run, then visualizes trends with Looker Studio. Built with Python, Flask, Docker, and GCP..

Host: GitHub
URL: https://github.com/kellyjadams/spotify-data-analyze
Owner: kellyjadams
Created: 2024-12-10T20:39:05.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-06T03:58:41.000Z (about 1 year ago)
Last Synced: 2025-05-06T04:33:58.850Z (about 1 year ago)
Topics: data-analysis, data-engineering
Language: Python
Homepage: https://lookerstudio.google.com/reporting/e2f6d5f3-c3cf-4687-ba01-d3a47a15998c
Size: 49.8 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Spotify Listening Logger

This is a personal data project focusing on **data engineering** and **analytics engineering** skills.

- **Data engineering**: ETL pipelines, orchestration, and cloud infrastructure
- **Analytics engineering**: **BigQuery SQL**, data modeling, building analysis-ready datasets

It automatically logs my Spotify listening history every minute and stores it in **BigQuery** for analysis.

Links:
- [Blog Post](https://www.kellyjadams.com/post/spotify-listening-logger)

## Why I Built This

At the end of the year, I want to compare what's logged in BigQuery with my annual **Spotify Wrapped**.

I wanted to explore how to:

- Build a serverless data pipeline using **Cloud Run**, **Docker**, and **GitHub Actions**
- Manage secrets securely with `.env` files and **GitHub Secrets**
- Schedule and orchestrate updates using **Cloud Scheduler**
- Structure and transform data for analysis in **BigQuery**
- Design pipelines that are scalable and support near real-time insights

This project brings together core **cloud** and **analytics engineering** tools to build something end-to-end—from ingestion to analysis.

## Key Features

- **Data Ingestion & Deployment**
- Polls Spotify’s now-playing endpoint every minute using a serverless Cloud Run app
- Built with a containerized Python app using Flask + Spotipy
- Deployed manually using a shell script; Cloud Scheduler handles orchestration by triggering the Cloud Run endpoint every minute.
- **Data Storage & Modeling**
- Streams listening history into **BigQuery**
- Cleans and deduplicates plays for session-level analysis
- Stores detailed metadata: artist, album, genre, popularity, duration
- **Analysis & Visualization**
- Designed analysis-ready datasets using **BigQuery SQL**

## Technical Skills

- **Data Pipeline Design**: Built a serverless ETL pipeline from Spotify API to BigQuery using Python and Cloud Run
- **Cloud-Native ETL**: Extracted, transformed, and loaded data on a schedule using Cloud Scheduler and containerized Flask app
- **Python**: Wrote ingestion and transformation logic using the `Spotipy` library
- **REST API**: Created a lightweight endpoint to trigger ingestion using Flask
- **BigQuery**: Designed table schema and streamed structured data for analysis
- **Docker + Cloud Run**: Packaged and deployed the app as a scalable container
- **Environment variable management**: Handled secrets securely using `.env` and GitHub Secrets

## Project Structure

```
spotify-data-analyze/
├── analysis/
│ ├── queries/
│ ├── views/
│ │ ├── deduped_plays_pacific.sql
├── cloud/
│ └── playback/
│ ├── main.py
│ ├── Dockerfile
│ ├── requirements.txt
│ └── deploy.sh
├── scripts/
│ ├── create_bigquery_table.py
│ ├── delete_bigquery_table.py
│ └── load_env.py
├── .env
└── .github/workflows/
└── cloud-deploy.yml
```

## Environment & Deployment

This project uses a `.env` file for Spotify and GCP credentials. These variables are injected during Cloud Run deployment.

I deployed the app manually using:

```bash
cd cloud/playback
./deploy.sh
```

I automated ingestion, by setting up a **Cloud Scheduler** job to hit the app endpoint every minute.

The data is stored in a **BigQuery** table with fields like `track`, `artist`, `genre`, and `popularity`. See `create_bigquery_table.py` for schema setup.

## Next Steps

Below are my next steps:
- Automate CI/CD deployment via GitHub Actions
- Finalize my Looker Studio Dashboard that automatically retrieves data from my BigQuery tables
- Analyze stats using BigQuery SQL queries

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kellyjadams/spotify-data-analyze

Awesome Lists containing this project

README