{"id":28151687,"url":"https://github.com/khdevops/spotify_data_pipeline","last_synced_at":"2025-10-28T02:39:07.584Z","repository":{"id":261530466,"uuid":"874160524","full_name":"KHDevOps/spotify_data_pipeline","owner":"KHDevOps","description":"Spotify Top 50 Data Pipeline","archived":false,"fork":false,"pushed_at":"2024-11-07T09:14:54.000Z","size":987,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-15T04:13:01.354Z","etag":null,"topics":["apache-airflow","bash","bigquery","docker","git","google-cloud-platform","pandas","python","sql","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KHDevOps.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-17T11:01:06.000Z","updated_at":"2024-11-10T14:51:57.000Z","dependencies_parsed_at":"2025-05-06T14:45:31.196Z","dependency_job_id":"9177ba76-ab9d-4494-a41b-d0d93e209a44","html_url":"https://github.com/KHDevOps/spotify_data_pipeline","commit_stats":null,"previous_names":["leomendoza13/spotify_data_pipeline","khdevops/spotify_data_pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/KHDevOps/spotify_data_pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KHDevOps%2Fspotify_data_pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KHDevOps%2Fspotify_data_pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KHDevOps%2Fspotify_data_pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KHDevOps%2Fspotify_data_pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KHDevOps","download_url":"https://codeload.github.com/KHDevOps/spotify_data_pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KHDevOps%2Fspotify_data_pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281375304,"owners_count":26490213,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-28T02:00:06.022Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","bash","bigquery","docker","git","google-cloud-platform","pandas","python","sql","terraform"],"created_at":"2025-05-15T04:12:59.744Z","updated_at":"2025-10-28T02:39:07.570Z","avatar_url":"https://github.com/KHDevOps.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spotify Top 50 Data Pipeline Project\n\nThis project automates the extraction, processing, and loading of Spotify playlist data for global rankings using Google Cloud services and Terraform for infrastructure deployment.\n\n## Project Overview\n\nThis pipeline:\n\n1. Extracts Spotify data (Top 50 songs by country) via the Spotify API.\n2. Processes the data and organizes it into separate tables (tracks, albums, artists, etc.).\n3. Loads the processed data into Google Cloud Storage and BigQuery for analysis.\n\n## Architecture Overview\n\nA high-level view of the data flow:\n\n![Spotify Data Pipeline Architecture](assets/Spotify-pipeline.svg) \n\n- **Compute Engine** runs the Airflow instance that triggers Spotify data extraction and processing tasks.\n- **Google Cloud Storage** holds raw and processed data.\n- **BigQuery** is used to store and analyze Spotify data in structured tables.\n\n## Project Structure\n\n```\n.\n├── .gitignore\n├── LICENSE\n├── README.md\n├── assets\n│   └── Spotify-pipeline.svg\n├── config\n│   ├── __init__.py\n│   ├── config.py\n│   └── spotify_api_ids.json\n├── dags\n│   ├── extraction.py\n│   └── process_load.py\n├── terraform\n│   ├── bigquery.tf\n│   ├── compute_instance.tf\n│   ├── example.tfvars\n│   ├── main.tf\n│   ├── provider.tf\n│   ├── scripts\n│   │   └── startup-script.sh\n│   ├── service_account.tf\n│   ├── storage.tf\n│   └── variables.tf\n└── utils\n    ├── __init__.py\n    ├── extraction_utils.py\n    └── process_load_utils.py\n```\n\n## Prerequisites\n\n- **Google Cloud Platform** with API access to Storage, BigQuery, and Compute Engine.\n- **gcloud CLI** installed on your machine.\n- **Spotify Developer Account** with access to client credentials.\n- **Terraform** installed on your machine.\n\n## Setup Instructions\n\n### Step 1: Clone the Repository\n\nClone the repository and naviguate to the project folder\n\n```bash\ngit clone git@github.com:Leomendoza13/spotify_data_pipeline\ncd spotify_data_pipeline\n```\n\n### Step 2: Create your new project on Google Cloud Platform Console\n\n1. Create a [Google Cloud Platform Account](https://console.cloud.google.com/) if you haven’t already. New users get a free 3-month trial.\n\n2. Go to your [console](https://console.cloud.google.com/) and create a new project using the \"Create Project\" button.\n\n3. Go to **Compute Engine** tab and enable **Compute Engine API**. Repeat this for **BigQuery API** to enable both services.\n\n### Step 3: Configure GCloud CLI\n\n1. Install [gcloud CLI](https://cloud.google.com/sdk/docs/install) if it’s not already installed.\n\n2. Connect to your Google Cloud account and authenticate:\n\n```bash\ngcloud auth application-default login\n```\n\nThis will generate a URL in your CLI, click on it, and log in to your Google Cloud account.\n\n3. Set the project ID:\n\n```bash\ngcloud config set project [PROJECT_ID]\n```\n\n### Step 4: Configure Spotify Credentials\n\n1. Create an account on the [Spotify API](https://developer.spotify.com/) if needed, and get your Spotify client credentials.\n\n2. Open `config/spotify_api_id.json` and replace `\"your_spotify_client_id\"` and `\"your_spotify_client_secret\"` with your actual Spotify client credentials:\n\n```bash\ncat config/spotify_api_ids.json\n```\n\nNote: These credentials are sensitive and should be kept secure. Do not share or commit them publicly.\n\n### Step 5: Configure Terraform Variables\n\n1. Install [Terraform](https://developer.hashicorp.com/terraform/tutorials/gcp-get-started/install-cli) if it’s not already installed.\n\n2. Create a `terraform.tfvars` file based on `example.tfvars`:\n\n```bash\ncp terraform/example.tfvars terraform/terraform.tfvars\n```\n\n3. Edit `terraform/terraform.tfvars` to add your specific values:\n\n```\nproject_id       = \"your-project-id\"  \nssh_user         = \"your-ssh-username\"  \nssh_pub_key_path = \"~/.ssh/id_rsa.pub\"  \nsource_folder    = \"../dags/\"  \nids_path         = \"../config/\"\n```\n\n### Step 6: Deploy the Infrastructure\n\nNavigate to the `terraform` folder, initialize Terraform and apply the configuration:\n\n```bash\ncd terraform  \nterraform init  \nterraform apply\n```\n\nConfirm the resources to be deployed. This command will set up:\n\n- A Compute Engine instance to run the extraction and processing scripts.\n- A Google Cloud Storage bucket for storing raw and processed data.\n- BigQuery tables for storing and analyzing Spotify data.\n\n### Step 7: Actions After `terraform apply`\n\nAfter running the `terraform apply` command, your Google Cloud infrastructure is fully set up to support the extraction, processing, and loading of Spotify data. Here’s an overview of what happens next:\n\n1. **Infrastructure Setup**: Once `terraform apply` completes, the following resources are created:\n   - A Compute Engine instance is deployed to run the data extraction and transformation scripts.\n   - A Google Cloud Storage bucket is configured to store both raw data extracted from Spotify and processed data.\n   - BigQuery tables are created to organize and analyze Spotify data.\n\n2. **Loading Scripts and Credentials**: Using the configurations specified in `terraform.tfvars`, the folders containing your Airflow DAGs (`dags/`) and your Spotify credentials (`config/`) are copied to the Compute Engine instance.\n\n3. **Airflow Initialization**: The Compute Engine instance is configured to start Airflow and automatically load the DAGs in the `dags/` folder. Airflow is now set up to periodically trigger the Spotify data extraction and processing workflows.\n\n4. **Pipeline Triggering**: Airflow initiates the pipeline automatically according to the schedule. After Airflow starts, it may take around 5 minutes for the pipeline to fully initialize and begin processing data.\n\n5. **BigQuery Analysis**: Once the data is loaded, you can use BigQuery to query and analyze the Spotify data. For example, you can create visualizations or reports on music popularity trends by country and over time.\n\n### **Step 8: ⚠️ DON'T FORGET TO `terraform destroy` WHEN IT IS DONE ⚠️**\n\n```bash\nterraform destroy\n```\n\nRunning `terraform destroy` is essential after you’re done to prevent unnecessary costs. Google Cloud resources like Compute Engine instances and BigQuery storage incur charges as long as they’re active. By running `terraform destroy`, you ensure that all deployed resources are deleted, helping to avoid unexpected expenses.\n\n### Usage\n\n- **Extract and Load Data**: Use Airflow or a similar task orchestrator to trigger the DAGs in `dags/` for periodic data extraction and loading.\n- **Analyze Data in BigQuery**: Use BigQuery SQL queries to analyze top Spotify tracks across countries. Here's a sample query to get started:\n\n```sql\nSELECT\n    ft.track_name,\n    da.artist_name,\n    dp.playlist_name,\n    ft.position\nFROM\n    `your_project_id.spotify_country_rankings.top_tracks` ft\nJOIN `your_project_id.spotify_country_rankings.artists` da ON ft.artist_id = da.artist_id\nJOIN `your_project_id.spotify_country_rankings.playlists` dp ON ft.playlist_id = dp.playlist_id\nWHERE\n    dp.playlist_name = 'Usa'\nORDER BY\n    ft.position ASC\n```\n\n### Contributing\n\nContributions to this project are welcome! By submitting a pull request, contributors agree to license their work under the same MIT License.\n\n### License\n\nThis project is licensed under the MIT License. See the LICENSE file for more details.\n\n### Author\n\nThis project was created and developed by me :) **Léo Mendoza**.\n\nFeel free to reach out for questions, contributions, or feedback at [leo.mendoza@epita.com](mailto:leo.mendoza@epita.com).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkhdevops%2Fspotify_data_pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkhdevops%2Fspotify_data_pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkhdevops%2Fspotify_data_pipeline/lists"}