{"id":16085687,"url":"https://github.com/iamraphson/de-2024-project-spotify","last_synced_at":"2025-10-16T00:41:25.011Z","repository":{"id":229051527,"uuid":"775598346","full_name":"iamraphson/DE-2024-project-spotify","owner":"iamraphson","description":"FInal project for data zoom camp 2024","archived":false,"fork":false,"pushed_at":"2024-03-31T20:07:13.000Z","size":2199,"stargazers_count":19,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T17:39:19.799Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iamraphson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-21T17:25:50.000Z","updated_at":"2025-03-10T07:42:38.000Z","dependencies_parsed_at":"2024-03-31T20:42:49.912Z","dependency_job_id":null,"html_url":"https://github.com/iamraphson/DE-2024-project-spotify","commit_stats":null,"previous_names":["iamraphson/de-2024-project-spotify"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamraphson%2FDE-2024-project-spotify","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamraphson%2FDE-2024-project-spotify/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamraphson%2FDE-2024-project-spotify/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iamraphson%2FDE-2024-project-spotify/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iamraphson","download_url":"https://codeload.github.com/iamraphson/DE-2024-project-spotify/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244169220,"owners_count":20409682,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-09T13:08:57.061Z","updated_at":"2025-10-16T00:41:24.943Z","avatar_url":"https://github.com/iamraphson.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Pipeline Project For Some Spotify Data\n\n\u003cdetails\u003e\n    \u003csummary\u003eTable of Contents\u003c/summary\u003e\n    \u003col\u003e\n        \u003cli\u003e\n            \u003ca href=\"#introduction\"\u003eIntroduction\u003c/a\u003e\n            \u003cul\u003e\n                \u003cli\u003e\u003ca href=\"#built-with\"\u003eBuilt With\u003c/a\u003e\u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/li\u003e\n        \u003cli\u003e\n            \u003ca href=\"#project-architecture\"\u003eProject Architecture\u003c/a\u003e\n        \u003c/li\u003e\n        \u003cli\u003e\n             \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e\n             \u003cul\u003e\n                \u003cli\u003e\n                \u003ca href=\"#create-a-google-cloud-project\"\u003eCreate a Google Cloud Project\u003ca\u003e\n                \u003c/li\u003e\n                \u003cli\u003e\n                \u003ca href=\"#set-up-kaggle\"\u003eSet up Kaggle\u003ca\u003e\n                \u003c/li\u003e\n                \u003cli\u003e\n                    \u003ca href=\"#set-up-the-infrastructure-on-GCP-with-terraform\"\u003eSet up the infrastructure on GCP with Terraform\u003c/a\u003e\n                \u003c/li\u003e\n                \u003cli\u003e\n                    \u003ca href=\"#set-up-airflow-and-metabase\"\u003eSet up Airflow and Metabase\u003c/a\u003e\n                \u003c/li\u003e\n            \u003c/ul\u003e\n        \u003c/li\u003e\n        \u003cli\u003e\n            \u003ca href=\"#data-ingestion\"\u003eData Ingestion\u003c/a\u003e\n        \u003c/li\u003e\n        \u003cli\u003e\n            \u003ca href=\"#data-transformation\"\u003eData Transformation\u003c/a\u003e\n        \u003c/li\u003e\n        \u003cli\u003e\n            \u003ca href=\"#data-visualization\"\u003eData Visualization\u003c/a\u003e\n        \u003c/li\u003e\n        \u003cli\u003e\n            \u003ca href=\"#contact\"\u003eContact\u003c/a\u003e\n        \u003c/li\u003e\n         \u003cli\u003e\n            \u003ca href=\"#acknowledgments\"\u003eAcknowledgments\u003c/a\u003e\n        \u003c/li\u003e\n    \u003c/ol\u003e\n\u003c/details\u003e\n\n## Introduction\n\nThis project is a vital aspect of the [2024 Data Engineering Zoomcamp curriculum](https://github.com/DataTalksClub/data-engineering-zoomcamp). Within this project, I constructed a data pipeline tasked with loading and processing data retrieved from a Kaggle dataset, which includes 2023 Spotify data. Access to the dataset is accessible on [Kaggle](https://www.kaggle.com/datasets/tonygordonjr/spotify-dataset-2023).\n\nThis dataset comprises an API extraction encompassing various aspects of artists, including details about their music, genres, albums, tracks, and audio features. The dataset consists of four files: Album.csv, which contains information about the albums created by the artists; Artist.csv, which provides comprehensive details about the artists; Feature.csv, which includes the audio features of the tracks; and Track.csv, which contains details about the tracks and their popularity. For more information, please Kaggle\n\nWhat is the objective of this project? The objective is to construct a data pipeline that retrieves, stores, cleans, and presents the data through a straightforward dashboard for visualization. With this objective in mind, we can analyze various aspects such as the types of albums preferred by certain artists, the number of track releases over the years, albums with a high number of tracks, artists with a significant volume of tracks, and the audio features of all tracks in the dataset, among others.\n\n### Built With\n\n- Dataset repo: [Kaggle](https://www.kaggle.com)\n- Infrastructure as Code: [Terraform](https://www.terraform.io/)\n- Workflow Orchestration: [Airflow](https://airflow.apache.org)\n- Data Lake: [Google Cloud Storage](https://cloud.google.com/storage)\n- Data Warehouse: [Google BigQuery](https://cloud.google.com/bigquery)\n- Transformation: [DBT](https://www.getdbt.com/)\n- Visualisation: [Metabase](https://www.metabase.com/)\n- Programming Language: Python and SQL\n\n## Project Architecture\n\n![architecture](./screenshots/architecture.png)\n\nThe cloud infrastructure has been established using Terraform, while Airflow is being executed within a local Docker container.\n\n## Getting Started\n\n### Prerequisites\n\n1. A [Google Cloud Platform](https://cloud.google.com/) account.\n2. A [kaggle](https://www.kaggle.com/) account.\n3. Install VSCode or [Zed](https://zed.dev/) or any other IDE that works for you.\n4. [Install Terraform](https://www.terraform.io/downloads)\n5. [Install Docker Desktop](https://docs.docker.com/get-docker/)\n6. [Install Google Cloud SDK](https://cloud.google.com/sdk)\n7. Clone this repository onto your local machine.\n\n### Create a Google Cloud Project\n\n- Go to [Google Cloud](https://console.cloud.google.com/) and create a new project.\n- Retrieve the project ID and define the environment variable `GCP_PROJECT_ID` in the .env file located in the root directory.\n- Create a [Service account](https://cloud.google.com/iam/docs/service-account-overview) with the following roles:\n  - `BigQuery Admin`\n  - `Storage Admin`\n  - `Storage Object Admin`\n  - `Viewer`\n- Download the Service Account credentials and store it in `$HOME/.google/credentials/`.\n- You need to activate the following APIs [here](https://console.cloud.google.com/apis/library/browse)\n  - Cloud Storage API\n  - BigQuery API\n- Assign the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your JSON credentials file, such that `GOOGLE_APPLICATION_CREDENTIALS` will be $HOME/.google/credentials/\u003cauthkeys_filename\u003e.json\n  - add this line to the end of the `.bashrc` file\n  ```bash\n  export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json\n  ```\n  - Activate the enviroment variable by runing `source .bashrc`\n\n### Set up kaggle\n\n- A detailed description on how to authenicate is found [here](https://www.kaggle.com/docs/api)\n- Specify the environment variables `KAGGLE_USER` and `KAGGLE_TOKEN` in the .env file situated in the root directory. Please note that `KAGGLE_TOKEN` is interchangeable with `KAGGLE_KEY`.\n\n### Set up the infrastructure on GCP with Terraform\n\n- Using either Zed or VSCode, open the cloned project `DE-2024-project-spotify`.\n- To customize the default values of `variable \"project\"` and `variable \"region\"` to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables `TF_VAR_project` and `TF_VAR_region`.\n- Open the terminal and navigate to the root directory of the project.\n- Change the directory to the terraform folder by running the command `cd terraform`.\n- Set an alias with the command `alias tf='terraform'`.\n- Initialise Terraform by executing `tf init`.\n- Plan the infrastructure using `tf plan`.\n- Apply the changes with `tf apply`.\n\n### Set up Airflow and Metabase\n\n- Please confirm that the following environment variables are configured in `.env` in the root directory of the project.\n  - `AIRFLOW_UID`. You can use run `echo -e \"AIRFLOW_UID=$(id -u)\" \u003e .env` on your CLI\n  - `KAGGLE_USERNAME`. This should be set from [Set up kaggle](#set-up-kaggle) section.\n  - `KAGGLE_TOKEN`. This should be set from [Set up kaggle](#set-up-kaggle) section too\n  - `GCP_PROJECT_ID`. This should be set from [Create a Google Cloud Project](#create-a-google-cloud-project) section\n  - `GCP_SPOTIFY_BUCKET=spotify_project_datalake_\u003cGCP project id\u003e`\n  - `GCP_SPOTIFY_WH_DATASET=spotify_warehouse`\n  - `GCP_SPOTIFY_WH_EXT_DATASET=spotify_warehouse_ext`\n- Run `docker-compose up`.\n- Access the Airflow dashboard by navigating to `http://localhost:8080/` in your web browser. The interface will appear similar to the following screenshot. Log in using `airflow` as the username and password.\n\n![Airflow](./screenshots/airflow_home.png)\n\n- To access the Metabase dashboard, open your web browser and visit `http://localhost:1460`. The interface will look similar to the following screenshot. You will need to sign up to use the UI.\n\n![Metabase](./screenshots/metabase_home.png)\n\n## Data Ingestion\n\nOnce you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below, you'll see a list of DAGs available.\n![DAGS](./screenshots/airflow_index.png)\nBelow is the DAG's graph.\n![DAG Graph](./screenshots/airflow_graph.png)\nTo run the DAG, Click on the play button\n![Run Graph](./screenshots/airflow_run.png)\n\n## Data Transformation\n\n- Navigate to the root directory of the project in the terminal, and then change the directory to the \"data_dbt\" folder using the command `cd data_dbt`.\n- Create a \"profiles.yml\" file within `${HOME}/.dbt`, and define a profile for this project according to the instructions provided below.\n\n```yaml\ndata_dbt_spotify:\n  outputs:\n    dev:\n      dataset: spotify_warehouse\n      fixed_retries: 1\n      keyfile: \u003clocation_google_auth_key\u003e\n      location: \u003cpreferred project region\u003e\n      method: service-account\n      priority: interactive\n      project: \u003cpreferred project id\u003e\n      threads: 6\n      timeout_seconds: 300\n      type: bigquery\n  target: dev\n```\n\n- To run all models, run `dbt run -t dev`\n- Navigate to your Google [BigQuery](https://console.cloud.google.com/bigquery) project by clicking on this link. There, you'll find all the tables and views created by DBT.\n  ![Big Query Schema](./screenshots/BigQuery.png)\n\n## Data Visualization\n\nPlease watch the [provided video tutorial](https://youtu.be/BnLkrA7a6gM\u0026) for guidance on configuring your Metabase database connection with BigQuery. You can customize your dashboard to suit your preferences. Additionally, you can view the complete screenshot of the dashboard I created in this [PDF](./screenshots/DE_2024_spotify_1.pdf).\n\n![Dashboard](./screenshots/dashboard_1.png)\n\n![Dashboard](./screenshots/dashboard_2.png)\n\n## Contact\n\nTwitter: [@iamraphson](https://twitter.com/iamraphson)\n\n## Acknowledgments\n\nI want to express my deepest appreciation to the organizers, especially [Alex](https://www.linkedin.com/in/agrigorev/), for offering the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) course. It has been an incredibly valuable learning experience for me.\n\n🦅\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamraphson%2Fde-2024-project-spotify","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiamraphson%2Fde-2024-project-spotify","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamraphson%2Fde-2024-project-spotify/lists"}