{"id":22148530,"url":"https://github.com/mitgar14/etl-workshop-2","last_synced_at":"2025-03-24T12:42:37.876Z","repository":{"id":261678526,"uuid":"854315033","full_name":"mitgar14/etl-workshop-2","owner":"mitgar14","description":"Workshop #2 (ETL process using Airflow) for the ETL course using Apache Airflow to build a data pipeline.","archived":false,"fork":false,"pushed_at":"2024-11-07T20:05:28.000Z","size":7951,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-29T17:44:54.660Z","etag":null,"topics":["airflow","data-engineer","data-engineering","data-visualization","etl","pandas","postgresql","powerbi","python","sqlalchemy"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mitgar14.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-08T23:55:55.000Z","updated_at":"2024-11-07T18:36:17.000Z","dependencies_parsed_at":"2024-11-07T21:33:00.706Z","dependency_job_id":null,"html_url":"https://github.com/mitgar14/etl-workshop-2","commit_stats":null,"previous_names":["mitgar14/etl-workshop-2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mitgar14","download_url":"https://codeload.github.com/mitgar14/etl-workshop-2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245274517,"owners_count":20588801,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","data-engineer","data-engineering","data-visualization","etl","pandas","postgresql","powerbi","python","sqlalchemy"],"created_at":"2024-12-01T23:28:18.265Z","updated_at":"2025-03-24T12:42:37.855Z","avatar_url":"https://github.com/mitgar14.png","language":"Jupyter Notebook","readme":"# Workshop #2: Data Engineer \u003cimg src=\"https://cdn-icons-png.flaticon.com/512/8618/8618924.png\" alt=\"Data Icon\" width=\"30px\"/\u003e\n\nRealized by **Martín García** ([@mitgar14](https://github.com/mitgar14)).\n\n## Overview ✨\n\u003e [!NOTE]\n\u003e It's important to note that the raw Grammys dataset must be stored in a database in order to be read correctly.\n\nIn this workshop we will use two datasets (*spotify_dataset* and *the_grammys_awards*) that will be processed through Apache Airflow applying data cleaning, transformation and loading and storage, including a merge of both datasets. The result will culminate in visualizations on a dashboard that will give us important conclusions about this dataset.\n\nThe tools used are:\n\n* Python 3.10 ➜ [Download site](https://www.python.org/downloads/)\n* Jupyter Notebook ➜ [VS Code tool for using notebooks](https://youtu.be/ZYat1is07VI?si=BMHUgk7XrJQksTkt)\n* PostgreSQL ➜ [Download site](https://www.postgresql.org/download/)\n* Power BI (Desktop version) ➜ [Download site](https://www.microsoft.com/es-es/power-platform/products/power-bi/desktop)\n\n\u003e [!WARNING]\n\u003e Apache Airflow only runs correctly in Linux environments. If you have Windows, we recommend using a virtual machine or WSL.\n\nThe dependencies needed for Python are\n\n* Apache Airflow\n* Dotenv\n* Pandas\n* Matplotlib\n* Seaborn\n* SQLAlchemy\n* PyDrive2\n\nThese dependencies are included in the `requirements.txt` file of the Python project. The step-by-step installation will be described later.\n\n## Dataset Information \u003cimg src=\"https://github.com/user-attachments/assets/5fa5298c-e359-4ef1-976d-b6132e8bda9a\" alt=\"Dataset\" width=\"30px\"/\u003e\n\n\nThe datasets used (*spotify_dataset* and *the_grammy_awards*) are crucial for analyzing music trends, comparing track features, and understanding the relation between track characteristics and award recognition.\n\nHere’s an overview of the two datasets that were provided:\n\n### 1. **Spotify Dataset** (`spotify_dataset.csv`) \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/Spotify.png/1200px-Spotify.png\" alt=\"Spotify\" width=\"22px\"/\u003e\n\nThis dataset contains a wide variety of information about songs available on Spotify. Each row represents a single track with multiple attributes describing both the track's metadata and musical characteristics. The most important columns are:\n\n- **Unnamed: 0**: Acts as an index for the dataset.\n- **track_id**: A unique identifier for each track on Spotify.\n- **artists**: Name(s) of the artist(s) associated with the track.\n- **album_name**: The name of the album the track is from.\n- **track_name**: The title of the track.\n- **popularity**: A score between 0 and 100 indicating the popularity of the track on Spotify, where higher values mean more popularity.\n- **duration_ms**: The duration of the track in milliseconds.\n- **danceability**: A measure of how suitable a track is for dancing, where higher values indicate better danceability.\n- **energy**: A measure of intensity and activity in the track.\n- **key**: The musical key of the track (0 = C, 1 = C#, etc.).\n- **loudness**: The overall loudness of the track in decibels.\n- **mode**: Whether the track is in a major (1) or minor (0) mode.\n- **explicit**: Indicates if the track contains explicit content (True/False).\n- **tempo**: The speed of the track measured in beats per minute (BPM).\n- **valence**: A measure of the musical positiveness of the track.\n- **time_signature**: An estimated overall time signature of a track.\n- **track_genre**: The genre associated with the track.\n\n### 2. **Grammy Awards Dataset** (`the_grammy_awards.csv`) \u003cimg src=\"https://www.pngall.com/wp-content/uploads/9/Grammy-Awards-PNG-Download-Image.png\" alt=\"Grammys\" width=\"22px\"/\u003e\n\nThis dataset contains information about Grammy Awards, with each row representing a nomination for a particular award. Key columns include:\n\n- **year**: The year the Grammy Awards took place.\n- **title**: The name of the Grammy event.\n- **published_at**: The date when the Grammy event details were published.\n- **category**: The category of the Grammy award (e.g., Record Of The Year, Best Pop Solo Performance).\n- **nominee**: The name of the nominated song or album.\n- **artist**: The artist(s) associated with the nominated song or album.\n- **workers**: Contributors (such as producers, engineers) involved in the nominated work.\n- **img**: URL linking to the image of the Grammy event or nominee.\n- **winner**: A boolean indicating whether the nominee won the award (True/False).\n\n## Data flow \u003cimg src=\"https://cdn-icons-png.flaticon.com/512/1953/1953319.png\" alt=\"Data flow\" width=\"22px\"/\u003e\n\n![Flujo de datos](https://github.com/user-attachments/assets/6e9d34c0-8611-4f1a-b283-87029d2621da)\n\n## Run the project \u003cimg src=\"https://github.com/user-attachments/assets/99bffef1-2692-4cb8-ba13-d6c8c987c6dd\" alt=\"Running code\" width=\"30px\"/\u003e\n\n\n### 🛠️ Clone the repository\n\nExecute the following command to clone the repository:\n\n```bash\n  git clone https://github.com/mitgar14/etl-workshop-2.git\n```\n\n#### Demonstration of the process\n\n![git clone](https://github.com/user-attachments/assets/b1b6c169-1935-4683-832f-87d627163928)\n\n---\n\n### 🔐 Generate your Google Drive Auth file (`client_secrets.json`)\n\n* To learn how to generate a `client_secrets.json` file, [you can follow the following guide](https://github.com/mitgar14/etl-workshop-2/blob/main/docs/guides/drive_api.md). This guide explains step by step how to generate the authentication key to use the Google Drive API via PyDrive 2 in your *Store* script.\n\n* In case you receive an **error 400 - redirect_uri_mismatch**, [you can follow the next page](https://elcuaderno.notion.site/Solucionado-Acceso-bloqueado-La-solicitud-de-esta-app-no-es-v-lida-Google-Drive-API-106a9368866a8037b597ecdec3346405?pvs=4).\n\n---\n\n### ⚙️ Configure PyDrive2 (`settings.yaml`)\n\nTo properly configure this project and ensure it works as expected, please follow the detailed instructions provided in the **PyDrive2 configuration guide**. This guide walks you through setting up the necessary variables, OAuth credentials, and project settings for Google Drive API integration using PyDrive2. \n\n* You will configure your `settings.yaml` file for authentication and authorization. [You can find the step-by-step guide here](https://github.com/mitgar14/etl-workshop-2/blob/main/docs/guides/drive_settings.md).\n\n---\n\n### 🌍 Enviromental variables\n\n\u003e [!IMPORTANT]\n\u003e Remember that you must use the absolute routes to the path.\n\nFor this project we use some environment variables that will be stored in one file named ***.env***, this is how we will create this file:\n\n1. We create a directory named ***env*** inside our cloned repository.\n\n2. There we create a file called ***.env***.\n\n3. In that file we declare 6 enviromental variables. Remember that some variables in this case go without double quotes, i.e. the string notation (`\"`). Only the absolute routes go with these notation:\n  ```python\n  # PostgreSQL Variables\n  \n  # PG_HOST: Specifies the hostname or IP address of the PostgreSQL server.\n  PG_HOST = # db-server.example.com\n  \n  # PG_PORT: Defines the port used to connect to the PostgreSQL database.\n  PG_PORT = # 5432 (default PostgreSQL port)\n  \n  # PG_USER: The username for authenticating with the PostgreSQL database.\n  PG_USER = # your-postgresql-username\n  \n  # PG_PASSWORD: The password for authenticating with the PostgreSQL database.\n  PG_PASSWORD = # your-postgresql-password\n  \n  # PG_DATABASE: The name of the PostgreSQL database to connect to.\n  PG_DATABASE = # your-database-name\n  \n  # Google Drive Variables\n  \n  # CLIENT_SECRETS_PATH: Path to the client secrets file used for Google Drive authentication.\n  CLIENT_SECRETS_PATH = \"/path/to/your/credentials/client_secrets.json\"\n  \n  # SETTINGS_PATH: Path to the settings file for the application configuration.\n  SETTINGS_PATH = \"/path/to/your/env/settings.yaml\"\n  \n  # SAVED_CREDENTIALS_PATH: Path to the file where Google Drive saved credentials are stored.\n  SAVED_CREDENTIALS_PATH = \"/path/to/your/credentials/saved_credentials.json\"\n\n  # FOLDER_ID: The ID of your Google Drive folder. You can get it from the link in your folder.\n  FOLDER_ID = # your-drive-folder-id\n  ```\n\n#### Demonstration of the process\n\n![env variables](https://github.com/user-attachments/assets/1ace0df1-3313-4e59-b73b-8f5b280dbaed)\n\n---\n\n### 🐍 Creating the virtual environment\n\nTo install the dependencies you need to first create a Python virtual environment. In order to create it run the following command:\n\n```bash\npython 3 -m venv venv\n```\n\nOnce created, run this other command to be able to run the environment. It is important that you are inside the project directory:\n\n```bash\nsource venv/bin/activate\n```\n\n#### Demonstration of the process\n\n![activar entorno](https://github.com/user-attachments/assets/e9a8eab0-0e6a-4093-8992-aaa6f6abff6c)\n\n---\n\n### 📦 Installing the dependencies with *pip* \n\nOnce you enter the virtual environment you can and execute `pip install -r requirements.txt` to install the dependencies. Now, you can execute both the notebooks and the Airflow pipeline.\n\n#### Demonstration of the process\n\n![pip install](https://github.com/user-attachments/assets/99ab96f9-4782-46b5-80ec-d5653bb0103d)\n\n---\n\n### 📔 Running the notebooks\n\nBefore executing the notebooks, it's necessary to **execute the *00-grammy_raw_load* notebook**; that notebook loads the Grammys Awards dataset into a PostgreSQL database.\n\nAfter you have run that notebook, then run the others in the following order. Remember that you can run all the cells in the notebook using the “Run All” button:\n\n   1. *01-EDA_Spotify.ipynb*\n   2. *02-EDA_Grammys.ipynb*\n   3. *03-data_pipeline.ipynb*\n\n![Ejecutar todo](https://github.com/user-attachments/assets/23855432-fe8f-49ca-9cac-6175b5ba84de)\n  \nRemember to choose **the right Python kernel** at the time of running the notebook.\n\n![Python kernel](https://github.com/user-attachments/assets/b22bc16d-028a-4b0d-8565-7dde8434d7bf)\n\n---\n\n### ☁ Deploy the Database at a Cloud Provider\n\nTo perform the Airflow tasks related to Data Extraction and Loading we recommend **making use of a cloud database service**. Here are some guidelines for deploying your database in the cloud:\n\n* [Microsoft Azure - Guide](https://github.com/mitgar14/etl-workshop-2/blob/main/docs/guides/azure_postgres.md)\n* [Google Cloud Platform (GCP) - Guide](https://github.com/mitgar14/etl-workshop-2/blob/main/docs/guides/gcp_postgres.md)\n\n---\n\n### 🚀 Running the Airflow pipeline\n\nTo run Apache Airflow you must first export the `AIRFLOW_HOME` environment variable. This environment variable determines the project directory where we will be working with Airflow.\n\n```bash\nexport AIRFLOW_HOME=\"$(pwd)/airflow\"\n```\n\nFinally, you can run Apache Airflow with the following command:\n\n```bash\nairflow standalone\n```\n\nAllow Apache Airflow to read the modules contained in `src` by giving the absolute path to that directory in the configuration variable `plugins_folder` at the `airflow.cfg` file:\n\n![plugins_path](https://github.com/user-attachments/assets/4b8cd7e0-1648-4c87-bc5d-596e1ac8ec43)\n\n#### Demonstration of the process\n\n\u003e [!IMPORTANT]\n\u003e You need to enter the address [http://localhost:8080](http://localhost:8080/) in order to run the Airflow GUI and run the DAG corresponding to the project (*workshop2_dag*).\n\n![airflow](https://github.com/user-attachments/assets/2cea557b-391a-4385-818b-8c3822e00076)\n\n\n## Thank you! 💕\n\nThanks for visiting my project. Any suggestion or contribution is always welcome 🐍.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitgar14%2Fetl-workshop-2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmitgar14%2Fetl-workshop-2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitgar14%2Fetl-workshop-2/lists"}