{"id":20804335,"url":"https://github.com/mitgar14/etl-workshop-3","last_synced_at":"2026-04-16T05:04:43.292Z","repository":{"id":262908118,"uuid":"868515210","full_name":"mitgar14/etl-workshop-3","owner":"mitgar14","description":"Workshop #3 (Machine Learning and Data Streaming) for the ETL course using scikit-learn to develop the ML model and Apache Kafka to manage the data streaming process.","archived":false,"fork":false,"pushed_at":"2024-11-15T23:19:15.000Z","size":2213,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-27T13:30:09.634Z","etag":null,"topics":["data-enginner","data-science","data-streaming","etl","kafka","machine-learning","pandas","python","sklearn","sqlalchemy"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mitgar14.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-06T15:27:07.000Z","updated_at":"2025-02-05T18:22:45.000Z","dependencies_parsed_at":"2025-03-12T03:38:47.389Z","dependency_job_id":null,"html_url":"https://github.com/mitgar14/etl-workshop-3","commit_stats":null,"previous_names":["mitgar14/etl-workshop-3"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mitgar14/etl-workshop-3","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mitgar14","download_url":"https://codeload.github.com/mitgar14/etl-workshop-3/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitgar14%2Fetl-workshop-3/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31872036,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"online","status_checked_at":"2026-04-16T02:00:06.042Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-enginner","data-science","data-streaming","etl","kafka","machine-learning","pandas","python","sklearn","sqlalchemy"],"created_at":"2024-11-17T19:08:46.522Z","updated_at":"2026-04-16T05:04:43.277Z","avatar_url":"https://github.com/mitgar14.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Workshop #3: Machine Learning and Data Streaming \u003cimg src=\"https://cdn-icons-png.flaticon.com/512/2980/2980560.png\" alt=\"Data Icon\" width=\"30px\"/\u003e\n\nRealized by **Martín García** ([@mitgar14](https://github.com/mitgar14)).\n\n## Overview ✨\n\nIn this workshop, the [World Happiness Report dataset](https://www.kaggle.com/datasets/unsdsn/world-happiness) will be used, comprising four CSV files with data from 2015 to 2019. A streaming data pipeline will be implemented using Apache Kafka. Once processed, the data will be fed into a Random Forest regression model to estimate the Happiness Score based on other scores in the dataset. The results will then be uploaded to a database, where the information will be analyzed to assess the accuracy and insights of the predictions.\n\n**The tools used are:**\n\n* Python 3.10 ➜ [Download site](https://www.python.org/downloads/)\n* Jupyter Notebook ➜ [VS Code tool for using notebooks](https://youtu.be/ZYat1is07VI?si=BMHUgk7XrJQksTkt)\n* Docker ➜ [Download site for Docker Desktop](https://www.docker.com/products/docker-desktop/)\n* PostgreSQL ➜ [Download site](https://www.postgresql.org/download/)\n* Power BI (Desktop version) ➜ [Download site](https://www.microsoft.com/es-es/power-platform/products/power-bi/desktop)\n\n---\n\n**The dependencies needed for Python are:**\n\n- python-dotenv\n- kafka-python-ng\n- country-converter\n- pandas\n- matplotlib\n- seaborn\n- plotly\n- nbformat\n- scikit-learn\n- sqlalchemy\n- psycopg2-binary\n\nThese libraries are included in the Poetry project config file ([`pyproject.toml`](https://github.com/mitgar14/etl-workshop-3/blob/main/pyproject.toml)). The step-by-step installation will be described later.\n\n---\n\n**The images used in Docker are:**\n\n* confluentinc/cp-zookeeper\n* confluentinc/cp-kafka\n\nThe configuration and installation of these images are facilitated by the Docker Compose config file ([`docker-compose.yml`](https://github.com/mitgar14/etl-workshop-3/blob/main/docker-compose.yml)). The explanation for using these images will be explained later.\n\n## Dataset Information \u003cimg src=\"https://github.com/user-attachments/assets/5fa5298c-e359-4ef1-976d-b6132e8bda9a\" alt=\"Dataset\" width=\"30px\"/\u003e\n\nAfter performing several transformations on the data, the columns to be analyzed in this workshop are as follows:\n\n| Column                 | Description                                       | Data Type   |\n|------------------------|---------------------------------------------------|-------------|\n| **country**            | The country name, representing each nation        | Object      |\n| **continent**          | The continent to which each country belongs       | Object      |\n| **year**               | The year the data was recorded                    | Integer     |\n| **economy**            | A measure of each country's economic status       | Float       |\n| **health**             | Health index indicating general well-being        | Float       |\n| **social_support**     | Perceived social support within each country      | Float       |\n| **freedom**            | Citizens' perception of freedom                   | Float       |\n| **corruption_perception** | Level of corruption as perceived by citizens  | Float       |\n| **generosity**         | Level of generosity within the country            | Float       |\n| **happiness_rank**     | Global ranking based on happiness score           | Integer     |\n| **happiness_score**    | Overall happiness score for each country          | Float       |\n\n## Data flow \u003cimg src=\"https://cdn-icons-png.flaticon.com/512/1953/1953319.png\" alt=\"Data flow\" width=\"22px\"/\u003e\n\n![Flujo de datos #3](https://github.com/user-attachments/assets/aad878b6-6955-47de-aa6e-a69a21fa1d6d)\n\n## Run the project \u003cimg src=\"https://github.com/user-attachments/assets/99bffef1-2692-4cb8-ba13-d6c8c987c6dd\" alt=\"Running code\" width=\"30px\"/\u003e\n\n### 🛠️ Clone the repository\n\nExecute the following command to clone the repository:\n\n```bash\n  git clone https://github.com/mitgar14/etl-workshop-3.git\n```\n\n#### Demonstration of the process\n\n![git clone](https://github.com/user-attachments/assets/0e717e57-3636-4c97-9093-d41a8e884df4)\n\n---\n\n### 🌍 Enviromental variables\n\nFor this project we use some environment variables that will be stored in one file named ***.env***, this is how we will create this file:\n\n1. We create a directory named ***env*** inside our cloned repository.\n\n2. There we create a file called ***.env***.\n\n3. In that file we declare 5 enviromental variables. Remember that some variables in this case go without double quotes, i.e. the string notation (`\"`).:\n  ```python\n  # PostgreSQL Variables\n  \n  # PG_HOST: Specifies the hostname or IP address of the PostgreSQL server.\n  PG_HOST = # db-server.example.com\n  \n  # PG_PORT: Defines the port used to connect to the PostgreSQL database.\n  PG_PORT = # 5432 (default PostgreSQL port)\n  \n  # PG_USER: The username for authenticating with the PostgreSQL database.\n  PG_USER = # your-postgresql-username\n  \n  # PG_PASSWORD: The password for authenticating with the PostgreSQL database.\n  PG_PASSWORD = # your-postgresql-password\n  \n  # PG_DATABASE: The name of the PostgreSQL database to connect to.\n  PG_DATABASE = # your-database-name\n  ```\n\n#### Demonstration of the process\n\n![env variables](https://github.com/user-attachments/assets/06c8fd3d-2aea-45ef-8b63-34d261c9ad67)\n\n---\n\n### 📦 Installing the dependencies with *Poetry*\n\n\u003e To install Poetry follow [this link](https://elcuaderno.notion.site/Poetry-8f7b23a0f9f340318bbba4ef36023d60?pvs=4).\n\n1. Enter the Poetry shell with `poetry shell`.\n\n2. Once the virtual environment is created, execute `poetry install` to install the dependencies. In some case of error with the *.lock* file, just execute `poetry lock` to fix it.\n\n3. Now you can execute the notebooks!\n\n#### Demonstration of the process\n\n![poetry](https://github.com/user-attachments/assets/37e64017-e874-478e-8702-b7c9dff3c661)\n\n---\n\n### 📔 Running the notebooks\n\nWe execute the 3 notebooks following the next order. You can run these by just pressing the \"Execute All\" button:\n\n   1. *01-EDA.ipynb*\n   2. *02-model_training.ipynb*\n   3. *03-metrics.ipynb*\n\n![Running the notebooks](https://github.com/user-attachments/assets/f50a3cc2-90eb-48d3-b452-7bec0e5022c5)\n  \nRemember to choose **the right Python kernel** at the time of running the notebook.\n\n![Python kernel](https://github.com/user-attachments/assets/3b3c57ca-a07e-4a42-aa1d-4fdd9ea8187e)\n\n---\n\n### ☁ Deploy the Database at a Cloud Provider\n\nTo perform the Airflow tasks related to Data Extraction and Loading we recommend **making use of a cloud database service**. Here are some guidelines for deploying your database in the cloud:\n\n* [Microsoft Azure - Guide](https://github.com/mitgar14/etl-workshop-3/blob/main/docs/guides/azure_postgres.md)\n* [Google Cloud Platform (GCP) - Guide](https://github.com/mitgar14/etl-workshop-3/blob/main/docs/guides/gcp_postgres.md)\n\n---\n\n### 🐳 Run Kafka in Docker\n\n\u003e [!IMPORTANT]\n\u003e Make sure that Docker is installed in your system.\n\nTo set up Kafka using Docker and run your `producer.py` and `consumer.py` scripts located in the `./kafka` directory, follow these steps:\n\n1. 🚀 **Start Kafka and Zookeeper Services**\n\n   Open your terminal or command prompt and navigate to the root directory of your cloned repository:\n\n   ```bash\n   cd etl-workshop-3\n   ```\n\n   Use the provided `docker-compose.yml` file to start the Kafka and Zookeeper services:\n\n   ```bash\n   docker-compose up -d\n   ```\n\n   This command will start the services in detached mode. Docker will pull the necessary images if they are not already available locally.\n\n   Check if the Kafka and Zookeeper containers are up and running:\n\n   ```bash\n   docker ps\n   ```\n\n   You should see `kafka_docker` and `zookeeper_docker` in the list of running containers.\n\n   #### Demonstration of the process\n\n   ![docker_1](https://github.com/user-attachments/assets/a6fa6cfd-d880-469f-a7fe-1a5315be0513)\n\n2. 📌 **Create a Kafka Topic**\n\n   Create a Kafka topic that your producer and consumer will use. Make sure to name it `whr_kafka_topic` to not clash with the Python scripts:\n\n   ```bash\n   docker exec -it kafka_docker kafka-topics --create --topic whr_kafka_topic --bootstrap-server localhost:9092\n   ```\n\n   List the available topics to confirm that the `whr_kafka_topic` has been created:\n\n   ```bash\n   docker exec -it kafka_docker kafka-topics --list --bootstrap-server localhost:9092\n   ```\n\n   ![docker_2](https://github.com/user-attachments/assets/7abbd046-24a2-4aeb-8007-e30a2a956c8e)\n\n4. 🏃 **Run the Producer Script**\n\n   In Visual Studio Code, navigate to the `./kafka` directory and run the `producer.py` script **in a dedicated terminal**. The producer will start sending messages to the `whr_kafka_topic`.\n\n   ![docker_kafka_producer](https://github.com/user-attachments/assets/cb368364-f67f-47a8-91d1-ecb1ff89de77)\n\n5. 👂 **Run the Consumer Script**\n\n    Now navigate to the `./kafka` directory, and run the `consumer.py` script **in a dedicated terminal**. You should now see the consumer receiving it in real-time.\n\n   ![docker_kafka_consumer](https://github.com/user-attachments/assets/a5c264b3-dde2-46c8-a6a4-92784d5e5a89)\n\n6. 🛑 **Shut Down the Services**\n\n    When you're finished, you can stop and remove the Kafka and Zookeeper containers:\n\n    ```bash\n    docker-compose down\n    ```\n    \n![docker_compose_down](https://github.com/user-attachments/assets/7fdfd18b-a220-47f6-bb6e-0cfa0e26e95b)\n    \n## Thank you! 💕\n\nThanks for visiting my project. Any suggestion or contribution is always welcome 🐍.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitgar14%2Fetl-workshop-3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmitgar14%2Fetl-workshop-3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitgar14%2Fetl-workshop-3/lists"}