{"id":24780601,"url":"https://github.com/silvano315/etl-project-using-api","last_synced_at":"2025-03-24T04:41:03.615Z","repository":{"id":254150875,"uuid":"845497962","full_name":"Silvano315/ETL-project-using-API","owner":"Silvano315","description":"This project involves selecting an API from RapidAPI and using Python to extract data in JSON format. The extracted data is then cleaned and transformed by parsing only the necessary information. Finally, the cleaned data is loaded into an SQL database for further analysis.","archived":false,"fork":false,"pushed_at":"2024-09-08T15:32:07.000Z","size":1386,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-29T10:40:15.487Z","etag":null,"topics":["air-quality","etl","python","rapid-api","sql","sqlite3"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Silvano315.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-21T11:16:42.000Z","updated_at":"2024-09-10T15:28:15.000Z","dependencies_parsed_at":"2024-08-26T19:00:45.426Z","dependency_job_id":"4025eb65-f252-402c-8cad-b43b8e032395","html_url":"https://github.com/Silvano315/ETL-project-using-API","commit_stats":null,"previous_names":["silvano315/etl-project-using-api"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Silvano315%2FETL-project-using-API","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Silvano315%2FETL-project-using-API/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Silvano315%2FETL-project-using-API/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Silvano315%2FETL-project-using-API/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Silvano315","download_url":"https://codeload.github.com/Silvano315/ETL-project-using-API/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245212112,"owners_count":20578439,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["air-quality","etl","python","rapid-api","sql","sqlite3"],"created_at":"2025-01-29T10:35:19.677Z","updated_at":"2025-03-24T04:41:03.595Z","avatar_url":"https://github.com/Silvano315.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ETL-project-using-API\n\n## Table of Contents\n1. [Introduction](#introduction)\n2. [Data Source](#data-source)\n3. [Python Pipeline for Air Quality Monitoring](#python-pipeline-for-air-quality-monitoring)\n   - [1. Data Extraction](#1-data-extraction)\n   - [2. Data Cleaning](#2-data-cleaning)\n   - [3. Data Transformation](#3-data-transformation)\n   - [4. Data Loading](#4-data-loading)\n   - [5. Data Visualization](#5-data-visualization)\n   - [6. Automation and Scheduling](#6-automation-and-scheduling)\n4. [Requirements](#requirements)\n\n\n## Introduction\n\nETL stands for **Extract, Transform, Load**, which is a process used in data warehousing and analytics pipelines. It involves three main stages:\n\n1. **Extract**: Collecting data from various sources, such as databases, APIs, or files.\n2. **Transform**: Cleaning, normalizing, and transforming the extracted data to ensure consistency, quality, and relevance.\n3. **Load**: Loading the transformed data into a data warehouse, database, or a data lake for further analysis and visualization.\n\nThe main advantages of the ETL Process could be:\n- **Data Centralization**: ETL allows for consolidating data from various sources into a single, unified repository, making it easier to analyze and draw insights.\n- **Data Quality**: By transforming and cleaning data, ETL processes help ensure that the data is consistent, accurate, and reliable.\n- **Automation**: ETL pipelines can be scheduled to run at regular intervals, ensuring that the data is always up-to-date without manual intervention.\n\nETL processes are widely used in various fields, including: Business Intelligence (BI), Data Warehousing, Data Integration, Data Migration.\n\nThis project leverages the ETL process to monitor air quality in Milan using data extracted from an API available on [RapidAPI](https://rapidapi.com/weatherbit/api/air-quality). The ETL pipeline performs three main tasks:\n\n1. **Extraction**: Air quality data is extracted from the RapidAPI's air-quality API using Python's `requests` library.\n2. **Transformation**: The extracted data is then cleaned, formatted, and transformed to create additional features and remove any inconsistencies. This includes handling missing values, removing duplicates, converting data types, and generating new derived features.\n3. **Loading**: The transformed data is saved in a CSV file and used to create various visualizations such as histograms, scatter plots, and time series, providing insights into the air quality in Milan.\n\nThe advantage of using this ETL approach is that it automates the entire data collection, transformation, and visualization process, allowing for continuous monitoring and analysis of air quality. The integration with the `schedule` library allows the pipeline to run at regular intervals (e.g., every 24 hours), ensuring the data is always fresh and up-to-date.\n\n\n## Data from Rapid-API\n\nThe data utilized in this project was obtained from the Weatherbit Air Quality API, available through RapidAPI. This API allows users to access current air quality data, 3-day (hourly) air quality forecasts, and 24-hour historical air quality conditions for any location worldwide. For this project, the focus is on retrieving 24-hour historical air quality data for Milan, Italy.\n\nThe Weatherbit Air Quality API provides comprehensive information about air quality conditions, including both current and historical data. With this API, users can retrieve:\n\n- **3-Day Hourly Forecasts**: Provides a forecast of air quality for the next 72 hours, broken down by the hour.\n- **Current Air Quality + Pollen Levels**: Offers real-time data on air quality and pollen levels.\n- **24-Hour Historical Data**: Supplies hourly historical air quality data for the past 24 hours for any location.\n\nThe dataset obtained from the API includes the following fields, which provide a detailed view of air quality conditions:\n\n- **lat**: Latitude (Degrees) of the location.\n- **lon**: Longitude (Degrees) of the location.\n- **timezone**: Local IANA timezone for the location.\n- **city_name**: Name of the nearest city.\n- **country_code**: Country abbreviation.\n- **state_code**: State abbreviation or code.\n- **timestamp_local**: Local time of the measurement.\n- **timestamp_utc**: Coordinated Universal Time (UTC) of the measurement.\n- **ts**: Unix timestamp at UTC time.\n- **aqi**: Air Quality Index (AQI) following the US EPA standard (ranges from 0 to 500).\n- **o3**: Concentration of surface Ozone (O3) in micrograms per cubic meter (µg/m³).\n- **so2**: Concentration of surface Sulfur Dioxide (SO2) in micrograms per cubic meter (µg/m³).\n- **no2**: Concentration of surface Nitrogen Dioxide (NO2) in micrograms per cubic meter (µg/m³).\n- **co**: Concentration of Carbon Monoxide (CO) in micrograms per cubic meter (µg/m³).\n- **pm25**: Concentration of Particulate Matter (PM2.5) less than 2.5 micrometers in diameter (µg/m³).\n- **pm10**: Concentration of Particulate Matter (PM10) less than 10 micrometers in diameter (µg/m³).\n\nThe data provided by the Weatherbit Air Quality API, accessed via RapidAPI, is sourced from reputable monitoring stations worldwide and updated regularly. This ensures high-quality, reliable data suitable for both real-time monitoring applications and long-term trend analysis.\n\n\n## Python Pipeline for Air Quality Monitoring\n\nThe [`Python pipeline`](ETL_pipeline_air_quality.py) implemented in this project is designed to extract, transform, and load (ETL) data related to air quality monitoring in Milan, Italy. This pipeline uses Python to automate data processing and visualization tasks, providing valuable insights into air quality trends and pollution levels. The key components of the pipeline include:\n\n### 1. Data Extraction\n\nThe extraction process utilizes the RapidAPI Air Quality API to retrieve historical air quality data for Milan.\nThe extraction step involves:\n- Making HTTP requests to the RapidAPI endpoint.\n- Handling the API response to ensure data integrity.\n- Saving the raw data in JSON format for further processing.\n\n### 2. Data Cleaning\n\nOnce the data is extracted, it undergoes a cleaning process to ensure its quality and usability. The cleaning steps include:\n- Removing missing values and duplicates to prevent data inconsistencies.\n- Converting timestamps to a uniform datetime format for proper time series analysis.\n- Dropping unnecessary columns or those with only unique values to streamline the dataset.\n\n### 3. Data Transformation\n\nThe transformation process involves reshaping and enriching the dataset to facilitate deeper insights. Key transformations include:\n- Adding new columns derived from existing ones, such as pollutant ratios (e.g., PM10/PM2.5 ratio).\n- Reformatting the DataFrame to include separate columns for year, month, day, and hour.\n- Setting the `timestamp_local` as the index for time series analysis.\n\n### 4. Data Loading\n\nAfter the data is transformed, it is saved in a CSV format. This ensures that the cleaned and transformed data is stored in a structured format for future use or analysis. The loading step includes:\n- Combining new transformed data with existing datasets to create a comprehensive and up-to-date record.\n- Saving the updated data to a designated file location for easy access.\n\nI also implemented a saving Data to SQLite Database:\n- The transformed data is also appended to an existing `SQLite` database using `sqlite3` library. This allows for persistent storage and efficient querying of large datasets.\n- The function is used to save the DataFrame (`df`) to the specified SQLite database [file](Data/air_quality.db). If the table already exists, new data is appended to it, ensuring that existing data is preserved and updated.\n\n### 5. Data Visualization\n\nTo provide a more intuitive understanding of air quality trends and pollution levels, the pipeline includes a robust visualization component. Using the Plotly library, the following visualizations are generated and saved:\n- **Histogram with Mean and Standard Deviation**: Visualizes the distribution of key air quality metrics.\n- **Box Plot**: Displays the distribution and outliers of air quality data.\n- **Correlation Matrix Heatmap**: Highlights relationships between different pollutants.\n- **Time Series Plot**: Shows changes in air quality over time with an interactive range slider.\n- **Scatter Plot with Regression Line**: Visualizes relationships between different pollutant concentrations.\n- **Distribution Plot with KDE and Histogram**: Provides a dynamic, interactive view of the distributions of various pollutants.\n\nThese plots are saved in this [folder](Images/Air_Quality/) and they have the html format.\n\n### 6. Automation and Scheduling\n\nTo keep the data up-to-date, the ETL pipeline is automated using the `schedule` library. The pipeline is set to run at regular intervals (e.g., every 24 hours) to fetch the latest air quality data and update the visualizations accordingly. This automation is implemented using Python's `threading` module to run the scheduler continuously in the background, allowing for uninterrupted data processing.\n\n\n## Requirements\n\nTo run the Python pipeline, ensure that all the required libraries are installed. You can use the `requirements.txt` file generated by `pipreqs` to install them. To do this, run:\n\n```bash\npip install -r requirements.txt\n```\n\nAfter that, to use the Python Pipeline:\n\n1. **Configure API Keys**: Place your RapidAPI keys in the `API_keys.json` file in the correct format.\n2. **Run the Pipeline**: Execute the ETL pipeline by running the Python script.\n3. **Access Visualizations**: Visualizations will be saved in the `Images/Air_Quality/` directory for easy access and review.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsilvano315%2Fetl-project-using-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsilvano315%2Fetl-project-using-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsilvano315%2Fetl-project-using-api/lists"}