https://github.com/imamaaa/mlops-air-quality-prediction-pipeline
https://github.com/imamaaa/mlops-air-quality-prediction-pipeline
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/imamaaa/mlops-air-quality-prediction-pipeline
- Owner: imamaaa
- Created: 2025-03-18T22:22:58.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-18T22:40:24.000Z (2 months ago)
- Last Synced: 2025-03-18T23:28:43.957Z (2 months ago)
- Language: Jupyter Notebook
- Size: 1.84 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MLOps Environmental Monitoring & Pollution Prediction System
## Project Overview
This project implements an **MLOps pipeline** for monitoring **air pollution** and predicting **Air Quality Index (AQI)** using machine learning. The system integrates **data collection, model development, deployment, and monitoring** into a streamlined workflow.### Key Objectives
- Automate **data collection & versioning** using **DVC**
- Develop a **time-series prediction model** (ARIMA/LSTM) for **AQI forecasting**
- Deploy the model as an **API** using **Flask/FastAPI**
- Set up **monitoring & alerting** using **Prometheus & Grafana**
- Perform **live testing** using real-time data from **OpenWeather API**### Key Tools Used
- **DVC (Data Version Control)** → Tracks & manages datasets via **Amazon S3**
- **Flask/FastAPI** → Serves model predictions as an **API**
- **Prometheus & Grafana** → Monitors model performance & system health---
## System Architecture
```mermaid
graph TD
subgraph Data Collection
A1[OpenWeather API] -->|Fetch Data| B1[Data Collection Script]
B1 -->|Store| C1[DVC Storage]
end
subgraph Model Training & Deployment
C1 -->|Versioned Data| D1[Preprocessing and Feature Engineering]
D1 -->|Train| E1[Time-Series Model - ARIMA/LSTM]
E1 -->|Deploy| F1[Flask FastAPI API]
end
subgraph Monitoring
F1 -->|Expose Metrics| G1[Prometheus]
G1 -->|Visualize| H1[Grafana]
end```
---
## Features Implemented
| Feature | Description |
|-----------------------|-------------|
| **Data Collection** | Fetches real-time AQI & weather data via OpenWeather API |
| **Automated Scheduling** | **Windows Task Scheduler** executes the batch file every 4 hours for continuous data collection |
| **Data Versioning** | Uses **DVC** with **Amazon S3** for dataset tracking |
| **Model Training** | Develops **ARIMA & LSTM** for AQI forecasting |
| **Model Deployment** | Deploys predictions via **Flask/FastAPI API** |
| **Live Monitoring** | Uses **Grafana & Prometheus** for tracking performance |---
## System Architecture for Monitoring
- **Flask Application**: Serves the LSTM model for real-time predictions based on live data from OpenWeather API.
- **Prometheus**: Collects and stores metrics exposed by the Flask app’s `/metrics` endpoint.
- **Grafana**: Visualizes Prometheus metrics in customizable dashboards for real-time performance monitoring.
- **Live Data Streams**: Fetch weather and pollution data continuously to feed the prediction pipeline.This architecture ensures that the Flask API, metrics collection, and visualization tools work seamlessly to provide insights into system behavior.
---
## Workflow
1. **Data Collection**: The system fetches real-time AQI & weather data from **OpenWeather API**. Data collection is automated using **Windows Task Scheduler**, which runs every 4 hours.
2. **Data Preprocessing & Feature Engineering**:
- Various preprocessing and feature engineering steps were performed, including handling missing values, outlier detection, scaling, and feature extraction. For a detailed breakdown, please refer to the **Project Report** included in the repository.
3. **Data Versioning**: The collected and preprocessed data is stored and tracked using **DVC (Data Version Control)** with **Amazon S3** as remote storage.
4. **Model Development**:
- Two time-series models were developed: **ARIMA (AutoRegressive Integrated Moving Average)** and **LSTM (Long Short-Term Memory Neural Network)**.
- **ARIMA** was used for short-term AQI forecasting, while **LSTM** captured long-term dependencies.
- The best-trained model was selected for deployment. More details on the model evaluation and hyperparameter tuning can be found in the **Project Report**.
5. **Model Deployment**: The trained model is deployed as an API using **Flask/FastAPI**, allowing users to make real-time predictions.
6. **Monitoring & Logging**: The deployed API and model performance are continuously monitored using **Prometheus & Grafana**, providing real-time metrics and visualizations.---
## API Endpoints
| Method | Endpoint | Description |
|--------|-------------------|-------------|
| GET | `/predict` | Get pollution level prediction |
| POST | `/predict` | Send input data for model inference |
| GET | `/metrics` | API & model performance metrics |---
## Challenges & Key Learnings
### Challenges Faced
- Real-time data ingestion while ensuring dataset versioning with DVC
- Optimizing time-series models (ARIMA/LSTM) for air quality forecasting
- Issues with MLflow setup and tracking, including configuration difficulties and experiment logging inconsistencies### Key Learnings
- Configuring remote storage and using DVC for dataset versioning and tracking
- Automating data pipelines with Windows Task Scheduler for continuous data ingestion
- Deploying ML models as APIs with Flask/FastAPI for real-time inference
- Improving model interpretability and performance tracking through advanced monitoring tools like Prometheus & Grafana---
## Future Improvements
- **Enhance MLflow integration** for **experiment tracking and continuous training**
- **Implement automated hyperparameter tuning** using MLflow
- Improve LSTM & ARIMA model accuracy through hyperparameter optimization and additional feature engineering
- Expand Grafana & Prometheus setup to track more detailed model performance metrics and enhance dashboard visualization
- Deploy to cloud-based services (AWS, GCP, Azure)
- **Integrate alerting system** for high pollution days---