https://github.com/drkbluescience/autogluon_cameroon_air_quality
Finished 5th in the Cameroon Air Quality Prediction competition, later refining the model to achieve a score better than the 1st place submission using AutoGluon.
https://github.com/drkbluescience/autogluon_cameroon_air_quality
autogluon automl feature-engineering machine-learning regression-analysis tabular-data
Last synced: over 1 year ago
JSON representation
Finished 5th in the Cameroon Air Quality Prediction competition, later refining the model to achieve a score better than the 1st place submission using AutoGluon.
- Host: GitHub
- URL: https://github.com/drkbluescience/autogluon_cameroon_air_quality
- Owner: drkbluescience
- Created: 2024-11-13T12:16:29.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-14T10:06:17.000Z (over 1 year ago)
- Last Synced: 2025-01-14T11:33:07.255Z (over 1 year ago)
- Topics: autogluon, automl, feature-engineering, machine-learning, regression-analysis, tabular-data
- Language: Jupyter Notebook
- Homepage:
- Size: 4.16 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **Cameroon Air Quality Prediction - AutoGluon**
## Introduction
This study focuses on predicting air quality in Cameroon, specifically the concentration of particulate matter (**PM2.5**), using various machine learning techniques. The dataset includes weather and air quality features collected from different cities across Cameroon.
## Leaderboard Achievement
Here’s a snapshot of the position on the leaderboard during the **Cameroon Air Quality Prediction** competition, showing the score in 5th place at the end of the competition. After further model improvements, the 1st-place score was surpassed.

## Methodology
The analysis began with an exploration of the dataset, where **data inconsistencies** were addressed. Features with a **single value**, such as **'sunrise'**, **'sunset'**, and **'snowfall_sum'**, were removed. Redundant variables, including **city**, **longitude**, and **latitude**, were also eliminated to reduce unnecessary complexity in the models.
### Feature Engineering
Enhancing predictive power involved analyzing the distribution of **PM2.5** concentrations across different cities, leading to the creation of a new feature:
- **Distance from Bafoussam**, the city with the highest PM2.5 levels.
## Models
Several machine learning models were initially employed to predict **PM2.5** levels, including:
- **CatBoost**
- **LightGBM (LGBM)**
- **XGBoost (XGB)**
- **GradientBoostingRegressor**
- **ExtraTreesRegressor**
- **RandomForestRegressor**
- **AdaBoostRegressor**
- **MLPRegressor**
These models were evaluated using a **9-split RepeatedKFold cross-validation** strategy to ensure reliable results.
However, after initial testing, **AutoGluon** was introduced and ultimately provided the best performance, surpassing all other models in predictive accuracy.
## Results
Among all models tested, **CatBoost** performed well, achieving a **root mean squared error (RMSE)** of **3.11078**. However, **AutoGluon** outperformed every other model, achieving the **lowest RMSE of 2.97008**.
### Key Result Comparison
| Model | RMSE |
| -------------------------- | --------- |
| **CatBoost** | 3.11078 |
| **AutoGluon** | **2.97008** |
This result demonstrates a **significant improvement** over the other models, indicating the superior predictive capabilities of **AutoGluon** for this particular task.
While other models, such as **CatBoost** and **ExtraTrees**, provided competitive results, **AutoGluon’s automatic model selection and hyperparameter tuning** led to the best performance, further validating its effectiveness as an **AutoML tool** for air quality prediction.
## Conclusion
This study highlights the critical role of **feature engineering** in improving model performance, as well as the superiority of **AutoGluon** over other machine learning models for the task of predicting **PM2.5** concentrations. AutoGluon’s automated approach to model selection and optimization resulted in a more accurate prediction, achieving the best RMSE score and outperforming traditional models like **CatBoost** and **XGBoost**.