https://github.com/drkbluescience/autogluon_cameroon_air_quality

Finished 5th in the Cameroon Air Quality Prediction competition, later refining the model to achieve a score better than the 1st place submission using AutoGluon.
https://github.com/drkbluescience/autogluon_cameroon_air_quality

autogluon automl feature-engineering machine-learning regression-analysis tabular-data

Last synced: over 1 year ago
JSON representation

Finished 5th in the Cameroon Air Quality Prediction competition, later refining the model to achieve a score better than the 1st place submission using AutoGluon.

Host: GitHub
URL: https://github.com/drkbluescience/autogluon_cameroon_air_quality
Owner: drkbluescience
Created: 2024-11-13T12:16:29.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-14T10:06:17.000Z (over 1 year ago)
Last Synced: 2025-01-14T11:33:07.255Z (over 1 year ago)
Topics: autogluon, automl, feature-engineering, machine-learning, regression-analysis, tabular-data
Language: Jupyter Notebook
Homepage:
Size: 4.16 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# **Cameroon Air Quality Prediction - AutoGluon**

## Introduction

This study focuses on predicting air quality in Cameroon, specifically the concentration of particulate matter (**PM2.5**), using various machine learning techniques. The dataset includes weather and air quality features collected from different cities across Cameroon.

## Leaderboard Achievement

Here’s a snapshot of the position on the leaderboard during the **Cameroon Air Quality Prediction** competition, showing the score in 5th place at the end of the competition. After further model improvements, the 1st-place score was surpassed.
![Leaderboard](images/leaderboard.png)

## Methodology

The analysis began with an exploration of the dataset, where **data inconsistencies** were addressed. Features with a **single value**, such as **'sunrise'**, **'sunset'**, and **'snowfall_sum'**, were removed. Redundant variables, including **city**, **longitude**, and **latitude**, were also eliminated to reduce unnecessary complexity in the models.

### Feature Engineering

Enhancing predictive power involved analyzing the distribution of **PM2.5** concentrations across different cities, leading to the creation of a new feature:
- **Distance from Bafoussam**, the city with the highest PM2.5 levels.

## Models

Several machine learning models were initially employed to predict **PM2.5** levels, including:

- **CatBoost**
- **LightGBM (LGBM)**
- **XGBoost (XGB)**
- **GradientBoostingRegressor**
- **ExtraTreesRegressor**
- **RandomForestRegressor**
- **AdaBoostRegressor**
- **MLPRegressor**

These models were evaluated using a **9-split RepeatedKFold cross-validation** strategy to ensure reliable results.

However, after initial testing, **AutoGluon** was introduced and ultimately provided the best performance, surpassing all other models in predictive accuracy.

## Results

Among all models tested, **CatBoost** performed well, achieving a **root mean squared error (RMSE)** of **3.11078**. However, **AutoGluon** outperformed every other model, achieving the **lowest RMSE of 2.97008**.

### Key Result Comparison

| Model | RMSE |
| -------------------------- | --------- |
| **CatBoost** | 3.11078 |
| **AutoGluon** | **2.97008** |

This result demonstrates a **significant improvement** over the other models, indicating the superior predictive capabilities of **AutoGluon** for this particular task.

While other models, such as **CatBoost** and **ExtraTrees**, provided competitive results, **AutoGluon’s automatic model selection and hyperparameter tuning** led to the best performance, further validating its effectiveness as an **AutoML tool** for air quality prediction.

## Conclusion

This study highlights the critical role of **feature engineering** in improving model performance, as well as the superiority of **AutoGluon** over other machine learning models for the task of predicting **PM2.5** concentrations. AutoGluon’s automated approach to model selection and optimization resulted in a more accurate prediction, achieving the best RMSE score and outperforming traditional models like **CatBoost** and **XGBoost**.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/drkbluescience/autogluon_cameroon_air_quality

Awesome Lists containing this project

README