Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/santhin/air-pollution

Preprocessing air pollution data using ditrubuted computing and ML
https://github.com/santhin/air-pollution

air-pollution coiled dask ml optuna pandas plotly python xgboost

Last synced: about 1 month ago
JSON representation

Preprocessing air pollution data using ditrubuted computing and ML

Awesome Lists containing this project

README

        

---
## ๐Ÿง About

The project was created for academic purposes. Consists of unification archive measurement data from range 2000-2019 with existing database.

Measurement data were combined with meteorological data using Dask and distributed computing provided by Coiled

The model was trained using Xgboost and Optuna for hyperparameter tunning, which reached an RMSE of 6,932 [ยตg / m3].

### Links to data:
https://powietrze.gios.gov.pl/pjp/archives
https://powietrze.gios.gov.pl/pjp/content/api
https://danepubliczne.imgw.pl/data/

## ๐Ÿ Installing

- Python 3.8.3

```
git clone https://github.com/Santhin/air-pollution.git
```

Installing dependencies:

```
pip install -r requirements.txt
or
poetry install
```
Run jupyter notebook with:
```
jupyter notebook
```
To install coiled software environment:
```
import coiled

coiled.create_software_environment(
name="my-software-env",
conda="coiled-environment-py38.yml",
)
```

### Project structure
```
โ”œโ”€โ”€ coiled-environment-py38.yml
โ”œโ”€โ”€ data
โ”‚ โ”œโ”€โ”€ dictionaries
โ”‚ โ”‚ โ”œโ”€โ”€ IndeksJakosciPowietrza.csv
โ”‚ โ”‚ โ”œโ”€โ”€ Indeks\ jako\305\233ci\ powietrza\ gio\305\233.xlsx
โ”‚ โ”‚ โ”œโ”€โ”€ IndeksJakosciPowietrza.xlsx
โ”‚ โ”‚ โ”œโ”€โ”€ Kody_stacji_pomiarowych.xlsx
โ”‚ โ”‚ โ”œโ”€โ”€ Matching_stations
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ SmogoliczkaStacje.csv
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ SynopStacje.csv
โ”‚ โ”‚ โ”œโ”€โ”€ Metadane\ -\ stacje\ i\ stanowiska\ pomiarowe.xlsx
โ”‚ โ”‚ โ”œโ”€โ”€ Normy.pkl
โ”‚ โ”‚ โ”œโ”€โ”€ PomiarySample.pkl
โ”‚ โ”‚ โ”œโ”€โ”€ response_api_gios.json
โ”‚ โ”‚ โ”œโ”€โ”€ RodzajeParametrow.csv
โ”‚ โ”‚ โ”œโ”€โ”€ rodzaje_parametrow.pkl
โ”‚ โ”‚ โ”œโ”€โ”€ SensoryPomiarowe.csv
โ”‚ โ”‚ โ”œโ”€โ”€ SensoryPomiarowe.pkl
โ”‚ โ”‚ โ”œโ”€โ”€ stacje_pom_api.json
โ”‚ โ”‚ โ”œโ”€โ”€ StacjePomiarowe.xlsx
โ”‚ โ”‚ โ””โ”€โ”€ stacjeSmogoliczka.csv
โ”‚ โ”œโ”€โ”€ IndeksJakosciPowietrza.csv
โ”‚ โ””โ”€โ”€ train_data.csv
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ notebooks
โ”‚ โ”œโ”€โ”€ Air\ Quality\ Index\ Gios.ipynb
โ”‚ โ”œโ”€โ”€ eda\ without\ progress-Copy4.ipynb
โ”‚ โ”œโ”€โ”€ Filtering\ excel\ files\ and\ picking\ right\ parameters.ipynb
โ”‚ โ”œโ”€โ”€ Fixing\ missing\ lat\ and\ lon\ in\ stations\ .ipynb
โ”‚ โ”œโ”€โ”€ loader_sql.py
โ”‚ โ”œโ”€โ”€ Matching\ stations\ synop\ with\ Smogoliczka\ .ipynb
โ”‚ โ”œโ”€โ”€ Matching\ synop\ data\ with\ smogoliczka.ipynb
โ”‚ โ”œโ”€โ”€ Matching\ Synop\ with\ Smogoliczka\ final.ipynb
โ”‚ โ”œโ”€โ”€ ML\ PM2.5.ipynb
โ”‚ โ”œโ”€โ”€ New\ Strategy\ script\ for\ excel\ files.ipynb
โ”‚ โ”œโ”€โ”€ __pycache__
โ”‚ โ”‚ โ””โ”€โ”€ loader_sql.cpython-38.pyc
โ”‚ โ”œโ”€โ”€ Repairing\ stations\ names\ and\ merging\ into\ one\ .ipynb
โ”‚ โ””โ”€โ”€ Smogoliczka\ API\ to\ pomiary_pivot.ipynb
โ”œโ”€โ”€ poetry.lock
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ requirements.txt
```

## โ›๏ธ Built Using

- [MsSQL](https://www.microsoft.com/pl-pl/sql-server/sql-server-downloads) - Database
- [S3Bucket](https://aws.amazon.com/s3/) - Cloud storage
- [Coiled](https://coiled.io/) - Distributed computing
- [Dask](https://dask.org/) - Preprocessing data
- [Optuna](https://optuna.org/) - Hyperparameter optimization
- [Xgboost](https://xgboost.readthedocs.io/en/latest/) - ML