Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/se-jaeger/conformal-data-cleaning

Code for the AISTATS 2024 Paper "From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance"
https://github.com/se-jaeger/conformal-data-cleaning

Last synced: 8 days ago
JSON representation

Code for the AISTATS 2024 Paper "From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance"

Host: GitHub
URL: https://github.com/se-jaeger/conformal-data-cleaning
Owner: se-jaeger
License: mit
Created: 2024-02-07T14:28:06.000Z (almost 1 year ago)
Default Branch: main
Last Pushed: 2024-02-14T10:39:08.000Z (12 months ago)
Last Synced: 2024-02-14T11:37:23.374Z (12 months ago)
Language: Python
Size: 1000 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Conformal Data Cleaning

This repository contains source code for the experiments conducted in the AISTATS 2024 paper `From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance`.

## Run Experiments

First of all, use [`load_corrupt_and_test_datasets.ipynb`](./notebooks/load_corrupt_and_test_datasets.ipynb) to download and corrupt the datasets and setup the expected structure of the [`data`](./data/) directory.

[`run_experiment.py`](./scripts/run_experiment.py) implements a simple CLI script (`run-experiment`), which allows to easily run experiments.

**Conformal Data Cleaning:**
```bash
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
experiment \
--confidence_level \
"0.999"
```

**ML Baseline:**
```bash
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
baseline \
--method \
"AutoGluon" \
--method_hyperparameter \
"0.999"
```

**PyOD Baseline (not included in the paper):**
```bash
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
baseline \
--method \
"PyodECOD" \
--method_hyperparameter \
"0.3"
```

For Garf, please use [main.py](./garf/main.py).
```bash
python \
main.py \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments"
```

## Run our Experimental Setup

We ran our experiments on Kubernetes using Helm. Please checkout the [helm charts](./infrastructure/helm/) and change the `image` and `imagePullSecrets` settings in the `values.yaml` files accordingly to your setup.
Therefore, some read-write-many volumes are necessary to store the experiment results. Please checkout the [`infrastructure/k8s`](./infrastructure/k8s/) directory (and don't forget to setup the data directory as describe above).

Using `make docker` builds and pushes the necessary docker images and `make helm-install` uses [`deploy_experiments.py`](./scripts/deploy_experiments.py) to start our experimental setup.

## Evaluation

[`notebooks/evaluation`](./notebooks/evaluation/) contains notebooks we use for evaluating the results and [`5_plotting.ipynb`](./notebooks/evaluation/5_plotting.ipynb) outputs the plots shown in the paper.