Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/csinva/imodels-data

Preprocessed data for various popular tabular datasets to go along with imodels.
https://github.com/csinva/imodels-data

ai classification data data-science dataset explainability imodels interpretability machine-learning ml rule-based xai

Last synced: 2 months ago
JSON representation

Preprocessed data for various popular tabular datasets to go along with imodels.

Host: GitHub
URL: https://github.com/csinva/imodels-data
Owner: csinva
Created: 2021-11-20T19:53:53.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2023-11-15T21:16:58.000Z (about 1 year ago)
Last Synced: 2024-10-07T13:41:39.971Z (3 months ago)
Topics: ai, classification, data, data-science, dataset, explainability, imodels, interpretability, machine-learning, ml, rule-based, xai
Language: Jupyter Notebook
Homepage: https://csinva.io/imodels/
Size: 50.7 MB
Stars: 4
Watchers: 4
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

        
 imodels🔍 data

 Tabular data for various problems, especially for high-stakes rule-based modeling with the imodels package.

 See also https://huggingface.co/imodels 


Includes the following datasets and more (see notebooks for more details on the datasets).

To download, use the "Name" field as the key: e.g. `imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')`.

| Name                  |   Samples |   Features |   Class 0 |   Class 1 |   Majority class % |

|:----------------------|----------:|-----------:|----------:|----------:|-------------------:|

| heart                 |       270 |         15 |       150 |       120 |               55.6 |

| breast_cancer         |       277 |         17 |       196 |        81 |               70.8 |

| haberman              |       306 |          3 |        81 |       225 |               73.5 |

| credit_g              |      1000 |         60 |       300 |       700 |               70   |

| csi_pecarn_prop       |      3313 |         97 |      2773 |       540 |               83.7 |

| csi_pecarn_pred       |      3313 |         39 |      2773 |       540 |               83.7 |

| juvenile_clean        |      3640 |        286 |      3153 |       487 |               86.6 |

| compas_two_year_clean |      6172 |         20 |      3182 |      2990 |               51.6 |

| enhancer              |      7809 |         80 |      7115 |       694 |               91.1 |

| fico                  |     10459 |         23 |      5000 |      5459 |               52.2 |

| iai_pecarn_prop       |     12044 |         73 |     11841 |       203 |               98.3 |

| iai_pecarn_pred       |     12044 |         58 |     11841 |       203 |               98.3 |

| credit_card_clean     |     30000 |         33 |     23364 |      6636 |               77.9 |

| tbi_pecarn_prop       |     42428 |        223 |     42052 |       376 |               99.1 |

| tbi_pecarn_pred       |     42428 |        121 |     42052 |       376 |               99.1 |

| readmission_clean     |    101763 |        150 |     54861 |     46902 |               53.9 |

# Data usage

First, install the `imodels` package: `pip install imodels`. Then, use the `imodels.get_clean_dataset` function.

```python

imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑> Tuple[numpy.ndarray, numpy.ndarray, list]

"""

Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally

Parameters

----------

dataset_name: str

    dataset_name - unique dataset identifier

data_source: str

    options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'

data_path: str

    path to load/save data (default: 'data')

Returns

-------

X: np.ndarray

    features

y: np.ndarray

    outcome

feature_names: list

"""

```

   

## Example

```python

# download compas dataset from imodels

X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')

# download ionosphere dataset from pmlb

X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')

# download liver dataset from openml

X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')

# download ca housing from sklearn

X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')

```

# Data info

Data comes from various sources - please cite those sources appropriately.

> [notebooks_fetch_data](notebooks_fetch_data) contains notebooks which download and preprocess the data

> 

> [data_cleaned](data_cleaned) contains the cleaned csv file for each dataset

## Clinical decision-rule (PECARN) datasets

To use any of the clinical decision-rule datasets, you must first accept the research data use agreement [here](https://pecarn.org/datasets/).

There are two versions of each PECARN (TBI, IAI, and CSI) dataset.

- `prop`: missing values have not been imputed

- `pred`: missing values have been imputed

`csi_pecarn_pred.csv` note: unlike the rest of the datasets in this repo, which are fully cleaned, `csi_pecarn_pred.csv` contains a variable ("SITE") 

that should be removed before fitting models.

| Dataset |  Task                                                        | Size                            | References |

| ---------- | ----- | ----------------------------------------------------------- | :-------------------------------: |

|iai_pecarn| Predict intra-abdominal injury requiring acute intervention before CT | 12,044 patients, 203 with IAI-I | [📄](https://pubmed.ncbi.nlm.nih.gov/23375510/), [🔗](https://pecarn.org/datasets/) |

|tbi_pecarn| Predict traumatic brain injuries before CT | 42,412 patients, 376 with ciTBI | [📄](https://pecarn.org/studyDatasets/documents/Kuppermann_2009_The-Lancet_000.pdf), [🔗](https://pecarn.org/datasets/) |

|csi_pecarn | Predict cervical spine injury in children | 3,314 patients, 540 with CSI | [📄](https://pecarn.org/studyDatasets/documents/Kuppermann_2009_The-Lancet_000.pdf), [🔗](https://pecarn.org/datasets/)

## Miscellaneous notes

The `breast_cancer` dataset here is not the extremely common Wisconsin breast-cancer dataset but rather [this dataset](https://www.openml.org/search?type=data&sort=runs&id=13&status=active) from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.

Some other cool datasets:

- [moleculenet](https://moleculenet.org/datasets-1) - benchmarks for molecular datasets

- [srbench](https://github.com/cavalab/srbench) - benchmarking for symbolic regression

- [big-bench](https://github.com/google/BIG-bench) - language modeling benchmarks