Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/EpistasisLab/pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://github.com/EpistasisLab/pmlb

Last synced: 3 months ago
JSON representation

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

Host: GitHub
URL: https://github.com/EpistasisLab/pmlb
Owner: EpistasisLab
License: mit
Created: 2016-11-11T19:16:44.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2023-11-26T16:32:11.000Z (12 months ago)
Last Synced: 2024-07-28T09:33:02.502Z (3 months ago)
Language: Python
Homepage: https://epistasislab.github.io/pmlb/
Size: 235 MB
Stars: 794
Watchers: 30
Forks: 130
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Penn Machine Learning Benchmarks

This repository contains the code and data for a large, curated set of benchmark datasets for evaluating and comparing supervised machine learning algorithms.

These data sets cover a broad range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features.

Please go to our [home page](https://epistasislab.github.io/pmlb/) to interactively browse the datasets, vignette, and contribution guide!

## Breaking changes in PMLB 1.0

*This repository has been restructured, and several dataset names have been changed!*

If you have an older version of PMLB, we highly recommend you upgrade it to v1.0 for updated URLs and names of datasets:

```

pip install pmlb --upgrade

```

## Datasets

Datasets are tracked with Git Large File Storage (LFS).

If you would like to clone the entire repository, please [install and set up Git LFS](https://git-lfs.github.com/) for your user account. 

Alternatively, you can download the `.zip` file from GitHub.

All data sets are stored in a common format:

* First row is the column names

* Each following row corresponds to one row of the data

* The target column is named `target`

* All columns are tab (`\t`) separated

* All files are compressed with `gzip` to conserve space

![Dataset_Sizes](datasets/dataset_sizes.svg)

The [complete table](pmlb/all_summary_stats.tsv) of dataset characteristics is also available for download.

Please note, in our documentation, a feature is considered:

* "binary" if it is of type integer and has 2 unique values (equivalent to pandas profiling's "boolean")

* "categorical" if it is of type integer and has *more than* 2 unique values (equivalent to pandas profiling's "categorical")

* "continuous" if it is of type float (equivalent to pandas profiling's "numeric").

## Python wrapper

For easy access to the benchmark data sets, we have provided a Python wrapper named `pmlb`. The wrapper can be installed on Python via `pip`:

```

pip install pmlb

```

and used in Python scripts as follows:

```python

from pmlb import fetch_data

# Returns a pandas DataFrame

adult_data = fetch_data('adult')

print(adult_data.describe())

```

The `fetch_data` function has two additional parameters:

* `return_X_y` (True/False): Whether to return the data in scikit-learn format, with the features and labels stored in separate NumPy arrays.

* `local_cache_dir` (string): The directory on your local machine to store the data files so you don't have to fetch them over the web again. By default, the wrapper does not use a local cache directory.

For example:

```python

from pmlb import fetch_data

# Returns NumPy arrays

adult_X, adult_y = fetch_data('adult', return_X_y=True, local_cache_dir='./')

print(adult_X)

print(adult_y)

```

You can also list all of the available data sets as follows:

```python

from pmlb import dataset_names

print(dataset_names)

```

Or if you only want a list of available classification or regression datasets:

```python

from pmlb import classification_dataset_names, regression_dataset_names

print(classification_dataset_names)

print('')

print(regression_dataset_names)

```

## Example usage: Compare two classification algorithms with PMLB

PMLB is designed to make it easy to benchmark machine learning algorithms against each other. Below is a Python code snippet showing the most basic way to use PMLB to compare two algorithms.

```python

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

import seaborn as sb

from pmlb import fetch_data, classification_dataset_names

logit_test_scores = []

gnb_test_scores = []

for classification_dataset in classification_dataset_names:

    X, y = fetch_data(classification_dataset, return_X_y=True)

    train_X, test_X, train_y, test_y = train_test_split(X, y)

    logit = LogisticRegression()

    gnb = GaussianNB()

    logit.fit(train_X, train_y)

    gnb.fit(train_X, train_y)

    logit_test_scores.append(logit.score(test_X, test_y))

    gnb_test_scores.append(gnb.score(test_X, test_y))

sb.boxplot(data=[logit_test_scores, gnb_test_scores], notch=True)

plt.xticks([0, 1], ['LogisticRegression', 'GaussianNB'])

plt.ylabel('Test Accuracy')

```

## Contributing

See our [Contributing Guide](https://epistasislab.github.io/pmlb/contributing.html). 

We're looking for help with documentation, and also appreciate new dataset and functionality contributions.

## Citing PMLB

If you use PMLB in a scientific publication, please consider citing one of the following papers:

Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, Daniel J. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein, Weixuan Fu, and Jason H. Moore.

[PMLB v1.0: an open source dataset collection for benchmarking machine learning methods](https://arxiv.org/abs/2012.00058).

_arXiv preprint arXiv:2012.00058_ (2020).

```bibtex

@article{romano2021pmlb,

  title={PMLB v1.0: an open source dataset collection for benchmarking machine learning methods},

  author={Romano, Joseph D and Le, Trang T and La Cava, William and Gregg, John T and Goldberg, Daniel J and Chakraborty, Praneel and Ray, Natasha L and Himmelstein, Daniel and Fu, Weixuan and Moore, Jason H},

  journal={arXiv preprint arXiv:2012.00058v2},

  year={2021}

}

```

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017). [PMLB: a large benchmark suite for machine learning evaluation and comparison](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4). *BioData Mining* **10**, page 36.

BibTeX entry:

```bibtex

@article{Olson2017PMLB,

    author="Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.",

    title="PMLB: a large benchmark suite for machine learning evaluation and comparison",

    journal="BioData Mining",

    year="2017",

    month="Dec",

    day="11",

    volume="10",

    number="1",

    pages="36",

    issn="1756-0381",

    doi="10.1186/s13040-017-0154-4",

    url="https://doi.org/10.1186/s13040-017-0154-4"

}

```

## Support for PMLB

PMLB was developed in the [Computational Genetics Lab](http://epistasis.org/) at the [University of Pennsylvania](https://www.upenn.edu/) with funding from the [NIH](http://www.nih.gov/) under grant AI117694, LM010098 and LM012601. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.