{"id":20644612,"url":"https://github.com/epistasislab/pmlb","last_synced_at":"2025-05-15T01:04:19.901Z","repository":{"id":42033965,"uuid":"73504562","full_name":"EpistasisLab/pmlb","owner":"EpistasisLab","description":"PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.","archived":false,"fork":false,"pushed_at":"2025-02-25T21:26:14.000Z","size":246784,"stargazers_count":820,"open_issues_count":17,"forks_count":139,"subscribers_count":29,"default_branch":"master","last_synced_at":"2025-04-06T18:12:35.924Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://epistasislab.github.io/pmlb/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EpistasisLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-11T19:16:44.000Z","updated_at":"2025-03-18T02:21:28.000Z","dependencies_parsed_at":"2025-03-09T13:37:18.357Z","dependency_job_id":"88ee3143-13e9-4d28-8fe7-944617d2ba7b","html_url":"https://github.com/EpistasisLab/pmlb","commit_stats":null,"previous_names":["epistasislab/penn-ml-benchmarks"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fpmlb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fpmlb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fpmlb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fpmlb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EpistasisLab","download_url":"https://codeload.github.com/EpistasisLab/pmlb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248785009,"owners_count":21161218,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T16:16:59.703Z","updated_at":"2025-04-13T21:29:12.853Z","avatar_url":"https://github.com/EpistasisLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Penn Machine Learning Benchmarks\n\nThis repository contains the code and data for a large, curated set of benchmark datasets for evaluating and comparing supervised machine learning algorithms.\nThese data sets cover a broad range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features.\n\nPlease go to our [home page](https://epistasislab.github.io/pmlb/) to interactively browse the datasets, vignette, and contribution guide!\n\n## Breaking changes in PMLB 1.0\n\n*This repository has been restructured, and several dataset names have been changed!*\n\nIf you have an older version of PMLB, we highly recommend you upgrade it to v1.0 for updated URLs and names of datasets:\n\n```\npip install pmlb --upgrade\n```\n\n## Datasets\n\nDatasets are tracked with Git Large File Storage (LFS).\nIf you would like to clone the entire repository, please [install and set up Git LFS](https://git-lfs.github.com/) for your user account. \nAlternatively, you can download the `.zip` file from GitHub.\n\nAll data sets are stored in a common format:\n\n* First row is the column names\n* Each following row corresponds to one row of the data\n* The target column is named `target`\n* All columns are tab (`\\t`) separated\n* All files are compressed with `gzip` to conserve space\n\n![Dataset_Sizes](datasets/dataset_sizes.svg)\n\nThe [complete table](pmlb/all_summary_stats.tsv) of dataset characteristics is also available for download.\nPlease note, in our documentation, a feature is considered:\n* \"binary\" if it is of type integer and has 2 unique values (equivalent to pandas profiling's \"boolean\")\n* \"categorical\" if it is of type integer and has *more than* 2 unique values (equivalent to pandas profiling's \"categorical\")\n* \"continuous\" if it is of type float (equivalent to pandas profiling's \"numeric\").\n\n## Python wrapper\n\nFor easy access to the benchmark data sets, we have provided a Python wrapper named `pmlb`. The wrapper can be installed on Python via `pip`:\n\n```\npip install pmlb\n```\n\nand used in Python scripts as follows:\n\n```python\nfrom pmlb import fetch_data\n\n# Returns a pandas DataFrame\nadult_data = fetch_data('adult')\nprint(adult_data.describe())\n```\n\nThe `fetch_data` function has two additional parameters:\n* `return_X_y` (True/False): Whether to return the data in scikit-learn format, with the features and labels stored in separate NumPy arrays.\n* `local_cache_dir` (string): The directory on your local machine to store the data files so you don't have to fetch them over the web again. By default, the wrapper does not use a local cache directory.\n\nFor example:\n\n```python\nfrom pmlb import fetch_data\n\n# Returns NumPy arrays\nadult_X, adult_y = fetch_data('adult', return_X_y=True, local_cache_dir='./')\nprint(adult_X)\nprint(adult_y)\n```\n\nYou can also list all of the available data sets as follows:\n\n```python\nfrom pmlb import dataset_names\n\nprint(dataset_names)\n```\n\nOr if you only want a list of available classification or regression datasets:\n\n```python\nfrom pmlb import classification_dataset_names, regression_dataset_names\n\nprint(classification_dataset_names)\nprint('')\nprint(regression_dataset_names)\n```\n\n## Example usage: Compare two classification algorithms with PMLB\n\nPMLB is designed to make it easy to benchmark machine learning algorithms against each other. Below is a Python code snippet showing the most basic way to use PMLB to compare two algorithms.\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.model_selection import train_test_split\n\nimport matplotlib.pyplot as plt\nimport seaborn as sb\n\nfrom pmlb import fetch_data, classification_dataset_names\n\nlogit_test_scores = []\ngnb_test_scores = []\n\nfor classification_dataset in classification_dataset_names:\n    X, y = fetch_data(classification_dataset, return_X_y=True)\n    train_X, test_X, train_y, test_y = train_test_split(X, y)\n\n    logit = LogisticRegression()\n    gnb = GaussianNB()\n\n    logit.fit(train_X, train_y)\n    gnb.fit(train_X, train_y)\n\n    logit_test_scores.append(logit.score(test_X, test_y))\n    gnb_test_scores.append(gnb.score(test_X, test_y))\n\nsb.boxplot(data=[logit_test_scores, gnb_test_scores], notch=True)\nplt.xticks([0, 1], ['LogisticRegression', 'GaussianNB'])\nplt.ylabel('Test Accuracy')\n```\n\n## Contributing\n\nSee our [Contributing Guide](https://epistasislab.github.io/pmlb/contributing.html). \nWe're looking for help with documentation, and also appreciate new dataset and functionality contributions.\n\n## Citing PMLB\n\nIf you use PMLB in a scientific publication, please consider citing one of the following papers:\n\nJoseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, Daniel J. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein, Weixuan Fu, and Jason H. Moore.\n[PMLB v1.0: an open source dataset collection for benchmarking machine learning methods](https://arxiv.org/abs/2012.00058).\n_arXiv preprint arXiv:2012.00058_ (2020).\n\n```bibtex\n@article{romano2021pmlb,\n  title={PMLB v1.0: an open source dataset collection for benchmarking machine learning methods},\n  author={Romano, Joseph D and Le, Trang T and La Cava, William and Gregg, John T and Goldberg, Daniel J and Chakraborty, Praneel and Ray, Natasha L and Himmelstein, Daniel and Fu, Weixuan and Moore, Jason H},\n  journal={arXiv preprint arXiv:2012.00058v2},\n  year={2021}\n}\n```\n\nRandal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017). [PMLB: a large benchmark suite for machine learning evaluation and comparison](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4). *BioData Mining* **10**, page 36.\n\nBibTeX entry:\n\n```bibtex\n@article{Olson2017PMLB,\n    author=\"Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.\",\n    title=\"PMLB: a large benchmark suite for machine learning evaluation and comparison\",\n    journal=\"BioData Mining\",\n    year=\"2017\",\n    month=\"Dec\",\n    day=\"11\",\n    volume=\"10\",\n    number=\"1\",\n    pages=\"36\",\n    issn=\"1756-0381\",\n    doi=\"10.1186/s13040-017-0154-4\",\n    url=\"https://doi.org/10.1186/s13040-017-0154-4\"\n}\n```\n\n## Support for PMLB\n\nPMLB was developed in the [Computational Genetics Lab](http://epistasis.org/) at the [University of Pennsylvania](https://www.upenn.edu/) with funding from the [NIH](http://www.nih.gov/) under grant AI117694, LM010098 and LM012601. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepistasislab%2Fpmlb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepistasislab%2Fpmlb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepistasislab%2Fpmlb/lists"}