{"id":15906798,"url":"https://github.com/csinva/imodels-data","last_synced_at":"2025-08-11T22:35:21.988Z","repository":{"id":56748244,"uuid":"430196526","full_name":"csinva/imodels-data","owner":"csinva","description":"Preprocessed data for various popular tabular datasets to go along with imodels.","archived":false,"fork":false,"pushed_at":"2023-11-15T21:16:58.000Z","size":53123,"stargazers_count":4,"open_issues_count":1,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-03T23:41:29.685Z","etag":null,"topics":["ai","classification","data","data-science","dataset","explainability","imodels","interpretability","machine-learning","ml","rule-based","xai"],"latest_commit_sha":null,"homepage":"https://csinva.io/imodels/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/csinva.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-20T19:53:53.000Z","updated_at":"2024-10-05T06:26:32.000Z","dependencies_parsed_at":"2023-11-15T22:41:33.566Z","dependency_job_id":null,"html_url":"https://github.com/csinva/imodels-data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/csinva/imodels-data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Fimodels-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Fimodels-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Fimodels-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Fimodels-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/csinva","download_url":"https://codeload.github.com/csinva/imodels-data/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Fimodels-data/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269969197,"owners_count":24505424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-11T02:00:10.019Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","classification","data","data-science","dataset","explainability","imodels","interpretability","machine-learning","ml","rule-based","xai"],"created_at":"2024-10-06T13:41:47.964Z","updated_at":"2025-08-11T22:35:21.962Z","avatar_url":"https://github.com/csinva.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e imodels🔍 data\u003c/h1\u003e\n\u003cp align=\"center\"\u003e Tabular data for various problems, especially for high-stakes rule-based modeling with the \u003ca href=\"https://github.com/csinva/imodels\"\u003eimodels package.\u003c/a\u003e\n\u003cp align=\"center\"\u003e See also https://huggingface.co/imodels \u003c/p\u003e\n\u003c/p\u003e\n\n\nIncludes the following datasets and more (see notebooks for more details on the datasets).\n\nTo download, use the \"Name\" field as the key: e.g. `imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')`.\n\n\n| Name                  |   Samples |   Features |   Class 0 |   Class 1 |   Majority class % |\n|:----------------------|----------:|-----------:|----------:|----------:|-------------------:|\n| heart                 |       270 |         15 |       150 |       120 |               55.6 |\n| breast_cancer         |       277 |         17 |       196 |        81 |               70.8 |\n| haberman              |       306 |          3 |        81 |       225 |               73.5 |\n| credit_g              |      1000 |         60 |       300 |       700 |               70   |\n| csi_pecarn_prop       |      3313 |         97 |      2773 |       540 |               83.7 |\n| csi_pecarn_pred       |      3313 |         39 |      2773 |       540 |               83.7 |\n| juvenile_clean        |      3640 |        286 |      3153 |       487 |               86.6 |\n| compas_two_year_clean |      6172 |         20 |      3182 |      2990 |               51.6 |\n| enhancer              |      7809 |         80 |      7115 |       694 |               91.1 |\n| fico                  |     10459 |         23 |      5000 |      5459 |               52.2 |\n| iai_pecarn_prop       |     12044 |         73 |     11841 |       203 |               98.3 |\n| iai_pecarn_pred       |     12044 |         58 |     11841 |       203 |               98.3 |\n| credit_card_clean     |     30000 |         33 |     23364 |      6636 |               77.9 |\n| tbi_pecarn_prop       |     42428 |        223 |     42052 |       376 |               99.1 |\n| tbi_pecarn_pred       |     42428 |        121 |     42052 |       376 |               99.1 |\n| readmission_clean     |    101763 |        150 |     54861 |     46902 |               53.9 |\n\n# Data usage\n\nFirst, install the `imodels` package: `pip install imodels`. Then, use the `imodels.get_clean_dataset` function.\n\n```python\nimodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑\u003e Tuple[numpy.ndarray, numpy.ndarray, list]\n\"\"\"\nFetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally\n\nParameters\n----------\ndataset_name: str\n    dataset_name - unique dataset identifier\ndata_source: str\n    options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'\ndata_path: str\n    path to load/save data (default: 'data')\n\nReturns\n-------\nX: np.ndarray\n    features\ny: np.ndarray\n    outcome\nfeature_names: list\n\"\"\"\n\n```\n   \n\n## Example\n\n```python\n# download compas dataset from imodels\nX, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')\n# download ionosphere dataset from pmlb\nX, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')\n# download liver dataset from openml\nX, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')\n# download ca housing from sklearn\nX, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')\n```\n\n# Data info\n\nData comes from various sources - please cite those sources appropriately.\n\n\u003e [notebooks_fetch_data](notebooks_fetch_data) contains notebooks which download and preprocess the data\n\u003e \n\u003e [data_cleaned](data_cleaned) contains the cleaned csv file for each dataset\n\n\n## Clinical decision-rule (PECARN) datasets\nTo use any of the clinical decision-rule datasets, you must first accept the research data use agreement [here](https://pecarn.org/datasets/).\n\nThere are two versions of each PECARN (TBI, IAI, and CSI) dataset.\n- `prop`: missing values have not been imputed\n- `pred`: missing values have been imputed\n\n`csi_pecarn_pred.csv` note: unlike the rest of the datasets in this repo, which are fully cleaned, `csi_pecarn_pred.csv` contains a variable (\"SITE\") \nthat should be removed before fitting models.\n\n\n| Dataset |  Task                                                        | Size                            | References |\n| ---------- | ----- | ----------------------------------------------------------- | :-------------------------------: |\n|iai_pecarn| Predict intra-abdominal injury requiring acute intervention before CT | 12,044 patients, 203 with IAI-I | [📄](https://pubmed.ncbi.nlm.nih.gov/23375510/), [🔗](https://pecarn.org/datasets/) |\n|tbi_pecarn| Predict traumatic brain injuries before CT | 42,412 patients, 376 with ciTBI | [📄](https://pecarn.org/studyDatasets/documents/Kuppermann_2009_The-Lancet_000.pdf), [🔗](https://pecarn.org/datasets/) |\n|csi_pecarn | Predict cervical spine injury in children | 3,314 patients, 540 with CSI | [📄](https://pecarn.org/studyDatasets/documents/Kuppermann_2009_The-Lancet_000.pdf), [🔗](https://pecarn.org/datasets/)\n\n## Miscellaneous notes\nThe `breast_cancer` dataset here is not the extremely common Wisconsin breast-cancer dataset but rather [this dataset](https://www.openml.org/search?type=data\u0026sort=runs\u0026id=13\u0026status=active) from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.\n\nSome other cool datasets:\n\n- [moleculenet](https://moleculenet.org/datasets-1) - benchmarks for molecular datasets\n- [srbench](https://github.com/cavalab/srbench) - benchmarking for symbolic regression\n- [big-bench](https://github.com/google/BIG-bench) - language modeling benchmarks\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcsinva%2Fimodels-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcsinva%2Fimodels-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcsinva%2Fimodels-data/lists"}