{"id":19259028,"url":"https://github.com/mlfoundations/tableshift","last_synced_at":"2025-04-21T16:30:38.151Z","repository":{"id":173462632,"uuid":"650335981","full_name":"mlfoundations/tableshift","owner":"mlfoundations","description":"A benchmark for distribution shift in tabular data","archived":false,"fork":false,"pushed_at":"2024-06-06T17:55:39.000Z","size":741,"stargazers_count":50,"open_issues_count":11,"forks_count":12,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-01T14:21:04.950Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://tableshift.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlfoundations.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-06T21:18:08.000Z","updated_at":"2025-02-10T00:01:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"671ad9ba-1236-46b8-8c68-12a561fa5b12","html_url":"https://github.com/mlfoundations/tableshift","commit_stats":null,"previous_names":["mlfoundations/tableshift"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Ftableshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Ftableshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Ftableshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Ftableshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlfoundations","download_url":"https://codeload.github.com/mlfoundations/tableshift/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250090636,"owners_count":21373221,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T19:15:06.941Z","updated_at":"2025-04-21T16:30:37.594Z","avatar_url":"https://github.com/mlfoundations.png","language":"Python","funding_links":[],"categories":["Benchmarks \u0026 Comparisons"],"sub_categories":["Benchmark Repositories"],"readme":"![status](https://github.com/mlfoundations/tableshift/actions/workflows/python-package-conda.yml/badge.svg)\n![status](https://github.com/mlfoundations/tableshift/actions/workflows/run-example-script.yml/badge.svg)\n![status](https://github.com/mlfoundations/tableshift/actions/workflows/docker.yml/badge.svg)\n\n![tableshift logo](img/tableshift.png)\n\n# TableShift\n\nTableShift is a benchmarking library for machine learning with tabular data under distribution shift.\n\nYou can read more about TableShift at [tableshift.org](https://tableshift.org/index.html) or read the full paper (published in NeurIPS 2023 Datasets \u0026 Benchmarks Track) on [arxiv](https://arxiv.org/abs/2312.07577). If you use the benchmark in your research, please cite the paper:\n\n```\n@article{gardner2023tableshift,\n  title={Benchmarking Distribution Shift in Tabular Data with TableShift},\n  author={Gardner, Josh and Popovic, Zoran and Schmidt, Ludwig},\n  journal={Advances in Neural Information Processing Systems},\n  year={2023}\n}\n```\n\nIf you find an issue, please file a GitHub [issue](https://github.com/mlfoundations/tableshift/issues/new/choose).\n\n# Quickstart\n\n**Environment setup:** We recommend the use of docker with TableShift. Our dataset construction and model pipelines have a diverse set of dependencies that included non-Python files required to make some libraries work. As a result, we recommend you use the provided Docker image for using the benchmark, and suggest forking this Docker image for your own development.\n\n```bash \n# fetch the docker image\ndocker pull ghcr.io/jpgard/tableshift:latest\n\n# run it to test your setup; this automatically launches examples/run_expt.py\ndocker run ghcr.io/jpgard/tableshift:latest --model xgb\n\n# optionally, use the container interactively\ndocker run -it --entrypoint=/bin/bash ghcr.io/jpgard/tableshift:latest\n\n```\n\n**Conda:** We recommend using Docker with TableShift when running training or using any of the pretrained modeling code, as the libraries used for training contain a complex and subtle set of dependencies that can be difficult to configure outside Docker. However, Conda might provide a more lightweight environment for basic development and exploration with TableShift, so we describe how to set up Conda here. \n\nTo create a conda environment, simply clone this repo, enter the root directory, and run the following commands to create and test a local execution environment:\n\n```bash\n# set up the environment\nconda env create -f environment.yml\nconda activate tableshift\n# test the install by running the training script\npython examples/run_expt.py\n```\n\nThe final line above will print some detailed logging output as the script executes. When you see `training completed! test accuracy: 0.6221` your environment is ready to go! (Accuracy may vary slightly due to randomness.)\n\n**Accessing datasets:** If you simply want to load and use a standard version of\none of the public TableShift datasets, it's as simple as:\n\n```python\nfrom tableshift import get_dataset\n\ndataset_name = \"diabetes_readmission\"\ndset = get_dataset(dataset_name)\n```\n\nThe full list of identifiers for all available datasets is below; simply swap any of these for `dataset_name` to access the relevant data.\n\nIf you would like to use a dataset *without* a domain split, replace `get_dataset()` with `get_iid_dataset()`.\n\nThe call to `get_dataset()` returns a `TabularDataset` that you can use to\neasily load tabular data in several formats, including Pandas DataFrame and\nPyTorch DataLoaders:\n\n```python\n# Fetch a pandas DataFrame of the training set\nX_tr, y_tr, _, _ = dset.get_pandas(\"train\")\n\n# Fetch and use a pytorch DataLoader\ntrain_loader = dset.get_dataloader(\"train\", batch_size=1024)\n\nfor X, y, _, _ in train_loader:\n    ...\n```\n\nFor all TableShift datasets, the following splits are\navailable: `train`, `validation`, `id_test`, `ood_validation`, `ood_test`.\n\nFor IID datasets (those without a domain split) these splits are available: `train`, `validation`, `test`.\n\nThere is a complete example of a training script in `examples/run_expt.py`.\n\n# Benchmark Dataset Availability\n\n*tl;dr: if you want to get started exploring ASAP, use datasets marked as \"\npublic\" below.*\n\nAll of the datasets used in the TableShift benchmark are either publicly available or provide open credentialized\naccess.\nThe datasets with open credentialized access require signing a data use agreement; as a result,\nsome datasets must be manually fetched and stored locally. TableShift makes this process as simple as possible.\n\nA list of datasets, their names in TableShift, and the corresponding access\nlevels are below. The string identifier is the value that should be passed as the `experiment` parameter\nto `get_dataset()` or the `--experiment` flag of `run_expt.py` and other training scripts.\n\n| Dataset                 | String Identifier         | Availability                                                                                                 | Source                                                                                                                 |\n|-------------------------|---------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|\n| Voting                  | `anes`                    | Public Credentialized Access ([source](https://electionstudies.org))                                         | [American National Election Studies (ANES)](https://electionstudies.org)                                               |\n| ASSISTments             | `assistments`             | Public                                                                                                       | [Kaggle](https://www.kaggle.com/datasets/nicolaswattiez/skillbuilder-data-2009-2010)                                   |\n| Childhood Lead          | `nhanes_lead`             | Public                                                                                                       | [National Health and Nutrition Examination Survey (NHANES)](https://www.cdc.gov/nchs/nhanes/index.htm)                 |\n| College Scorecard       | `college_scorecard`       | Public                                                                                                       | [College Scorecard](http://collegescorecard.ed.gov)                                                                    |\n| Diabetes                | `brfss_diabetes`          | Public                                                                                                       | [Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/index.html)                             |\n| Food Stamps             | `acsfoodstamps`           | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |\n| HELOC                   | `heloc`                   | Public Credentialized Access ([source](https://community.fico.com/s/explainable-machine-learning-challenge)) | [FICO](https://community.fico.com/s/explainable-machine-learning-challenge)                                            |\n| Hospital Readmission    | `diabetes_readmission`    | Public                                                                                                       | [UCI](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)                           |\n| Hypertension            | `brfss_blood_pressure`    | Public                                                                                                       | [Behavioral Risk Factor Surveillance System (BRFSS)](https://www.cdc.gov/brfss/index.html)                             |\n| ICU Length of Stay      | `mimic_extract_los_3`     | Public Credentialized Access ([source](https://mimic.mit.edu/docs/gettingstarted/))                          | [MIMIC-iii](https://physionet.org/content/mimiciii/) via [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract) |\n| ICU Mortality           | `mimic_extract_mort_hosp` | Public Credentialized Access ([source](https://mimic.mit.edu/docs/gettingstarted/))                          | [MIMIC-iii](https://physionet.org/content/mimiciii/) via [MIMIC-Extract](https://github.com/MLforHealth/MIMIC_Extract) |\n| Income                  | `acsincome`               | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |\n| Public Health Insurance | `acspubcov`               | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |\n| Sepsis                  | `physionet`               | Public                                                                                                       | [Physionet](https://physionet.org/content/challenge-2019/)                                                             |\n| Unemployment            | `acsunemployment`         | Public                                                                                                       | [American Community Survey](https://www.census.gov/programs-surveys/acs) (via [folktables](http://folktables.org)      |\n\nNote that details on the data source, which files to load, and the feature\ncodings are provided in the TableShift source code for each dataset and data\nsource (see `data_sources.py` and the `tableshift.datasets` module).\n\nFor additional, non-benchmark datasets (possibly with only IID splits, not a distribution shift),\nsee `tableshift.configs.non_benchmark.configs.py`\n\n# Dataset Details\n\nMore information about the tasks, datasets, splitting variables, data sources, and motivation are available in the\nTableShift paper; we provide a summary below.\n\n| Task                    | Target                                                       | Shift                       | Domain   | Baseline | Total Observations |\n|-------------------------|--------------------------------------------------------------|-----------------------------|----------|----------|--------------------|\n| ASSISTments             | Next Answer Correct                                          | School                      | \u0026#10003; | -34.5%   | 2,667,776          |\n| College Scorecard       | Low Degree Completion Rate                                   | Institution Type            | \u0026#10003; | -11.2%   | 124,699            |\n| ICU Mortality  | ICU patient expires in hospital during current visit         | Insurance Type              | \u0026#10003; | -6.3%    | 23,944             |\n| Hospital Readmission    | 30-day readmission of diabetic hospital patients             | Admission source            | \u0026#10003; | -5.9%    | 99,493             |\n| Diabetes                | Diabetes diagnosis                                           | Race                        | \u0026#10003; | -4.5%    | 1,444,176          |\n| ICU Length of Stay      | Length of stay \u003e= 3 hrs in ICU                               | Insurance Type              | \u0026#10003; | -3.4%    | 23,944             |\n| Voting                  | Voted in U.S. presidential election                          | Geographic Region           | \u0026#10003; | -2.6%    | 8280               |\n| Food Stamps             | Food stamp recipiency in past year for households with child | Geographic Region           | \u0026#10003; | -2.4%    | 840,582            |\n| Unemployment            | Unemployment for non-social security-eligible adults         | Education Level             | \u0026#10003; | -1.3%    | 1,795,434          |\n| Income                  | Income \u003e= 56k for employed adults                            | Geographic Region           | \u0026#10003; | -1.3%    | 1,664,500          |\n| HELOC              | Repayment of Home Equity Line of Credit loan                 | Est. third-party risk level |          | -22.6%   | 10,459             |\n| Public Health Insurance | Coverage of non-Medicare eligible low-income individuals     | Disability Status           |          | -14.5%   | 5,916,565          |\n| Sepsis                  | Sepsis onset within next 6hrs for hospital patients          | Length of Stay              |          | -6.0%    | 1,552,210          |\n| Childhood Lead          | Blood lead levels above CDC Blood Level Reference Value      | Poverty level               |          | -5.1%    | 27,499             |\n| Hypertension            | Hypertension diagnosis for high-risk age (50+)               | BMI Category                |          | -4.4%    | 846,761            |\n\n# A Self-Contained Training Example\n\nA sample training script is located at `examples/run_expt.py`. However, training a scikit-learn model is as simple as:\n\n```python\nfrom tableshift import get_dataset\nfrom sklearn.ensemble import GradientBoostingClassifier\n\ndset = get_dataset(\"diabetes_readmission\")\nX_train, y_train, _, _ = dset.get_pandas(\"train\")\n\n# Train\nestimator = GradientBoostingClassifier()\ntrained_estimator = estimator.fit(X_train, y_train)\n\n# Test\nfor split in ('id_test', 'ood_test'):\n    X, y, _, _ = dset.get_pandas(split)\n    preds = estimator.predict(X)\n    acc = (preds == y).mean()\n    print(f'accuracy on split {split} is: {acc:.3f}')\n```\n\nThe code should output the following:\n\n```  \naccuracy on split id_test is: 0.655\naccuracy on split ood_test is: 0.619\n```\n\nNow, please close that domain gap!\n\n# Non-benchmark datasets\n\nWe also have several tabular datasets available in TableShift which are not part of the official TableShift benchmark,\nbut which still may be useful for tabular data research. We are continuously adding datasets to the package. These\ndatasets support all of the same functionality provided for the TableShift benchmark datasets, but we did not include\nthese as an official part of the TableShift benchmark -- they are not an official part of the TableShift package and are\nmostly intended for convenience and for our own internal use.\n\nFor a list of the non-benchmark datasets, see the file `tableshift.configs.non_benchmark_configs.py`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlfoundations%2Ftableshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlfoundations%2Ftableshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlfoundations%2Ftableshift/lists"}