{"id":23450766,"url":"https://github.com/basedrhys/ood-generalization","last_synced_at":"2025-10-30T20:30:22.744Z","repository":{"id":62908762,"uuid":"468009402","full_name":"basedrhys/ood-generalization","owner":"basedrhys","description":"Code for the paper \"When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations\"","archived":false,"fork":false,"pushed_at":"2023-08-10T20:42:46.000Z","size":3922,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-24T00:16:15.823Z","etag":null,"topics":["chest-xray-images","chest-xrays","computer-vision","healthcare","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/basedrhys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-09T16:36:02.000Z","updated_at":"2024-03-14T11:46:57.000Z","dependencies_parsed_at":"2024-12-24T00:15:08.705Z","dependency_job_id":"014657e9-3e2a-4af5-9216-78d7ad7b9e5c","html_url":"https://github.com/basedrhys/ood-generalization","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Food-generalization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Food-generalization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Food-generalization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/basedrhys%2Food-generalization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/basedrhys","download_url":"https://codeload.github.com/basedrhys/ood-generalization/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239048217,"owners_count":19573186,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chest-xray-images","chest-xrays","computer-vision","healthcare","machine-learning"],"created_at":"2024-12-24T00:15:02.695Z","updated_at":"2025-10-30T20:30:22.076Z","avatar_url":"https://github.com/basedrhys.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations\n\nCode for the paper _\"When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations\"_ by [Rhys Compton](https://www.rhyscompton.co.nz), [Lily Zhang](https://lhz1029.github.io), [Aahlad Puli](https://aahladmanas.github.io), and [Rajesh Ranganath](https://cims.nyu.edu/~rajeshr/)\n\n[ArXiv Preprint](https://arxiv.org/abs/2308.04431)\n\n![High Level Overview](img/high_level_overview.png)\n\n## Acknowledgements\n\nThe training harness is heavily based on the excellent [ClinicalDG](https://github.com/MLforHealth/ClinicalDG) repo which is in turn a modified version of [DomainBed](https://github.com/facebookresearch/DomainBed).\n\n## To replicate the experiments in the paper:\n\n### Step 0: Environment and Prerequisites\nRun the following commands to clone this repo and create the Conda environment:\n\n\n### Step 1: Obtaining the Data\nSee [DataSources.md](DataSources.md) for detailed instructions.\n\n### Step 2: Running Experiments\n\nExperiments can be ran using the same procedure as for the [DomainBed framework](https://github.com/facebookresearch/DomainBed), with a few additional adjustable data hyperparameters which should be passed in as a JSON formatted dictionary.\n\nFor example, to train a single model:\n```\npython -m clinicaldg.scripts.train\\\n\t--algorithm ERM\\\n\t--dataset eICUSubsampleUnobs\\\n\t--es_method val\\\n\t--hparams  '{\"eicu_architecture\": \"GRU\", \"eicu_subsample_g1_mean\": 0.5, \"eicu_subsample_g2_mean\": 0.05}'\\\n\t--output_dir /path/to/output\n```\n\nA detailed list of `hparams` available for each dataset can be found [here](hparams.md).\n\nWe provide the bash scripts used for our main experiments in the `bash_scripts` directory. You will likely need to customize them, along with the launcher, to your compute environment.\n\n## W+B Sweeps\n\nThis codebase heavily utilises [W+B](https://wandb.ai/site) to run experiments, both for tracking and recording results, and running experiments via the [Sweeps](https://docs.wandb.ai/guides/sweeps) feature (along side Slurm arrays).\n\nThe process for this is as follows:\n\n* Define your sweep hyperparameters in a `.yaml` file (e.g., [this YAML file for image size experimentation](./sweeps/4_image_model_size.yaml))\n* Start the sweep: `wandb sweep \u003cyaml filename\u003e`\n* Create your Slurm array script (e.g., [sweep.sbatch](./sweeps/sweep.sbatch)) -- the key parameter is the `--array=` feature, which should be set to `0-\u003cnum parameter configs\u003e`\n* Start the Slurm array job: `sbatch sweeps/sweep.sbatch`\n* Sit back and watch GPUs go brrrr\n\n## Loading the Datasets\n\nThe following steps walkthrough the process for loading datasets and applying hospital-label balancing\n\n* Example wrapper script: `sweeps/sweep.sbatch` (this contains paths to the dataset files / loading them via `singularity`)\n* Main entrypoint: `clinicaldg/scripts/train.py`\n* Corresponding W+B `.yaml` file for hospital-label balancing: `sweeps/2d_nurd_fix.yaml`\n  * The key parameter is :`--balance_resample \"label_notest,under\"`, which does label-balancing with no target test hospital (i.e., label balance the two hospitals to each other), via undersampling. `\"hospital_label,under\"` would be a more appropriate name, in hindsight.\n  * The `--test_env` is not used but just provided as a placeholder\n  * This file gives a range of example invocations of the script, e.g. the following will train a model for **Pneumonia** prediction on **MIMIC** and **CXP**, balancing them such that `P(Y = 1 | Hospital = MIMIC) == P(Y = 1 | Hospital = CXP)`:\n\n    ```bash\n    python ./clinicaldg/scripts/train.py --max_steps 20001 --train_envs MIMIC,CXP --test_env MIMIC --balance_resample \"label_notest,under\" --binary_label Pneumonia\n    ```\n  \n* The codebase is very generalized (to be used in both the eICU task and CXR classification) so has a lot of code to sift through. The path to dataset loading (and ultimately balancing) is from `train.py`, then through:\n  * `dataset = ds_class(hparams, args)`, which invokes...\n  * `CXRBase __init__()` (in `clinicaldg/datasets.py`)\n  * Lines `411` to `447` in `clinicaldg/datasets.py` do the actual hospital-label balancing\n\nIf wanting to use this data within another codebase, one could save the train/val/test DFs to CSV files after processing is finished, i.e., at line `496`; these contain the file paths / labels and can be easily loaded into another python project.\n\n### Dataloader\n\nThe PyTorch dataset can be found in `clinicaldg/cxr/data.py`, and the `get_dataset()` function in that file also.\n\nThe other key class is the `InifiniteDataLoader` used in `train.py` Line `303`. One of these is created for each hospital we're training on, and equal sized batches are sampled from each at every iteration.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasedrhys%2Food-generalization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbasedrhys%2Food-generalization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbasedrhys%2Food-generalization/lists"}