{"id":13577667,"url":"https://github.com/MadryLab/data-transfer","last_synced_at":"2025-04-05T12:31:02.099Z","repository":{"id":45764182,"uuid":"513207479","full_name":"MadryLab/data-transfer","owner":"MadryLab","description":null,"archived":false,"fork":false,"pushed_at":"2022-09-23T16:47:17.000Z","size":72,"stargazers_count":34,"open_issues_count":3,"forks_count":1,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-05T14:46:52.414Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MadryLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-12T16:03:27.000Z","updated_at":"2024-04-22T03:04:14.000Z","dependencies_parsed_at":"2023-01-19T00:46:05.044Z","dependency_job_id":null,"html_url":"https://github.com/MadryLab/data-transfer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdata-transfer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdata-transfer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdata-transfer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdata-transfer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MadryLab","download_url":"https://codeload.github.com/MadryLab/data-transfer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247338632,"owners_count":20922989,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T15:01:23.396Z","updated_at":"2025-04-05T12:30:57.078Z","avatar_url":"https://github.com/MadryLab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# A Data-Based Perspective on Transfer Learning.\n\nThis repository contains the code of our paper:\n\n**A Data-Based Perspective on Transfer Learning** \u003c/br\u003e\n*Saachi Jain\\*, Hadi Salman\\*, Alaa Khaddaj\\*, Eric Wong, Sung Min Park, Aleksander Madry*  \u003cbr\u003e\n[Paper](https://arxiv.org/abs/2207.05739) - [Blog post](http://gradientscience.org/data-transfer/)\n\n\n```bibtex\n@article{jain2022data,\n  title={A Data-Based Perspective on Transfer Learning},\n  author={Jain, Saachi and Salman, Hadi and Khaddaj, Alaa and Wong, Eric and Park, Sung Min and Madry, Aleksander},\n  journal={arXiv preprint arXiv:2207.05739},\n  year={2022}\n}\n```\n\nThe major content of our repo are:\n\n* [src/](src): Contains all our code for running full transfer pipeline.\n* [configs/](configs): Contains the config files that training codes expect. These config files contain the hyperparams for each transfer tasks.\n* [analysis/](analysis): Contains code for all the analysis we do in our paper.\n\n## Getting started\n*Our code relies on the [FFCV Library](https://ffcv.io/). To install this library along with other dependencies including PyTorch, follow the instructions below.*\n\n```\nconda create -n ffcv python=3.9 cupy pkg-config compilers libjpeg-turbo opencv pytorch torchvision cudatoolkit=11.3 numba -c pytorch -c conda-forge \nconda activate ffcv\npip install ffcv\n```\n\n## Full pipeline: Train source model and transfer to various downstream tasks\n\nTo train an ImageNet model and transfer it to all the datasets we consider in the paper, simply run:\n\n```\npython src/train_imagenet_class_subset.py \\\n                        --config-file configs/base_config.yaml \\\n                        --training.data_root $PATH_TO_DATASETS \\\n                        --out.output_pkl_dir $OUTDIR\n\n```\nwhere `$OUTDIR` is the output directory of your choice, and `$PATH_TO_DATASETS` is the path where the datasets exists (see below).\n\nThe config file `configs/base_config.yaml` contains all the hyperparameters needed for this experiment. For example, you can specify which downstream tasks you want to transfer to, or how many Imagenet class to train on the source model.\n\n## Calculating influences\nUse `analysis/data_compressors/2_20_compressor.py` to compress model results into a summary file. Then use `analysis/compute_influences.py` to compute the influences. In a notebook, simply run the following code:\n\n```python\nsf = \u003cSUMMARY FILE FOLDER\u003e\nds = compute_influences.SummaryFileDataSet(sf, dataset, INFLUENCE_KEY, keyword)\ndl = torch.utils.data.DataLoader(ds, batch_size=1024, shuffle=False, drop_last=False)\ninfl = compute_influences.batch_calculate_influence(dl, len(val_labels), 1000, div=True)\n```\n\n## Running counterfactual experiment\nOnce influences have been computed, we can now run counterfactual experiments by removing top or bottom influencing classes from the source dataset (ImageNet), and then applying transfer learning again. This can be done by running:\n```\npython src/counterfactuals_main.py\\\n            --config-file configs/base_config.yaml\\\n            --training.transfer_task ${TASK}\\\n            --out.output_pkl_dir ${OUT_DIR}\\\n            --counterfactual.cf_target_dataset ${DATASET}\\\n            --counterfactual.cf_infl_order_file ${INFL_ORDER_FILE} \\\n            --data.num_classes -1 \\\n            --counterfactual.cf_order TOP \\\n            --counterfactual.cf_num_classes_min ${MIN_STEPS} \\\n            --counterfactual.cf_num_classes_max ${MAX_STEPS} \\\n            --counterfactual.cf_num_classes_step ${STEP_SIZE} \\\n            --counterfactual.cf_type CLASS\n```\n\n## Datasets that we use (see our paper for citations) \n* aircraft ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/fgvc-aircraft-2013b.tar.gz?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* birds ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/birdsnap.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* caltech101 ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/caltech101.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* caltech256 ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/caltech256.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* cifar10 **(Automatically downloaded when you run the code)**\n* cifar100 **(Automatically downloaded when you run the code)**\n\n* flowers ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/flowers.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* food ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/food.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* pets ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/pets.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* stanford_cars ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/stanford_cars.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n* SUN397 ([Download]( https://robustnessws4285631339.blob.core.windows.net/public-datasets/SUN397.tar?sv=2020-08-04\u0026ss=bfqt\u0026srt=sco\u0026sp=rwdlacupitfx\u0026se=2051-10-06T07:09:59Z\u0026st=2021-10-05T23:09:59Z\u0026spr=https,http\u0026sig=U69sEOSMlliobiw8OgiZpLTaYyOA5yt5pHHH5%2FKUYgI%3D\n))\n\nWe have created an [FFCV](https://ffcv.io/) version of each of these datasets to enable super fast training. We will make these datasets available soon!\n\n## Download our data\nComing soon!\n\n## Download our pretrained models\nComing soon!\n\n## A detailed demo\nComing soon!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMadryLab%2Fdata-transfer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMadryLab%2Fdata-transfer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMadryLab%2Fdata-transfer/lists"}