{"id":13737957,"url":"https://github.com/MadryLab/datamodels-data","last_synced_at":"2025-05-08T15:32:07.773Z","repository":{"id":97137803,"uuid":"453554775","full_name":"MadryLab/datamodels-data","owner":"MadryLab","description":"Data for \"Datamodels: Predicting Predictions with Training Data\"","archived":false,"fork":false,"pushed_at":"2023-05-25T18:17:09.000Z","size":23,"stargazers_count":90,"open_issues_count":0,"forks_count":3,"subscribers_count":7,"default_branch":"main","last_synced_at":"2024-11-15T06:32:56.046Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MadryLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-01-30T00:44:08.000Z","updated_at":"2024-11-08T13:10:18.000Z","dependencies_parsed_at":"2023-07-04T07:17:54.333Z","dependency_job_id":null,"html_url":"https://github.com/MadryLab/datamodels-data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdatamodels-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdatamodels-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdatamodels-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MadryLab%2Fdatamodels-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MadryLab","download_url":"https://codeload.github.com/MadryLab/datamodels-data/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253096290,"owners_count":21853571,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T03:02:07.114Z","updated_at":"2025-05-08T15:32:07.515Z","avatar_url":"https://github.com/MadryLab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Data from \"Datamodels: Predicting Predictions with Training Data\"\n\nHere we provide the data used in the paper \"Datamodels: Predicting Predictions with Training Data\" ([arXiv](https://arxiv.org/abs/2202.00622), [Blog](https://gradientscience.org/datamodels-1)).\n\nLooking for the code to make your own datamodels? It's now been released [here](https://github.com/MadryLab/datamodels)!\n\n*Note that all of the data below is stored on Amazon S3  using the “requester pays” option to avoid a blowup in our data transfer costs (we put estimated AWS costs below)---if you are on a budget and do not mind waiting a bit longer, please contact us at datamodels@mit.edu and we can try to arrange a free (but slower) transfer.*\n\n## Citation\nTo cite this data, please use the following BibTeX entry:\n```\n@inproceedings{ilyas2022datamodels,\n  title = {Datamodels: Predicting Predictions from Training Data},\n  author = {Andrew Ilyas and Sung Min Park and Logan Engstrom and Guillaume Leclerc and Aleksander Madry},\n  booktitle = {ArXiv preprint arXiv:2202.00622},\n  year = {2022}\n}\n```\n\n## Overview\nWe provide the data used in our paper to analyze two image classification datasets: CIFAR-10 and (a modified version of) [FMoW](https://wilds.stanford.edu/datasets/#fmow).\n\nFor each dataset, the data consists of two parts:\n1. *Training data* for datamodeling, which consists of:\n     * Training subsets or \"training masks\", which are the independent variables of the regression tasks; and\n     * Model outputs (correct-class margins and logits), which are the\ndependent variables of the regression tasks.\n2. *Datamodels* estimated from this data using LASSO.\n\nFor each dataset, there are multiple versions of the data depending on the choice of the hyperparameter \u0026alpha;, the subsampling fraction (this is the random fraction of training examples on which each model is trained; see Section 2 of our paper for more information).\n\nFollowing table shows the number of models we trained and used for estimating datamodels (also see Table 1 in paper):\n| Subsampling \u0026alpha; (%) | CIFAR-10  | FMoW    |\n|-----------------------|-----------|---------|\n| 10                   | 1,500,000 | N/A     |\n| 20                   | 750,000   | 375,000 |\n| 50                   | 300,000   | 150,000 |\n| 75                  | 600,000   | 300,000 |\n\n\n### Training data\nFor each dataset and $\\alpha$, we provide the following data:\n\n```python\n# M is the number of models trained\n/{DATASET}/data/train_masks_{PCT}pct.npy  # [M x N_train] boolean\n/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_test] np.float16\n/{DATASET}/data/train_margins_{PCT}pct.npy # [M x N_train] np.float16\n```\n(The files live in the Amazon S3 bucket `madrylab-datamodels`; we provide instructions for acces in the \u003ca href=\"#downloading\"\u003enext section\u003c/a\u003e.)\n\nEach row of the above matrices corresponds to one instance of model trained; each column corresponds to a training or test example.\nCIFAR-10 examples are organized in the default order; for FMoW, see \u003ca href=\"#fmow-data\"\u003ehere\u003c/a\u003e.\nFor example, a train mask for CIFAR-10 has the shape [M x 50,000].\n\nFor CIFAR-10, we also provide the full logits for all ten classes:\n```python\n/cifar/data/train_logits_{PCT}pct.npy  # [M x N_test x 10] np.float16\n/cifar/data/test_logits_{PCT}pct.npy   # [M x N_test x 10] np.float16\n```\nNote that you can also compute the margins from these logits.\n\nWe include an addtional 10,000 models for each setting that we used for evaluation; the total number of models in each matrix is `M` as indicated in the above table plus 10,000.\n\n### Datamodels\nAll estimated datamodels for each split (`train` or `test`) are provided as a dictionary in a `.pt` file (load with `torch.load`):\n```python\n/{DATASET}/datamodels/train_{PCT}pct.pt\n/{DATASET}/datamodels/test_{PCT}pct.pt\n```\n\nEach dictionary contains:\n* `weight`: matrix of shape `N_train x N`, where `N` is either `N_train` or `N_test` depending on the group of target examples\n* `bias`: vector of length `N`, corresponding to biases for each datamodel\n* `lam`: vector of length `N`, regularization \u0026lambda; chosen by CV for each datamodel\n\n## Downloading\nWe make all of our data available via Amazon S3.\nTotal sizes of the training data files are as follows:\n| Dataset, \t\u0026alpha; (%) | masks, margins (GB) |  logits (GB) |\n|-----------------------|-----------|---------|\n| CIFAR-10, 10           | 245 | 1688 |\n| CIFAR-10, 20           | 123 | 849 |\n| CIFAR-10, 50           | 49 | 346 |\n| CIFAR-10, 75           | 98 | 682 |\n| FMoW, 20           | 25.4 | -  |\n| FMoW, 50           | 10.6 | -  |\n| FMoW, 75           | 21.2 | -  |\n\nTotal sizes of datamodels data (the model weights) are 16.9 GB for CIFAR-10 and 0.75 GB for FMoW.\n\n### Setting up AWS\n1. Make an AWS account\n2. Download the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)\n3. Run `aws configure` and add the Access key ID (you can get these by clicking on your account on top right corner -\u003e Security credentials)\n\n\n### API\nYou can download them using the Amazon S3 CLI interface with the requester pays option as follows (replacing the fields {...} as appropriate):\n```bash\naws s3api get-object --bucket madrylab-datamodels \\\n                     --key {DATASET}/data/{SPLIT}_{DATA_TYPE}_{PCT}.npy \\\n                     --request-payer requester \\\n                     [OUT_FILE]\n```\n\nFor example, to retrieve the test set margins for CIFAR-10 models trained on 50% subsets, use:\n```bash\naws s3api get-object --bucket madrylab-datamodels \\\n                     --key cifar/data/test_margins_50pct.npy \\\n                     --request-payer requester \\\n                     test_margins_50pct.npy\n```\n\n### Pricing\nThe total data transfer fee (from AWS to internet) for all of the data is around $374 (= 4155 GB x 0.09 USD per GB).\n\nIf you only download everything except for the logits (which is sufficient to reproduce all of our analysis), the fee is around $53.\n\n## Loading data\n\nThe data matrices are in `numpy` array format (`.npy`).\nAs some of these are quite large, you can read small segments without reading the entire file into memory\nby additionally specifying the `mmap_mode` argument in `np.load`:\n```python\nX = np.load('train_masks_10pct.npy', mmap_mode='r')\nY = np.load('test_margins_10pct.npy', mmap_mode='r')\n...\n# Use segments, e.g, X[:100], as appropriate\n# Run regress(X, Y[:]) using choice of estimation algorithm.\n```\n\n## FMoW data\n\nWe use a customized version of the FMoW dataset from [WILDS](https://wilds.stanford.edu/datasets/#fmow) (derived from this [original dataset](https://arxiv.org/abs/1711.07846)) that restricts the year of the training set to 2012. Our code is adapted from [here](https://github.com/p-lambda/wilds/blob/main/wilds/datasets/fmow_dataset.py).\n\nTo use the dataset, first download WILDS using:\n```bash\npip install wilds\n```\n(see [here](https://github.com/p-lambda/wilds#installation) for more detailed instructions).\n\nIn our paper, we only use the in-distribution training and test splits in our analysis (the original version from WILDS also has out-of-distribution as well as validation splits).\nOur dataset splits can be constructed as follows and used like a PyTorch dataset:\n```python\nfrom fmow import FMoWDataset\n\nds = FMoWDataset(root_dir='/mnt/nfs/datasets/wilds/',\n                     split_scheme='time_after_2016')\n\ntransform_steps = [\n    transforms.ToTensor(),\n    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])\n    ]\ntransform = transforms.Compose(transform_steps)\n\nds_train = ds.get_subset('train', transform=transform)\nds_test = ds.get_subset('id_test', transform=transform)\n```\n\nThe columns of matrix data \u003ca href=\"#training-data\"\u003edescribed above\u003c/a\u003e is ordered according to the default ordering of examples given by the above constructors.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMadryLab%2Fdatamodels-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMadryLab%2Fdatamodels-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMadryLab%2Fdatamodels-data/lists"}