{"id":17596021,"url":"https://github.com/nimarb/pytorch_influence_functions","last_synced_at":"2025-04-04T18:09:10.130Z","repository":{"id":36460468,"uuid":"225033158","full_name":"nimarb/pytorch_influence_functions","owner":"nimarb","description":"This is a PyTorch reimplementation of Influence Functions from the ICML2017 best paper: Understanding Black-box Predictions via Influence Functions by Pang Wei Koh and Percy Liang.","archived":false,"fork":false,"pushed_at":"2023-10-29T07:10:32.000Z","size":459,"stargazers_count":332,"open_issues_count":27,"forks_count":71,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-28T17:09:19.588Z","etag":null,"topics":["deep-learning","influence-functions","pytorch","pytorch-implementation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nimarb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-30T15:38:20.000Z","updated_at":"2025-03-12T21:52:33.000Z","dependencies_parsed_at":"2024-10-30T06:34:57.778Z","dependency_job_id":null,"html_url":"https://github.com/nimarb/pytorch_influence_functions","commit_stats":{"total_commits":13,"total_committers":4,"mean_commits":3.25,"dds":0.3076923076923077,"last_synced_commit":"7b7c7b6589e949c412e882883a569ee85766630a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nimarb%2Fpytorch_influence_functions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nimarb%2Fpytorch_influence_functions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nimarb%2Fpytorch_influence_functions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nimarb%2Fpytorch_influence_functions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nimarb","download_url":"https://codeload.github.com/nimarb/pytorch_influence_functions/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247226215,"owners_count":20904465,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","influence-functions","pytorch","pytorch-implementation"],"created_at":"2024-10-22T08:07:11.888Z","updated_at":"2025-04-04T18:09:10.109Z","avatar_url":"https://github.com/nimarb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Influence Functions for PyTorch\n\nThis is a PyTorch reimplementation of Influence Functions from the ICML2017 best paper:\n[Understanding Black-box Predictions via Influence Functions](https://arxiv.org/abs/1703.04730) by Pang Wei Koh and Percy Liang.\nThe reference implementation can be found here: [link](https://github.com/kohpangwei/influence-release).\n\n- [Why Use Influence Functions?](#why-use-influence-functions)\n- [Requirements](#requirements)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Background and Documentation](#background-and-documentation)\n  - [config](#config)\n    - [Misc parameters](#misc-parameters)\n    - [Calculation parameters](#calculation-parameters)\n      - [s_test](#stest)\n  - [Modes of computation](#modes-of-computation)\n  - [Output variables](#output-variables)\n    - [Influences](#influences)\n    - [Harmful](#harmful)\n    - [Helpful](#helpful)\n- [Roadmap](#roadmap)\n  - [v0.2](#v02)\n  - [v0.3](#v03)\n  - [v0.4](#v04)\n\n## Why Use Influence Functions?\n\nInfluence functions help you to debug the results of your deep learning model\nin terms of the dataset. When testing for a single test image, you can then\ncalculate which training images had the largest result on the classification\noutcome. Thus, you can easily find mislabeled images in your dataset, or\ncompress your dataset slightly to the most influential images important for\nyour individual test dataset. That can increase prediction accuracy, reduce\ntraining time, and reduce memory requirements. For more details please see\nthe original paper linked [here](https://arxiv.org/abs/1703.04730).\n\nInfluence functions can of course also be used for data other than images,\nas long as you have a supervised learning problem.\n\n## Requirements\n\n* Python 3.6 or later\n* PyTorch 1.0 or later\n* NumPy 1.12 or later\n\nTo run the tests, further requirements are:\n\n* torchvision 0.3 or later\n* PIL\n\n## Installation\n\nYou can either install this package directly through pip:\n\n```bash\npip3 install --user pytorch-influence-functions\n```\n\nOr you can clone the repo and \n\n* import it as a package after it's in your `PATH`.\n* install it using `python setup.py install`\n* install it using `python setup.py develop` (if you want to edit the code)\n\n## Usage\n\nCalculating the influence of the individual samples of your training dataset\non the final predictions is straight forward.\n\nThe most barebones way of getting the code to run is like this:\n\n```python\nimport pytorch_influence_functions as ptif\n\n# Supplied by the user:\nmodel = get_my_model()\ntrainloader, testloader = get_my_dataloaders()\n\nptif.init_logging()\nconfig = ptif.get_default_config()\n\ninfluences, harmful, helpful = ptif.calc_img_wise(config, model, trainloader, testloader)\n\n# do someting with influences/harmful/helpful\n```\n\nHere, `config` contains default values for the influence function calculation\nwhich can of course be changed. For details and examples, look [here](#config).\n\n## Background and Documentation\n\nThe precision of the output can be adjusted by using more iterations and/or\nmore recursions when approximating the influence.\n\n### config\n\n`config` is a dict which contains the parameters used to calculate the\ninfluences. You can get the default `config` by calling `ptif.get_default_config()`.\n\nI recommend you to change the following parameters to your liking. The list\nbelow is divided into parameters affecting the calculation and parameters\naffecting everything else.\n\n#### Misc parameters\n\n* `save_pth`: Default `None`, folder where to save `s_test` and `grad_z` files\n  if saving is desired\n* `outdir`: folder name to which the result json files are written\n* `log_filename`: Default `None`, if set the output will be logged to this\n  file in addition to `stdout`.\n\n#### Calculation parameters\n\n* `seed`: Default = 42, random seed for numpy, random, pytorch\n* `gpu`: Default = -1, `-1` for calculation on the CPU otherwise GPU id\n* `calc_method`: Default = img_wise, choose between the two calculation methods\n  outlined [here](#modes-of-computation).\n* `DataLoader` object for the desired dataset\n  * `train_loader` and `test_loader`\n* `test_sample_start_per_class`: Default = False, per class index from where to\n  start to calculate the influence function. If `False`, it will start from `0`.\n  This is useful if you want to calculate the influence function of a whole\n  test dataset and manually split the calculation up over multiple threads/\n  machines/gpus. Then, you can start at various points in the dataset.\n* `test_sample_num`: Default = False, number of samples per class\n  starting from the `test_sample_start_per_class` to calculate the influence\n  function for. E.g. if your dataset has 10 classes and you set this value to\n  `1`, then the influence functions will be calculated for `10 * 1` test\n  samples, one per class. If `False`, calculates the influence for all images.\n\n##### s_test\n\n* `recursion_depth`: Default = 5000, recursion depth for the `s_test` calculation.\n  Greater recursion depth improves precision.\n* `r`: Default = 1, number of `s_test` calculations to take the average of.\n  Greater r averaging improves precision.\n* Combined, the original paper suggests that `recursion_depth * r` should equal\n  the training dataset size, thus the above values of `r = 10` and\n  `recursion_depth = 5000` are valid for CIFAR-10 with a training dataset size\n  of 50000 items.\n* `damp`: Default = 0.01, damping factor during `s_test` calculation.\n* `scale`: Default = 25, scaling factor during `s_test` calculation.\n\n### Modes of computation\n\nThis packages offers two modes of computation to calculate the influence\nfunctions. The first mode is called `calc_img_wise`, during which the two\nvalues `s_test` and `grad_z` for each training image are computed on the fly\nwhen calculating the influence of that single image. The algorithm moves then\non to the next image. The second mode is called `calc_all_grad_then_test` and\ncalculates the `grad_z` values for all images first and saves them to disk.\nThen, it'll calculate all `s_test` values and save those to disk. Subsequently,\nthe algorithm will then calculate the influence functions for all images by\nreading both values from disk and calculating the influence base on them. This\ncan take significant amounts of disk space (100s of GBs) but with a fast SSD\ncan speed up the calculation significantly as no duplicate calculations take\nplace. This is the case because `grad_z` has to be calculated twice, once for\nthe first approximation in `s_test` and once to combine with the `s_test`\nvector to calculate the influence. Most importantnly however, `s_test` is only\ndependent on the test sample(s). While one `grad_z` is used to estimate the\ninitial value of the Hessian during the `s_test` calculation, this is\ninsignificant. `grad_z` on the other hand is only dependent on the training\nsample. Thus, in the `calc_img_wise` mode, we throw away all `grad_z`\ncalculations even if we could reuse them for all subsequent `s_test`\ncalculations, which could potentially be 10s of thousands. However, as stated\nabove, keeping the `grad_z`s only makes sense if they can be loaded faster/\nkept in RAM than calculating them on-the-fly.\n\n**TL;DR**: The recommended way is using `calc_img_wise` unless you have a crazy\nfast SSD, lots of free storage space, and want to calculate the influences on\nthe prediction outcomes of an entire dataset or even \u003e1000 test samples.\n\n### Output variables\n\nVisualised, the output can look like this:\n\n![influences for ship on cifar10-resnet](figs/inf_resnet_basic_110_ship_1.png)\n\nThe test image on the top left is test image for which the influences were\ncalculated. To get the correct test outcome of _ship_, the Helpful images from\nthe training dataset were the most helpful, whereas the Harmful images were the\nmost harmful. Here, we used CIFAR-10 as dataset. The model was ResNet-110. The\nnumbers above the images show the actual influence value which was calculated.\n\nThe next figure shows the same but for a different model, DenseNet-100/12.\nThus, we can see that different models learn more from different images.\n\n![influences for ship on cifar10-densenet](figs/inf_densenet_BC_100_12_ship_1.png)\n\n#### Influences\n\nIs a dict/json containting the influences calculated of all training data\nsamples for each test data sample. The dict structure looks similiar to this:\n\n```python\n{\n    \"0\": {\n        \"label\": 3,\n        \"num_in_dataset\": 0,\n        \"time_calc_influence_s\": 129.6417362689972,\n        \"influence\": [\n            -0.00016939856868702918,\n            4.3426321099104825e-06,\n            -9.501376189291477e-05,\n            ...\n        ],\n        \"harmful\": [\n            31527,\n            5110,\n            47217,\n            ...\n        ],\n        \"helpful\": [\n            5287,\n            22736,\n            3598,\n            ...\n        ]\n    },\n    \"1\": {\n        \"label\": 8,\n        \"num_in_dataset\": 1,\n        \"time_calc_influence_s\": 121.8709237575531,\n        \"influence\": [\n            3.993639438704122e-06,\n            3.454859779594699e-06,\n            -3.5805194329441292e-06,\n            ...\n```\n\n#### Harmful\n\nHarmful is a list of numbers, which are the IDs of the training data samples\nordered by harmfulness. If the influence function is calculated for multiple\ntest images, the harmfulness is ordered by average harmfullness to the\nprediction outcome of the processed test samples.\n\n#### Helpful\n\nHelpful is a list of numbers, which are the IDs of the training data samples\nordered by helpfulness. If the influence function is calculated for multiple\ntest images, the helpfulness is ordered by average helpfulness to the\nprediction outcome of the processed test samples.\n\n## Roadmap\n\n### v0.2\n\n* [x] makes variable names etc. dataset independent\n* [x] remove all dataset name checks from the code\n* [ ] ability to disable shell output eg for `display_progress` from the config\n* [ ] add proper result plotting support\n* [ ] add a dataloader for training on the most influential samples only\n* [x] add some visualisation of the outcome\n* [ ] add recreation of some graphs of the original paper to verify\n  implementation\n* [ ] allow custom save name for the influence json\n\n### v0.3\n\n* [ ] make the config a class, so that it can readjust itself, for example\n  when the `r` and `recursion_depth` values can be lowered without big impact\n* [ ] check killing data augmentation!?\n* [ ] in `calc_influence_function.py` in `load_s_test`, `load_grad_z` don't\n  hard code the filenames\n\n### v0.4\n\n* [ ] integrate myPy type annotations (static type checking)\n* [ ] Use multiprocessing to calc the influence\n* [ ] use `r\"doc\"` docstrings like pytorch\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnimarb%2Fpytorch_influence_functions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnimarb%2Fpytorch_influence_functions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnimarb%2Fpytorch_influence_functions/lists"}