{"id":13779463,"url":"https://github.com/aai-institute/pyDVL","last_synced_at":"2025-05-11T13:30:45.547Z","repository":{"id":60996162,"uuid":"354117916","full_name":"aai-institute/pyDVL","owner":"aai-institute","description":"pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation","archived":false,"fork":false,"pushed_at":"2024-08-15T20:17:19.000Z","size":353044,"stargazers_count":92,"open_issues_count":109,"forks_count":8,"subscribers_count":4,"default_branch":"develop","last_synced_at":"2024-08-17T21:42:41.646Z","etag":null,"topics":["banzhaf-index","data-centric-ai","data-cleaning","data-pruning","data-quality","data-valuation","game-theory","influence-functions","least-core","machine-learning","robust-machine-learning","shapley-value","transferlab"],"latest_commit_sha":null,"homepage":"https://pydvl.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aai-institute.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"COPYING.LESSER","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-04-02T19:46:57.000Z","updated_at":"2024-08-21T07:38:30.037Z","dependencies_parsed_at":"2023-09-01T13:58:13.347Z","dependency_job_id":"490c42a9-ca1e-407f-b48e-0c7307b03258","html_url":"https://github.com/aai-institute/pyDVL","commit_stats":null,"previous_names":["appliedai-initiative/pydvl"],"tags_count":47,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aai-institute%2FpyDVL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aai-institute%2FpyDVL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aai-institute%2FpyDVL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aai-institute%2FpyDVL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aai-institute","download_url":"https://codeload.github.com/aai-institute/pyDVL/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225056698,"owners_count":17414187,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["banzhaf-index","data-centric-ai","data-cleaning","data-pruning","data-quality","data-valuation","game-theory","influence-functions","least-core","machine-learning","robust-machine-learning","shapley-value","transferlab"],"created_at":"2024-08-03T18:01:05.465Z","updated_at":"2025-05-11T13:30:45.530Z","avatar_url":"https://github.com/aai-institute.png","language":"Python","funding_links":[],"categories":["Libraries"],"sub_categories":["Task Agnostic"],"readme":"\u003cp align=\"center\" style=\"text-align:center;\"\u003e\n    \u003cimg alt=\"pyDVL Logo\" src=\"https://raw.githubusercontent.com/aai-institute/pyDVL/develop/logo.svg\" width=\"200\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\" style=\"text-align:center;\"\u003e\n    A library for data valuation.\n\u003c/p\u003e\n\n\u003cp align=\"center\" style=\"text-align:center;\"\u003e\n    \u003ca href=\"https://pypi.org/project/pydvl/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/pydvl.svg\" alt=\"PyPI\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/pydvl/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/pydvl.svg\" alt=\"Version\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pydvl.org\"\u003e\u003cimg src=\"https://img.shields.io/badge/docs-All%20versions-009485\" alt=\"documentation\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://raw.githubusercontent.com/aai-institute/pyDVL/master/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/pypi/l/pydvl\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml\"\u003e\u003cimg src=\"https://github.com/aai-institute/pyDVL/actions/workflows/main.yaml/badge.svg\" alt=\"Build status\" \u003e\u003c/a\u003e\n    \u003ca href=\"https://codecov.io/gh/aai-institute/pyDVL\"\u003e\u003cimg src=\"https://codecov.io/gh/aai-institute/pyDVL/graph/badge.svg?token=VN7DNDE0FV\"/\u003e\u003c/a\u003e\n    \u003ca href=\"https://zenodo.org/badge/latestdoi/354117916\"\u003e\u003cimg src=\"https://zenodo.org/badge/354117916.svg\" alt=\"DOI\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**pyDVL** collects algorithms for **Data Valuation** and **Influence Function**\ncomputation. Here is the list of [all methods implemented](https://pydvl.org/devel/getting-started/methods/).\n\n**Data Valuation** for machine learning is the task of assigning a scalar\nto each element of a training set which reflects its contribution to the final\nperformance or outcome of some model trained on it. Some concepts of\nvalue depend on a specific model of interest, while others are model-agnostic.\npyDVL focuses on model-dependent methods.\n\n\u003cdiv align=\"center\" style=\"text-align:center;\"\u003e\n    \u003cimg\n        width=\"60%\"\n        align=\"center\"\n        style=\"display: block; margin-left: auto; margin-right: auto;\"\n        src=\"https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg\"\n        alt=\"best sample removal\"\n    /\u003e\n    \u003cp align=\"center\" style=\"text-align:center;\"\u003e\n        Comparison of different data valuation methods\n        on best sample removal.\n    \u003c/p\u003e\n\u003c/div\u003e\n\nThe **Influence Function** is an infinitesimal measure of the effect that single\ntraining points have over the parameters of a model, or any function thereof.\nIn particular, in machine learning they are also used to compute the effect\nof training samples over individual test points.\n\n\u003cdiv align=\"center\" style=\"text-align:center;\"\u003e\n    \u003cimg\n        width=\"60%\"\n        align=\"center\"\n        style=\"display: block; margin-left: auto; margin-right: auto;\"\n        src=\"https://pydvl.org/devel/examples/img/influence_functions_example.png\"\n        alt=\"best sample removal\"\n    /\u003e\n    \u003cp align=\"center\" style=\"text-align:center;\"\u003e\n        Influences of input points with corrupted data.\n        Highlighted points have flipped labels.\n    \u003c/p\u003e\n\u003c/div\u003e\n\n# Installation\n\nTo install the latest release use:\n\n```shell\n$ pip install pyDVL\n```\n\nYou can also install the latest development version from\n[TestPyPI](https://test.pypi.org/project/pyDVL/):\n\n```shell\npip install pyDVL --index-url https://test.pypi.org/simple/\n```\n\npyDVL has also extra dependencies for certain functionalities, \ne.g. for using influence functions run\n```shell\n$ pip install pyDVL[influence]\n```\n\nFor more instructions and information refer to [Installing pyDVL\n](https://pydvl.org/stable/getting-started/#installation) in the documentation.\n\n# Usage\n\nPlease read [Getting\nStarted](https://pydvl.org/stable/getting-started/first-steps/) in the\ndocumentation for more instructions. We provide several examples for data\nvaluation and for influence functions in our [Example\nGallery](https://pydvl.org/stable/examples/).\n\n## Influence Functions\n\n1. Import the necessary packages (the exact ones depend on your specific use case).\n2. Create PyTorch data loaders for your train and test splits.\n3. Instantiate your neural network model and define your loss function.\n4. Instantiate an `InfluenceFunctionModel` and fit it to the training data\n5. For small input data, you can call the `influences()` method on the fitted\n   instance. The result is a tensor of shape `(training samples, test samples)`\n   that contains at index `(i, j`) the influence of training sample `i` on\n   test sample `j`.\n6. For larger datasets, wrap the model into a \"calculator\" and call methods on\n   it. This splits the computation into smaller chunks and allows for lazy\n   evaluation and out-of-core computation.\n\nThe higher the absolute value of the influence of a training sample\non a test sample, the more influential it is for the chosen test sample, model\nand data loaders. The sign of the influence determines whether it is \nuseful (positive) or harmful (negative).\n\n\u003e **Note** pyDVL currently only support PyTorch for Influence Functions. We plan\n\u003e to add support for Jax next.\n\n```python\nimport torch\nfrom torch import nn\nfrom torch.utils.data import DataLoader, TensorDataset\n\nfrom pydvl.influence import SequentialInfluenceCalculator\nfrom pydvl.influence.torch import DirectInfluence\nfrom pydvl.influence.torch.util import (\n   NestedTorchCatAggregator,\n   TorchNumpyConverter,\n   )\n\ninput_dim = (5, 5, 5)\noutput_dim = 3\ntrain_x, train_y = torch.rand((10, *input_dim)), torch.rand((10, output_dim))\ntest_x, test_y = torch.rand((5, *input_dim)), torch.rand((5, output_dim))\ntrain_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)\ntest_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)\nmodel = nn.Sequential(\n  nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),\n  nn.Flatten(),\n  nn.Linear(27, 3),\n  )\nloss = nn.MSELoss()\n\ninfl_model = DirectInfluence(model, loss, hessian_regularization=0.01)\ninfl_model = infl_model.fit(train_data_loader)\n\n# For small datasets, instantiate the full influence matrix:\ninfluences = infl_model.influences(test_x, test_y, train_x, train_y)\n\n# For larger datasets, use the Influence calculators:\ninfl_calc = SequentialInfluenceCalculator(infl_model)\n\n# Lazy object providing arrays batch-wise in a sequential manner\nlazy_influences = infl_calc.influences(test_data_loader, train_data_loader)\n\n# Trigger computation and pull results to memory\ninfluences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())\n\n# Trigger computation and write results batch-wise to disk\nlazy_influences.to_zarr(\"influences_result\", TorchNumpyConverter())\n```\n\n## Data Valuation\n\nThe steps required to compute data values for your samples are:\n\n1. Import the necessary packages (the exact ones will depend on your specific\n   use case, but most of the interface is exposed through `pydvl.valuation`).\n2. Create two `Dataset` objects with your train and test splits. There are\n   some factories to do this from arrays or scikit-learn toy datasets.\n3. Create an instance of a `SupervisedScorer`, with any sklearn scorer and a\n   \"valuation set\" over which your model will be scored.\n4. Wrap model and scorer in a `ModelUtility`.\n5. Use one of the methods defined in the library to compute the values. In the\n   example below, we use the most basic *Montecarlo Shapley* with uniform\n   sampling, an approximate method for computing Data Shapley values.\n6. Call `fit` in a joblib parallel context. The result is a variable of type\n   `ValuationResult` that contains the indices and their values as well as other\n   attributes. This object can be sliced, sorted and inspected directly, or you\n   can convert it to a dataframe for convenience.\n\nThe higher the value for an index, the more important it is for the chosen\nmodel, dataset and scorer. Reciprocally, low-value points could be mislabelled,\nor out-of-distribution, and dropping them can improve the model's performance.\n\n```python\nfrom joblib import parallel_config\nfrom sklearn.datasets import load_iris\nfrom sklearn.svm import SVC\nfrom pydvl.valuation import Dataset, ShapleyValuation, UniformSampler,\\ \n    MinUpdates, ModelUtility, SupervisedScorer\n\nseed = 42\nmodel = SVC(kernel=\"linear\", probability=True, random_state=seed)\n\ntrain, val = Dataset.from_sklearn(load_iris(), train_size=0.6, random_state=24)\nscorer = SupervisedScorer(model, val, default=0.0)\nutility = ModelUtility(model, scorer)\nsampler = UniformSampler(batch_size=2 ** 6, seed=seed)\nstopping = MinUpdates(1000)\nvaluation = ShapleyValuation(utility, sampler, stopping, progress=True)\n\nwith parallel_config(n_jobs=32):\n    valuation.fit(train)\n\nresult = valuation.result\n```\n\n### Deprecation notice\n\nUp until v0.9.2 valuation methods were available through the `pydvl.value`\nmodule, which is now deprecated in favour of the design showcased above,\navailable under `pydvl.valuation`. The old module will be removed in a future\nrelease.\n\n# Contributing\n\nPlease open new issues for bugs, feature requests and extensions. You can read\nabout the structure of the project, the toolchain and workflow in the [guide for\ncontributions](CONTRIBUTING.md).\n\n# License\n\npyDVL is distributed under\n[LGPL-3.0](https://www.gnu.org/licenses/lgpl-3.0.html). A complete version can\nbe found in two files: [here](LICENSE) and [here](COPYING.LESSER).\n\nAll contributions will be distributed under this license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faai-institute%2FpyDVL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faai-institute%2FpyDVL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faai-institute%2FpyDVL/lists"}