Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

awesome-data-valuation

đź’± A curated list of data valuation (DV) to design your next data marketplace
https://github.com/daviddao/awesome-data-valuation

  • Towards Efficient Data Valuation Based on the Shapley Value - -1176},<br> year={2019},<br> organization={PMLR}<br>}</pre></details> | [đź’»](https://github.com/sunblaze-ucb/data-valuation) | |
  • Data Shapley: Equitable Valuation of Data for Machine Learning - based methods for its efficient estimation. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{ghorbani2019data,<br> title={Data shapley: Equitable valuation of data for machine learning},<br> author={Ghorbani, Amirata and Zou, James},<br> booktitle={International Conference on Machine Learning},<br> pages={2242--2251},<br> year={2019},<br> organization={PMLR}<br>}</pre></details> | [đź’»](https://github.com/amiratag/DataShapley) | |
  • A Distributional Framework for Data Valuation - based algorithm for the distributional Shapley value with strong approximation guarantees. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{ghorbani2020distributional,<br> title={A Distributional Framework for Data Valuation},<br> author={Ghorbani, Amirata, P. Kim, Michael and Zou, James},<br> booktitle={International Conference on Machine Learning},<br> year={2020}<br>}</pre></details> | [đź’»](https://github.com/amiratag/DistributionalShapley) | |
  • Asymmetric Shapley values: incorporating causal knowledge into model-agnostic explainability - agnostic explainability},<br> author={Frye, Christopher and Rowat, Colin and Feige, Ilya},<br> journal={Advances in Neural Information Processing Systems},<br> volume={33},<br> year={2020}<br>}</pre></details> | | [🎥](https://www.youtube.com/watch?v=7d13f4UaAn0) |
  • Collaborative Machine Learning with Incentive-Aware Model Rewards - aware model rewards},<br> author={Sim, Rachael Hwee Ling and Zhang, Yehong and Chan, Mun Choon and Low, Bryan Kian Hsiang},<br> booktitle={International Conference on Machine Learning},<br> pages={8927--8936},<br> year={2020},<br> organization={PMLR}<br>}</pre></details> | | |
  • Validation free and replication robust volume-based data valuation - based data valuation},<br> author={Xu, Xinyi and Wu, Zhaoxuan and Foo, Chuan Sheng and Low, Bryan Kian Hsiang},<br> journal={Advances in Neural Information Processing Systems},<br> volume={34},<br> year={2021}<br>}</pre></details> | [đź’»](https://github.com/ZhaoxuanWu/VolumeBased-DataValuation) | |
  • Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning - reduced Data Valuation Framework for Machine Learning},<br> author={Kwon, Yongchan and Zou, James},<br> journal={arXiv preprint arXiv:2110.14049},<br> year={2021}<br>}</pre></details> | | |
  • Gradient-Driven Rewards to Guarantee Fairness in Collaborative Machine Learning - time gradient reward mechanism with a fairness guarantee. </details> | <details><summary>Bibtex</summary><pre>@article{xu2021gradient,<br> title={Gradient driven rewards to guarantee fairness in collaborative machine learning},<br> author={Xu, Xinyi and Lyu, Lingjuan and Ma, Xingjun and Miao, Chenglin and Foo, Chuan Sheng and Low, Bryan Kian Hsiang},<br> journal={Advances in Neural Information Processing Systems},<br> volume={34},<br> pages={16104--16117},<br> year={2021}<br>}</pre></details> | | |
  • Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning - based Shapley value (SV) or Least core (LC) estimation heuristics. They propose learning to predict the performance of a learning algorithm (denoted data utility learning) and using this predictor to estimate learning performance without retraining for cheaper SV and LC estimation. </details> | <details><summary>Bibtex</summary><pre>@article{wang2021improving,<br> title={Improving cooperative game theory-based data valuation via data utility learning},<br> author={Wang, Tianhao and Yang, Yu and Jia, Ruoxi},<br> journal={arXiv preprint arXiv:2107.06336},<br> year={2021}<br>}</pre></details> | | |
  • Data Banzhaf: A Robust Data Valuation Framework for Machine Learning - v206-wang23e, title={Data Banzhaf: A Robust Data Valuation Framework for Machine Learning},<br> author={Wang, Jiachen T. and Jia, Ruoxi},<br> booktitle={Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},<br> pages={6388--6421},<br> year={2023},<br> editor={Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},<br> volume={206},<br> series={Proceedings of Machine Learning Research},<br> month={25--27 Apr},<br> publisher={PMLR},<br> pdf={https://proceedings.mlr.press/v206/wang23e/wang23e.pdf},<br> url={https://proceedings.mlr.press/v206/wang23e.html}<br>}</pre></details> | [đź’»](https://github.com/Jiachen-T-Wang/data-banzhaf) | |
  • A Multilinear Sampling Algorithm to Estimate Shapley Values
  • If You Like Shapley Then You’ll Love the Core
  • CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification - class and out-of-class contributions. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{schoch2022csshapley,<br> title={{CS}-Shapley: Class-wise Shapley Values for Data Valuation in Classification},<br> author={Stephanie Schoch and Haifeng Xu and Yangfeng Ji},<br> booktitle={Advances in Neural Information Processing Systems},<br> year={2022}<br>}</pre></details> | [đź’»](https://github.com/uvanlp/valda) | |
  • Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms - linear time and approximations in sublinear time for k-nearest-neighbor models. They empirically evaluate their algorithms at scale and extend them to several other settings. </details> | <details><summary>Bibtex</summary><pre>@article{jia12efficient,<br> title={Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms},<br> author={Jia, Ruoxi and Dao, David and Wang, Boxin and Hubis, Frances Ann and Gurel, Nezihe Merve and Zhang, Bo Li4 Ce and Song, Costas Spanos1 Dawn},<br> journal={Proceedings of the VLDB Endowment},<br> volume={12},<br> number={11}<br>}</pre></details> | [đź’»](https://github.com/easeml/datascope) | |
  • Efficient computation and analysis of distributional Shapley values - parametric density estimation as well as new efficient methods for its estimation. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{kwon2021efficient,<br> title={Efficient computation and analysis of distributional Shapley values},<br> author={Kwon, Yongchan and Rivas, Manuel A and Zou, James},<br> booktitle={International Conference on Artificial Intelligence and Statistics},<br> pages={793--801},<br> year={2021},<br> organization={PMLR}<br>}</pre></details> | [đź’»](https://github.com/ykwon0407/fast_dist_shapley) | |
  • Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification? - one-out-based and Shapley value-based methods as well as an empirical study across several ML tasks investigating the two aforementioned methods as well as exact Shapley value-based methods and Shapley over KNN Surrogates. </details> | <details><summary>Bibtex</summary><pre>@misc{jia2021scalability,<br> title={Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?}, <br> author={Ruoxi Jia and Fan Wu and Xuehui Sun and Jiacen Xu and David Dao and Bhavya Kailkhura and Ce Zhang and Bo Li and Dawn Song},<br> year={2021},<br> eprint={1911.07128},<br> archivePrefix={arXiv},<br> primaryClass={cs.LG}<br>}</pre></details> | [đź’»](https://github.com/AI-secure/Shapley-Study) | |
  • Shapley values for feature selection: The good, the bad, and the axioms
  • Understanding Black-box Predictions via Influence Functions - box predictions via influence functions},<br> author={Koh, Pang Wei and Liang, Percy},<br> booktitle={International Conference on Machine Learning},<br> pages={1885--1894},<br> year={2017},<br> organization={PMLR}<br>}</pre></details> | [đź’»](https://bit.ly/gt-influence) | [🎥](https://drive.google.com/open?id=1ZLY_9Wsk9MA0kXAoJDd6o1gbLvHhyPAn) |
  • On the accuracy of influence functions for measuring group effects - Siang Ang*, Hubert H. K. Teo*, and Percy Liang | 2019 | <details><summary>Summary</summary> Koh et al. (2019) study influence functions to measure effects of large groups of training points instead of individual points. They empirically find a correlation and often underestimation between predicted and actual effects and theoretically show that this need not hold in general, realistic settings. </details> | <details><summary>Bibtex</summary><pre>@article{koh2019accuracy,<br> title={On the accuracy of influence functions for measuring group effects},<br> author={Koh, Pang Wei and Ang, Kai-Siang and Teo, Hubert HK and Liang, Percy},<br> journal={arXiv preprint arXiv:1905.13289},<br> year={2019}<br>}</pre></details> | [đź’»](https://github.com/kohpangwei/group-influence-release) | [🎥](https://drive.google.com/open?id=1ZLY_9Wsk9MA0kXAoJDd6o1gbLvHhyPAn) |
  • Scaling Up Influence Functions - size Transformer models with hundreds of millions of parameters. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{schioppa2022scaling,<br> title={Scaling Up Influence Functions},<br> author={Schioppa, Andrea and Zablotskaia, Polina and Vilar, David and Sokolov, Artem},<br> booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},<br> year={2022}<br>}</pre></details> | [đź’»](https://github.com/google-research/jax-influence) | |
  • Studying large language model generalization with influence functions - FAC to approximate the Hessian of the loss of large language models. They apply this technique to study influence functions on large language models, up to 50 billion parameters. </details> | <details><summary>Bibtex</summary><pre>@article{grosse2023studying,<br> title={Studying large language model generalization with influence functions},<br> author={Grosse, Roger and Bae, Juhan and Anil, Cem and Elhage, Nelson and Tamkin, Alex and Tajdini, Amirhossein and Steiner, Benoit and Li, Dustin and Durmus, Esin and Perez, Ethan and others},<br> journal={arXiv preprint arXiv:2308.03296},<br> year={2023}<br>}</pre></details> | | |
  • Data Valuation using Reinforcement Learning - research/google-research/tree/master/dvrl) | [🎥](https://icml.cc/media/icml-2020/Slides/6276.pdf) |
  • DAVINZ: Data Valuation using Deep Neural Networks at Initialization - based and training-free method for efficient data valuation with large and complex deep neural networks (DNNs). They derive and exploit a domain-aware generalization bound for DNNs to characterize their performance without training and uses this bound as the scoring function while keeping conventional techniques such as Shapley values as the valuation function. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{wu2022davinz,<br> title={DAVINZ: Data Valuation using Deep Neural Networks at Initialization},<br> author={Wu, Zhaoxuan and Shu, Yao and Low, Bryan Kian Hsiang},<br> booktitle={International Conference on Machine Learning},<br> pages={24150--24176},<br> year={2022},<br> organization={PMLR}<br>}</pre></details> | | [🎥](https://icml.cc/media/icml-2022/Slides/17500.pdf) |
  • Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value - of-bag estimate of a bagging estimator for computationally efficient data valuation. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{DBLP:conf/icml/Kwon023, <br> author={Yongchan Kwon and James Zou}, <br> editor={Andreas Krause and Emma Brunskill and Kyunghyun Cho and Barbara Engelhardt and Sivan Sabato and Jonathan Scarlett},<br> title={Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value},<br> booktitle={International Conference on Machine Learning, {ICML} 2023, 23-29 July 2023, Honolulu, Hawaii, {USA}}, <br> series={Proceedings of Machine Learning Research}, <br> volume={202}, <br> pages={18135--18152},<br> publisher={{PMLR}}, <br> year={2023}, <br> url={https://proceedings.mlr.press/v202/kwon23e.html}, <br> timestamp={Mon, 28 Aug 2023 17:23:08 +0200}, <br> biburl={https://dblp.org/rec/conf/icml/Kwon023.bib}, <br> bibsource={dblp computer science bibliography, https://dblp.org} <br>}</pre></details> | [đź’»](https://github.com/ykwon0407/STAT5206_Fall_2023) | [🎥](https://icml.cc/virtual/2023/poster/23689) |
  • Fundamentals of Task-Agnostic Data Valuation - agnostic data valuation. It discusses valuing a data seller's dataset from a buyer's perspective without validation requirements. The approach involves estimating statistical differences through diversity and relevance measures without needing the raw data, and designing queries that maintain the seller's blindness to the buyer's raw data. The work is significant for practical scenarios where utility metrics like test accuracy on a validation set are not feasible.</details> | <details><summary>Bibtex</summary><pre>@article{Amiri2023FundamentalsOT,<br> title={Fundamentals of Task-Agnostic Data Valuation},<br> author={Mohammad Mohammadi Amiri and Frederic Berdoz and Ramesh Raskar},<br> journal={Proceedings of the AAAI Conference on Artificial Intelligence},<br> volume={37},<br> pages={9226-9234},<br> year={2023},<br> doi={10.1609/aaai.v37i8.26106}<br>}</pre></details> | | |
  • OpenDataVal: a Unified Benchmark for Data Valuations - seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},<br> year={2023},<br>url={https://openreview.net/forum?id=eEK99egXeB}<br>}</pre></details> | [đź’»](https://github.com/opendataval/opendataval) | [🎥](https://neurips.cc/virtual/2023/poster/73521) |
  • influenciae - AI | 2023 | <details><summary>Summary</summary> A stable implementation of influence functions in tensorflow. </details> | <details><summary>Bibtex</summary><pre></pre></details> | [đź’»](https://github.com/deel-ai/influenciae) | |
  • pyDVL - institute/pyDVL) | |
  • Data Valuation in Machine Learning: “Ingredients”, Strategies, and Open Challenges - xu.com/ijcai_slides.pdf) |
  • A demonstration of sterling: a privacy-preserving data marketplace - Preserving Data Marketplace},<br> author={Hynes, Nick and Dao, David and Yan, David and Cheng, Raymond and Song, Dawn},<br> journal={Proceedings of the VLDB Endowment},<br> volume={11},<br> number={12},<br> year={2018}<br>}</pre></details> | | |
  • DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation
  • A Marketplace for Data: An Algorithmic Solution - -726},<br> year={2019}<br>}</pre></details> | | |
  • Computing a Data Dividend
  • Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards
  • Data Capsule: A New Paradigm for Automatic Compliance with Data Privacy Regulations - ucb/Data-Capsule) | |
  • A Principled Approach to Data Valuation for Federated Learning
  • Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset - scale chest X-ray dataset},<br> author={Tang, Siyi and Ghorbani, Amirata and Yamashita, Rikiya and Rehman, Sameer and Dunnmon, Jared A and Zou, James and Rubin, Daniel L},<br> journal={Scientific reports},<br> volume={11},<br> number={1},<br> pages={1--9},<br> year={2021},<br> publisher={Nature Publishing Group}<br>}</pre></details> | | |
  • Efficient and Fair Data Valuation for Horizontal Federated Learning - -152},<br> publisher={Springer}<br>}</pre></details> | | |
  • Improving Fairness for Data Valuation in Horizontal Federated Learning
  • Data Valuation for Vertical Federated Learning: An Information-Theoretic Approach - preserving cross-party data collaboration. This work introduces "FedValue," the first privacy-preserving, task-specific, model-free data valuation method for vertical FL tasks. It incorporates Shapley-CMI, an information-theoretic metric, for assessing data values from a game-theoretic perspective. The paper also proposes a novel server-aided federated computation mechanism and techniques to accelerate Shapley-CMI computation. Extensive experiments demonstrate the effectiveness and efficiency of FedValue.</details> | <details><summary>Bibtex</summary><pre>@misc{han2021datavaluation,<br> title={Data Valuation for Vertical Federated Learning: An Information-Theoretic Approach}, <br> author={Xiao Han and Leye Wang and Junjie Wu},<br> year={2021},<br> eprint={URL or DOI link TBD},<br>}</pre></details> | | |
  • Towards More Efficient Data Valuation in Healthcare Federated Learning Using Ensembling - Cramer, Pavitra Krishnaswamy, Praveer Singh | 2021 | <details><summary>Summary</summary>This paper addresses the challenge of data valuation in federated learning within healthcare. The authors propose a method called SaFE (Shapley Value for Federated Learning using Ensembling), which is designed to be efficient in settings where the number of contributing institutions is manageable. SaFE approximates the Shapley value using gradients from training a single model and develops methods for efficient contribution index estimation. This approach is particularly relevant in medical imaging where data heterogeneity is common and fast, accurate data valuation is necessary for multi-institutional collaborations.</details> | <details><summary>Bibtex</summary><pre>@article{Kumar2021TowardsME,<br> title={Towards More Efficient Data Valuation in Healthcare Federated Learning Using Ensembling},<br> author={Sourav Kumar and A. Lakshminarayanan and Ken Chang and Feri Guretno and Ivan Ho Mien and Jayashree Kalpathy-Cramer and Pavitra Krishnaswamy and Praveer Singh},<br> journal={ArXiv},<br> year={2021},<br> volume={abs/2209.05424}<br>}</pre></details> | | |
  • Nonrivalry and the Economics of Data - 58},<br> DOI = {10.1257/aer.20191330},<br> URL = {https://www.aeaweb.org/articles?id=10.1257/aer.20191330}<br>}</pre></details> | | |
  • Chapter 5: Data as Labor, Radical Markets
  • Should We Treat Data as Labor? Moving beyond "Free" - Ibarra, Leonard Goff, Diego JimĂ©nez-Hernández, Jaron Lanier, E. Glen Weyl | 2018 | | <details><summary>Bibtex</summary><pre>@article{10.1257/pandp.20181003,<br> Author = {Arrieta-Ibarra, Imanol and Goff, Leonard and JimĂ©nez-Hernández, Diego and Lanier, Jaron and Weyl, E. Glen},<br> Title = {Should We Treat Data as Labor? Moving beyond "Free"},<br> Journal = {AEA Papers and Proceedings},<br> Volume = {108},<br> Year = {2018},<br> Month = {May},<br> Pages = {38-42},<br> DOI = {10.1257/pandp.20181003},<br> URL = {https://www.aeaweb.org/articles?id=10.1257/pandp.20181003}<br>}</pre></details> | | |
  • Performative Prediction - DĂĽnner, Moritz Hardt | 2020 | <details><summary>Summary</summary> Perdomo et al. (2020) introduce the concept of "performative prediction" dealing with predictions that influence the target they aim to predict, e.g. through taking actions based on the predictions, causing a distribution shift. The authors develop a risk minimization framework for performative prediction and introduce the equilibrium notion of performative stability where predictions are calibrated against future outcomes that manifest from acting on the prediction. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{perdomo2020performative,<br> title={Performative prediction},<br> author={Perdomo, Juan and Zrnic, Tijana and Mendler-D{\"u}nner, Celestine and Hardt, Moritz},<br> booktitle={International Conference on Machine Learning},<br> pages={7599--7609},<br> year={2020},<br> organization={PMLR}<br>}</pre> </details> | | |
  • Stochastic Optimization for Performative Prediction - DĂĽnner, Juan Perdomo, Tijana Zrnic, Moritz Hardt | 2020 | <details><summary>Summary</summary> Mendler-DĂĽnner et al. (2020) look at stochastic optimization for performative prediction and prove convergence rates for greedily deploying models after each stochastic update (which may cause distribution shift affecting convergence to a stability point) or lazily deploying the model after several updates. </details> | <details><summary>Bibtex</summary><pre>@article{mendler2020stochastic,<br> title={Stochastic optimization for performative prediction},<br> author={Mendler-D{\"u}nner, Celestine and Perdomo, Juan and Zrnic, Tijana and Hardt, Moritz},<br> journal={Advances in Neural Information Processing Systems},<br> volume={33},<br> pages={4929--4939},<br> year={2020}<br>}</pre> </details> | | |
  • Strategic Classification is Causal Modeling in Disguise - trivial causal inference problem. The authors provide a distinction between gaming and improvement as well as provide a causal framework for strategic adaptation. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{miller2020strategic,<br> title={Strategic classification is causal modeling in disguise},<br> author={Miller, John and Milli, Smitha and Hardt, Moritz},<br> booktitle={International Conference on Machine Learning},<br> pages={6917--6926},<br> year={2020},<br> organization={PMLR}<br>}</pre> </details> | | |
  • Alternative Microfoundations for Strategic Classification - DĂĽnner, Moritz Hardt | 2021 | <details> <summary>Summary</summary> Jagadeesan et al. (2021) show that standard microfoundations in strategic classification, that typically uses individual-level behaviour to deduce aggregate-level responses, can lead to degenerate behaviour in aggregate: discontinuities in the aggregate response, stable points ceasing to exist, and maximizing social burden. The authors introduce a noisy response model inspired by performative prediction that mitigates these limitations for binary classification. </details> | <details><summary>Bibtex</summary><pre>@inproceedings{jagadeesan2021alternative,<br> title={Alternative microfoundations for strategic classification},<br> author={Jagadeesan, Meena and Mendler-D{\"u}nner, Celestine and Hardt, Moritz},<br> booktitle={International Conference on Machine Learning},<br> pages={4687--4697},<br> year={2021},<br> organization={PMLR}<br>}</pre></details> | | |
  • Costas Spanos
  • Jinsung Yoon
  • Tomas Pfister
  • Amirata Ghorbani
  • James Zou
  • Nektaria Tryfona
  • Rachael Hwee Ling Sim
  • Bryan Kian Hsiang Low
  • Dawn Song
  • Zhaoxuan Wu
  • Xinyi Xu
  • Tianhao Wang
  • JosĂ© González Cabañas - Santander Big Data Institute | 7 |
  • Ruben Cuevas Rumin
  • Jiachen T. Wang
  • Bohong Wang
  • Yongchan Kwon
  • Siyi Tang
  • Li Xiong
  • Jessica Vitak
  • Katie Chamberlain Kritikos - Champaign | 6 |
  • Zhenan Fan
  • Shuyue Wei
  • Hannah Stein
  • Wolfgang Maass
  • Mohammad Mohammadi Amiri
  • Ramesh Raskar
  • Konstantin D. Pandl
  • Ali Sunyaev
  • Ludovico Boratto
  • han xiao
  • Junjie Wu
  • Xiao Tian
  • Kean Birch
  • Callum Ward
  • Praveer Singh
  • Anran Xu
  • Guihai Chen
  • Andre Esteva - Founder & CEO, Artera | 23 |
  • Prateek Mittal
  • Hyeontaek Oh
  • Lingjiao Chen
  • Xiangyu Chang
  • Hoang Anh Just
  • David Dao
  • Mark Mazumder
  • Vijay Janapa Reddi
  • Sabri Eyuboglu
  • Wenqian Li