{"id":15030105,"url":"https://github.com/daochenzha/data-centric-ai","last_synced_at":"2026-02-05T20:09:54.337Z","repository":{"id":148943992,"uuid":"610647033","full_name":"daochenzha/data-centric-AI","owner":"daochenzha","description":"A curated, but incomplete, list of data-centric AI resources.","archived":false,"fork":false,"pushed_at":"2024-06-26T23:03:47.000Z","size":2086,"stargazers_count":1113,"open_issues_count":2,"forks_count":78,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-06-30T00:42:03.622Z","etag":null,"topics":["ai","artificial-intelligence","data-centric","data-centric-ai","data-centric-machine-learning","data-curation","data-engineering","data-quality","data-science","machine-learning"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daochenzha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-07T07:40:28.000Z","updated_at":"2025-06-26T03:51:51.000Z","dependencies_parsed_at":"2024-10-30T02:51:41.046Z","dependency_job_id":null,"html_url":"https://github.com/daochenzha/data-centric-AI","commit_stats":{"total_commits":62,"total_committers":8,"mean_commits":7.75,"dds":0.4516129032258065,"last_synced_commit":"2e531048cfebef248108e91e4ab50d8c4cb422db"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/daochenzha/data-centric-AI","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daochenzha%2Fdata-centric-AI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daochenzha%2Fdata-centric-AI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daochenzha%2Fdata-centric-AI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daochenzha%2Fdata-centric-AI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daochenzha","download_url":"https://codeload.github.com/daochenzha/data-centric-AI/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daochenzha%2Fdata-centric-AI/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266263057,"owners_count":23901356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","data-centric","data-centric-ai","data-centric-machine-learning","data-curation","data-engineering","data-quality","data-science","machine-learning"],"created_at":"2024-09-24T20:12:27.176Z","updated_at":"2026-02-05T20:09:54.296Z","avatar_url":"https://github.com/daochenzha.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Awesome-Data-Centric-AI\n[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\n\nA curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.\n\n:loudspeaker: News: Please check out our open-sourced [Large Time Series Model (LTSM)](https://github.com/daochenzha/ltsm)!\n\nIf you want to contribute to this list, please feel free to send a pull request. Also, you can contact [daochen.zha@rice.edu](mailto:daochen.zha@rice.edu).\n\n*   Survey paper: [Data-centric Artificial Intelligence: A Survey](https://arxiv.org/abs/2303.10158)\n*   Perspective paper (SDM 2023): [Data-centric AI: Perspectives and Challenges](https://arxiv.org/abs/2301.04819)\n*   Data-centric AI: Techniques and Future Perspectives (KDD 2023 Tutorial): [[Website]](https://dcaitutorial.github.io/) [[Slides]](https://dcaitutorial.github.io/files/data-centric_AI_KDD2023.pdf) [[Video]](http://youtu.be/6WjHpFeOgQ0) [[Paper]](https://dl.acm.org/doi/10.1145/3580305.3599553)\n*   Blogs: \n    *   [What Are the Data-Centric AI Concepts behind GPT Models?](https://towardsdatascience.com/what-are-the-data-centric-ai-concepts-behind-gpt-models-a590071bb727)\n    *   [Are Prompts Generated by Large Language Models (LLMs) Reliable?](https://towardsdatascience.com/are-prompt-generated-by-large-language-models-llms-reliable-4162fd10c845)\n    *   [The Data-centric AI Concepts in Segment Anything](https://medium.com/towards-data-science/the-data-centric-ai-concepts-in-segment-anything-8eea556ac9d)\n*   中文解读: \n    *   [GPT模型成功的背后用到了哪些以数据为中心的人工智能（Data-centric AI）技术？](https://zhuanlan.zhihu.com/p/617057227)\n    *   [如何评价Meta/FAIR 最新工作Segment Anything？](https://www.zhihu.com/question/593888697/answer/2972047807)\n    *   [进行data-centric的研究时，需要的算力大吗？](https://www.zhihu.com/question/595473790/answer/2985057859)\n*   Graph Structure Learning (GSL) is a data-centric AI direction in graph neural networks (GNNs):\n    *   Paper: [OpenGSL: A Comprehensive Benchmark for Graph Structure Learning](https://arxiv.org/abs/2306.10280)\n    *   Code: https://github.com/OpenGSL/OpenGSL\n    *   中文解读：[GNN中的Data-centric AI —— 图结构学习（GSL）以及基准库OpenGSL介绍](https://zhuanlan.zhihu.com/p/642738341)\n*   Check our latest Knowledge Graphs (KGs) based paper search engine DiscoverPath: https://github.com/ynchuang/DiscoverPath\n\nWant to discuss with others who are also interested in data-centric AI? There are three options:\n*   Join our [Slack channel](https://join.slack.com/t/data-centric-ai-group/shared_invite/zt-1u93fd7uc-DXz5C~5ejrEwwUXui5tqPA)\n*   Join our QQ group (183116457). Password: `datacentric`\n*   Join the WeChat group below (if the QR code is expired, please add WeChat ID: ``zdcwhu`` and add a note indicating that you want to join the Data-centric AI group)!\n\n\n\u003cimg width=\"250\" src=\"./imgs/group.jpeg\" alt=\"group\" /\u003e\n\n## What is Data-centric AI?\n\nData-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.\n\n## Data-centric AI vs. Model-centric AI\n\n\u003cimg width=\"500\" src=\"./imgs/data-centric.png\" alt=\"data-centric\" /\u003e\n\nIn the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, **data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data**.\n\nIt is important to note that \"data-centric\" differs fundamentally from \"data-driven\", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.\n\n\n## Why Data-centric AI?\n\u003cimg width=\"800\" src=\"./imgs/motivation.png\" alt=\"motivation\" /\u003e\n\nTwo motivating examples of GPT models highlight the central role of data in AI.\n*   On the left, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.\n*   On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.\n\nAnother example is [Segment Anything](https://arxiv.org/abs/2304.02643), a foundation model for computer vision. The core of training Segment Anything lies in the large amount of annotated data, containing more than 1 billion masks, which is 400 times larger than existing segmentation datasets.\n\n## What is the Data-centric AI Framework?\n\u003cimg width=\"800\" src=\"./imgs/framework.png\" alt=\"framework\" /\u003e\n\nData-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.\n\n*   The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models.\n*   The objective of inference data development is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs.\n*   The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment.\n\n## Cite this Work\nZha, Daochen, et al. \"Data-centric Artificial Intelligence: A Survey.\" arXiv preprint arXiv:2303.10158, 2023.\n```bibtex\n@article{zha2023data-centric-survey,\n  title={Data-centric Artificial Intelligence: A Survey},\n  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},\n  journal={arXiv preprint arXiv:2303.10158},\n  year={2023}\n}\n```\nZha, Daochen, et al. \"Data-centric AI: Perspectives and Challenges.\" SDM, 2023.\n```bibtex\n@inproceedings{zha2023data-centric-perspectives,\n  title={Data-centric AI: Perspectives and Challenges},\n  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},\n  booktitle={SDM},\n  year={2023}\n}\n```\n\n## Table of Contents\n* [Training Data Development](#training-data-development)\n  * [Data Collection](#data-collection)\n  * [Data Labeling](#data-labeling)\n  * [Data Preparation](#data-preparation)\n  * [Data Reduction](#data-reduction)\n  * [Data Augmentation](#data-augmentation)\n  * [Pipeline Search](#pipeline-search)\n* [Inference Data Development](#inference-data-development)\n  * [In-distribution Evaluation](#in-distribution-evaluation)\n  * [Out-of-distribution Evaluation](#out-of-distribution-evaluation)\n  * [Prompt Engineering](#prompt-engineering)\n* [Data Maintenance](#data-maintenance)\n  * [Data Understanding](#data-understanding)\n  * [Data Quality Assurance](#data-quality-assurance)\n  * [Data Storage and Retrieval](#data-storage-and-retrieval)\n* [Data Benchmark](#data-benchmark)\n  * [Training Data Development Benchmark](#training-data-development-benchmark)\n  * [Inference Data Development Benchmark](#inference-data-development-benchmark)\n  * [Data Maintenance Benchmark](#data-maintenance-benchmark)\n  * [Unified Benchmark](#unified-benchmark)\n\n\n## Training Data Development\n\u003cimg width=\"800\" src=\"./imgs/training-data-development.png\" alt=\"training-data-development\" /\u003e\n\n### Data Collection\n\n* Revisiting time series outlier detection: Definitions and benchmarks, NeurIPS 2021 [[Paper]](https://openreview.net/forum?id=r8IvOsnHchr) [[Code]](https://github.com/datamllab/tods)\n* Dataset discovery in data lakes, ICDE 2020 [[Paper]](https://arxiv.org/pdf/2011.10427.pdf)\n* Aurum: A data discovery system, ICDE 2018 [[Paper]](https://ieeexplore.ieee.org/document/8509315) [[Code]](https://github.com/mitdbg/aurum-datadiscovery)\n* Table union search on open data, VLDB 2018 [[Paper]](https://dl.acm.org/doi/abs/10.14778/3192965.3192973)\n* Data Integration: The Current Status and the Way Forward, IEEE Computer Society Technical Committee on Data Engineering 2018 [[Paper]](https://cs.uwaterloo.ca/~ilyas/papers/StonebrakerIEEE2018.pdf)\n* To join or not to join? thinking twice about joins before feature selection, SIGMOD 2016 [[Paper]](https://pages.cs.wisc.edu/~arun/hamlet/OptFSSIGMOD.pdf) \n* Data curation at scale: the data tamer system, CIDR 2013 [[Paper]](https://cs.uwaterloo.ca/~ilyas/papers/StonebrakerCIDR2013.pdf) \n* Data integration: A theoretical perspective, PODS 2002 [[Paper]](https://www.cs.ubc.ca/~rap/teaching/534a/readings/Lenzerini-pods02.pdf)\n\n### Data Labeling\n* Segment Anything [[Paper]](https://scontent-hou1-1.xx.fbcdn.net/v/t39.2365-6/10000000_900554171201033_1602411987825904100_n.pdf?_nc_cat=100\u0026ccb=1-7\u0026_nc_sid=3c67a6\u0026_nc_ohc=Ald4OYhL6hgAX-h7JxS\u0026_nc_ht=scontent-hou1-1.xx\u0026oh=00_AfCkb5vYTMBeCKRoya_B2sHuwAMcJrYJ08uJQdNkXDscLg\u0026oe=643500E7) [[code]](https://github.com/facebookresearch/segment-anything)\n* Active Ensemble Learning for Knowledge Graph Error Detection, WSDM 2023 [[Paper]](https://www4.comp.polyu.edu.hk/~xiaohuang/docs/Junnan_WSDM2023.pdf)\n* Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI, NeurIPS 2022 Workshop on Human in the Loop Learning [[paper]](https://arxiv.org/abs/2207.09109) [[code]](https://github.com/MLSysOps/Active-Learning-as-a-Service)\n* Training language models to follow instructions with human feedback, NeurIPS 2022 [[Paper]](https://arxiv.org/abs/2203.02155)\n* Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling, ICLR 2021 [[Paper]](https://arxiv.org/abs/2012.06046) [[Code]](https://github.com/benbo/interactive-weak-supervision)\n* A survey of deep active learning, ACM Computing Surveys 2021 [[Paper]](https://arxiv.org/abs/2009.00236)\n* Adaptive rule discovery for labeling text data, SIGMOD 2021 [[Paper]](https://arxiv.org/abs/2005.06133)\n* Cut out the annotator, keep the cutout: better segmentation with weak supervision, ICLR 2021 [[Paper]](https://openreview.net/forum?id=bjkX6Kzb5H)\n* Meta-AAD: Active anomaly detection with deep reinforcement learning, ICDM 2020 [[Paper]](https://arxiv.org/abs/2009.07415) [[Code]](https://github.com/daochenzha/Meta-AAD)\n* Snorkel: Rapid training data creation with weak supervision, VLDB 2020 [[Paper]](https://arxiv.org/abs/1711.10160) [[Code]](https://github.com/snorkel-team/snorkel)\n* Graph-based semi-supervised learning: A review, Neurocomputing 2020 [[Paper]](https://arxiv.org/abs/2102.13303) \n* Annotator rationales for labeling tasks in crowdsourcing, JAIR 2020 [[Paper]](https://www.ischool.utexas.edu/~ml/papers/kutlu_jair20.pdf)\n* Rethinking pre-training and self-training, NeurIPS 2020 [[Paper]](https://arxiv.org/abs/2006.06882)\n* Multi-label dataless text classification with topic modeling, KIS 2019 [[Paper]](https://arxiv.org/abs/1711.01563)\n* Data programming: Creating large training sets, quickly, NeurIPS 2016 [[Paper]](https://arxiv.org/abs/1605.07723)\n* Semi-supervised consensus labeling for crowdsourcing, SIGIR 2011 [[Paper]](http://www.cs.columbia.edu/~prokofieva/CandidacyPapers/Tang_Crowd.pdf)\n* Vox Populi: Collecting High-Quality Labels from a Crowd, COLT 2009 [[Paper]](https://www.wisdom.weizmann.ac.il/~shamiro/publications/2009_COLT_DekSham.pdf)\n* Democratic co-learning, ICTAI 2004 [[Paper]](https://ieeexplore.ieee.org/document/1374241)\n* Active learning with statistical models, JAIR 1996 [[Paper]](https://arxiv.org/abs/cs/9603104)\n \n### Data Preparation\n* DataFix: Adversarial Learning for Feature Shift Detection and Correction, NeurIPS 2023 [[Paper]](https://openreview.net/forum?id=lBhRTO2uWf) [[Code]](https://github.com/AI-sandbox/DataFix/)\n* OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [[Paper]](https://arxiv.org/abs/2306.10280) [[Code]](https://github.com/OpenGSL/OpenGSL)\n* TSFEL: Time series feature extraction library, SoftwareX 2020 [[Paper]](https://www.sciencedirect.com/science/article/pii/S2352711020300017) [[Code]](https://github.com/fraunhoferportugal/tsfel)\n* Alphaclean: Automatic generation of data cleaning pipelines, arXiv 2019 [[Paper]](https://arxiv.org/abs/1904.11827) [[Code]](https://github.com/sjyk/alphaclean)\n* Introduction to Scikit-learn, Book 2019 [[Paper]](https://link.springer.com/chapter/10.1007/978-1-4842-4470-8_18) [[Code]](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n* Feature extraction: a survey of the types, techniques, applications, ICSC 2019 [[Paper]](https://ieeexplore.ieee.org/document/8938371) \n* Feature engineering for predictive modeling using reinforcement learning, AAAI 2018 [[Paper]](https://arxiv.org/abs/1709.07150)\n* Time series classification from scratch with deep neural networks: A strong baseline, IIJCNN 2017 [[Paper]](https://arxiv.org/abs/1611.06455)\n* Missing data imputation: focusing on single imputation, ATM 2016 [[Paper]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4716933/) \n* Estimating the number and sizes of fuzzy-duplicate clusters, CIKM 2014 [[Paper]](https://dl.acm.org/doi/10.1145/2661829.2661885)\n* Data normalization and standardization: a technical report, MLTR 2014 [[Paper]](https://www.researchgate.net/publication/340579135_Data_Normalization_and_Standardization_A_Technical_Report)\n* CrowdER: crowdsourcing entity resolution, VLDB 2012 [[Paper]](https://vldb.org/pvldb/vol5/p1483_jiannanwang_vldb2012.pdf)\n* Imputation of Missing Data Using Machine Learning Techniques, KDD 1996 [[Paper]](https://d1wqtxts1xzle7.cloudfront.net/75598281/KDD96-023-libre.pdf?1638499004=\u0026response-content-disposition=inline%3B+filename%3DImputation_of_Missing_Data_Using_Machine.pdf\u0026Expires=1678667994\u0026Signature=BNTzHGy7~TevRwBAyUu4QCeyNWOC7vH9RcG3bx6zGHjO4mTZa1DAv8GznqsJP25EKorca59PX4R2BYrWiFgTXvtXDwgB7lgvWa0B~W6Z3fjosZLWyMRAjAuhDbFdc-jhI1vlaXHIwzvetDG6ldZZJJCNj6fY0JmkgGcFLP52JR0wy02LjxlPwgAaRyrx1m1-4MvKi-4qS9N~J55ddEchqjcezfREIOA-ab2izlrIH~nzh4UTY7D2uiPmEKQiA85wOfkI0KFjImqGiLiIEo82uA071MjdgkWfPBUrrS60EQT89bDrn-PeCHMXKCE0WnaK0MUm5tOF62h~KLU-D5y5Dg__\u0026Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA) \n\n### Data Reduction\n* Active feature selection for the mutual information criterion, AAAI 2021 [[Paper]](https://arxiv.org/abs/2012.06979) [[Code]](https://github.com/ShacharSchnapp/ActiveFeatureSelection)\n* Active incremental feature selection using a fuzzy-rough-set-based information entropy, IEEE Transactions on Fuzzy Systems, 2020 [[Paper]](https://ieeexplore.ieee.org/document/8933450)\n* MESA: boost ensemble imbalanced learning with meta-sampler, NeurIPS 2020 [[Paper]](https://arxiv.org/abs/2010.08830) [[Code]](https://github.com/ZhiningLiu1998/mesa)\n* Autoencoders, arXiv 2020 [[Paper]](https://arxiv.org/abs/2003.05991)\n* Feature selection: A data perspective, ACM COmputer Surveys, 2017 [[Paper]](https://dl.acm.org/doi/abs/10.1145/3136625) [[Code]](http://featureselection.asu.edu/)\n* Intrusion detection model using fusion of chi-square feature selection and multi class SVM, Journal of King Saud University-Computer and Information Sciences 2017 [[Paper]](https://www.researchgate.net/publication/299567135_Intrusion_Detection_Model_Using_fusion_of_Chi-square_feature_selection_and_multi_class_SVM) \n* Feature selection and analysis on correlated gas sensor data with recursive feature elimination, Sensors and Actuators B: Chemical 2015 [[Paper]](https://www.sciencedirect.com/science/article/abs/pii/S0925400515001872) \n* Embedded unsupervised feature selection, AAAI 2015 [[Paper]](https://faculty.ist.psu.edu/szw494/publications/EUFS.pdf)\n* Using random undersampling to alleviate class imbalance on tweet sentiment data, ICIRI 2015 [[Paper]](https://ieeexplore.ieee.org/document/7300975)\n* Feature selection based on information gain, IJITEE 2013 [[Paper]](https://citeseerx.ist.psu.edu/document?repid=rep1\u0026type=pdf\u0026doi=e17df473c25cccd8435839c9b6150ee61bec146a)\n* Linear discriminant analysis, Book 2013 [[Paper]](https://link.springer.com/chapter/10.1007/978-1-4419-9878-1_4) \n* Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction, 2012 [[Paper]](http://www.math.le.ac.uk/people/ag153/homepage/KNN/OliverKNN_Talk.pdf)\n* Principal component analysis, Wiley Interdisciplinary Reviews 2010 [[Paper]](https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wics.101?casa_token=7GCaA_PLs7YAAAAA:X9iYhUUhXB-3kTi-sQfPjVliujsPep6Iin7Cs9ld8wv3ECz9oeFuPEeaA_AfM0QH-cQkaD3kDoTL5sA) [[Code]](https://github.com/erdogant/pca)\n* Finding representative patterns with ordered projections, Pattern Recognition 2003 [[Paper]](https://www.sciencedirect.com/science/article/abs/pii/S003132030200119X)\n\n\n### Data Augmentation\n* Towards automated imbalanced learning with deep hierarchical reinforcement learning, CIKM 2022 [[Paper]](https://arxiv.org/abs/2208.12433) [[Code]](https://github.com/daochenzha/autosmote)\n* G-Mixup: Graph Data Augmentation for Graph Classification, ICML 2022 [[Paper]](https://arxiv.org/abs/2202.07179) [[Code]](https://github.com/ahxt/g-mixup)\n* Cascaded Diffusion Models for High Fidelity Image Generation, JMLR 2022 [[Paper]](https://arxiv.org/abs/2106.15282)\n* Time series data augmentation for deep learning: A survey, IJCAI 2021 [[Paper]](https://arxiv.org/abs/2002.12478)\n* Text data augmentation for deep learning, JBD 2020 [[Paper]](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0)\n* Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification, ACL 2020 [[Paper]](https://arxiv.org/abs/2004.12239) [[Code]](https://github.com/SALT-NLP/MixText)\n* Autoaugment: Learning augmentation policies from data, CVPR 2019 [[Paper]](https://arxiv.org/abs/1805.09501) [[Code]](https://github.com/DeepVoltaire/AutoAugment)\n* Mixup: Beyond empirical risk minimization, ICLR 2018 [[Paper]](https://arxiv.org/abs/1710.09412) [[Code]](https://github.com/facebookresearch/mixup-cifar10)\n* Synthetic data augmentation using GAN for improved liver lesion classification, ISBI 2018 [[Paper]](https://arxiv.org/abs/1801.02385) [[Code]](https://github.com/NicoEssi/GAN_Synthetic_Medical_Image_Augmentation)\n* Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation, ASRU 2017 [[Paper]](https://arxiv.org/abs/1707.06265)\n* Character-level convolutional networks for text classification, NeurIPS 2015 [[Paper]](https://arxiv.org/abs/1509.01626) [[Code]](https://github.com/zhangxiangxiao/Crepe)\n* ADASYN: Adaptive synthetic sampling approach for imbalanced learning, IJCNN 2008 [[Paper]](https://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf) [[Code]](https://github.com/stavskal/ADASYN)\n* SMOTE: synthetic minority over-sampling technique, JAIR 2002 [[Paper]](https://arxiv.org/abs/1106.1813) [[Code]](https://github.com/analyticalmindsltd/smote_variants)\n\n### Pipeline Search\n* Towards Personalized Preprocessing Pipeline Search, arXiv 2023 [[Paper]](https://arxiv.org/pdf/2302.14329v1.pdf) \n* AutoVideo: An Automated Video Action Recognition System, IJCAI 2022 [[Paper]](https://arxiv.org/abs/2108.04212) [[Code]](https://github.com/datamllab/autovideo)\n* Tods: An automated time series outlier detection system, AAAI 2021 [[Paper]](https://arxiv.org/abs/2009.09822) [[Code]](https://github.com/datamllab/tods)\n* Deepline: Automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering, KDD 2020 [[Paper]](https://arxiv.org/abs/1911.00061) \n* On evaluation of automl systems, ICML 2020 [[Paper]](https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_59.pdf)\n* AlphaD3M: Machine learning pipeline synthesis, ICML 2018 [[Paper]](https://arxiv.org/abs/2111.02508)\n* Efficient and robust automated machine learning, NeurIPS 2015 [[Paper]](https://papers.nips.cc/paper/2015/file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf) [[Code]](https://paperswithcode.com/paper/efficient-and-robust-automated-machine)\n* Tiny3D: A Data-Centric AI based 3D Object Detection Service Production System [[Code]](https://github.com/TinyDataML/Tiny3D)\n* Learning From How Humans Correct [[Paper]](https://arxiv.org/abs/2102.00225)\n* The Re-Label Method For Data-Centric Machine Learning [[Paper]](https://arxiv.org/abs/2302.04391)\n* Automatic Label Error Correction [[Paper]](https://www.techrxiv.org/users/679328/articles/731085) [[Code]](https://github.com/guotong1988/Automatic-Label-Error-Correction)\n\n## Inference Data Development\n\u003cimg width=\"800\" src=\"./imgs/inference-data-development.png\" alt=\"inference-data-development\" /\u003e\n\n### In-distribution Evaluation\n* FOCUS: Flexible optimizable counterfactual explanations for tree ensembles, AAAI 2022 [[Paper]](https://arxiv.org/abs/1911.12199) [[Code]](https://github.com/a-lucic/focus)\n* Sliceline: Fast, linear-algebra-based slice finding for ml model debugging, SIGMOD 2021 [[Paper]](https://dl.acm.org/doi/10.1145/3448016.3457323) [[Code]](https://github.com/DataDome/sliceline)\n* Counterfactual explanations for oblique decision trees: Exact, efficient algorithms, AAAI 2021 [[Paper]](https://arxiv.org/abs/2103.01096)\n* A Step Towards Global Counterfactual Explanations: Approximating the Feature Space Through Hierarchical Division and Graph Search, AAIML 2021 [[Paper]](https://www.researchgate.net/publication/357175710_A_Step_Towards_Global_Counterfactual_Explanations_Approximating_the_Feature_Space_Through_Hierarchical_Division_and_Graph_Search) \n* An exact counterfactual-example-based approach to tree-ensemble models interpretability, arXiv 2021 [[Paper]](https://arxiv.org/abs/2105.14820) [[Code]](https://paperswithcode.com/paper/an-exact-counterfactual-example-based)\n* No subclass left behind: Fine-grained robustness in coarse-grained classification problems, NeurIPS 2020 [[Paper]](https://arxiv.org/abs/2011.12945) [[Code]](https://paperswithcode.com/paper/no-subclass-left-behind-fine-grained-1)\n* FACE: feasible and actionable counterfactual explanations, AIES 2020 [[Paper]](https://arxiv.org/abs/1909.09369) [[Code]](https://github.com/sharmapulkit/FACE-Feasible-Actionable-Counterfactual-Explanations)\n* DACE: Distribution-Aware Counterfactual Explanation by Mixed-Integer Linear Optimization, IJCAI 2020 [[Paper]](https://www.ijcai.org/proceedings/2020/0395.pdf)\n* Multi-objective counterfactual explanations, arXiv 2020 [[Paper]](https://arxiv.org/abs/2004.11165) [[Code]](https://github.com/dandls/moc)\n* Certifai: Counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models, AIES 2020 [[Paper]](https://arxiv.org/abs/1905.07857) [[Code]]()\n* Propublica's compas data revisited, arXiv 2019 [[Paper]](https://arxiv.org/abs/1906.04711)\n* Slice finder: Automated data slicing for model validation, ICDE 2019 [[Paper]](https://research.google/pubs/pub47966/) [[Code]](https://github.com/yeounoh/slicefinder)\n* Multiaccuracy: Black-box post-processing for fairness in classification, AIES 2019 [[Paper]](https://arxiv.org/abs/1805.12317) [[Code]](https://github.com/amiratag/MultiAccuracyBoost)\n* Model agnostic contrastive explanations for structured data, arXiv 2019 [[Paper]](https://arxiv.org/abs/1906.00117)\n* Counterfactual explanations without opening the black box: Automated decisions and the GDPR, Harvard Journal of Law \u0026 Technology 2018 [[Paper]](https://arxiv.org/abs/1711.00399)\n* Comparison-based inverse classification for interpretability in machine learning, IPMU 2018 [[Paper]](https://link.springer.com/chapter/10.1007/978-3-319-91473-2_9)\n* Quantitative program slicing: Separating statements by relevance, ICSE 2013 [[Paper]](https://ieeexplore.ieee.org/document/6606695)\n* Stratal slicing, Part II: Real 3-D seismic data, Geophysics 1998 [[Paper]](https://library.seg.org/doi/abs/10.1190/1.1444352)\n\n\n### Out-of-distribution Evaluation\n* A brief review of domain adaptation, Transactions on Computational Science and Computational Intelligenc 2021 [[Paper]](https://arxiv.org/pdf/2010.03978v1.pdf)\n* Domain adaptation for medical image analysis: a survey, IEEE Transactions on Biomedical Engineering 2021 [[Paper]](https://arxiv.org/pdf/2102.09508v1.pdf)\n* Retiring adult: New datasets for fair machine learning, NeurIPS 2021 [[Paper]](https://arxiv.org/pdf/2108.04884v3.pdf) [[Code]](https://github.com/socialfoundations/folktables)\n* Wilds: A benchmark of in-the-wild distribution shifts, ICML 2021 [[Paper]](https://arxiv.org/pdf/2012.07421v3.pdf) [[Code]](https://github.com/p-lambda/wilds)\n* Do image classifiers generalize across time?, ICCV 2021 [[Paper]](https://arxiv.org/pdf/1906.02168v3.pdf)\n* Using videos to evaluate image model robustness, arXiv 2019 [[Paper]](https://arxiv.org/abs/1904.10076) \n* Regularized learning for domain adaptation under label shifts, ICLR 2019 [[Paper]](https://arxiv.org/pdf/1903.09734v1.pdf) [[Code]](https://github.com/Angie-Liu/labelshift)\n* Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [[Paper]](https://arxiv.org/abs/1903.12261) [[Code]](https://github.com/hendrycks/robustness)\n* Towards deep learning models resistant to adversarial attacks, ICLR 2018 [[Paper]](https://arxiv.org/pdf/1706.06083v4.pdf) [[Code]](https://paperswithcode.com/paper/towards-deep-learning-models-resistant-to)\n* Robust physical-world attacks on deep learning visual classification, CVPR 2018 [[Paper]](http://openaccess.thecvf.com/content_cvpr_2018/papers/Eykholt_Robust_Physical-World_Attacks_CVPR_2018_paper.pdf) \n* Detecting and correcting for label shift with black box predictors, ICML 2018 [[Paper]](https://arxiv.org/pdf/1802.03916v3.pdf)\n* Poison frogs! targeted clean-label poisoning attacks on neural networks, NeurIPS 2018 [[Paper]](https://arxiv.org/pdf/1804.00792v2.pdf) [[Code]](https://github.com/ashafahi/inceptionv3-transferLearn-poison)\n* Practical black-box attacks against machine learning, CCS 2017 [[Paper]](https://arxiv.org/pdf/1602.02697v4.pdf)\n* Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, AISec 2017 [[Paper]](https://arxiv.org/pdf/1708.03999v2.pdf) [[Code]](https://github.com/huanzhang12/ZOO-Attack)\n* Deepfool: a simple and accurate method to fool deep neural networks, CVPR 2016 [[Paper]](https://arxiv.org/pdf/1511.04599v3.pdf) [[Code]](https://github.com/LTS4/DeepFool)\n* Evasion attacks against machine learning at test time, ECML PKDD 2013 [[Paper]](https://arxiv.org/pdf/1708.06131v1.pdf) [[Code]](https://github.com/Koukyosyumei/AIJack)\n* Adapting visual category models to new domains, ECCV 2010 [[Paper]](https://link.springer.com/chapter/10.1007/978-3-642-15561-1_16)\n* Covariate shift by kernel mean matching, MIT Press 2009 [[Paper]](https://academic.oup.com/mit-press-scholarship-online/book/13447/chapter-abstract/166932982?redirectedFrom=fulltext)\n* Covariate shift adaptation by importance weighted cross validation, JMLR 2007 [[Paper]](https://www.jmlr.org/papers/volume8/sugiyama07a/sugiyama07a.pdf) \n\n### Prompt Engineering\n* SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization, arXiv 2023 [[Paper]](https://arxiv.org/pdf/2303.13035.pdf)\n* Making Pre-trained Language Models Better Few-shot Learners, arXiv 2021 [[Paper]](https://arxiv.org/pdf/2012.15723v2.pdf) [[Code]](https://github.com/princeton-nlp/LM-BFF)\n* Bartscore: Evaluating generated text as text generation, NeurIPS 2021 [[Paper]](https://arxiv.org/pdf/2106.11520v2.pdf) [[Code]](https://github.com/neulab/BARTScore)\n* BERTese: Learning to Speak to BERT, arXiv 2021 [[Paper]](https://arxiv.org/pdf/2103.05327v2.pdf)\n* Few-shot text generation with pattern-exploiting training, arXiv 2020 [[Paper]](https://arxiv.org/pdf/2012.11926v2.pdf) \n* Exploiting cloze questions for few shot text classification and natural language inference, arXiv 2020 [[Paper]](https://arxiv.org/pdf/2001.07676v3.pdf) [[Code]](https://github.com/timoschick/pet)\n* It's not just size that matters: Small language models are also few-shot learners, arXiv 2020 [[Paper]](https://arxiv.org/pdf/2009.07118v2.pdf)\n* How can we know what language models know?, TACL 2020 [[Paper]](https://arxiv.org/pdf/1911.12543v2.pdf) [[Code]](https://github.com/jzbjyb/LPAQA)\n* Universal adversarial triggers for attacking and analyzing NLP, EMNLP 2019 [[Paper]](https://arxiv.org/pdf/1908.07125v3.pdf) [[Code]](https://github.com/Eric-Wallace/universal-triggers)\n\n\n\n## Data Maintenance\n\u003cimg width=\"800\" src=\"./imgs/data-maintenance.png\" alt=\"data-maintenance\" /\u003e\n\n### Data Understanding\n* The science of visual data communication: What works, Psychological Science in the Public Interest 2021 [[Paper]](https://journals.sagepub.com/doi/full/10.1177/15291006211051956)\n* Towards out-of-distribution generalization: A survey, arXiv 2021 [[Paper]](https://arxiv.org/abs/2108.13624) \n* Snowy: Recommending utterances for conversational visual analysis, UIST 2021 [[Paper]](https://arxiv.org/abs/2110.04323) \n* A distributional framework for data valuation, ICML 2020 [[Paper]](https://arxiv.org/abs/2002.12334)\n* A comparison of radial and linear charts for visualizing daily patterns, TVCG 2020 [[Paper]](https://ieeexplore.ieee.org/abstract/document/8807238)\n* A marketplace for data: An algorithmic solution, EC 2019 [[Paper]](https://arxiv.org/abs/1805.08125)\n* Data shapley: Equitable valuation of data for machine learning, PMLR 2019 [[Paper]](https://arxiv.org/abs/1904.02868) [[Code]](https://github.com/amiratag/DataShapley)\n* Deepeye: Towards automatic data visualization, ICDE 2018 [[Paper]](https://dbgroup.cs.tsinghua.edu.cn/ligl/papers/icde18-deepeye.pdf) [[Code]](https://github.com/Thanksyy/DeepEye-APIs)\n* Voyager: Exploratory analysis via faceted browsing of visualization recommendations, TVCG 2016 [[Paper]](https://idl.cs.washington.edu/files/2015-Voyager-InfoVis.pdf)\n* A survey of clustering algorithms for big data: Taxonomy and empirical analysis, TETC 2014 [[Paper]](https://ieeexplore.ieee.org/document/6832486)\n* On the benefits and drawbacks of radial diagrams, Handbook of Human Centric Visualization 2013 [[Paper]](https://link.springer.com/chapter/10.1007/978-1-4614-7485-2_17)\n* What makes a visualization memorable?, TVCG 2013 [[Paper]](https://ieeexplore.ieee.org/document/6634103) \n* Toward a taxonomy of visuals in science communication, Technical Communication 2011 [[Paper]](https://www.jstor.org/stable/26464332)\n\n### Data Quality Assurance\n* Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving, CIKM Workshop 2022 [[Paper]](https://ceur-ws.org/Vol-3318/short14.pdf) \n* A Human-ML Collaboration Framework for Improving Video Content Reviews, arXiv 2022 [[Paper]](https://arxiv.org/abs/2210.09500) \n* Knowledge graph quality management: a comprehensive survey, TKDE 2022 [[Paper]](https://ieeexplore.ieee.org/document/9709663)\n* A crowdsourcing open platform for literature curation in UniProt, PLoS Biol. 2021 [[Paper]](https://pubmed.ncbi.nlm.nih.gov/34871295/) [[Code]]()\n* Building data curation processes with crowd intelligence, Advanced Information Systems Engineering 2020 [[Paper]](https://link.springer.com/chapter/10.1007/978-3-030-58135-0_3)\n* Data Curation with Deep Learning, EDBT, 2020 [[Paper]](https://openproceedings.org/2020/conf/edbt/paper_142.pdf)\n* Automating large-scale data quality verification, VLDB 2018 [[Paper]](https://dl.acm.org/doi/abs/10.14778/3229863.3229867) \n* Data quality: The role of empiricism, SIGMOD 2017 [[Paper]](https://sigmodrecord.org/publications/sigmodRecord/1712/pdfs/07_forum_Sadiq.pdf)\n* Tfx: A tensorflow-based production-scale machine learning platform, KDD 2017 [[Paper]](https://research.google/pubs/pub46484/) [[Code]](https://github.com/tensorflow/tfx)\n* Discovering denial constraints, VLDB 2013 [[Paper]](http://www.vldb.org/pvldb/vol6/p1498-papotti.pdf) [[Code]](https://github.com/ah89/ApproxDC)\n* Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [[Paper]](https://dl.acm.org/doi/10.1145/1541880.1541883)\n* Conditional functional dependencies for data cleaning, ICDE 2007 [[Paper]](https://homepages.inf.ed.ac.uk/wenfei/papers/icde07-cfd.pdf)\n* Data quality assessment, Communications of the ACM 2002 [[Paper]](https://dl.acm.org/doi/pdf/10.1145/505248.506010) \n\n\n### Data Storage and Retrieval\n* Dbmind: A self-driving platform in opengauss, PVLDB 2021 [[Paper]](http://vldb.org/pvldb/vol14/p2743-zhou.pdf)\n* Online index selection using deep reinforcement learning for a cluster database, ICDEW 2020 [[Paper]](https://ieeexplore.ieee.org/document/9094124) \n* Bridging the semantic gap with SQL query logs in natural language interfaces to databases, ICDE 2019 [[Paper]](https://arxiv.org/abs/1902.00031)\n* An end-to-end learning-based cost estimator, VLDB 2019 [[Paper]](https://arxiv.org/abs/1906.02560) [[Code]](https://github.com/greatji/Learning-based-cost-estimator)\n* An adaptive approach for index tuning with learning classifier systems on hybrid storage environments, Hybrid Artificial Intelligent Systems 2018 [[Paper]](https://link.springer.com/chapter/10.1007/978-3-319-92639-1_60)\n* Automatic database management system tuning through large-scale machine learning, SIGMOD 2017 [[Paper]](https://www.cs.cmu.edu/~ggordon/van-aken-etal-parameters.pdf)\n* Learning to rewrite queries, CIKM 2016 [[Paper]](http://www.yichang-cs.com/yahoo/CIKM2016_rewrite.pdf)\n* DBridge: A program rewrite tool for set-oriented query execution, IEEE ICDE 2011 [[Paper]](https://www.semanticscholar.org/paper/DBridge%3A-A-program-rewrite-tool-for-set-oriented-Chavan-Guravannavar/39b214a86cd6c3fd7fe359ce2127dd17e163452c) \n* Starfish: A Self-tuning System for Big Data Analytics, CIDR 2011 [[Paper]](https://www.researchgate.net/publication/220988112_Starfish_A_Self-tuning_System_for_Big_Data_Analytics) [[Code]](https://github.com/mostafa-ead/Starfish)\n* DB2 advisor: An optimizer smart enough to recommend its own indexes, ICDE 2000 [[Paper]](https://www.researchgate.net/publication/220967569_DB2_Advisor_An_Optimizer_Smart_Enough_to_Recommend_its_own_Indexes) \n* An efficient, cost-driven index selection tool for Microsoft SQL server, VLDB 1997 [[Paper]](https://www.vldb.org/conf/1997/P146.PDF)\n\n## Data Benchmark\n\n### Training Data Development Benchmark\n* OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [[Paper]](https://arxiv.org/abs/2306.10280) [[Code]](https://github.com/OpenGSL/OpenGSL)\n* REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines, EDBT 2023 [[Paper]](https://arxiv.org/abs/2302.04702) [[Code]](https://github.com/mohamedyd/rein-benchmark)\n* Usb: A unified semi-supervised learning benchmark for classification, NeurIPS 2022 [[Paper]](https://arxiv.org/abs/2208.07204) [[Code]](https://github.com/microsoft/Semi-supervised-learning)\n* A feature extraction \u0026 selection benchmark for structural health monitoring, Structural Health Monitoring 2022 [[Paper]](https://journals.sagepub.com/doi/full/10.1177/14759217221111141)\n* Data augmentation for deep graph learning: A survey, KDD 2022 [[Paper]](https://arxiv.org/abs/2202.08235)\n* Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods, Briefings in Bioinformatics 2022 [[Paper]](https://pubmed.ncbi.nlm.nih.gov/35945147/) [[Code]](https://github.com/abhivij/bloodbased-pancancer-diagnosis)\n* Amlb: an automl benchmark, arXiv 2022 [[Paper]](https://arxiv.org/abs/2207.12560)\n* A benchmark for data imputation methods, Front. Big Data 2021 [[Paper]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8297389/) [[Code]](https://github.com/se-jaeger/data-imputation-paper)\n* Benchmark and survey of automated machine learning frameworks, JAIR 2021 [[Paper]](https://arxiv.org/abs/1904.12054) \n* Benchmarking differentially private synthetic data generation algorithms, arXiv 2021 [[Paper]](https://arxiv.org/abs/2112.09238)\n* A comprehensive benchmark framework for active learning methods in entity matching, SIGMOD 2020 [[Paper]](https://arxiv.org/abs/2003.13114)\n* Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy, CVPR 2020 [[Paper]](https://arxiv.org/abs/2004.00448) [[Code]](https://github.com/clovaai/cutblur)\n* Comparison of instance selection and construction methods with various classifiers, Applied Sciences 2020 [[Paper]](https://www.mdpi.com/2076-3417/10/11/3933)\n* An empirical survey of data augmentation for time series classification with neural networks, arXiv 2020 [[Paper]](https://arxiv.org/abs/2007.15951) [[Code]](https://github.com/uchidalab/time_series_augmentation)\n* Toward a quantitative survey of dimension reduction techniques, IEEE Transactions on Visualization and Computer Graphics 2019 [[Paper]](https://pubmed.ncbi.nlm.nih.gov/31567092/) [[Code]](https://mespadoto.github.io/proj-quant-eval/post/datasets/)\n* Cleanml: A benchmark for joint data cleaning and machine learning experiments and analysis, arXiv 2019 [[Paper]](https://arxiv.org/abs/1904.09483) [[Code]](https://github.com/chu-data-lab/CleanML)\n* Comparison of different image data augmentation approaches, Journal of Big Data 2019 [[Paper]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8707550/) [[Code]]()\n* A benchmark and comparison of active learning for logistic regression, Pattern Recognition 2018 [[Paper]](https://arxiv.org/abs/1611.08618)\n* A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge, Database (Oxford). 2017 [[Paper]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737202/) [[Data]](https://ngdc.cncb.ac.cn/databasecommons/database/id/3366)\n* RODI: A benchmark for automatic mapping generation in relational-to-ontology data integration, ESWC 2015 [[Paper]](https://link.springer.com/chapter/10.1007/978-3-319-18818-8_2) [[Code]](https://github.com/chrpin/rodi)\n* TPC-DI: the first industry benchmark for data integration, PVLDB 2014 [[Paper]](http://www.vldb.org/pvldb/vol7/p1367-poess.pdf)\n* Comparison of instance selection algorithms II. Results and comments, ICAISC 2004 [[Paper]](https://link.springer.com/chapter/10.1007/978-3-540-24844-6_87)\n\n### Inference Data Development Benchmark\n* Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, arXiv 2023 [[Paper]](https://arxiv.org/abs/2206.04615) [[Code]](https://github.com/google/BIG-bench)\n* Carla: a python library to benchmark algorithmic recourse and counterfactual explanation algorithms, arXiv 2021 [[Paper]](https://arxiv.org/abs/2108.00783) [[Code]](https://github.com/carla-recourse/CARLA)\n* Benchmarking adversarial robustness on image classification, CVPR 2020 [[Paper]](https://openaccess.thecvf.com/content_CVPR_2020/papers/Dong_Benchmarking_Adversarial_Robustness_on_Image_Classification_CVPR_2020_paper.pdf)\n* Searching for a search method: Benchmarking search algorithms for generating nlp adversarial examples, ACL Workshop 2020 [[Code]](https://github.com/QData/TextAttack)\n* Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [[Paper]](https://arxiv.org/abs/1903.12261) [[Code]](https://github.com/hendrycks/robustness) [[Code]]()\n\n### Data Maintenance Benchmark\n* Chart-to-text: A large-scale benchmark for chart summarization, ACL 2022 [[Paper]](https://arxiv.org/abs/2203.06486) [[Code]](https://github.com/JasonObeid/Chart2Text)\n* Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification?, CVPR 2021 [[Paper]](https://arxiv.org/abs/1911.07128) [[Code]](https://github.com/AI-secure/Shapley-Study)\n* An evaluation-focused framework for visualization recommendation algorithms, IEEE Transactions on Visualization and Computer Graphics 2021 [[Paper]](https://arxiv.org/abs/2109.02706) [[Code]](https://github.com/Zehua-Zeng/visualization-recommendation-evaluation)\n* Facilitating database tuning with hyper-parameter optimization: a comprehensive experimental evaluation, VLDB 2021 [[Paper]](https://arxiv.org/abs/2110.12654) [[Code]](https://github.com/PKU-DAIR/KnobsTuningEA)\n* Benchmarking Data Curation Systems, \nIEEE Data Eng. Bull. 2016 [[Paper]](http://sites.computer.org/debull/A16june/p47.pdf) \n* Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [[Paper]](https://dl.acm.org/doi/10.1145/1541880.1541883)\n* Benchmark development for the evaluation of visualization for data mining, Information visualization in data mining and knowledge discovery 2001 [[Paper]](https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir6287.pdf)\n\n### Unified Benchmark\n* Dataperf: Benchmarks for data-centric AI development, arXiv 2022 [[Paper]](https://arxiv.org/abs/2207.10062) \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaochenzha%2Fdata-centric-ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaochenzha%2Fdata-centric-ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaochenzha%2Fdata-centric-ai/lists"}