{"id":13712930,"url":"https://github.com/styfeng/DataAug4NLP","last_synced_at":"2025-05-06T22:31:50.244Z","repository":{"id":37459767,"uuid":"367549861","full_name":"styfeng/DataAug4NLP","owner":"styfeng","description":"Collection of papers and resources for data augmentation for NLP.","archived":false,"fork":false,"pushed_at":"2022-08-12T21:20:02.000Z","size":123,"stargazers_count":828,"open_issues_count":0,"forks_count":78,"subscribers_count":28,"default_branch":"main","last_synced_at":"2024-11-13T23:32:45.135Z","etag":null,"topics":["acl2021","artificial-intelligence","data-augmentation","deep-learning","machine-learning","natural-language-processing","survey","survey-paper","text-classification","transformers"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2105.03075","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/styfeng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-15T05:43:59.000Z","updated_at":"2024-11-04T08:09:02.000Z","dependencies_parsed_at":"2022-08-31T13:11:20.201Z","dependency_job_id":null,"html_url":"https://github.com/styfeng/DataAug4NLP","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/styfeng%2FDataAug4NLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/styfeng%2FDataAug4NLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/styfeng%2FDataAug4NLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/styfeng%2FDataAug4NLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/styfeng","download_url":"https://codeload.github.com/styfeng/DataAug4NLP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252779181,"owners_count":21802900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acl2021","artificial-intelligence","data-augmentation","deep-learning","machine-learning","natural-language-processing","survey","survey-paper","text-classification","transformers"],"created_at":"2024-08-02T23:01:24.717Z","updated_at":"2025-05-06T22:31:45.267Z","avatar_url":"https://github.com/styfeng.png","language":null,"readme":"# Data Augmentation Techniques for NLP \n\n\nIf you'd like to add your paper, do not email us. Instead, read the protocol for [adding a new entry](https://github.com/styfeng/DataAug4NLP/blob/main/rules.md) and send a pull request.\n\nWe group the papers by [text classification](#text-classification), [translation](#translation), [summarization](#summarization), [question-answering](#question-answering), [sequence tagging](#sequence-tagging), [parsing](#parsing), [grammatical-error-correction](#grammatical-error-correction), [generation](#generation), [dialogue](#dialogue), [multimodal](#multimodal), [mitigating bias](#mitigating-bias), [mitigating class imbalance](#mitigating-class-imbalance), [adversarial examples](#adversarial-examples), [compositionality](#compositionality), and [automated augmentation](#automated-augmentation).\n\nThis repository is based on our paper, [\"A survey of data augmentation approaches in NLP (Findings of ACL '21)\"](https://aclanthology.org/2021.findings-acl.84/). You can cite it as follows:\n```\n@inproceedings{feng-etal-2021-survey,\n    title = \"A Survey of Data Augmentation Approaches for {NLP}\",\n    author = \"Feng, Steven Y.  and\n      Gangal, Varun  and\n      Wei, Jason  and\n      Chandar, Sarath  and\n      Vosoughi, Soroush  and\n      Mitamura, Teruko  and\n      Hovy, Eduard\",\n    booktitle = \"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021\",\n    month = aug,\n    year = \"2021\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.findings-acl.84\",\n    doi = \"10.18653/v1/2021.findings-acl.84\",\n    pages = \"968--988\",\n}\n```\nAuthors: \u003ca href=\"https://scholar.google.ca/citations?hl=en\u0026user=zwiszZIAAAAJ\"\u003eSteven Y. Feng\u003c/a\u003e,\n\t\t\t  \u003ca href=\"https://scholar.google.com/citations?user=rWZq2nQAAAAJ\u0026hl=en\"\u003eVarun Gangal\u003c/a\u003e,\n\t\t\t  \u003ca href=\"https://scholar.google.com/citations?user=wA5TK_0AAAAJ\u0026hl=en\"\u003eJason Wei\u003c/a\u003e,\n\t\t\t  \u003ca href=\"https://scholar.google.co.in/citations?user=yxWtZLAAAAAJ\u0026hl=en\"\u003eSarath Chandar\u003c/a\u003e,\n\t\t\t  \u003ca href=\"https://scholar.google.ca/citations?user=45DAXkwAAAAJ\u0026hl=en\"\u003eSoroush Vosoughi\u003c/a\u003e,\n\t\t\t  \u003ca href=\"https://scholar.google.com/citations?user=gjsxBCkAAAAJ\u0026hl=en\"\u003eTeruko Mitamura\u003c/a\u003e,\n\t\t\t  \u003ca href=\"https://scholar.google.com/citations?user=PUFxrroAAAAJ\u0026hl=en\"\u003eEduard Hovy\u003c/a\u003e\n\nSpecial thanks to Ryan Shentu, Fiona Feng, Karen Liu, Emily Nie, Tanya Lu, and Bonnie Ma for helping out with this repo.\nNote: WIP. More papers will be added from our survey paper to this repo soon.\nInquiries should be directed to stevenyfeng@gmail.com or by opening an issue here.\n\nAlso, check out our **talk for Google Research** (Steven Feng and Varun Gangal) [here](https://www.youtube.com/watch?v=kNBVesKUZCk\u0026ab_channel=StevenFeng), and our **podcast episode** (Steven Feng and Eduard Hovy) [here](https://www.youtube.com/watch?v=qmqyT_97Poc) and [here](https://thedataexchange.media/data-augmentation-in-natural-language-processing/).\n\n\n### Text Classification\n| Paper | Datasets | \n| -- | --- |\n| Unsupervised Word Sense Disambiguation Rivaling Supervised Methods ([ACL '95](https://www.aclweb.org/anthology/P95-1026.pdf)) | Paper-Specific/Legacy Corpus | \n| Synonym Replacement (Character-Level Convolutional Networks for Text Classification, [NeurIPS '15](https://papers.nips.cc/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf)) | AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon | \n| That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets [(EMNLP '15)](https://www.aclweb.org/anthology/D15-1306.pdf) | twitter| \n| Robust Training under Linguistic Adversity [(EACL '17)](https://www.aclweb.org/anthology/E17-2004/) [code](https://github.com/lrank/Linguistic_adversity) | Movie review, customer review, SUBJ, SST | \n| Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations [(NAACL '18)](https://www.aclweb.org/anthology/N18-2072.pdf) [code](https://github.com/pfnet-research/contextual_augmentation) | SST, SUBJ, MRQA, RT, TREC | \n| Variational Pretraining for Semi-supervised Text Classification [(ACL '19)](https://www.aclweb.org/anthology/P19-1590.pdf) [code](http://github.com/allenai/vampire) | IMDB, AG News, Yahoo, hatespeech | \n| EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [(EMNLP '19)](http://dx.doi.org/10.18653/v1/D19-1670) [code](https://github.com/jasonwei20/eda_nlp) | SST, CR, SUBJ, TREC, PC |\n| A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification [(DeepLo @ EMNLP '19)](https://arxiv.org/abs/1910.04176) | SNIPS |\n| Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification [(AAAI '20)](https://doi.org/10.1609/aaai.v34i04.5822) | TREC, SST, Subj, MR |\n| MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.194/) [code](https://github.com/GT-SALT/MixText) | AG News, DBpedia, Yahoo, IMDb | \n| Unsupervised Data Augmentation for Consistency Training [(NeurIPS '20)](https://papers.nips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html) [code](https://papers.nips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html) | Yelp, IMDb, amazon, DBpedia | \n| Not Enough Data? Deep Learning to the Rescue! [(AAAI '20)](https://arxiv.org/abs/1911.03118) | ATIS, TREC, WVA | \n| Data Augmentation using Pre-trained Transformer Models [LifeLongNLP @ AACL '20](https://arxiv.org/abs/2003.02245), [code](https://github.com/varunkumar-dev/TransformersDataAugmentation) |SNIPS, TREC, SST2 |\n| SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.97/) [code](https://github.com/nng555/ssmba) | IWSLT'14 | \n| Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.726/) | ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony |\n| Textual Data Augmentation for Efficient Active Learning on Tiny Datasets [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.600/) | SST2, TREC |\n| Text Augmentation in a Multi-Task View [(EACL '21)](https://www.aclweb.org/anthology/2021.eacl-main.252/) | SST2, TREC, SUBJ | \n| GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation [(arXiv '21)](https://arxiv.org/abs/2104.08826) | SST2, CR, TREC, SUBJ, MPQA, CoLA |\n| Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning [(NAACL '21)](https://arxiv.org/abs/2103.07552) [code](https://github.com/jasonwei20/triplet-loss) | HUFF, COV-Q, AMZN, FEWREL | \n| Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification [(EMNLP '21)](https://arxiv.org/abs/2109.00523) [code](https://github.com/lancopku/text-autoaugment) | IMDB, SST2, SST5, TREC, YELP2, YELP5 |\n| AEDA: An Easier Data Augmentation Technique for Text Classification [(EMNLP '21)](https://arxiv.org/abs/2108.13230) [code](https://github.com/akkarimi/aeda_nlp) | SST, CR, SUBJ, TREC, PC |\n\n### Translation\n\n| Paper | Datasets | \n| -- | --- |\n| Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, [ACL '16](https://www.aclweb.org/anthology/P16-1009.pdf)) | WMT '15 en-de, IWSLT '15 en-tr |\n| Adapting Neural Machine Translation with Parallel Synthetic Data [(WMT '17)](https://www.aclweb.org/anthology/W17-4714/) | COMMON, 1 Billion Words, dev2013, XRCE, IT, E-Com| \n| Data Augmentation for Low-Resource Neural Machine Translation [(ACL '17)](https://www.aclweb.org/anthology/P17-2090/) [code](https://github.com/marziehf/DataAugmentationNMT) | WMT '14/'15/'16 en-de/de-en| \n| Synthetic Data for Neural Machine Translation of Spoken-Dialects [(arxiv '17)](https://arxiv.org/abs/1707.00079) | LDC2012T09, OpenSubtitles-2013| \n| Multi-Source Neural Machine Translation with Data Augmentation [(IWSLT '18)](https://arxiv.org/abs/1810.06826) | TED Talks| \n| SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation [(EMNLP '18)](https://www.aclweb.org/anthology/D18-1100/) | IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de |\n| Generalizing Back-Translation in Neural Machine Translation [(WMT '19)](https://www.aclweb.org/anthology/W19-5205/) | ed NewsCrawl2, WMT'18 de-en| \n| Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation [(ACL '19)](https://www.aclweb.org/anthology/P19-1175/) | DGT-TM en-ml/en-hu| \n| Augmenting Neural Machine Translation with Knowledge Graphs [(arxiv '19)](https://arxiv.org/abs/1902.08816) | WMT '14 -'18| \n| Generalized Data Augmentation for Low-Resource Translation [(ACL '19)](https://www.aclweb.org/anthology/P19-1579/) [code](https://github.com/xiamengzhou/DataAugForLRL)| ENG-HRL-LRL, HRL-LRL | \n| Improving Robustness of Machine Translation with Synthetic Noise [(NAACL '19)](https://www.aclweb.org/anthology/N19-1190/) [code](https://github.com/MysteryVaibhav/robust_mtnt)| EP, TED, MTNT en-fr en-jpn| \n| Soft Contextual Data Augmentation for Neural Machine Translation [(ACL '19)](https://www.aclweb.org/anthology/P19-1555/) [code](https://github.com/teslacool/SCA) | IWSLT '14 de/es/he-en, WMT '14 en-de |\n| Data augmentation using back-translation for context-aware neural machine translation [(DiscoMT @ EMNLP '19)](https://www.aclweb.org/anthology/D19-6504/) [code](https://github.com/sugi-a/discomt2019) | IWSLT'17 en-ja/en-fr, BookCorpus, Europarl v7, National Diet of Japan | \n| Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back-Translation [(W-NUT @ EMNLP '19)](https://www.aclweb.org/anthology/D19-5543/) | WMT'15/'19 en/fr, MTNT, IWSLT'17, MuST-C | \n| Data augmentation for pipeline-based speech translation [(Baltic HLT '20)](https://hal.inria.fr/hal-02907053) | WMT '17 | \n| Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation [(IJCAI '20)](https://www.ijcai.org/proceedings/2020/496) [code](https://github.com/ghchen18/leca) | WMT '16 de-en, NIST zh-en |\n| A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation [(Information '20)](https://www.mdpi.com/2078-2489/11/5/255) | IWSLT '14 en-de | \n| Syntax-aware Data Augmentation for Neural Machine Translation [(arxiv '20)](https://arxiv.org/abs/2004.14200) | WMT '14 en-de, IWSLT '14 de-en | \n| SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.97/) [code](https://github.com/nng555/ssmba) | IWSLT'14 | \n| Data diversification: A simple strategy for neural machine translation [(NeurIPS '20)](https://proceedings.neurips.cc/paper/2020/file/7221e5c8ec6b08ef6d3f9ff3ce6eb1d1-Paper.pdf) [code](https://github.com/nxphi47/data_diversification) | WMT '14 en-de/en-fr, IWSLT '13/'14/'15 en-de/de-en/en-fr |\n| AdvAug: Robust Adversarial Augmentation for Neural Machine Translation [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.529/) | NIST zh-en, WMT '14 en-de| \n| Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation [(arxiv '20)](https://arxiv.org/abs/2004.02577) | WMT '14/'19 | \n| Sentence Boundary Augmentation For Neural Machine Translation Robustness [(arxiv '20)](https://arxiv.org/abs/2010.11132) | IWSLT '14/'15/'18 en-de, WMT '18 en-de | \n| Valar nmt : Vastly lacking resources neural machine translation [(Stanford CS224N)](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15811193.pdf) | Bible, Misc, Europarl v8, Newstest '18 | \n\n\n### Summarization\n\n| Paper | Datasets | \n| -- | --- |\n| Transforming Wikipedia into Augmented Data for Query-Focused Summarization [(arxiv '19)](https://arxiv.org/abs/1911.03324) | DUC |\n| Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge [(EMNLP '19)](https://www.aclweb.org/anthology/D19-1616/) | Swisstext, commoncrawl | \n| Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation [(NAACL '21)](https://arxiv.org/abs/2010.12836) | CNN-DailyMail | \n| Data Augmentation for Abstractive Query-Focused Multi-Document Summarization [(AAAI '21)](https://arxiv.org/abs/2103.01863) [code](https://github.com/ramakanth-pasunuru/QmdsCnnIr) | QMDSCNN, QMDSIR, WikiSum, DUC 2006, DUC 2007 |\n\n\n### Question Answering\n\n| Paper | Datasets | \n| -- | --- |\n| QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension [(ICLR '18)](https://openreview.net/forum?id=B14TlG-RW) | SQuAD, TriviaQA |\n| An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering [(EMNLP '19 Workshop)](https://www.aclweb.org/anthology/D19-5829/) | MRQA | \n| Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering [(arxiv '19)](https://arxiv.org/abs/1904.06652) | SQuAD, Trivia-QA, CMRC, DRCD | \n| XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering [(arxiv '19)](https://openreview.net/forum?id=BJgAf6Etwr) | XNLI, SQuAD |\n| Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering [(arxiv '20)](https://arxiv.org/abs/2010.12643) | MLQA, XQuAD, SQuAD-it, PIAF | \n| Logic-Guided Data Augmentation and Regularization for Consistent Question Answering [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.499/) [code](https://github.com/AkariAsai/logic_guided_qa) | WIQA, QuaRel, HotpotQA |\n\n\n### Sequence Tagging\n\n| Paper | Datasets | \n| -- | --- |\n| Data Augmentation via Dependency Tree Morphing for Low-Resource Languages [(EMNLP '18)](https://www.aclweb.org/anthology/D18-1545.pdf) [code](https://github.com/gozdesahin/crop-rotate-augment) | universal dependencies project | \n| DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.488/) [code](https://github.com/ntunlp/daga) | CoNLL2002/2003 |\n| An Analysis of Simple Data Augmentation for Named Entity Recognition [(COLING '20)](https://www.aclweb.org/anthology/2020.coling-main.343/) | MaSciP, i2b2- 2010 |\n| SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.691/) [code](https://github.com/rz-zhang/SeqMix) | CoNLL-03, ACE05, Webpage |\n\n\n### Parsing\n| Paper | Datasets | \n| -- | --- |\n| Data Recombination for Neural Semantic Parsing [(ACL '16)](https://www.aclweb.org/anthology/P16-1002/) [code](https://github.com/dongpobeyond/Seq2Act) | GeoQuery, ATIS, Overnight |\n| A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages [(EMNLP '19)](https://www.aclweb.org/anthology/D19-1102/) | Universal Dependencies treebanks version 2.2 |\n| Named Entity Recognition for Social Media Texts with Semantic Augmentation [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.107/)[code](https://github.com/cuhksz-nlp/SANER) | WNUT16, WNUT17, Weibo |\n| Good-Enough Compositional Data Augmentation [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.676/) [code](https://github.com/jacobandreas/geca) | SCAN |\n| GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [(ICLR '21)](https://openreview.net/forum?id=kyaIeYj4zZ) | SPIDER, WIKISQL, WIKITABLEQUESTIONS |\n\n\n### Grammatical Error Correction\n| Paper | Datasets | \n| -- | --- |\n| GenERRate: Generating Errors for Use in Grammatical Error Detection [(BEA '09)](https://www.aclweb.org/anthology/W09-2112/) | Ungram-BNC |\n| Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners [(IJCNLP '11)](https://www.aclweb.org/anthology/I11-1017/) [code](https://github.com/google-research-datasets/clang8) | Lang-8 |\n| Artificial error generation for translation-based grammatical error correction [(University of Cambridge Technical Report '16)](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-895.pdf)  | Several Datasets |\n| Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. [(NAACL'18)](https://www.aclweb.org/anthology/N18-1057/) | Lang-8, CoNLL-2014, CoNLL-2013, JFLEG | \n| Using Wikipedia Edits in Low Resource Grammatical Error Correction. [(WNUT @ EMNLP '18)](https://doi.org/10.18653/v1/W18-6111) | Falko-MERLIN GEC Corpus |\n| Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting [(arxiv '19)](https://arxiv.org/abs/1909.06002) | CoNLL-2014 , JFLEG |\n| Controllable Data Synthesis Method for Grammatical Error Correction [(arxiv '19)](https://arxiv.org/abs/1909.13302) [code](https://github.com/marumalo/survey/issues/21) | NUCLE, Lang-8, One-Billion, CoNLL2013, CoNLL2014|\n| Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. [(BEA @ ACL '19)](https://doi.org/10.18653/v1/W19-4427) | FCE, NUCLE, W\u0026I+LOCNESS, Lang-8 |\n| Corpora Generation for Grammatical Error Correction [(NAACL'19)](https://doi.org/10.18653/v1/N19-1333) | CoNLL-2014, JFLEG, Lang-8 |\n| Erroneous data generation for Grammatical Error Correction [(BEA @ ACL '19)](https://www.aclweb.org/anthology/W19-4415/) | Lang-8,n CoNLL, JFLEG, CoNLL-2014, ABCN, FCE |\n| Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting [(arxiv '19)](https://arxiv.org/abs/1909.06002) [code](https://github.com/marumalo/survey/issues/6) | GYAFC, WMT14, WMT18 |\n| A neural grammatical error correction  system  built  on  better  pre-training  and  sequential  transfer  learning. [(BEA @ ACL '19)](https://doi.org/10.18653/v1/W19-4423) | FCE, NUCLE, W\u0026I+LOCNESS, Lang-8, Gutenberg, Tatoeba, WikiText-103 |\n| Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation [(COLING'20)](https://doi.org/10.18653/v1/2020.coling-main.200) | FCE, NUCLE, W\u0026I+LOCNESS, Lang-8 |\n| A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction [(BEA @ ACL '20)](https://www.aclweb.org/anthology/2020.bea-1.21/) | W\u0026I+LOCNESS, FCE, News Crawl 2, W\u0026I+L train, FCE-train, NUCLE, Lang-8, W\u0026I+L dev, FCE-test, Tatoeba, WikiText-103 |\n| A syntactic rule-based framework for parallel data synthesis in Japanese GEC [(MIT Thesis '20)](https://dspace.mit.edu/handle/1721.1/127416) | Lang-8 |\n\n\n### Generation\n\n| Paper | Datasets | \n| -- | --- |\n| TNT-NLG, System 2: Data repetition and meaning representation manipulation to improve neural generation [(E2E NLG Challenge System Descriptions)](http://www.macs.hw.ac.uk/InteractionLab/E2E/final_papers/E2E-TNT_NLG2.pdf) | TODO | \n| Findings of the Third Workshop on Neural Generation and Translation [(WNGT @ EMNLP '19)](https://www.aclweb.org/anthology/D19-5601/) | RotoWire English-German | \n| A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models [(INLG '19)](https://www.aclweb.org/anthology/W19-8672/) [code](https://github.com/kedz/noiseylg) | E2E Challenge Dataset, Laptops, TVs | \n| GenAug: Data Augmentation for Finetuning Text Generators [(DeeLIO @ EMNLP '20)](https://www.aclweb.org/anthology/2020.deelio-1.4/) [code](https://github.com/styfeng/GenAug) | Yelp | \n| Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers [(WebNLG+ @ INLG '20)](https://www.aclweb.org/anthology/2020.webnlg-1.9/) | WebNLG |\n\n\n### Dialogue\n| Paper | Datasets | \n| -- | --- |\n| Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding [(COLING '18)](https://www.aclweb.org/anthology/C18-1105/) [code](https://github.com/AtmaHou/Seq2SeqDataAugmentationForLU) | ATIS, Dec94, Stanford dialogue |\n| Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context [(arxiv '19)](https://arxiv.org/abs/1911.10484) [code](https://github.com/thu-spmi/damd-multiwoz) | MultiWOZ |\n| Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding [(Student Research Workshop @ NAACL '19)](https://www.aclweb.org/anthology/N19-3014/) | ATIS, Snips, MR |\n| Data Augmentation with Atomic Templates for Spoken Language Understanding [(EMNLP '19)](https://www.aclweb.org/anthology/D19-1375/) [code](https://github.com/sz128/DAAT_SLU) | DSTC 2\u00263,  DSTC2 |\n| Data Augmentation for Spoken Language Understanding via Joint Variational Generation [(AAAI '19)](https://ojs.aaai.org/index.php/AAAI/article/view/4729) | ATIS, Snips, MIT |\n| Effective Data Augmentation Approaches to End-to-End Task-Oriented Dialogue [(IALP '19)](https://ieeexplore.ieee.org/document/9037690) | CamRest676, KVRET |\n| Paraphrase Augmented Task-Oriented Dialog Generation [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.60/) [code](https://github.com/thu-spmi/PARG) | TCamRest676, MultiWOZ |\n| Dialog State Tracking with Reinforced Data Augmentation [(AAAI '20)](https://ojs.aaai.org/index.php/AAAI/article/view/6491) | WoZ,  MultiWoZ |\n| Data Augmentation for Copy-Mechanism in Dialogue State Tracking [(arxiv '20)](https://arxiv.org/abs/2002.09634) | WoZ, DSTC2, Multi |\n| Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent Classification [(PACLIC '20)](https://www.aclweb.org/anthology/2020.paclic-1.20/) [code](https://github.com/slouvan/saug) | ATIS, SNIPS, FB |\n| Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management [(TACL '21)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00352/97777/Conversation-Graph-Data-Augmentation-Training-and) | M2M, MultiWOZ |\n| GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation [(EMNLP '21)](https://aclanthology.org/2021.emnlp-main.35/) [code](https://github.com/asappresearch/gold) | SMCalFlow, ROSTD |\n| Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation [(ACL '21 Findings)](https://aclanthology.org/2021.findings-acl.357/) [code](https://github.com/harsh19/Diverse-Reference-Augmentation/)| DailyDialog |\n\n### Multimodal\n| Paper | Datasets | \n| -- | --- |\n| Data Augmentation for Visual Question Answering [(INLG '17)](https://www.aclweb.org/anthology/W17-3529/) | COCO-VQA, COCO-QA |\n| Low Resource Multi-modal Data Augmentation for End-to-end ASR [(CoRR ’18)](https://deepai.org/publication/low-resource-multi-modal-data-augmentation-for-end-to-end-asr) | TODO |\n| Multi-Modal Data Augmentation for End-to-end ASR [(Interspeech '18)](https://www.isca-speech.org/archive/Interspeech_2018/abstracts/2456.html) | Voxforge, HUB4 |\n| Augmenting Image Question Answering Dataset by Exploiting Image Captions [(LREC '18)](https://www.aclweb.org/anthology/L18-1436/) | IQA |\n| Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks [(AVEC '18)](https://dl.acm.org/doi/10.1145/3266302.3266304) | TODO |\n| Multimodal Dialogue State Tracking By QA Approach with Data Augmentation [(DSTC8 @ AAAI '20)](https://arxiv.org/abs/2007.09903) | DSTC7-AVSD |\n| Data augmentation techniques for the Video Question Answering task [(arxiv '20)](https://arxiv.org/abs/2008.09849) | TGIF-QA,  MSVD-QA |\n| Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors [(NLP for ConvAI @ ACL '20)](https://arxiv.org/abs/2006.05635) | DSTC2 |\n| Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering [(ECCV '20)](https://link.springer.com/chapter/10.1007/978-3-030-58529-7_26) | TODO |\n| Text Augmentation Using BERT for Image Captioning [(Applied Sciences '20)](https://www.mdpi.com/2076-3417/10/17/5978) | MSCOCO |\n| MDA: Multimodal Data Augmentation Framework for Boosting Performance on Image-Text Sentiment/Emotion Classification Tasks [(IEEE Intelligent Systems '20)](https://ieeexplore.ieee.org/document/9206007) | TODO |\n\n### Mitigating Bias\n| Paper | Datasets | \n| -- | --- |\n| Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. [(NAACL '18)](https://www.aclweb.org/anthology/N18-2003/) [code](https://github.com/uclanlp/corefBias) | WinoBias, OntoNotes|\n| Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology [(ACL '19)](https://www.aclweb.org/anthology/P19-1161/) [code](https://github.com/rycolab/biasCDA) | TODO |\n| CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech [(ACL '19)](https://aclanthology.org/P19-1271.pdf) [Dataset](https://github.com/marcoguerini/CONAN)| New Dataset Created|\n| It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution [(EMNLP '19)](https://www.aclweb.org/anthology/D19-1530/) [code](https://github.com/rowanhm/counterfactual-data-substitution) | SSA, Stanford Large Movie Review, SimLex-999 |\n| Gender Bias in Neural Natural Language Processing. [(Springer '20)](https://link.springer.com/chapter/10.1007%2F978-3-030-62077-6_14 ) | Wikitext-2, CoNLL-2012 |\n| Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [(arxiv '20)](https://arxiv.org/abs/2010.12510) | SWAG, CoNLL2009, MultiNLI, HANS|\n\n### Mitigating Class Imbalance\n| Paper | Datasets | \n| -- | --- |\n| SMOTE: Synthetic Minority Over-sampling Technique [(Journal of Artificial Intelligence Research '02)](https://www.jair.org/index.php/jair/article/view/10302) | Pima, Phoneme, Adult, E-state, Satimage, Forest Cover, Oil, Mammography, Can |\n| Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem [(EMNLP '07)](https://www.aclweb.org/anthology/D07-1082/) | TODO |\n| MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation [(Knowledge-Based Systems '15)](https://www.sciencedirect.com/science/article/abs/pii/S0950705115002737?via%3Dihub) | bibtex, cal500, corel5k, slashdot, tmc2007, mediamill, medical, scene, enron, emotions |\n| SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary [(Journal of Artificial Intelligence Research '18)](https://www.jair.org/index.php/jair/article/view/11192) | TODO |\n\n### Adversarial examples\n\n| Paper | Datsets | \n| -- | --- |\n| Adversarial Example Generation with Syntactically Controlled Paraphrase Networks [(NAACL '18)](https://www.aclweb.org/anthology/N18-1170/) [code](https://github.com/miyyer/scpn)| SST, SICK | \n| AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples [(ACL '18)](https://www.aclweb.org/anthology/P18-1225/) [code](https://github.com/dykang/adventure)| WordNet, PPDB, SICK, SNLI, SciTail | \n| Breaking NLI Systems with Sentences that Require Simple Lexical Inferences [(ACL '18)](https://www.aclweb.org/anthology/P18-2103/) | SNLI, SciTail, MultiNLI |\n| Certified Robustness to Adversarial Word Substitutions [(EMNLP '19)](https://www.aclweb.org/anthology/D19-1423/) [code](https://github.com/robinjia/certified-word-sub)| IMDB, SNLI | \n| PAWS: Paraphrase Adversaries from Word Scrambling [(NAACL '19)](https://www.aclweb.org/anthology/N19-1131/) [code](https://github.com/google-research-datasets/paws)| PAWS (QQP + Wikipedia) | \n| Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency [(ACL '19)](https://aclanthology.org/P19-1103/) [code](https://github.com/JHL-HUST/PWWS) | IMDB, AG’s News, Yahoo Answers |\n\n\n### Compositionality\n\n| Paper | Datsets | \n| -- | --- |\n| Good-Enough Compositional Data Augmentation [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.676.pdf) [code](https://github.com/jacobandreas/geca) | SCAN |\n| Sequence-Level Mixed Sample Data Augmentation [(EMNLP '20)](https://www.aclweb.org/anthology/2020.emnlp-main.447) [code](https://github.com/dguo98/seqmix) | IWSLT ’14, WMT ’14 | \n\n### Automated Augmentation\n\n| Paper                                                        | Datsets                     |\n| ------------------------------------------------------------ | --------------------------- |\n| Learning Data Manipulation for Augmentation and Weighting [(NeurIPS '19)](https://papers.nips.cc/paper/2019/file/671f0311e2754fcdd37f70a8550379bc-Paper.pdf) [code](https://github.com/tanyuqian/learning-data-manipulation) | SST, IMDB, TREC, CIFAR-10   |\n| Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight [(ACL '20)](https://www.aclweb.org/anthology/2020.acl-main.564.pdf) | DailyDialog,  OpenSubtitles |\n| Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification [(EMNLP '21)](https://arxiv.org/abs/2109.00523) [code](https://github.com/lancopku/text-autoaugment) | IMDB, SST2, SST5, TREC, YELP2, YELP5 |\n\n\n### Popular Resources\n- [A visual survey of data augmentation in NLP](https://amitness.com/2020/05/data-augmentation-for-nlp/)\n- [nlpaug](https://github.com/makcedward/nlpaug)\n- [TextAttack](https://github.com/QData/TextAttack)\n- [AugLy](https://github.com/facebookresearch/AugLy)\n- [NL-Augmenter 🦎 → 🐍](https://github.com/GEM-benchmark/NL-Augmenter/)\n","funding_links":[],"categories":["Table of Contents"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstyfeng%2FDataAug4NLP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstyfeng%2FDataAug4NLP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstyfeng%2FDataAug4NLP/lists"}