{"id":13514317,"url":"https://github.com/makcedward/nlpaug","last_synced_at":"2025-05-13T18:11:03.754Z","repository":{"id":37396909,"uuid":"176858880","full_name":"makcedward/nlpaug","owner":"makcedward","description":"Data augmentation for NLP ","archived":false,"fork":false,"pushed_at":"2024-06-24T09:15:15.000Z","size":3365,"stargazers_count":4551,"open_issues_count":77,"forks_count":468,"subscribers_count":41,"default_branch":"master","last_synced_at":"2025-04-25T15:48:38.033Z","etag":null,"topics":["adversarial-attacks","adversarial-example","ai","artificial-intelligence","augmentation","data-science","machine-learning","ml","natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"https://makcedward.github.io/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/makcedward.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGE.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["makcedward"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2019-03-21T03:00:17.000Z","updated_at":"2025-04-24T15:13:58.000Z","dependencies_parsed_at":"2024-10-14T08:42:33.079Z","dependency_job_id":"f6229dac-aa4c-4a5f-948d-71c2b88236be","html_url":"https://github.com/makcedward/nlpaug","commit_stats":{"total_commits":603,"total_committers":34,"mean_commits":"17.735294117647058","dds":"0.10613598673300162","last_synced_commit":"23800cbb9632c7fc8c4a88d46f9c4ecf68a96299"},"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/makcedward%2Fnlpaug","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/makcedward%2Fnlpaug/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/makcedward%2Fnlpaug/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/makcedward%2Fnlpaug/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/makcedward","download_url":"https://codeload.github.com/makcedward/nlpaug/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254000854,"owners_count":21997442,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adversarial-attacks","adversarial-example","ai","artificial-intelligence","augmentation","data-science","machine-learning","ml","natural-language-processing","nlp"],"created_at":"2024-08-01T05:00:52.672Z","updated_at":"2025-05-13T18:11:03.730Z","avatar_url":"https://github.com/makcedward.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/makcedward"],"categories":["Libraries","GitHub","Jupyter Notebook","Data Processing","其他_NLP自然语言处理","Frameworks and libraries","Feature Extraction"],"sub_categories":["Data Transformation and Manipulation","Data Pre-processing \u0026 Loading","其他_文本生成、文本对话",":snake: Python","Text/NLP"],"readme":"\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"https://github.com/makcedward/nlpaug/blob/master/res/logo_small.png\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://travis-ci.org/makcedward/nlpaug\"\u003e\n        \u003cimg alt=\"Build\" src=\"https://travis-ci.org/makcedward/nlpaug.svg?branch=master\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.codacy.com/app/makcedward/nlpaug?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=makcedward/nlpaug\u0026amp;utm_campaign=Badge_Grade\"\u003e\n        \u003cimg alt=\"Code Quality\" src=\"https://api.codacy.com/project/badge/Grade/2d6d1d08016a4f78818161a89a2dfbfb\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pepy.tech/badge/nlpaug\"\u003e\n        \u003cimg alt=\"Downloads\" src=\"https://pepy.tech/badge/nlpaug\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n# nlpaug\n\nThis python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about [Data Augmentation in NLP](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28). `Augmenter` is the basic element of augmentation while `Flow` is a pipeline to orchestra multi augmenter together.\n\n## Features\n*   Generate synthetic data for improving model performance without manual effort\n*   Simple, easy-to-use and lightweight library. Augment data in 3 lines of code\n*   Plug and play to any machine leanring/ neural network frameworks (e.g. scikit-learn, PyTorch, TensorFlow)\n*   Support textual and audio input\n\n\u003ch3 align=\"center\"\u003eTextual Data Augmentation Example\u003c/h3\u003e\n\u003cbr\u003e\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/makcedward/nlpaug/blob/master/res/textual_example.png\"/\u003e\u003c/p\u003e\n\u003ch3 align=\"center\"\u003eAcoustic Data Augmentation Example\u003c/h3\u003e\n\u003cbr\u003e\u003cp align=\"center\"\u003e\u003cimg src=\"https://github.com/makcedward/nlpaug/blob/master/res/audio_example.png\"/\u003e\u003c/p\u003e\n\n| Section | Description |\n|:---:|:---:|\n| [Quick Demo](https://github.com/makcedward/nlpaug#quick-demo) | How to use this library |\n| [Augmenter](https://github.com/makcedward/nlpaug#augmenter) | Introduce all available augmentation methods |\n| [Installation](https://github.com/makcedward/nlpaug#installation) | How to install this library |\n| [Recent Changes](https://github.com/makcedward/nlpaug#recent-changes) | Latest enhancement |\n| [Extension Reading](https://github.com/makcedward/nlpaug#extension-reading) | More real life examples or researchs |\n| [Reference](https://github.com/makcedward/nlpaug#reference) | Reference of external resources such as data or model |\n\n## Quick Demo\n*   [Quick Example](https://github.com/makcedward/nlpaug/blob/master/example/quick_example.ipynb)\n*   [Example of Augmentation for Textual Inputs](https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb)\n*   [Example of Augmentation for Multilingual Textual Inputs ](https://github.com/makcedward/nlpaug/blob/master/example/textual_language_augmenter.ipynb)\n*   [Example of Augmentation for Spectrogram Inputs](https://github.com/makcedward/nlpaug/blob/master/example/spectrogram_augmenter.ipynb)\n*   [Example of Augmentation for Audio Inputs](https://github.com/makcedward/nlpaug/blob/master/example/audio_augmenter.ipynb)\n*   [Example of Orchestra Multiple Augmenters](https://github.com/makcedward/nlpaug/blob/master/example/flow.ipynb)\n*   [Example of Showing Augmentation History](https://github.com/makcedward/nlpaug/blob/master/example/change_log.ipynb)\n*   How to train [TF-IDF model](https://github.com/makcedward/nlpaug/blob/master/example/tfidf-train_model.ipynb)\n*   How to train [LAMBADA model](https://github.com/makcedward/nlpaug/blob/master/example/lambada-train_model.ipynb)\n*   How to create [custom augmentation](https://github.com/makcedward/nlpaug/blob/master/example/custom_augmenter.ipynb)\n*   [API Documentation](https://nlpaug.readthedocs.io/en/latest/)\n\n## Augmenter\n| Augmenter | Target | Augmenter | Action | Description |\n|:---:|:---:|:---:|:---:|:---:|\n|Textual| Character | KeyboardAug | substitute | Simulate keyboard distance error |\n|Textual| | OcrAug | substitute | Simulate OCR engine error |\n|Textual| | [RandomAug](https://medium.com/hackernoon/does-your-nlp-model-able-to-prevent-adversarial-attack-45b5ab75129c) | insert, substitute, swap, delete | Apply augmentation randomly |\n|Textual| Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym|\n|Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to [BERT](https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb), DistilBERT, [RoBERTa](https://medium.com/towards-artificial-intelligence/a-robustly-optimized-bert-pretraining-approach-f6b6e537e6a6) or [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b) language model to find out the most suitlabe word for augmentation|\n|Textual| | RandomWordAug | swap, crop, delete | Apply augmentation randomly |\n|Textual| | SpellingAug | substitute | Substitute word according to spelling mistake dictionary |\n|Textual| | SplitAug | split | Split one word to two words randomly|\n|Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym |\n|Textual| | [TfIdfAug](https://medium.com/towards-artificial-intelligence/unsupervised-data-augmentation-6760456db143) | insert, substitute | Use TF-IDF to find out how word should be augmented |\n|Textual| | WordEmbsAug | insert, substitute | Leverage  [word2vec](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a), [GloVe](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) or [fasttext](https://towardsdatascience.com/3-silver-bullets-of-word-embedding-in-nlp-10fa8f50cc5a) embeddings to apply augmentation|\n|Textual| | [BackTranslationAug](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28) | substitute | Leverage two translation models for augmentation |\n|Textual| | ReservedAug | substitute | Replace reserved words |\n|Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to [XLNet](https://medium.com/dataseries/why-does-xlnet-outperform-bert-da98a8503d5b), [GPT2](https://towardsdatascience.com/too-powerful-nlp-model-generative-pre-training-2-4cc6afb6655) or DistilGPT2 prediction |\n|Textual| | AbstSummAug | substitute | Summarize article by abstractive summarization method |\n|Textual| | LambadaAug | substitute | Using language model to generate text and then using classification model to retain high quality results |\n|Signal| Audio | CropAug | delete | Delete audio's segment |\n|Signal| | LoudnessAug|substitute | Adjust audio's volume |\n|Signal| | MaskAug | substitute | Mask audio's segment |\n|Signal| | NoiseAug | substitute | Inject noise |\n|Signal| | PitchAug | substitute | Adjust audio's pitch |\n|Signal| | ShiftAug | substitute | Shift time dimension forward/ backward |\n|Signal| | SpeedAug | substitute | Adjust audio's speed |\n|Signal| | VtlpAug | substitute | Change vocal tract |\n|Signal| | NormalizeAug | substitute | Normalize audio |\n|Signal| | PolarityInverseAug | substitute | Swap positive and negative for audio |\n|Signal| Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |\n|Signal| | TimeMaskingAug | substitute | Set block of values to zero according to time dimension |\n|Signal| | LoudnessAug | substitute | Adjust volume |\n\n## Flow\n| Augmenter | Augmenter | Description |\n|:---:|:---:|:---:|\n|Pipeline| Sequential | Apply list of augmentation functions sequentially |\n|Pipeline| Sometimes | Apply some augmentation functions randomly |\n\n## Installation\nThe library supports python 3.5+ in linux and window platform.\n\nTo install the library:\n```bash\npip install numpy requests nlpaug\n```\nor install the latest version (include BETA features) from github directly\n```bash\npip install numpy git+https://github.com/makcedward/nlpaug.git\n```\nor install over conda\n```bash\nconda install -c makcedward nlpaug\n```\n\nIf you use BackTranslationAug, ContextualWordEmbsAug, ContextualWordEmbsForSentenceAug and AbstSummAug, installing the following dependencies as well\n```bash\npip install torch\u003e=1.6.0 transformers\u003e=4.11.3 sentencepiece\n```\n\nIf you use LambadaAug, installing the following dependencies as well\n```bash\npip install simpletransformers\u003e=0.61.10\n```\n\nIf you use AntonymAug, SynonymAug, installing the following dependencies as well\n```bash\npip install nltk\u003e=3.4.5\n```\n\nIf you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first and installing the following dependencies as well\n```bash\nfrom nlpaug.util.file.download import DownloadUtil\nDownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model\nDownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model\nDownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model\n\npip install gensim\u003e=4.1.2\n```\n\nIf you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website\n```bash\nhttp://paraphrase.org/#/download\n```\n\nIf you use PitchAug, SpeedAug and VtlpAug, installing the following dependencies as well\n```bash\npip install librosa\u003e=0.9.1 matplotlib\n```\n\n## Recent Changes\n\n### 1.1.11 Jul 6, 2022\n*   [Return list of output](https://github.com/makcedward/nlpaug/issues/302)\n*   [Fix download util](https://github.com/makcedward/nlpaug/issues/301)\n*   [Fix lambda label misalignment](https://github.com/makcedward/nlpaug/issues/295)\n*   [Add language pack reference link for SynonymAug](https://github.com/makcedward/nlpaug/issues/289)\n\n\nSee [changelog](https://github.com/makcedward/nlpaug/blob/master/CHANGE.md) for more details.\n\n## Extension Reading\n*   [Data Augmentation library for Text](https://towardsdatascience.com/data-augmentation-library-for-text-9661736b13ff)\n*   [Does your NLP model able to prevent adversarial attack?](https://medium.com/hackernoon/does-your-nlp-model-able-to-prevent-adversarial-attack-45b5ab75129c)\n*   [How does Data Noising Help to Improve your NLP Model?](https://medium.com/towards-artificial-intelligence/how-does-data-noising-help-to-improve-your-nlp-model-480619f9fb10)\n*   [Data Augmentation library for Speech Recognition](https://towardsdatascience.com/data-augmentation-for-speech-recognition-e7c607482e78)\n*   [Data Augmentation library for Audio](https://towardsdatascience.com/data-augmentation-for-audio-76912b01fdf6)\n*   [Unsupervied Data Augmentation](https://medium.com/towards-artificial-intelligence/unsupervised-data-augmentation-6760456db143)\n*   [A Visual Survey of Data Augmentation in NLP](https://amitness.com/2020/05/data-augmentation-for-nlp/)\n\n## Reference\nThis library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See [data source](https://github.com/makcedward/nlpaug/blob/master/SOURCE.md) for more details.\n\n## Citation\n\n```latex\n@misc{ma2019nlpaug,\n  title={NLP Augmentation},\n  author={Edward Ma},\n  howpublished={https://github.com/makcedward/nlpaug},\n  year={2019}\n}\n```\n\nThis package is cited by many books, workshop and academic research papers (70+). Here are some of examples and you may visit [here](https://github.com/makcedward/nlpaug/blob/master/CITED.md) to get the full list.\n\n### Workshops cited nlpaug\n*   S. Vajjala. [NLP without a readymade labeled dataset](https://rpubs.com/vbsowmya/tmls2021) at [Toronto Machine Learning Summit, 2021](https://www.torontomachinelearning.com/). 2021\n\n### Book cited nlpaug\n*   S. Vajjala, B. Majumder, A. Gupta and H. Surana. [Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems](https://www.amazon.com/Practical-Natural-Language-Processing-Pragmatic/dp/1492054054). 2020\n*   A. Bartoli and A. Fusiello. [Computer Vision–ECCV 2020 Workshops](https://books.google.com/books?hl=en\u0026lr=lang_en\u0026id=0rYREAAAQBAJ\u0026oi=fnd\u0026pg=PR7\u0026dq=nlpaug\u0026ots=88bPp5rhnY\u0026sig=C2ue8Xxbu09l59nAMOcVxWYvvWM#v=onepage\u0026q=nlpaug\u0026f=false). 2020\n*   L. Werra, L. Tunstall, and T. Wolf [Natural Language Processing with Transformers](https://www.amazon.com/Natural-Language-Processing-Transformers-Applications/dp/1098103246/ref=sr_1_3?crid=2CWBPA8QG0TRU\u0026keywords=Natural+Language+Processing+with+Transformers\u0026qid=1645646312\u0026sprefix=natural+language+processing+with+transformers%2Caps%2C111\u0026sr=8-3). 2022\n\n### Research paper cited nlpaug\n*   Google: M. Raghu and  E. Schmidt. [A Survey of Deep Learning for Scientific Discovery](https://arxiv.org/pdf/2003.11755.pdf). 2020\n*   Sirius XM: E. Jing, K. Schneck, D. Egan and S. A. Waterman. [Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts](https://arxiv.org/pdf/2110.07096.pdf). 2021\n*   Salesforce Research: B. Newman, P. K. Choubey and N. Rajani. [P-adapters: Robustly Extracting Factual Information from Language Modesl with Diverse Prompts](https://arxiv.org/pdf/2110.07280.pdf). 2021\n*   Salesforce Research: L. Xue, M. Gao, Z. Chen, C. Xiong and R. Xu. [Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks](https://arxiv.org/pdf/2110.04413.pdf). 2021\n\n\n## Contributions\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://github.com/sakares\"\u003e\u003cimg src=\"https://avatars.githubusercontent.com/u/1306031\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003esakares saengkaew\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://github.com/bdalal\"\u003e\u003cimg src=\"https://avatars.githubusercontent.com/u/3478378?s=400\u0026v=4\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003eBinoy Dalal\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003ca href=\"https://github.com/emrecncelik\"\u003e\u003cimg src=\"https://avatars.githubusercontent.com/u/20845117?v=4\" width=\"100px;\" alt=\"\"/\u003e\u003cbr /\u003e\u003csub\u003e\u003cb\u003eEmrecan Çelik\u003c/b\u003e\u003c/sub\u003e\u003c/a\u003e\u003cbr /\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmakcedward%2Fnlpaug","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmakcedward%2Fnlpaug","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmakcedward%2Fnlpaug/lists"}