{"id":16778709,"url":"https://github.com/ritvik19/text-data-augmentation","last_synced_at":"2025-04-10T20:51:42.105Z","repository":{"id":104587608,"uuid":"418042896","full_name":"Ritvik19/Text-Data-Augmentation","owner":"Ritvik19","description":"State of the Art Text Data Augmentation for Natural Language Processing Applications","archived":false,"fork":false,"pushed_at":"2022-02-04T17:11:52.000Z","size":70,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-24T18:21:20.833Z","etag":null,"topics":["natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"https://ritvik19.github.io/text-data-augmentation/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ritvik19.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-17T06:36:23.000Z","updated_at":"2025-02-06T09:30:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"0e35a3ac-caa7-4251-ad96-d2668e617068","html_url":"https://github.com/Ritvik19/Text-Data-Augmentation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ritvik19%2FText-Data-Augmentation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ritvik19%2FText-Data-Augmentation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ritvik19%2FText-Data-Augmentation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ritvik19%2FText-Data-Augmentation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ritvik19","download_url":"https://codeload.github.com/Ritvik19/Text-Data-Augmentation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248296753,"owners_count":21080304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp"],"created_at":"2024-10-13T07:28:29.414Z","updated_at":"2025-04-10T20:51:42.097Z","avatar_url":"https://github.com/Ritvik19.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text-Data-Augmentation\n\nState of the Art Text Data Augmentation for Natural Language Processing Applications\n\n## Table of Contents\n\n- [Text-Data-Augmentation](#text-data-augmentation)\n  - [Table of Contents](#table-of-contents)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [Abstractive Summarization](#abstractive-summarization)\n    - [Back Translation](#back-translation)\n    - [Character Noise](#character-noise)\n    - [Contextual Word Replacement](#contextual-word-replacement)\n    - [Easy Data Augmentation](#easy-data-augmentation)\n    - [KeyBoard Noise](#keyboard-noise)\n    - [OCR Noise](#ocr-noise)\n    - [Paraphrase](#paraphrase)\n    - [Similar Word Replacement](#similar-word-replacement)\n    - [Synonym Replacement](#synonym-replacement)\n    - [Word Split](#word-split)\n  - [References](#references)\n\n---\n\n## Installation\n\n```bash\npip install git+https://github.com/Ritvik19/Text-Data-Augmentation.git\n```\n\n---\n\n## Usage\n\nThis library various techniques for augmenting text data:\n\n### Abstractive Summarization\n\nAbstractive Summarization Augmentation summarizes the model using transformer models. [[17]](#ref-17) [[18]](#ref-18)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import AbstractiveSummarization\n\u003e\u003e\u003e aug = AbstractiveSummarization()\n\u003e\u003e\u003e aug(['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization, abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R\u0026D, to financial research and legal documents analysis.'])\n['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization, abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R\u0026D, to financial research and legal documents analysis.', 'Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text . Unlike extractive summarization, it does not copy important phrases from the source text but also potentially come up with new phrases thatare relevant, which can be seen as paraphrasing .']\n```\n\n### Back Translation\n\nBack Translation Augmentation relies on translating text data to another language and then translating it back to the original language. This technique allows generating textual data of distinct wording to original text while preserving the original context and meaning.[[1]](#ref-1) [[2]](#ref-2) [[10]](#ref-10)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import BackTranslation\n\u003e\u003e\u003e aug = BackTranslation()\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps on the lazy dog']\n```\n\n### Character Noise\n\nCharacter Noise Augmentation adds character level noise by randomly inserting, deleting, swaping or replacing some charaters in the input text. [[2]](#ref-2) [[9]](#ref-9)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import CharacterNoise\n\u003e\u003e\u003e aug = CharacterNoise(alpha=0.1, n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps ovr the lazy dog']\n```\n\n### Contextual Word Replacement\n\nContextual Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a mask and then using a Masked Language Model to fill it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[3]](#ref-3) [[11]](#ref-11) [[19]](#ref-19)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import ContextualWordReplacement\n\u003e\u003e\u003e aug = ContextualWordReplacement(n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over his lazy dog']\n```\n\n### Easy Data Augmentation\n\nEasy Data Augmentation adds word level noise by randomly inserting, deleting, swaping some words in the input text or by shuffling the sentences in the input text. [[4]](#ref-4) [[5]](#ref-5) [[9]](#ref-9) [[12]](#ref-12) [[13]](#ref-13)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import EasyDataAugmentation\n\u003e\u003e\u003e aug = EasyDataAugmentation(n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over the dog']\n```\n\n### KeyBoard Noise\n\nKeyBoard Noise Augmentation adds character level spelling mistake noise by mimicing typographical errors made using a qwerty keyboard in the input text. [[2]](#ref-2) [[9]](#ref-9)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import KeyBoardNoise\n\u003e\u003e\u003e aug = KeyBoardNoise(alpha=0.1, n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick broen fox jumps over the lazy dog']\n```\n\n### OCR Noise\n\nOCR Noise Augmentation adds character level spelling mistake noise by mimicing ocr errors in the input text. [[6]](#ref-6)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import OCRNoise\n\u003e\u003e\u003e aug = OCRNoise(alpha=0.1, n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick hrown lox jumps over the lazy dog']\n```\n\n### Paraphrase\n\nParaphrase Augmentation rephrases the input sentences using T5 models. [[2]](#ref-2)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import Paraphrase\n\u003e\u003e\u003e aug = Paraphrase(\"\u003cT5 Model\u003e\", n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox has jumped on the lazy dog.']\n```\n\n### Similar Word Replacement\n\nSimilar Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a word having the most similar vector to it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[7]](#ref-7) [[15]](#ref-15) [[16]](#ref-16) [[19]](#ref-19)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import SimilarWordReplacement\n\u003e\u003e\u003e aug = SimilarWordReplacement(\"en_core_web_lg\",  alpha=0.1, n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick White Wolf jumps over the lazy Cat.']\n```\n\n### Synonym Replacement\n\nSynonym Replacement Augmentation creates Augmented Samples by randomly replacing some words with their synonyms based on the word net data base. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[4]](#ref-4) [[8]](#ref-8) [[13]](#ref-13) [[19]](#ref-19)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import SynonymReplacement\n\u003e\u003e\u003e aug = SynonymReplacement(alpha=0.1, n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over the lethargic dog']\n```\n\n### Word Split\n\nWord Split Augmentation adds word level spelling mistake noise by spliting words randomly in the input text. [[2]](#ref-2) [[14]](#ref-14)\n\n```python\n\u003e\u003e\u003e from text_data_augmentation import WordSplit\n\u003e\u003e\u003e aug = WordSplit(alpha=0.1, n_aug=1)\n\u003e\u003e\u003e aug(['A quick brown fox jumps over the lazy dog'])\n['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over th e lazy dog']\n```\n\n---\n\n## References\n\n1. \u003ca href=\"https://arxiv.org/pdf/2106.04681.pdf\" id=\"ref-1\"\u003eData Expansion Using Back Translation and Paraphrasing for Hate Speech Detection\u003c/a\u003e\n2. \u003ca href=\"https://arxiv.org/ftp/arxiv/papers/2107/2107.03158.pdf\" id=\"ref-2\"\u003eA Survey on Data Augmentation for Text Classification\u003c/a\u003e\n3. \u003ca href=\"https://arxiv.org/pdf/1805.06201.pdf\" id=\"ref-3\"\u003eContextual Augmentation: Data Augmentation by Words with Paradigmatic Relations\u003c/a\u003e\n4. \u003ca href=\"https://arxiv.org/pdf/1901.11196.pdf\" id=\"ref-4\"\u003eEDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks\u003c/a\u003e\n5. \u003ca href=\"https://aclanthology.org/2020.coling-main.343.pdf\" id=\"ref-5\"\u003eAn Analysis of Simple Data Augmentation for Named Entity Recognition\u003c/a\u003e\n6. \u003ca href=\"https://zenodo.org/record/3245169/files/JCDL2019_Deep_Analysis.pdf\" id=\"ref-6\"\u003eDeep Statistical Analysis of OCR Errors for Effective Post-OCR Processing\u003c/a\u003e\n7. \u003ca href=\"https://www.researchgate.net/publication/331784439_A_Study_of_Various_Text_Augmentation_Techniques_for_Relation_Classification_in_Free_Text\" id=\"ref-7\"\u003eA Study of Various Text Augmentation Techniques for Relation Classification in Free Text\u003c/a\u003e\n8. \u003ca href=\"http://ceur-ws.org/Vol-2268/paper11.pdf\" id=\"ref-8\"\u003eText Augmentation for Neural Networks\u003c/a\u003e\n9. \u003ca href=\"https://arxiv.org/pdf/1711.02173.pdf\" id=\"ref-9\"\u003eSynthetic And Natural Noise Both Break Neural Machine Translation\u003c/a\u003e\n10. \u003ca href=\"https://arxiv.org/pdf/1511.06709.pdf\" id=\"ref-10\"\u003eImproving Neural Machine Translation Models with Monolingual Data\u003c/a\u003e\n11. \u003ca href=\"https://arxiv.org/pdf/2003.02245.pdf\" id=\"ref-11\"\u003eData Augmentation Using Pre-trained Transformer Models\u003c/a\u003e\n12. \u003ca href=\"https://arxiv.org/pdf/1903.09460.pdf\" id=\"ref-12\"\u003eData Augmentation via Dependency Tree Morphing for Low-Resource Languages\u003c/a\u003e\n13. \u003ca href=\"https://arxiv.org/pdf/1809.02079.pdf\" id=\"ref-13\"\u003eAdversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models\u003c/a\u003e\n14. \u003ca href=\"https://arxiv.org/pdf/1812.05271v1.pdf\" id=\"ref-14\"\u003eTextBugger: Generating Adversarial Text Against Real-world Applications\u003c/a\u003e\n15. \u003ca href=\"https://arxiv.org/pdf/1804.07998.pdf\" id=\"ref-15\"\u003eGenerating Natural Language Adversarial Examples\u003c/a\u003e\n16. \u003ca href=\"https://arxiv.org/pdf/1509.01626.pdf\" id=\"ref-16\"\u003eCharacter-level Convolutional Networks for Text Classification\u003c/a\u003e\n17. \u003ca href=\"https://arxiv.org/pdf/1812.02303.pdf\" id=\"ref-17\"\u003eNeural Abstractive Text Summarization with Sequence-to-Sequence Models\u003c/a\u003e\n18. \u003ca href=\"https://arxiv.org/pdf/1910.13461v1.pdf\" id=\"ref-18\"\u003eBART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension\u003c/a\u003e\n19. \u003ca href=\"https://arxiv.org/pdf/1904.12848.pdf\" id=\"ref-19\"\u003eUnsupervised Data Augmentation for Consistency Training\u003c/a\u003e\n20. \u003ca href=\"https://arxiv.org/pdf/2007.02033.pdf\" id=\"ref-20\"\u003eText Data Augmentation: Towards better detection of spear-phishing emails\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fritvik19%2Ftext-data-augmentation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fritvik19%2Ftext-data-augmentation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fritvik19%2Ftext-data-augmentation/lists"}