{"id":13709405,"url":"https://github.com/dsfsi/textaugment","last_synced_at":"2025-05-16T12:02:35.425Z","repository":{"id":40260043,"uuid":"185192277","full_name":"dsfsi/textaugment","owner":"dsfsi","description":"TextAugment: Text Augmentation Library","archived":false,"fork":false,"pushed_at":"2024-02-20T11:57:52.000Z","size":168,"stargazers_count":415,"open_issues_count":10,"forks_count":60,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-02T08:07:27.536Z","etag":null,"topics":["augmentation","augmentation-methods","hacktoberfest","low-resouce-language","mixup","natural-language-processing","nlp","nlp-augmentation","synonym","word2vec","wordnet"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dsfsi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-06T12:28:19.000Z","updated_at":"2025-02-25T06:21:15.000Z","dependencies_parsed_at":"2023-01-25T08:00:40.207Z","dependency_job_id":"264c8b2c-2840-4db3-9f5d-efaab81b4634","html_url":"https://github.com/dsfsi/textaugment","commit_stats":{"total_commits":57,"total_committers":8,"mean_commits":7.125,"dds":0.543859649122807,"last_synced_commit":"6946b9126056e6822b7d408604bf9a3cfd91e2f2"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2Ftextaugment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2Ftextaugment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2Ftextaugment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dsfsi%2Ftextaugment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dsfsi","download_url":"https://codeload.github.com/dsfsi/textaugment/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248018032,"owners_count":21034045,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["augmentation","augmentation-methods","hacktoberfest","low-resouce-language","mixup","natural-language-processing","nlp","nlp-augmentation","synonym","word2vec","wordnet"],"created_at":"2024-08-02T23:00:38.892Z","updated_at":"2025-04-09T10:02:21.370Z","avatar_url":"https://github.com/dsfsi.png","language":"Python","funding_links":[],"categories":["Beyond Vision","A01_文本生成_文本对话","Python"],"sub_categories":["**NLP**","其他_文本生成_文本对话"],"readme":"\n\n# [TextAugment: Improving Short Text Classification through Global Augmentation Methods](https://arxiv.org/abs/1907.03752) \n\n[![licence](https://img.shields.io/github/license/dsfsi/textaugment.svg?maxAge=3600)](https://github.com/dsfsi/textaugment/blob/master/LICENCE) [![GitHub release](https://img.shields.io/github/release/dsfsi/textaugment.svg?maxAge=3600)](https://github.com/dsfsi/textaugment/releases) [![Wheel](https://img.shields.io/pypi/wheel/textaugment.svg?maxAge=3600)](https://pypi.python.org/pypi/textaugment) [![python](https://img.shields.io/pypi/pyversions/textaugment.svg?maxAge=3600)](https://pypi.org/project/textaugment/) [![TotalDownloads](https://pepy.tech/badge/textaugment)](https://pypi.org/project/textaugment/) [![Downloads](https://static.pepy.tech/badge/textaugment/month)](https://pypi.org/project/textaugment/) [![LNCS](https://img.shields.io/badge/LNCS-Book%20Chapter-B31B1B.svg)](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) [![arxiv](https://img.shields.io/badge/cs.CL-arXiv%3A1907.03752-B31B1B.svg)](https://arxiv.org/abs/1907.03752)\n\n\n## You have just found TextAugment.\n\nTextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of [NLTK](https://www.nltk.org/), [Gensim v3.x](https://radimrehurek.com/gensim/), and [TextBlob](https://textblob.readthedocs.io/) and plays nicely with them.\n\n## Acknowledgements\nCite this [paper](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) when using this library. [Arxiv Version](https://arxiv.org/abs/1907.03752)\n\n```\n@inproceedings{marivate2020improving,\n  title={Improving short text classification through global augmentation methods},\n  author={Marivate, Vukosi and Sefara, Tshephisho},\n  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},\n  pages={385--399},\n  year={2020},\n  organization={Springer}\n}\n```\n\n# Table of Contents\n\n- [Features](#Features)\n- [Citation Paper](#citation-paper) \n\t- [Requirements](#Requirements)\n\t- [Installation](#Installation)\n\t- [How to use](#How-to-use)\n\t\t- [Word2vec-based augmentation](#Word2vec-based-augmentation)\n\t\t- [WordNet-based augmentation](#WordNet-based-augmentation)\n\t\t- [RTT-based augmentation](#RTT-based-augmentation)\n- [Easy data augmentation (EDA)](#eda-easy-data-augmentation-techniques-for-boosting-performance-on-text-classification-tasks)\n- [An easier data augmentation (AEDA)](#aeda-an-easier-data-augmentation-technique-for-text-classification)\n- [Mixup augmentation](#mixup-augmentation)\n  - [Implementation](#Implementation)\n- [Acknowledgements](#Acknowledgements)\n\n## Features\n\n- Generate synthetic data for improving model performance without manual effort\n- Simple, lightweight, easy-to-use library.\n- Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn)\n- Support textual data\n\n## Citation Paper\n\n**[Improving short text classification through global augmentation methods](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21)**.\n\n\n\n![alt text](https://raw.githubusercontent.com/dsfsi/textaugment/master/augment.png \"Augmentation methods\")\n\n### Requirements\n\n* Python 3\n\nThe following software packages are dependencies and will be installed automatically.\n\n```shell\n$ pip install numpy nltk gensim==3.8.3 textblob googletrans \n\n```\nThe following code downloads NLTK corpus for [wordnet](http://www.nltk.org/howto/wordnet.html).\n```python\nnltk.download('wordnet')\n```\nThe following code downloads [NLTK tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html). This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. \n```python\nnltk.download('punkt')\n```\nThe following code downloads default [NLTK part-of-speech tagger](https://www.nltk.org/_modules/nltk/tag.html) model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.\n```python\nnltk.download('averaged_perceptron_tagger')\n```\nUse gensim to load a pre-trained word2vec model. Like [Google News from Google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).\n```python\nimport gensim\nmodel = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)\n```\nYou can also use gensim to load Facebook's Fasttext [English](https://fasttext.cc/docs/en/english-vectors.html) and [Multilingual models](https://fasttext.cc/docs/en/crawl-vectors.html)\n```\nimport gensim\nmodel = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')\n```\n\nOr training one from scratch using your data or the following public dataset:\n\n- [Text8 Wiki](http://mattmahoney.net/dc/enwik9.zip)\n\n- [Dataset from \"One Billion Word Language Modeling Benchmark\"](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)\n\n### Installation\n\nInstall from pip [Recommended] \n```sh\n$ pip install textaugment\nor install latest release\n$ pip install git+git@github.com:dsfsi/textaugment.git\n```\n\nInstall from source\n```sh\n$ git clone git@github.com:dsfsi/textaugment.git\n$ cd textaugment\n$ python setup.py install\n```\n\n### How to use\n\nThere are three types of augmentations which can be used:\n\n- word2vec \n\n```python\nfrom textaugment import Word2vec\n```\n- fasttext \n\n```python\nfrom textaugment import Fasttext\n```\n\n- wordnet \n```python\nfrom textaugment import Wordnet\n```\n- translate (This will require internet access)\n```python\nfrom textaugment import Translate\n```\n#### Fasttext/Word2vec-based augmentation\n\n[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/word2vec_example.ipynb)\n\n**Basic example**\n\n```python\n\u003e\u003e\u003e from textaugment import Word2vec, Fasttext\n\u003e\u003e\u003e t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')\n\u003e\u003e\u003e t.augment('The stories are good')\nThe films are good\n\u003e\u003e\u003e t = Fasttext(model='path/to/gensim/model'or 'gensim model itself')\n\u003e\u003e\u003e t.augment('The stories are good')\nThe films are good\n```\n**Advanced example**\n\n```python\n\u003e\u003e\u003e runs = 1 # By default.\n\u003e\u003e\u003e v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)\n\u003e\u003e\u003e p = 0.5 # The probability of success of an individual trial. (0.1\u003cp\u003c1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.\n\n\u003e\u003e\u003e word = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)\n\u003e\u003e\u003e word.augment('The stories are good', top_n=10)\nThe movies are excellent\n\u003e\u003e\u003e fast = Fasttext(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)\n\u003e\u003e\u003e fast.augment('The stories are good', top_n=10)\nThe movies are excellent\n```\n#### WordNet-based augmentation\n**Basic example**\n```python\n\u003e\u003e\u003e import nltk\n\u003e\u003e\u003e nltk.download('punkt')\n\u003e\u003e\u003e nltk.download('wordnet')\n\u003e\u003e\u003e from textaugment import Wordnet\n\u003e\u003e\u003e t = Wordnet()\n\u003e\u003e\u003e t.augment('In the afternoon, John is going to town')\nIn the afternoon, John is walking to town\n```\n**Advanced example**\n\n```python\n\u003e\u003e\u003e v = True # enable verbs augmentation. By default is True.\n\u003e\u003e\u003e n = False # enable nouns augmentation. By default is False.\n\u003e\u003e\u003e runs = 1 # number of times to augment a sentence. By default is 1.\n\u003e\u003e\u003e p = 0.5 # The probability of success of an individual trial. (0.1\u003cp\u003c1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.\n\n\u003e\u003e\u003e t = Wordnet(v=False ,n=True, p=0.5)\n\u003e\u003e\u003e t.augment('In the afternoon, John is going to town', top_n=10)\nIn the afternoon, Joseph is going to town.\n```\n#### RTT-based augmentation\n**Example**\n```python\n\u003e\u003e\u003e src = \"en\" # source language of the sentence\n\u003e\u003e\u003e to = \"fr\" # target language\n\u003e\u003e\u003e from textaugment import Translate\n\u003e\u003e\u003e t = Translate(src=\"en\", to=\"fr\")\n\u003e\u003e\u003e t.augment('In the afternoon, John is going to town')\nIn the afternoon John goes to town\n```\n# EDA: Easy data augmentation techniques for boosting performance on text classification tasks \n## This is the implementation of EDA by Jason Wei and Kai Zou. \n\nhttps://www.aclweb.org/anthology/D19-1670.pdf\n\n[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/eda_example.ipynb)\n\n#### Synonym Replacement\nRandomly choose *n* words from the sentence that are not stop words. Replace each of these words with\none of its synonyms chosen at random. \n\n**Basic example**\n```python\n\u003e\u003e\u003e from textaugment import EDA\n\u003e\u003e\u003e t = EDA()\n\u003e\u003e\u003e t.synonym_replacement(\"John is going to town\", top_n=10)\nJohn is give out to town\n```\n\n#### Random Deletion\nRandomly remove each word in the sentence with probability *p*.\n\n**Basic example**\n```python\n\u003e\u003e\u003e from textaugment import EDA\n\u003e\u003e\u003e t = EDA()\n\u003e\u003e\u003e t.random_deletion(\"John is going to town\", p=0.2)\nis going to town\n```\n\n#### Random Swap\nRandomly choose two words in the sentence and swap their positions. Do this n times.\n\n**Basic example**\n```python\n\u003e\u003e\u003e from textaugment import EDA\n\u003e\u003e\u003e t = EDA()\n\u003e\u003e\u003e t.random_swap(\"John is going to town\")\nJohn town going to is\n```\n\n#### Random Insertion \nFind a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times\n\n**Basic example**\n```python\n\u003e\u003e\u003e from textaugment import EDA\n\u003e\u003e\u003e t = EDA()\n\u003e\u003e\u003e t.random_insertion(\"John is going to town\")\nJohn is going to make up town\n```\n\n# AEDA: An easier data augmentation technique for text classification\n\nThis is the implementation of AEDA by Karimi et al, a variant of EDA. It is based on the random insertion of punctuation marks.\n\nhttps://aclanthology.org/2021.findings-emnlp.234.pdf\n\n## Implementation\n[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/eda_example.ipynb)\n\n#### Random Insertion of Punctuation Marks\n\n**Basic example**\n```python\n\u003e\u003e\u003e from textaugment import AEDA\n\u003e\u003e\u003e t = AEDA()\n\u003e\u003e\u003e t.punct_insertion(\"John is going to town\")\n! John is going to town\n```\n\n# Mixup augmentation\n\nThis is the implementation of mixup augmentation by [Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz](https://openreview.net/forum?id=r1Ddp1-Rb) adapted to NLP. \n\nUsed in [Augmenting Data with Mixup for Sentence Classification: An Empirical Study](https://arxiv.org/abs/1905.08941). \n\nMixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples. \n\n## Implementation\n\n[See this notebook for an example](https://github.com/dsfsi/textaugment/blob/master/examples/mixup_example_using_IMDB_sentiment.ipynb)\n\n## Built with ❤ on\n* [Python](http://python.org/)\n\n## Authors\n* [Joseph Sefara](https://za.linkedin.com/in/josephsefara) (http://www.speechtech.co.za)\n* [Vukosi Marivate](http://www.vima.co.za) (http://www.vima.co.za)\n\n## Acknowledgements\nCite this [paper](https://link.springer.com/chapter/10.1007%2F978-3-030-57321-8_21) when using this library. [Arxiv Version](https://arxiv.org/abs/1907.03752)\n\n```\n@inproceedings{marivate2020improving,\n  title={Improving short text classification through global augmentation methods},\n  author={Marivate, Vukosi and Sefara, Tshephisho},\n  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},\n  pages={385--399},\n  year={2020},\n  organization={Springer}\n}\n```\n\n## Licence\nMIT licensed. See the bundled [LICENCE](https://github.com/dsfsi/textaugment/blob/master/LICENCE) file for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsfsi%2Ftextaugment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdsfsi%2Ftextaugment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsfsi%2Ftextaugment/lists"}