https://github.com/ritvik19/text-data-augmentation
State of the Art Text Data Augmentation for Natural Language Processing Applications
https://github.com/ritvik19/text-data-augmentation
natural-language-processing nlp
Last synced: about 1 year ago
JSON representation
State of the Art Text Data Augmentation for Natural Language Processing Applications
- Host: GitHub
- URL: https://github.com/ritvik19/text-data-augmentation
- Owner: Ritvik19
- License: mit
- Created: 2021-10-17T06:36:23.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-02-04T17:11:52.000Z (over 4 years ago)
- Last Synced: 2025-03-24T18:21:20.833Z (about 1 year ago)
- Topics: natural-language-processing, nlp
- Language: Python
- Homepage: https://ritvik19.github.io/text-data-augmentation/
- Size: 68.4 KB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text-Data-Augmentation
State of the Art Text Data Augmentation for Natural Language Processing Applications
## Table of Contents
- [Text-Data-Augmentation](#text-data-augmentation)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Usage](#usage)
- [Abstractive Summarization](#abstractive-summarization)
- [Back Translation](#back-translation)
- [Character Noise](#character-noise)
- [Contextual Word Replacement](#contextual-word-replacement)
- [Easy Data Augmentation](#easy-data-augmentation)
- [KeyBoard Noise](#keyboard-noise)
- [OCR Noise](#ocr-noise)
- [Paraphrase](#paraphrase)
- [Similar Word Replacement](#similar-word-replacement)
- [Synonym Replacement](#synonym-replacement)
- [Word Split](#word-split)
- [References](#references)
---
## Installation
```bash
pip install git+https://github.com/Ritvik19/Text-Data-Augmentation.git
```
---
## Usage
This library various techniques for augmenting text data:
### Abstractive Summarization
Abstractive Summarization Augmentation summarizes the model using transformer models. [[17]](#ref-17) [[18]](#ref-18)
```python
>>> from text_data_augmentation import AbstractiveSummarization
>>> aug = AbstractiveSummarization()
>>> aug(['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization, abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R&D, to financial research and legal documents analysis.'])
['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization, abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R&D, to financial research and legal documents analysis.', 'Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text . Unlike extractive summarization, it does not copy important phrases from the source text but also potentially come up with new phrases thatare relevant, which can be seen as paraphrasing .']
```
### Back Translation
Back Translation Augmentation relies on translating text data to another language and then translating it back to the original language. This technique allows generating textual data of distinct wording to original text while preserving the original context and meaning.[[1]](#ref-1) [[2]](#ref-2) [[10]](#ref-10)
```python
>>> from text_data_augmentation import BackTranslation
>>> aug = BackTranslation()
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps on the lazy dog']
```
### Character Noise
Character Noise Augmentation adds character level noise by randomly inserting, deleting, swaping or replacing some charaters in the input text. [[2]](#ref-2) [[9]](#ref-9)
```python
>>> from text_data_augmentation import CharacterNoise
>>> aug = CharacterNoise(alpha=0.1, n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps ovr the lazy dog']
```
### Contextual Word Replacement
Contextual Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a mask and then using a Masked Language Model to fill it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[3]](#ref-3) [[11]](#ref-11) [[19]](#ref-19)
```python
>>> from text_data_augmentation import ContextualWordReplacement
>>> aug = ContextualWordReplacement(n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over his lazy dog']
```
### Easy Data Augmentation
Easy Data Augmentation adds word level noise by randomly inserting, deleting, swaping some words in the input text or by shuffling the sentences in the input text. [[4]](#ref-4) [[5]](#ref-5) [[9]](#ref-9) [[12]](#ref-12) [[13]](#ref-13)
```python
>>> from text_data_augmentation import EasyDataAugmentation
>>> aug = EasyDataAugmentation(n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over the dog']
```
### KeyBoard Noise
KeyBoard Noise Augmentation adds character level spelling mistake noise by mimicing typographical errors made using a qwerty keyboard in the input text. [[2]](#ref-2) [[9]](#ref-9)
```python
>>> from text_data_augmentation import KeyBoardNoise
>>> aug = KeyBoardNoise(alpha=0.1, n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick broen fox jumps over the lazy dog']
```
### OCR Noise
OCR Noise Augmentation adds character level spelling mistake noise by mimicing ocr errors in the input text. [[6]](#ref-6)
```python
>>> from text_data_augmentation import OCRNoise
>>> aug = OCRNoise(alpha=0.1, n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick hrown lox jumps over the lazy dog']
```
### Paraphrase
Paraphrase Augmentation rephrases the input sentences using T5 models. [[2]](#ref-2)
```python
>>> from text_data_augmentation import Paraphrase
>>> aug = Paraphrase("", n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox has jumped on the lazy dog.']
```
### Similar Word Replacement
Similar Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a word having the most similar vector to it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[7]](#ref-7) [[15]](#ref-15) [[16]](#ref-16) [[19]](#ref-19)
```python
>>> from text_data_augmentation import SimilarWordReplacement
>>> aug = SimilarWordReplacement("en_core_web_lg", alpha=0.1, n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick White Wolf jumps over the lazy Cat.']
```
### Synonym Replacement
Synonym Replacement Augmentation creates Augmented Samples by randomly replacing some words with their synonyms based on the word net data base. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[4]](#ref-4) [[8]](#ref-8) [[13]](#ref-13) [[19]](#ref-19)
```python
>>> from text_data_augmentation import SynonymReplacement
>>> aug = SynonymReplacement(alpha=0.1, n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over the lethargic dog']
```
### Word Split
Word Split Augmentation adds word level spelling mistake noise by spliting words randomly in the input text. [[2]](#ref-2) [[14]](#ref-14)
```python
>>> from text_data_augmentation import WordSplit
>>> aug = WordSplit(alpha=0.1, n_aug=1)
>>> aug(['A quick brown fox jumps over the lazy dog'])
['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over th e lazy dog']
```
---
## References
1. Data Expansion Using Back Translation and Paraphrasing for Hate Speech Detection
2. A Survey on Data Augmentation for Text Classification
3. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations
4. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
5. An Analysis of Simple Data Augmentation for Named Entity Recognition
6. Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing
7. A Study of Various Text Augmentation Techniques for Relation Classification in Free Text
8. Text Augmentation for Neural Networks
9. Synthetic And Natural Noise Both Break Neural Machine Translation
10. Improving Neural Machine Translation Models with Monolingual Data
11. Data Augmentation Using Pre-trained Transformer Models
12. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
13. Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models
14. TextBugger: Generating Adversarial Text Against Real-world Applications
15. Generating Natural Language Adversarial Examples
16. Character-level Convolutional Networks for Text Classification
17. Neural Abstractive Text Summarization with Sequence-to-Sequence Models
18. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
19. Unsupervised Data Augmentation for Consistency Training
20. Text Data Augmentation: Towards better detection of spear-phishing emails