https://github.com/ritvik19/text-data-augmentation

State of the Art Text Data Augmentation for Natural Language Processing Applications
https://github.com/ritvik19/text-data-augmentation

natural-language-processing nlp

Last synced: about 1 year ago
JSON representation

State of the Art Text Data Augmentation for Natural Language Processing Applications

Host: GitHub
URL: https://github.com/ritvik19/text-data-augmentation
Owner: Ritvik19
License: mit
Created: 2021-10-17T06:36:23.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-02-04T17:11:52.000Z (over 4 years ago)
Last Synced: 2025-03-24T18:21:20.833Z (about 1 year ago)
Topics: natural-language-processing, nlp
Language: Python
Homepage: https://ritvik19.github.io/text-data-augmentation/
Size: 68.4 KB
Stars: 9
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Text-Data-Augmentation

State of the Art Text Data Augmentation for Natural Language Processing Applications

## Table of Contents

- [Text-Data-Augmentation](#text-data-augmentation)

  - [Table of Contents](#table-of-contents)

  - [Installation](#installation)

  - [Usage](#usage)

    - [Abstractive Summarization](#abstractive-summarization)

    - [Back Translation](#back-translation)

    - [Character Noise](#character-noise)

    - [Contextual Word Replacement](#contextual-word-replacement)

    - [Easy Data Augmentation](#easy-data-augmentation)

    - [KeyBoard Noise](#keyboard-noise)

    - [OCR Noise](#ocr-noise)

    - [Paraphrase](#paraphrase)

    - [Similar Word Replacement](#similar-word-replacement)

    - [Synonym Replacement](#synonym-replacement)

    - [Word Split](#word-split)

  - [References](#references)

---

## Installation

```bash

pip install git+https://github.com/Ritvik19/Text-Data-Augmentation.git

```

---

## Usage

This library various techniques for augmenting text data:

### Abstractive Summarization

Abstractive Summarization Augmentation summarizes the model using transformer models. [[17]](#ref-17) [[18]](#ref-18)

```python

>>> from text_data_augmentation import AbstractiveSummarization

>>> aug = AbstractiveSummarization()

>>> aug(['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization, abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R&D, to financial research and legal documents analysis.'])

['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization, abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R&D, to financial research and legal documents analysis.', 'Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text . Unlike extractive summarization, it does not copy important phrases from the source text but also potentially come up with new phrases thatare relevant, which can be seen as paraphrasing .']

```

### Back Translation

Back Translation Augmentation relies on translating text data to another language and then translating it back to the original language. This technique allows generating textual data of distinct wording to original text while preserving the original context and meaning.[[1]](#ref-1) [[2]](#ref-2) [[10]](#ref-10)

```python

>>> from text_data_augmentation import BackTranslation

>>> aug = BackTranslation()

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps on the lazy dog']

```

### Character Noise

Character Noise Augmentation adds character level noise by randomly inserting, deleting, swaping or replacing some charaters in the input text. [[2]](#ref-2) [[9]](#ref-9)

```python

>>> from text_data_augmentation import CharacterNoise

>>> aug = CharacterNoise(alpha=0.1, n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps ovr the lazy dog']

```

### Contextual Word Replacement

Contextual Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a mask and then using a Masked Language Model to fill it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[3]](#ref-3) [[11]](#ref-11) [[19]](#ref-19)

```python

>>> from text_data_augmentation import ContextualWordReplacement

>>> aug = ContextualWordReplacement(n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over his lazy dog']

```

### Easy Data Augmentation

Easy Data Augmentation adds word level noise by randomly inserting, deleting, swaping some words in the input text or by shuffling the sentences in the input text. [[4]](#ref-4) [[5]](#ref-5) [[9]](#ref-9) [[12]](#ref-12) [[13]](#ref-13)

```python

>>> from text_data_augmentation import EasyDataAugmentation

>>> aug = EasyDataAugmentation(n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over the dog']

```

### KeyBoard Noise

KeyBoard Noise Augmentation adds character level spelling mistake noise by mimicing typographical errors made using a qwerty keyboard in the input text. [[2]](#ref-2) [[9]](#ref-9)

```python

>>> from text_data_augmentation import KeyBoardNoise

>>> aug = KeyBoardNoise(alpha=0.1, n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick broen fox jumps over the lazy dog']

```

### OCR Noise

OCR Noise Augmentation adds character level spelling mistake noise by mimicing ocr errors in the input text. [[6]](#ref-6)

```python

>>> from text_data_augmentation import OCRNoise

>>> aug = OCRNoise(alpha=0.1, n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick hrown lox jumps over the lazy dog']

```

### Paraphrase

Paraphrase Augmentation rephrases the input sentences using T5 models. [[2]](#ref-2)

```python

>>> from text_data_augmentation import Paraphrase

>>> aug = Paraphrase("", n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox has jumped on the lazy dog.']

```

### Similar Word Replacement

Similar Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a word having the most similar vector to it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[7]](#ref-7) [[15]](#ref-15) [[16]](#ref-16) [[19]](#ref-19)

```python

>>> from text_data_augmentation import SimilarWordReplacement

>>> aug = SimilarWordReplacement("en_core_web_lg",  alpha=0.1, n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick White Wolf jumps over the lazy Cat.']

```

### Synonym Replacement

Synonym Replacement Augmentation creates Augmented Samples by randomly replacing some words with their synonyms based on the word net data base. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[4]](#ref-4) [[8]](#ref-8) [[13]](#ref-13) [[19]](#ref-19)

```python

>>> from text_data_augmentation import SynonymReplacement

>>> aug = SynonymReplacement(alpha=0.1, n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over the lethargic dog']

```

### Word Split

Word Split Augmentation adds word level spelling mistake noise by spliting words randomly in the input text. [[2]](#ref-2) [[14]](#ref-14)

```python

>>> from text_data_augmentation import WordSplit

>>> aug = WordSplit(alpha=0.1, n_aug=1)

>>> aug(['A quick brown fox jumps over the lazy dog'])

['A quick brown fox jumps over the lazy dog', 'A quick brown fox jumps over th e lazy dog']

```

---

## References

1. Data Expansion Using Back Translation and Paraphrasing for Hate Speech Detection

2. A Survey on Data Augmentation for Text Classification

3. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

4. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

5. An Analysis of Simple Data Augmentation for Named Entity Recognition

6. Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

7. A Study of Various Text Augmentation Techniques for Relation Classification in Free Text

8. Text Augmentation for Neural Networks

9. Synthetic And Natural Noise Both Break Neural Machine Translation

10. Improving Neural Machine Translation Models with Monolingual Data

11. Data Augmentation Using Pre-trained Transformer Models

12. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages

13. Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models

14. TextBugger: Generating Adversarial Text Against Real-world Applications

15. Generating Natural Language Adversarial Examples

16. Character-level Convolutional Networks for Text Classification

17. Neural Abstractive Text Summarization with Sequence-to-Sequence Models

18. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

19. Unsupervised Data Augmentation for Consistency Training

20. Text Data Augmentation: Towards better detection of spear-phishing emails

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ritvik19/text-data-augmentation

Awesome Lists containing this project

README