https://github.com/sagorbrur/bnaug
Bangla Text Augmentation
https://github.com/sagorbrur/bnaug
augmentation-libraries back-translation bangla bangla-text-augmentation bengali bengali-nlp text-augmentation
Last synced: 9 months ago
JSON representation
Bangla Text Augmentation
- Host: GitHub
- URL: https://github.com/sagorbrur/bnaug
- Owner: sagorbrur
- License: mit
- Created: 2021-03-09T14:56:42.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-08-30T16:28:16.000Z (almost 3 years ago)
- Last Synced: 2025-08-16T21:35:54.577Z (10 months ago)
- Topics: augmentation-libraries, back-translation, bangla, bangla-text-augmentation, bengali, bengali-nlp, text-augmentation
- Language: Jupyter Notebook
- Homepage:
- Size: 51.8 KB
- Stars: 11
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: license
Awesome Lists containing this project
README
# bnaug (Bangla Text Augmentation)
__bnaug__ is a text augmentation tool for Bangla text.
## Installation
```
pip install bnaug
```
- Dependencies
- pytorch >=1.7.0
## Demo Notebook
- [bnaug demo](https://github.com/sagorbrur/bnaug/blob/main/notebook/bnaug_demo.ipynb)
## Necessary Model Links
- [word2vec](https://huggingface.co/sagorsarker/bangla_word2vec/resolve/main/bangla_word2vec_gen4.zip)
- [glove vector](https://huggingface.co/sagorsarker/bangla-glove-vectors/resolve/main/bn_glove.300d.zip)
## Sentence Augmentation
### Token Replacement
- Mask generation based augmentation
```py
from bnaug.sentence import TokenReplacement
tokr = TokenReplacement()
text = "আমি ঢাকায় বাস করি।"
output = tokr.masking_based(text, sen_n=5)
```
- Word2Vec based augmentation
```py
from bnaug.sentence import TokenReplacement
tokr = TokenReplacement()
text = "আমি ঢাকায় বাস করি।"
model = "msc/bangla_word2vec/bnwiki_word2vec.model"
output = tokr.word2vec_based(text, model=model, sen_n=5, word_n=5)
print(output)
```
- Glove based augmentation
```py
from bnaug.sentence import TokenReplacement
tokr = TokenReplacement()
text = "আমি ঢাকায় বাস করি।"
vector = "msc/bn_glove.300d.txt"
output = tokr.glove_based(text, vector_path=vector, sen_n=5, word_n=5)
print(output)
```
### Back Translation
Back translation based augmentation first translate Bangla sentence to English and then again translate the English to Bangla.
```py
from bnaug.sentence import BackTranslation
bt = BackTranslation()
text = "বাংলা ভাষা আন্দোলন তদানীন্তন পূর্ব পাকিস্তানে সংঘটিত একটি সাংস্কৃতিক ও রাজনৈতিক আন্দোলন। "
output = bt.get_augmented_sentences(text)
print(output)
```
### Text Generation
- Paraphrase generation
```py
from bnaug.sentence import TextGeneration
tg = TextGeneration()
text = "বিমানটি যখন মাটিতে নামার জন্য এয়ারপোর্টের কাছাকাছি আসছে, তখন ল্যান্ডিং গিয়ারের খোপের ঢাকনাটি খুলে যায়।"
output = tg.parapharse_generation(text)
print(output)
```
### Random Augmentation
- Random remove part and generate new sentence
At present it's removing word, stopwords, punctuations, numbers and generate new sentences
```py
from bnaug.sentence import RandomAugmentation
raug = RandomAugmentation()
sentence = "আমি ১০০ বাকি দিলাম"
output = raug.random_remove(sentence)
print(output)
```
or apply individually
```py
from bnaug import randaug
text = "১০০ বাকি দিলাম"
output = randaug.remove_digits(text)
print(output)
text = "১০০! বাকি দিলাম?"
output = randaug.remove_punctuations(text)
print(output)
text = "আমি ১০০ বাকি দিলাম"
randaug.remove_stopwords(text)
print(output)
text = "আমি ১০০ বাকি দিলাম"
randaug.remove_random_word(text)
print(output)
text = "আমি ১০০ বাকি দিলাম"
randaug.remove_random_char(text)
print(output)
```
## Inspired from
- [nlpaug](https://github.com/makcedward/nlpaug)
- [amitness blog post](https://amitness.com/2020/05/data-augmentation-for-nlp/)