Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MahirMahbub/Contextual-Spell-Checker-For-Bangla
Automatic Context Sensitive Spelling Correction for Bangla Text Using Bert and Levenstein Distance
https://github.com/MahirMahbub/Contextual-Spell-Checker-For-Bangla
bangla-bert bangla-nlp bert fastapi levenshtein-distance ner nlp spellcheck spelling-correction
Last synced: 30 days ago
JSON representation
Automatic Context Sensitive Spelling Correction for Bangla Text Using Bert and Levenstein Distance
- Host: GitHub
- URL: https://github.com/MahirMahbub/Contextual-Spell-Checker-For-Bangla
- Owner: MahirMahbub
- License: mit
- Created: 2022-01-18T16:54:06.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-08-10T15:28:26.000Z (11 months ago)
- Last Synced: 2024-02-18T13:32:09.432Z (4 months ago)
- Topics: bangla-bert, bangla-nlp, bert, fastapi, levenshtein-distance, ner, nlp, spellcheck, spelling-correction
- Language: Python
- Homepage:
- Size: 146 KB
- Stars: 17
- Watchers: 4
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Lists
- awesome-bangla - Bangla Contextual Spell Checker
README
# Auto Progressive Contextual Spell Checker For Bangla
## Automatic Progressive Context-Sensitive Spelling Correction for Bangla Text Using BERT and Levenshtein Distance.
- Bert Masked Model (Added), Other model support(For example, LSTM/GRU based Masked Prediction model) will be added.
- Bert NER Model (Added)
- Levenshtein Distance (Added)
- Dictionary Look up (Added), 451742 unique words from Oscar 2019 dataset.
- Progressive spell checking with NER (Added)
- New constraints added while checking the spelling (Added)## Instruction
- Download a Bert Masked Model in "model/bangla-bert-base" (Recommeded https://huggingface.co/sagorsarker/bangla-bert-base)
- Download a Bert NER Model in "model/mbert-bengali-ner" (Recommended https://huggingface.co/sagorsarker/mbert-bengali-ner)
- Specify the Bert Masked Model and Bert NER Model controller class name in "config.json"
- [Download](https://drive.google.com/file/d/1Z98rG7CSvnHFUSOAZ0jtWCCAYf_nBde0/view?usp=sharing) dictionary and place in at /data/output/
- Run app.py for API(Based of Fastapi)**Example**:
```
from source.spell_checker import SpellCheckersentence = "পুলিশ আসা আগে ডাকাত পালিয়ে গোছে".split(" ")
print(SpellChecker().prediction(sentence=sentence, k=100)))
>>> ['পুলিশ', 'আসার', 'আগে', 'ডাকাত', 'পালিয়ে', 'গেছে']sentence = "এক এলাকা সোলতা আহমেদের ছে আব্দুর রহমান (৩০)".split(" ")
print(SpellChecker().prediction(sentence=sentence, k=100)))
>>>['একই', 'এলাকার', 'সোলতা', 'আহমেদের', 'ছেলে', 'আব্দুর', 'রহমান', '(৩০)']sentence = "২০১৫ সালের নভেম্বরে প্রান্সে জলবায়ূ সসেলনে বিশেবর ২০০ দেশ অংশগ্রহণ করে".split(" ")
print(SpellChecker().prediction(sentence=sentence, k=100)))
>>>['২০১৫', 'সালের', 'নভেম্বরে', 'ফ্রান্সের', 'জলবায়ূ', 'সম্মেলনে', 'বিশ্বের', '২০০', 'দেশ', 'অংশগ্রহণ', 'করে']sentence = "পরে তাদসের উিপর হামলা করে এলোপাতাড়ি কুপাতে থাকে"
print(SpellChecker().prediction(sentence=sentence, k=100)))
>>>['পরে', 'তাদের', 'উপর', 'হামলা', 'করে', 'এলোপাতাড়ি', 'কুপাতে', 'থাকে']sentence = "তাূরা দেখেন ঢাকার দূই সিটিতে মশা মারতে যে ওষধূ ছিটানো হয় তা অকার্যকর"
print(SpellChecker().prediction(sentence=sentence, k=100)))
>>>['তারা', 'দেখে', 'ঢাকার', 'দুই', 'সিটিতে', 'মশা', 'মারতে', 'যে', 'ওষুধ', 'ছিটানো', 'হত', 'তা', 'অকার্যকর']```
## Result (Based on 1.0.2-alpha)
Evaluation dataset in created from https://github.com/habibsifat/Algorithm-for-Bengali-Error-Dataset-Generation.
**TP:** Did not change the correct word / total correct word.
**FN:** Change the correct word incorrectly / total correct word.
**FP:** Did not change the incorrect word (Mark incorrect as correct) / total incorrect word.
**TN:** Change the incorrect word correctly / total incorrect word.
**TN_PLUS:** Change the incorrect word incorrectly.
### Result of bangla bert for different language models
| Model | Top N| TP | FN | FP | TN | TN_PLUS |
| :----------- | :----------- | :----------- | :----------- | :----------- | :----------- | :------------ |
| Sagor Sarkar | 50 | 0.9782 | 0.0218 | 0.4150 | 0.5017 | 0.0833 | -->
| NWP(W2V Skipgram)| 50 | 0.9879 | 0.0121 | 0.6612 | 0.2825 | 0.0563 | -->### The result of spell checker based on bangla bert for different conditions
We conducted the experiment on different value of maximum edit distance (ml). The conditions are given below:
• **C1:** ml = Probable misspell word(mw)’s length//2.
• **C2:** ml = mw’s length//2 if mw’s length > 4 else ml = 2.
• **C3:** ml = mw’s length//2 if mw’s length > 6 else ml = 2.
• **C4:** ml = mw’s length//2 if mw’s length > 6 else ml = 3
| Condition | TP | FN | FP | TN | TN_PLUS |
| :----------- | :----------- | :----------- | :----------- | :----------- | :----------- |
| C1 | 0.9837 | 0.0163 | 0.6779 | 0.3209 | 0.0012 |
| C2 | 0.9782 | 0.0218 | 0.4150 | 0.5017 | 0.0833 |
| C3 | 0.9776 | 0.0224 | 0.5534 | 0.4410 | 0.0056 |
| C4 | 0.9623 | 0.0377 | 0.6498 | 0.2010 | 0.1492 |## API
We also added API.