Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/Foysal87/sbnltk

Bangla NLP toolkit. Bangla NER, POStag, Stemmer, Word embedding, sentence embedding, summarization, preprocessor, sentiment analysis, etc.
https://github.com/Foysal87/sbnltk
bangla-ner bangla-postag bangla-stemmer bangla-translator bangla-word-embedding bert
Last synced: about 1 month ago
JSON representation
Bangla NLP toolkit. Bangla NER, POStag, Stemmer, Word embedding, sentence embedding, summarization, preprocessor, sentiment analysis, etc.
Host: GitHub
URL: https://github.com/Foysal87/sbnltk
Owner: Foysal87
License: mit
Created: 2021-03-11T07:16:34.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-02-12T09:46:47.000Z (5 months ago)
Last Synced: 2024-05-19T07:20:33.427Z (about 1 month ago)
Topics: bangla-ner, bangla-postag, bangla-stemmer, bangla-translator, bangla-word-embedding, bert
Language: Python
Homepage:
Size: 91.8 KB
Stars: 23
Watchers: 1
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-bangla - Bangla NLP Toolkit
README

        pypi-download-stats

========================

[![PyPI version shields.io](https://img.shields.io/pypi/v/sbnltk.svg)](https://pypi.python.org/pypi/sbnltk/)

[![PyPI license](https://img.shields.io/pypi/l/sbnltk.svg)](https://pypi.python.org/pypi/sbnltk/)

[![PyPI pyversions](https://img.shields.io/pypi/pyversions/sbnltk.svg)](https://pypi.python.org/pypi/sbnltk/)

[![PyPI download month](https://img.shields.io/pypi/dm/sbnltk.svg)](https://pypi.python.org/pypi/sbnltk/)

[![PyPI download week](https://img.shields.io/pypi/dw/sbnltk.svg)](https://pypi.python.org/pypi/sbnltk/)

## Please use colab for getting no problem. For transformer model, please install simpleTransformer first or use [bn_nlp](https://github.com/Foysal87/bn_nlp) for static models. I uploaded dataset and training details in my github. *There is a problem in sentiment analyzer. I Will fix it soon.*

   

# SBNLTK

SUST-Bangla Natural Language toolkit. A python module for Bangla NLP tasks.\

Demo Version : 2.0.2 \

**NEED python 3.6+ vesrion!! Use virtual Environment for not getting unessessary Issues!!**

## INSTALLATION

### PYPI INSTALLATION

```commandline

pip3 install sbnltk

pip3 install simpletransformers

pip3 install fasttext

pip3 install scikit-learn

```

### MANUAL INSTALLATION FROM GITHUB

* Clone this project

* Install all the requirements

* Call the setup.py from terminal

## What will you get here?

* Bangla Text Preprocessor

* Bangla word dust,punctuation,stop word removal

* Bangla word sorting according to Bangla or English alphabet

* Bangla word normalization

* Bangla word stemmer

* Bangla Sentiment analysis(logisticRegression,LinearSVC,Multilnomial_naive_bayes,Random_Forst)

* Bangla Sentiment analysis with Bert

* Bangla sentence pos tagger (static, sklearn)

* Bangla sentence pos tagger with BERT(Multilingual-cased,Multilingual uncased) 

* Bangla sentence NER(Static,sklearn)

* Bangla sentence NER with BERT(Bert-Cased, Multilingual Cased/Uncased)

* Bangla word word2vec(gensim,glove,fasttext)

* Bangla sentence embedding(Contexual,Transformer/Bert)

* Bangla Document Summarization(Feature based, Contexual, sementic Based)

* Bangla Bi-lingual project(Bangla to english google translator without blocking IP)

* Bangla document information Extraction

**SEE THE CODE DOCS FOR USES!**

## TASKS, MODELS, ACCURACY, 
|            TASK           | 
|:-------------------------:| 
|        Preprocessor 
|      Word tokenizers      | 
|    Sentence tokenizers    | 
|          Stemmer          | 
|     Sentiment Analysis    | 
|                           | 
|                           | 
|                           | 
|                           | 
|         POS tagger        | 
|                           | 
|                           | 
|                           | 
|         NER tagger        | 
|                           | 
|                           | 
|                           | 
|                           | 
|       Word Embedding      | 
|                           | 
|                           | 
|     Sentence Embedding    | 
|                           | 
|                           | 
| Extractive  Summarization | 
|                           | 
|                           | 
|    Bi-lingual projects    | 
|   Information Extraction  | 
|                           | 
|                           |

DATASET AND DOCS MODEL                               |    ACCURACY    |         DATASET         | About | Code DOCS | :-----------------------------------------------------------------:|:--------------:|:-----------------------:|:-----:|:---------:| | Punctuation, Stop Word, DUST removal Word normalization, others.. |     ------     |          -----          |       |[docs](https://github.com/Foysal87/sbnltk/blob/main/docs/preprocessor.md)         | basic tokenizers Customized tokenizers              |      ----      |           ----          |       | [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Tokenizer.md#word-tokenizer)          | Basic tokenizers Customized tokenizers Sentence Cluster      |      -----     |          -----          |       | [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Tokenizer.md#sentence-tokenizer)         | StemmerOP                             |      85.5%     |           ----          |       | [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Stemmer.md)          | logisticRegression                        |      88.5%     |         20,000+         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentiment%20Analyzer.md#logistic-regression)         | LinearSVC                             |      82.3%     |         20,000+         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentiment%20Analyzer.md#linear-svc)         | Multilnomial_naive_bayes                     |      84.1%     |         20,000+         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentiment%20Analyzer.md#multinomial-naive-bayes)         | Random Forest                           |      86.9%     |         20,000+         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentiment%20Analyzer.md#random-forest)         | BERT                               |      93.2%     |         20,000+         |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentiment%20Analyzer.md#bert-sentiment-analysis)        | Static method                           |      55.5%     |     1,40,973  words     |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Postag.md#static-postag)        | SK-LEARN classification                      |      81.2%     |     6,000+ sentences    |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Postag.md#sklearn-postag)        | BERT-Multilingual-Cased                    |      69.2%     |          6,000+         |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Postag.md#bert-multilingual-cased-postag)        | BERT-Multilingual-Uncased                     |      78.7%     |          6,000+         |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Postag.md#bert-multilingual-uncased-postag)        | Static method                           |      65.3%     |     4,08,837 Entity     |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/NER.md#static-ner)       | SK-LEARN classification                      |      81.2%     |         65,000+         |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/NER.md#sklearn-classification)        | BERT-Cased                            |      79.2%     |         65,000+         |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/NER.md#bert-cased-ner)        | BERT-Mutilingual-Cased                      |      75.5%     |         65,000+         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/NER.md#bert-multilingual-cased-ner)         | BERT-Multilingual-Uncased                     |      90.5%     |         65,000+         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/NER.md#bert-multilingual-uncased-ner)         | Gensim-word2vec-100D- 1,00,00,000+ tokens             |        -       | 2,00,00,000+  sentences |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Word%20Embedding.md#gensim-word-embedding)         | Glove-word2vec-100D- 2,30,000+ tokens               |        -       |    5,00,000 sentences   |       | [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Word%20Embedding.md#fasttext-word-embedding)          | fastext-word2vec-200D 3,00,000+                  |        -       |    5,00,000 sentences   |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Word%20Embedding.md#glove-word-embedding)         | Contextual sentence embedding                   |        -       |          -----          |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentence%20Embedding.md#sentence-embedding-from-word2vec)         | Transformer embedding_hd                     |        -       |   3,00,000+ human data  |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentence%20Embedding.md#sentence-embedding-transformer-human-translate-data)        | Transformer embedding_gd                     |        -       |  3,00,000+ google data  |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Sentence%20Embedding.md#sentence-embedding-transformer-google-translate-data)        | Feature-based based                        | 70.0% f1 score |          ------         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Summarization.md#feature-based-model)         | Transformer sentence sentiment Based               |      67.0%     |          ------         |       | [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Summarization.md#word2vec_based_model)          | Word2vec--sentences contextual Based               |      60.0%     |          -----          |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Summarization.md#transformer_based_model)         | google translator with large data detector            |      ----      |           ----          |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/Bangla%20translator.md#using-google-translator)         | Static word features                       |        -       |                         |       |  [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/information%20extraction.md#feature-based-extraction)         | Semantic and contextual                      |        -       |                         |       |   [docs](https://github.com/Foysal87/sbnltk/blob/main/docs/information%20extraction.md#contextual-information-extraction)       | Bangla Coreference Resolution                      |        -       |                         |       |         |

## Next releases after testing this demo

|              Task              |    Version   |

|:------------------------------:|:------------:|

|     Coreference Resolution     |    v1.1      |

|      Language translation      |    V1.1      |

|      Masked Language model     |    V1.1      |

| Information retrieval Projects |    V1.1      |

|       Entity Segmentation      |    v1.3      |

|   Factoid Question Answering   |    v1.2      |

|     Question Classification    |    v1.2      |

|    sentiment Word embedding    |    v1.3      |

|     So many others features    |     ---      |

## Package Installation

You have to install these packages manually, if you get any module error.

* simpletransformers

* fasttext

## Models

Everything is automated here. when you call a model for the first time, it will be downloaded automatically.

## With GPU or Without GPU

* With GPU, you can run any models without getting any warnings.

* Without GPU, You will get some warnings. But this will not affect in result.

## Motivation

With approximately 228 million native speakers and another 37 million as second language speakers,Bengali is the fifth most-spoken native 

language and the seventh most spoken language by total number of speakers in the world. But still it is a low resource language. Why?

## Dataset

For all sbnltk dataset and existing Dataset, see this link [Bangla NLP Dataset](https://github.com/Foysal87/Bangla-NLP-Dataset) 

## Trainer

For training, You can see this [Colab Trainer](https://github.com/Foysal87/NLP-colab-trainer) . In future i will make a Trainer module!

## When will full version come?

Very soon. We are working on paper and improvement our modules. It will be released sequentially.

## About accuracy

Accuracy can be varied for the different datasets. We measure our model with random datasets but small scale. As human resources for this project are not so large.

## Contribute Here

* If you found any issue, please create an issue or contact with me.