An open API service indexing awesome lists of open source software.

https://github.com/sagorbrur/bangla-corpus

A curated list of Bangla NLP Corpus
https://github.com/sagorbrur/bangla-corpus

bangla bangla-corpus bengali bengali-corpus bn corpus nlp-corpus

Last synced: 10 months ago
JSON representation

A curated list of Bangla NLP Corpus

Awesome Lists containing this project

README

          

# Bangla Corpus
A curated list of different Bangla NLP corpus

## Bangla Machine Translation
* [Bangla NMT Corpus](https://github.com/csebuetnlp/banglanmt)(size 2.75M)
* [Bengali-English Bilingual Corpus](http://www.manythings.org/anki/)(size 4332)
* [samanantar](https://indicnlp.ai4bharat.org/samanantar/)
* [opus corpus](https://opus.nlpl.eu/)
* [ALT](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/)

## Bangla Transliteration
- [Dakshina Datasets](https://github.com/google-research-datasets/dakshina)
- [Bangla NLP Transliteration Datasets](https://github.com/arijitx/BanglaNLP)
- [IndicTrans](https://github.com/AI4Bharat/indicTrans)

## Bangla NER
- [Banner](https://github.com/imranulashrafi/banner)
- [Wikiann](https://metatext.io/datasets/wikiann)

## Bangla POS
- [Bangla POS Tagger](https://github.com/abhishekgupta92/bangla_pos_tagger)

## Bangla Question Answering
- [TiDy QA](https://ai.google.com/research/tydiqa)
- [squad_bn by csebuetnlp-translated](https://huggingface.co/datasets/csebuetnlp/squad_bn)

## Bangla Text Classification
- [Bangla Hate Speech Dataset by Rezaul Karim](https://github.com/rezacsedu/Bengali-Hate-Speech-Dataset)
- [Sentiment Analysis by Rezaul Karim](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP/tree/master/SentimentAnalysis_Multichannel_CNN_LSTM) (7.5k)
- [Fake News](https://github.com/Rowan1697/FakeNews)
- [Bangla Emotion Classification](https://github.com/omar-sharif03/NAACL-SRW-2021)
- [Bangla News Classificaiton](https://github.com/soham96/Bengali_news_classifier)
- [Socian Sentiment Datasets](https://github.com/socian-ai/socian-bangla-sentiment-dataset-labeled) (4k)
- [xnli_bn by csebuetnlp](https://huggingface.co/datasets/csebuetnlp/xnli_bn)

## Bangla Text Summarization
- [xlsum by csebuetnlp](https://huggingface.co/datasets/csebuetnlp/xlsum)

## Bangla Parapharase
- [Bangla Paraphrase by csebuetnlp](https://huggingface.co/datasets/csebuetnlp/BanglaParaphrase)
- [indic paraphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase)

## Bangla Code-Mixed Dataset
- [Bangla Code Mixing by Amitavadas](https://amitavadas.com/Code-Mixing.html)

## Benchmark Datasets
- [XTREME](https://github.com/google-research/xtreme)

## Bangla RAW Datasets
- [OSCAR](https://oscar-corpus.com/)
- [Wiki Dump](https://dumps.wikimedia.org/bnwiki/latest/)
- [Indic Corpus](https://indicnlp.ai4bharat.org/corpora/)
- [Common Crawl](http://data.statmt.org/ngrams/raw/)
- [cc-100](http://data.statmt.org/cc-100/)
- [Sangraha](https://huggingface.co/datasets/ai4bharat/sangraha)
- [Culturax](https://huggingface.co/datasets/uonlp/CulturaX)

## Bangla Embeddings
- [fasttext](https://fasttext.cc/docs/en/crawl-vectors.html)
- [word2vec](https://drive.google.com/file/d/1cQ8AoSdiX5ATYOzcTjCqpLCV1efB9QzT/view?usp=sharing)
- [BPEmb](https://bpemb.h-its.org/)