Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
indicnlp_catalog
A collaborative catalog of NLP resources for Indic languages
https://github.com/AI4Bharat/indicnlp_catalog
Last synced: 1 day ago
JSON representation
-
<a name='TextCorpora'></a>Text Corpora
-
<a name='MonolingualCorpus'></a>Monolingual Corpus
- v1 - emnlp.445)]
- LDCIL Monolingual Corpus
- Wikipedia Dumps
- OSCAR Corpus - scaled processed CommonCrawl.
- WMT Common Crawl Dumps
- WMT NEWS Crawl
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- Sanskrit Monolingual and Sandhi-split Corpus
- CMU Romanized Hinglish Corpus - 3211.pdf) for details.
- DNLP-Tel Telugu Corpus - gram model trained with word2vec.
- SinMin Corpus
- Nepali National corpus - national-corpus/).
-
<a name='NERCorpora'></a>NER Corpora
- WikiAnn NER Corpus - xdT99SeaCghihGa7nRkcXGwRGUIsKN?usp=sharing) (Old broken [LINK](http://nlp.cs.rpi.edu))
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- IJCNLP 200 NER Corpus
- AI4Bharat Naamapadam
- L3Cube-MahaNER - conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf).
- MultiCoNER - mixed subsets.The NER tag-set consists of six classes viz.: PER,LOC,CORP,GRP,PROD and CW. Described in [this paper](https://aclanthology.org/2022.semeval-1.196.pdf).
-
<a name='LexicalResources'></a>Lexical Resources and Semantic Similarity
- IndoWordNet
- Facebook Hindi Analogy Dataset
- AI4Bharat Cross-lingual Semantic Textual Similarity - Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines.
- Toxicity-200
- IndoWordNet
-
<a name='ParallelTranslationCorpus'></a>Parallel Translation Corpus
- Samanantar Parallel Corpus - Indian languages and 82m sentence pairs between Indian languages.
- FLORES-200 - way parallel.
- IIT Bombay English-Hindi Parallel Corpus - hi parallel corpora in public domain (about 1.5 million segments)
- CVIT-IIITH PIB Multilingual Corpus - IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus
- PMIndia - Indian languages mined from _Mann ki Baat_ speeches of the PM of India ([paper](https://arxiv.org/abs/2001.09907)).
- OPUS corpus
- WAT 2018 Parallel Corpus
- Charles University English-Hindi Parallel Corpus
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus v1.0
- Charles University English-Odia Parallel Corpus v2.0
- Charles University English-Urdu Religious Parallel Corpus
- JW300 Corpus
- ALT Parallel Corpus
- English-Tamil Wiki Titles
- English-Gujarati Wiki Titles
- EILMT Corpus
- QED Corpus - Hindi corpus of 43k sentences from the educational domain.
- WikiMatrix Corpus
- CCMatrix - matrix)).
- CGNetSwara - Gondi parallel corpus (19k sentence pairs)
- CLE Parallel Corpus
- PHINC - Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf).
- CALCS 2021 Eng-Hinglish dataset - Hinglish parallel corpus containing 10k pairs of sentences. Described in [this paper](https://arxiv.org/pdf/2202.09625.pdf).
- NLLB-Mined
- CCAligned - lingual web-document pairs in 137 languages aligned with English.
- NLLB-MD - translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages.
- Nepali National corpus - Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described [here](https://www.sketchengine.eu/nepali-national-corpus/).
-
<a name='MTEvaluation'></a>MT Evaluation
- WMT23 QE task - edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. [report](https://aclanthology.org/2023.wmt-1.52)
-
<a name='ParallelTransliterationCorpus'></a>Parallel Transliteration Corpus
- BrahmiNet Corpus
- Xlit-IITB-Par - English Transliteration Corpus mined from parallel translation corpora.
- FIRE 2013 Track on Transliterated Search
- NEWS 2018 Shared Task dataset
- AI4Bharat StoryWeaver Xlit Dataset - Transliteration datasets for Hindi, Maithili & Konkani
- Hindi WikiData Transliteration Pairs - Hindi dataset (90k pairs)
- NotAI-tech English-Telugu
- AI4Bharat Aksharantar - English transliteration pairs. Described in [this paper](https://arxiv.org/abs/2205.03018).
- Xlit-IITB-Par - English Transliteration Corpus mined from parallel translation corpora.
-
<a name='TextualClassification'></a>Text Classification
-
<a name='TextualEntailment'></a>Textual Entailment/Natural Language Inference
-
<a name='Paraphrase'></a> Paraphrase
-
<a name='SentimentAnalysis'></a>Sentiment, Sarcasm, Emotion Analysis
- IIT Bombay movie review datasets for Hindi and Marathi
- IIT Patna movie review datasets for Hindi
- IIIT-H LTRC Multi-domain dataset for Telugu
- ACTSA corpus for Telugu
- SentiWordNet - SAIL - Hindi, Bangla, Tamil & Telugu
- Dravidian-CodeMix - FIRE 2020 - Tamil & Malayalam
- IIIT-H LTRC Multi-domain dataset for Telugu
- IIT Patna movie review datasets for Hindi
-
<a name='QuestionAnswering'></a>Question Answering
- bAbi 1.2 dataset
- XQA - 1227.pdf)
- Chaii - hindi-and-tamil-question-answering/discussion/264695) is a good collection of papers on multilingual Question Answering.
- csebuetnlp Bangla QA
- IITB HiQuAD - answer pairs. Described in [this paper](https://www.cse.iitb.ac.in/~ganesh/papers/acl2019a.pdf).
-
<a name='HateSpeech'></a>Hate Speech and Offensive Comments
- Hate Speech and Offensive Content Identification in Indo-European Languages - 2020)
- An Indian Language Social Media Collection for Hate and Offensive Speech, 2020
- Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018
- Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018 - 5118))
-
<a name='InformationExtraction'></a>Information Extraction
- EventXtract-IL - ws.org/Vol-2266/T5-1.pdf).
- EDNIL-FIRE2020
- Facebook - MTOP Benchmark - Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in [this paper](https://arxiv.org/pdf/2008.09335.pdf).
-
<a name='POSTaggedcorpus'></a>POS Tagged corpus
-
<a name='DependencyParseCorpus'></a>Dependency Parse Corpus
-
<a name='CoreferenceCorpus'></a>Coreference Corpus
-
<a name='Summarization'></a>Summarization
- TeSum - summary pairs, with the summaries being manually created. [[paper](https://aclanthology.org/2022.lrec-1.614)]
-
<a name='ChunkCorpus'></a>Chunk Corpus
-
-
<a name='MajorIndicLanguageNLPRepositories'></a>Major Indic Language NLP Repositories
- Technology Development for Indian Languages (TDIL)
- Center for Indian Language Technology (CFILT)
- Language Technologies Research Center (LTRC)
- AI4Bharat
- University of Hyderabad - Sanskrit NLP
- National Platform for Language Technology
- BUET CSE NLP Group
- KMI Linguistics
- IIT Patna
- National Platform for Language Technology
- Universal Language Contribution API (ULCA)
-
<a name='Libraries'></a>Libraries and Tools
-
<a name='Benchmarks'></a>Evaluation Benchmarks
-
<a name='Standards'></a>Standards
-
<a name='Models'></a>Models
-
<a name='WordEmbeddings'></a>Word Embeddings
- AI4Bharat IndicFT - text word embeddings for 11 Indian languages.
- FastText CommonCrawl+Wikipedia
- FastText Wikipedia
- Polyglot
-
<a name='PreTrainedLanguageModels'></a>Pre-trained Language Models
- AI4Bharat IndicBERT
- AI4Bharat IndicBART - to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in [this paper](https://arxiv.org/abs/2109.02903).
- MuRIL
- mBART50 - trained model trained on CommonCrawl of many languages (including major Indic languages).
- BLOOM - decoder language model (includes major Indic languages.
- albert-base-sanskrit - based model trained on Sanskrit Wikipedia.
- RoBERTa-hindi-guj-san
- LaBSE
- EM-ALBERT
-
<a name='MultilingualWordEmbeddings'></a>Multilingual Word Embeddings
-
<a name='TranslationModels'></a>Translation Models
-
<a name='TransliterationModels'></a>Transliteration Models
- AI4Bharat IndicXlit - based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in [this paper](https://arxiv.org/abs/2205.03018).
-
<a name='SpeechModels'></a>Speech Models
- AI4Bharat IndicWav2Vec - trained models for 40 Indian languages based on Wav2Vec 2.0.
- arijitx/wav2vec2-large-xlsr-bengali - large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM.
-
<a name='NER'></a>NER
- AI4Bharat IndicNER
- L3Cube-MahaNER-BERT - conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf).
- AsNER
-
-
<a name='SpeechCorpora'></a>Speech Corpora
-
<a name='NER'></a>NER
- Microsoft-IITB Marathi Speech Corpus
- AccentDB
- IIT Madras TTS database
- BABEL Speech Corpus
- WikiPron - 1.521)
- CVIT IndicSpeech
- Google Speech Corpus - #66, #78-#79. [(paper)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf)
- SMC Malayalam Speech Corpus - [Download link](https://releases.smc.org.in/msc-reviewed-speech/)
- IISc-MILE Kannada ASR Corpus
- IISc-MILE Tamil ASR Corpus
- MUCS 2021 Dataset - Switching ASR Challenges for Low Resource Indian Languages
- Gramvaani
- Kashmiri Data Corpus
- Hindi-Tamil-English ASR Challenge
- Large Sinhala ASR training data set
- Large Bengali ASR training data set
- Large Nepali ASR training data set
- Crowdsourced high-quality Gujarati multi-speaker speech data set
- Crowdsourced high-quality Kannada multi-speaker speech data set
- Crowdsourced high-quality Malayalam multi-speaker speech data set
- Crowdsourced high-quality Marathi multi-speaker speech data set
- Crowdsourced high-quality Tamil multi-speaker speech data set
- Crowdsourced high-quality Telugu multi-speaker speech data set
- Nepali National corpus - national-corpus/).
- Shrutilipi
-
-
<a name='OCRCorpora'></a>OCR Corpora
-
<a name='NER'></a>NER
-
-
<a name='MultimodalCorpora'></a>Multimodal Corpora
-
<a name='NER'></a>NER
- English-Hindi Visual Genome
- English-Hindi Flickr 8k - Comparable-Data-Collection).
- English-Hindi Flickr 8k - Comparable-Data-Collection).
- English-Hindi Flickr 8k - Comparable-Data-Collection).
-
-
<a name='LanguageSpecificCatalogs'></a>Language Specific Catalogs
-
<a name='NER'></a>NER
-
Categories
<a name='TextCorpora'></a>Text Corpora
101
<a name='SpeechCorpora'></a>Speech Corpora
25
<a name='Models'></a>Models
24
<a name='MajorIndicLanguageNLPRepositories'></a>Major Indic Language NLP Repositories
11
<a name='MultimodalCorpora'></a>Multimodal Corpora
4
<a name='Benchmarks'></a>Evaluation Benchmarks
3
<a name='Libraries'></a>Libraries and Tools
3
<a name='Standards'></a>Standards
2
<a name='OCRCorpora'></a>OCR Corpora
2
<a name='LanguageSpecificCatalogs'></a>Language Specific Catalogs
1
Sub Categories
<a name='NER'></a>NER
35
<a name='ParallelTranslationCorpus'></a>Parallel Translation Corpus
29
<a name='MonolingualCorpus'></a>Monolingual Corpus
15
<a name='PreTrainedLanguageModels'></a>Pre-trained Language Models
9
<a name='ParallelTransliterationCorpus'></a>Parallel Transliteration Corpus
9
<a name='NERCorpora'></a>NER Corpora
8
<a name='SentimentAnalysis'></a>Sentiment, Sarcasm, Emotion Analysis
8
<a name='LexicalResources'></a>Lexical Resources and Semantic Similarity
5
<a name='QuestionAnswering'></a>Question Answering
5
<a name='HateSpeech'></a>Hate Speech and Offensive Comments
4
<a name='POSTaggedcorpus'></a>POS Tagged corpus
4
<a name='DependencyParseCorpus'></a>Dependency Parse Corpus
4
<a name='WordEmbeddings'></a>Word Embeddings
4
<a name='TranslationModels'></a>Translation Models
4
<a name='InformationExtraction'></a>Information Extraction
3
<a name='SpeechModels'></a>Speech Models
2
<a name='CoreferenceCorpus'></a>Coreference Corpus
1
<a name='ChunkCorpus'></a>Chunk Corpus
1
<a name='TextualEntailment'></a>Textual Entailment/Natural Language Inference
1
<a name='TransliterationModels'></a>Transliteration Models
1
<a name='MultilingualWordEmbeddings'></a>Multilingual Word Embeddings
1
<a name='Summarization'></a>Summarization
1
<a name='MTEvaluation'></a>MT Evaluation
1
<a name='Paraphrase'></a> Paraphrase
1
<a name='TextualClassification'></a>Text Classification
1