Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
tamil-nlp-catalog
Awesome List of Tamil NLP & AI Resources
https://github.com/narVidhai/tamil-nlp-catalog
Last synced: 5 days ago
JSON representation
-
**Tools, Libraries, Models**
-
General
- iNLTK
- Indic NLP Library - processing tools)
- Awesome-Tamil
-
Word Embeddings
- Wikipedia-based - {2016}
- CommonCrawl+Wikipedia - {2017}
- AI4Bharat IndicFT - {2020}
- BPEmb: Subword Embeddings - {2017, [Aligned Multilingual](https://nlp.h-its.org/bpemb/multi/)}
- PolyGlot
- Multilingual Aligned - {2017}
- ConceptNet
- Facebook MUSE
- GeoMM
-
Transformers, BERT
- Multilingual BERT
- XML RoBERTa
- ALBERT - bart)
- Google ELECTRA - TaMillion - {2020, [Code](https://mapmeld.medium.com/training-bangla-and-tamil-language-bert-models-46d7262b550f)}
- TranKit
- Multilingual Text2Text
- Tamil - for-tanglish)
- Google Multilingual T5
- TF-Hub - base-cased)}
-
Translation
- AI4Bharat IndicTrans - {2021, [Paper](https://arxiv.org/abs/2104.05596)}
- IIT-B Śata-Anuva̅dak
- not-AI-Tech Anuvaad - {2020, mT5 model fine-tuned on public datasets}
- IIIT-H IndicMulti
- EasyNMT - Collection of open source multilingual NMT models
-
Transliteration
- AI4Bharat Xlit
- AksharaMukha - [API](http://aksharamukha.appspot.com/python)
- PolyGlot Transliteration
- EpiTran - IPA Transliteration
-
OCR
-
Speech
- Indic Wav2Vec2
- Coqui - [StT](https://github.com/coqui-ai/STT)
-
Grammar
-
-
**Datasets**
-
Monolingual Corpus
- OSCAR Corpus 2019 - Deduplicated Corpus {226M Tokens, 5.1GB)
- WMT Raw 2017 - CC crawls from 2012-2016
- CC-100 - CC crawls from Jan-Dec 2018
- AI4Bharat IndicCorp - {582M}
- WikiDumps
- WMT News Crawl
- Kaggle Tamil Articles Corpus
- Dinamalar News Corpus - {2009-19, 120k articles}
- Leipzig Corpora
- LDCIL Standard Text Corpus - Free for students/faculties {11M tokens}
- EMILLE Corpus - {20M Tokens, developed [in collaboration with CIIL](http://corpora.ciil.org/)}
- Project Madurai
-
Translation
- AI4Bharat Samān-Antar
- OPUS Corpus - >ta)
- MultiCC Aligned - v1.php), [Tanzil](https://opus.nlpl.eu/Tanzil.php), [bible-corpus](https://github.com/christos-c/bible-corpus), [WikiMatrix](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix), and more...
- CommonCrawl-Matrix
- MultiIndicMT - WAT2021 - task.html#download)
- PM India Corpus - MkB)](http://preon.iiit.ac.in/~jerin/bhasha/), [NLPC-UoM Corpus](https://github.com/nlpc-uom/English-Tamil-Parallel-Corpus), [Wiki Titles](http://data.statmt.org/wikititles/v2/wikititles-v2.ta-en.tsv.gz), [Charles University EnTam v2.0 Corpus](http://ufal.mff.cuni.cz/~ramasamy/parallel/html/)
- Synthetic Corpus - Translations generated using Google
- Tatoeba Wiki Back-translated data
- VPT-IL-FIRE2018 - 3k verb phrases, available on request
- MTData library
- Indian Language Corpora Initiative - Available only on request
- Tourism - dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1801&lang=en), [Health](http://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1789&lang=en)
- NPLT
- Parallel Chunked Text Corpus ILCI-II - dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1411), [Agriculture & Entertainment Text Corpus-ILCI II](https://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1675), [General Text Corpus](https://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1271), [Health Text Corpus](https://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1394)
- Telugu-Tamil General Text Corpus
- Sinhala-Tamil Parallel Corpus - {[Paper1](https://www.aclweb.org/anthology/U14-1018/), [Paper2](https://ieeexplore.ieee.org/document/7980522), Data available on request?, [Test set](https://github.com/nlpc-uom/Sinhala-Tamil-Aligned-Parallel-Corpus)}
- cEnTam: Creation of a New English-Tamil Corpus, 2020 - Uses OPUS+WMT20 data
- NPLT
-
Transliteration
- NEWS2018 Dataset
- ICTA English-Sinhala-Tamil Names - {2009, 10k triplets, SQL format}
-
Speech, Audio
- OpenSLR - {2020, 9 hours, [Paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf)}
- IARPA Babel - {2017, 350 hours}
- Mozilla CommonVoice - {2020, 20 hours}
- Spoken Tutorial - TODO: Scrape from here
- IIT Madras TTS database - {2020, [Competition](http://tdil-dc.in/ttsapi/ttschallenge2020/)}
- LinguaLibre - Wiktionary-based word corpus
- SLR65 - Crowdsourced high-quality Tamil multi-speaker speech dataset
- VoxLingua107 - Language Identification dataset
- A classification dataset for Tamil music - {2020, [Paper](https://arxiv.org/abs/2009.04459)}
-
Named Entity Recognition
- FIRE2014
- FIRE2015 Social Media Text - Tweets
- WikiAnn - ([Latest Download Link](https://drive.google.com/drive/folders/1Q-xdT99SeaCghihGa7nRkcXGwRGUIsKN))
-
Text Classification
- iNLTK News Articles Classification
- Indic Tamil NLP 2018
- Offensive Language Identification in Dravidian Languages - {2020, [Dataset](https://github.com/manikandan-ravikiran/DOSA)}
- TamilMurasu News Articles Classification
-
OCR
- LipiTK Isolated Handwritten Tamil Character Dataset - {156 characters, 500 samples per char}
- Tamil Vowels - Scanned Handwritten - {12 vowels, 18 images each}
- Jaffna University Datasets of printed Tamil characters and documents
- Kalanjiyam: Unconstrained Offline Tamil Handwritten Database - {2016, [Paper](https://link.springer.com/chapter/10.1007/978-3-319-68124-5_24)}
- SynthText - {2019}
- IIIT-H OCR benchmark and synthetic data - {2021, Available on request}
-
Part-Of-Speech (POS) Tagging
-
Sentiment and Abuse Analysis
-
Lexical Resources
-
Benchmarks
- XTREME-S: Evaluating Cross-lingual Speech Representations - {[Paper](https://arxiv.org/pdf/2203.10752.pdf)}
-
Miscellaneous NLP Datasets
- XNLI 2019 - Request via email
- AI4Bharat Cross-lingual Semantic Textual Similarity - {2020}
- Multilingual Entity-Linking from WikiNews - {2020}
- EventXtractionIL-FIRE2018
- EDNIL-FIRE2020
- CMEE-FIRE2016
- Paraphrase Identification - Amrita University-DPIL Corpus
- Anaphora Resolution from Social Media Text - FIRE2020
- TamilPaa Song-Lyrics Dataset, 2020
- English to Tamil
- Tamil Glossary Dataset
- AI4Bharat Cross-Lingual Sentence Retrieval
-
-
**Other Important Resources**
-
Miscellaneous NLP Datasets
-
-
Uncategorized
-
Uncategorized
- awesome website - )*
-
Programming Languages
Categories
Sub Categories
Translation
23
Miscellaneous NLP Datasets
13
Monolingual Corpus
12
Word Embeddings
9
Speech, Audio
9
Transformers, BERT
9
Sentiment and Abuse Analysis
8
OCR
7
Transliteration
6
Lexical Resources
4
Text Classification
4
Named Entity Recognition
3
General
3
Part-Of-Speech (POS) Tagging
2
Speech
2
Uncategorized
1
Benchmarks
1
Grammar
1
Keywords
nlp
5
natural-language-processing
3
multilingual
3
tamil
2
pytorch
2
indian-languages
2
translation
2
machine-learning
2
machine-translation
2
tokenizer
2
transformers
2
artificial-intelligence
1
deeplearning
1
dependency-parsing
1
language-model
1
lemmatization
1
morphological-tagging
1
part-of-speech-tagging
1
sentence-segmentation
1
tokenization
1
universal-dependencies
1
adapters
1
word-embedding
1
bilingual-word-embedding
1
word-vectors
1
distributed-representations
1
python
1
tamil-nlp
1
dravidian
1
awesome-list
1
wrappers
1
multilingual-translations
1
multilingual-translation
1
machine-translation-models
1
transformer
1
telugu
1
mt5
1
marathi
1
malayalam
1
kannada
1
indic-languages
1
india
1
hindi
1
translator
1
tf-idf
1
search
1
rag
1
question-generation
1
multi-lingual
1
llm
1