awesome-urdu
📖 A curated list of resources dedicated to Urdu language.
https://github.com/urduhack/awesome-urdu
Last synced: 1 day ago
JSON representation
-
Urdu Datasets
-
General NLP Datasets
- Web news Data - Urdu Web news Data
- Urdu Paraphrase Plagiarism Corpus, 2016
- COrpus of Urdu News TExt Reuse (CoUNTeR), 2016
- Urdu Short Text Reuse Corpus (USTRC), 2018
- TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages
- Roman Urdu Dataset - Data for sentiment analysis, along with misc compiled data for Roman Urdu
- Collection of Urdu Datasets - Datasets for POS, NER and NLP tasks
- Urdu Universal Dependency Treebank
- UrduSummary Corpus Benchmark, 2016
- Rekhta Ghazals
- Web news Data - Urdu Web news Data
- Urdu Paraphrase Plagiarism Corpus, 2016
- COrpus of Urdu News TExt Reuse (CoUNTeR), 2016
- Urdu Short Text Reuse Corpus (USTRC), 2018
- Flickr8k Urdu Image-Caption Generation Dataset, 2020
- mLAMA: multilingual LAnguage Model Analysis, 2021
- Urdu Word Segmentation using CRF, 2018
- Apertium linguistic data for Urdu
-
Urdu Text Classification
-
Urdu Named-Entity Recognition
-
Urdu Monolingual Corpora
- UFAL Corpus, 2014 - 5.4M sentences (with POS tags)
- iNLTK Wiki Articles, 2020 - NLP/Tatoeba-Challenge/blob/master/data/Backtranslations.md), [2016 UrduWikiCorpus](http://urdu-corpus.blogspot.com/p/published-packages.html)
- Leipzig Corpora
- UrduWaC-2010 and urTenTen-2018, SketchEngine
- A Gold Standard Urdu Raw Text Corpus, LDCIL
- iNLTK Wiki Articles, 2020 - NLP/Tatoeba-Challenge/blob/master/data/Backtranslations.md), [2016 UrduWikiCorpus](http://urdu-corpus.blogspot.com/p/published-packages.html)
- WiToKit
- Maḵẖzan
-
Urdu Sentiment Datasets
- Urdu IMDb Movie Reviews - IMDB Movie Reviews data in Urdu
- 2010 Disaster Response Messages
- Urdu Sentiment Lexicon
- UCI Roman-Urdu Sentiment Classification, 2018 - 20k records
- Did You Offend Me? Classification of Offensive Tweets, 2018 - 3k tweets
- Urdu IMDb Movie Reviews - IMDB Movie Reviews data in Urdu
- 2010 Disaster Response Messages
- Sentiment Polarity Lexicons, 2017
- UCI Roman-Urdu Sentiment Classification, 2018 - 20k records
- Urdu Sentiment Benchmark, 2020
- Hate Speech & Offensive Language Detection, 2020 - 10k tweets
-
Urdu OCR Datasets
- U-HAT - Urdu Hand-Written Text Dataset
- IIIT-Hyderabad: Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks, 2017
- CLE Pakistan Urdu Image Corpora
- Cursive-Text: A Benchmark for Urdu Text Recognition in Natural Scene Images, 2020 - 2500 images, email for dataset
- U-HAT - Urdu Hand-Written Text Dataset
- Qaida - Synthetic datasets and pre-trained models
- 45K+ Clean-Background-Urdu-Ligatures-Dataset, 2019
-
Urdu Parallel Corpora for Machine Translation
- OPUS Corpora - >ur)
- CC-Aligned - 1310/), [OpenSubtitles](https://www.aclweb.org/anthology/L16-1147/), [TED](https://www.ted.com/participate/translate), [QED](https://www.aclweb.org/anthology/L14-1675/), etc.
- IIIT-Hyderabad MT Bhasha
- English-Urdu Religious Parallel Corpus
- Urdu-Nepali-English Parallel Corpus
- Cross-Language English-Urdu (CLEU) Corpus, 2018
- Flickr 8k Benchmark - 2.7k sentences
- Universal Declaration of Human Rights (benchmark)
- EMILLE/CIIL Corpus - Contains monolingual data as well
- Technology Development for Indian Languages
- National Platform for Language Technology
- CC-Aligned - 1310/), [OpenSubtitles](https://www.aclweb.org/anthology/L16-1147/), [TED](https://www.ted.com/participate/translate), [QED](https://www.aclweb.org/anthology/L14-1675/), etc.
- PM India Parallel Corpus
- Anuvaad Parallel Corpora
- MechanicalTurks 2012 Parallel Corpora
- Cross-Language English-Urdu (CLEU) Corpus, 2018
- EMILLE/CIIL Corpus - Contains monolingual data as well
- Technology Development for Indian Languages
-
Urdu Transliteration Datasets
-
Urdu Lexical Resources
- CLE Urdu WordNet
- Verb List
- MTurks-10k Multilingual Dictionary, 2014
- Microsoft IT Terminology
- Urdu N-grams, 2020 - Uni-Gram, Bi-Gram, Tri-Gram and Tetra-Gram
- CLE Urdu Books N-Grams
- Offline Eng-Urd Dictionary DB
- Urdu N-grams, 2020 - Uni-Gram, Bi-Gram, Tri-Gram and Tetra-Gram
- Offline Eng-Urd Dictionary DB
- UrduHack Words-List - Includes N-grams, NER Labels
- IndoWordnet Parallel Corpus - [pyiwn](https://github.com/riteshpanjwani/pyiwn), [Demo](https://www.cfilt.iitb.ac.in/indowordnet/))
- Roman Urdu Lexical Normalization, 2019
-
Urdu Speech Datasets
- Urdu 250 Isolated Words, 2018
- CLE Phonetically Rich Urdu Speech Corpus
- CMU Wilderness Speech Dataset, 2019
- FCBH Recordings
- LibriVox AudioBooks
- CLE Pakistan Urdu Speech Corpus
- LDC UPenn Datasets - Filter search by selecting language
- Urdu Raw Speech Corpus, LDCIL
- LDCIL ASR Corpus
- Urdu-Sindhi Speech Emotion Corpus, 2020 - Introducing_the_Urdu_Sindhi_Speech_Emotion_Corpus.pdf))
- Speech Emotion Recognition Benchmark, 2018
- Urdu 250 Isolated Words, 2018
- LibriVox AudioBooks
- Urdu-Sindhi Speech Emotion Corpus, 2020 - Introducing_the_Urdu_Sindhi_Speech_Emotion_Corpus.pdf))
- Speech Emotion Recognition Benchmark, 2018
- CMU Wilderness Speech Dataset, 2019
- CLE Pakistan Urdu Speech Corpus
-
Cross-lingual Datasets
- Cross-lingual Natural Language Inference (XNLI) Corpus, 2020
- Google XTrEME Benchmark, 2020 - Evaluation of cross-lingual generalization of multilingual models
- Urdu-Punjabi Pairs, Apertium
-
-
Urdu NLP Tools, Libraries and Models
-
Cross-lingual Datasets
- UrduHack
- Urdu Morphological Analyzer, IIIT Hyderabad
- UrduHack
- PronouncUR - Urdu words to pronouniciations format
- Indic PoS/NER Tagger
- Urdu Morphological Analyzer, IIIT Hyderabad
- EasyOCR
-
Language Models
-
Word Embeddings
- UrduHack Word-Vectors, 2019 - Word2Vec and FastText models
- Wiki-2016 - 2017](https://fasttext.cc/docs/en/crawl-vectors.html), [Multilingual Aligned, 2017](https://github.com/babylonhealth/fastText_multilingual)
- Polyglot Embeddings, 2013
- ConceptNet Embeddings, 2017
-
Translation Models
- Facebook M2M-100, 2020
- Facebook M2M-100, 2020
- IL-Multi, 2020
- Python Translators Services - Library to use Google, Bing, etc. translators for free
-
Transliteration Libraries
- PolyGlot
- AksharaMukhi - Devanagari (Hindi) to Urdu script converter
- Google Transliterate API - Roman Urdu to Perso-Arabic
- Google Transliterate API - Roman Urdu to Perso-Arabic
- LibIndicTrans - Transliterate Roman/Hindi to Urdu and vice-versa
- AksharaMukhi - Devanagari (Hindi) to Urdu script converter
-
-
Online Resources/Services
-
Transliteration Libraries
-
Urdu News websites
- JANG Group
- BBC Urdu
- Voice of America Urdu
- Nawa-i-Waqt Group
- Urdu Point Network
- More news websites
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- Nawa-i-Waqt Group
- JANG Group
- BBC Urdu
- Nawa-i-Waqt Group
- Urdu Point Network
-
Dictionaries
- ur.oxforddictionaries.com - Oxford Dictionary
- English Urdu Dictionary - English Urdu Dictionary
- Urdu English Dictionary 2 - Urdu English Dictionary 2
- ur.oxforddictionaries.com - Oxford Dictionary
- Urdu English Dictionary 2 - Urdu English Dictionary 2
- English Urdu Dictionary - English Urdu Dictionary
-
Programming Languages
Sub Categories
Urdu News websites
53
General NLP Datasets
18
Urdu Parallel Corpora for Machine Translation
18
Urdu Speech Datasets
17
Urdu Lexical Resources
12
Urdu Sentiment Datasets
11
Cross-lingual Datasets
10
Urdu Monolingual Corpora
8
Language Models
8
Transliteration Libraries
7
Urdu OCR Datasets
7
Dictionaries
6
Word Embeddings
4
Urdu Text Classification
4
Urdu Transliteration Datasets
4
Translation Models
4
Urdu Named-Entity Recognition
3
Keywords
nlp
4
machine-learning
3
dataset
2
pytorch
2
urdu-language
2
urdu
2
baidu
1
bing
1
caiyun
1
deepl
1
google
1
iciba
1
iflytek
1
itranslate
1
lingvanex
1
modernmt
1
mymemory
1
niutrans
1
papago
1
reverso
1
argos
1
alibaba
1
scene-text-recognition
1
scene-text
1
python
1
optical-character-recognition
1
ocr
1
lstm
1
information-retrieval
1
image-processing
1
easyocr
1
deep-learning
1
data-mining
1
crnn
1
cnn
1
apertium-languages
1
wikipedia-dump
1
wikipedia
1
tokenize
1
multilingual
1
dump
1
rekhta
1
parser
1
language-model
1
data
1
urdu-nlp
1
natural-language-processing
1
hindi-language
1
hindi
1
data-science
1