Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/urduhack/awesome-urdu

πŸ“– A curated list of resources dedicated to Urdu language.
https://github.com/urduhack/awesome-urdu

List: awesome-urdu

awsome awsome-list awsome-urdu dictionaries english-urdu-dictionary urdu urdu-datasets urdu-language

Last synced: about 1 month ago
JSON representation

πŸ“– A curated list of resources dedicated to Urdu language.

Awesome Lists containing this project

README

        

# Awesome Urdu [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

> A curated list of resources dedicated to Urdu language.
>
> Maintainers - [Ikram Ali](https://github.com/akkefa)

*Please read the [contribution guidelines](contributing.md) before contributing.*

Please feel free to create [pull requests](https://github.com/urduhack/awesome-urdu/pulls)

## Urdu Datasets

### General NLP Datasets

- [Web news Data](https://github.com/urduhack/) - Urdu Web news Data
- [Roman Urdu Dataset](https://github.com/Smat26/Roman-Urdu-Dataset) - Data for sentiment analysis, along with misc compiled data for Roman Urdu
- [Collection of Urdu Datasets](https://github.com/mirfan899/Urdu) - Datasets for POS, NER and NLP tasks
- [Urdu Universal Dependency Treebank](https://github.com/UniversalDependencies/UD_Urdu-UDTB)
- [UrduSummary Corpus Benchmark, 2016](https://github.com/humsha/USCorpus)
- [Rekhta Ghazals](https://github.com/amir9ume/urdu_ghazals_rekhta)
- [Urdu Paraphrase Plagiarism Corpus, 2016](http://ucrel.lancs.ac.uk/textreuse/uppc.php)
- Derived from: [COrpus of Urdu News TExt Reuse (CoUNTeR), 2016](http://ucrel.lancs.ac.uk/textreuse/counter.php)
- Extension: [Urdu Short Text Reuse Corpus (USTRC), 2018](http://ucrel.lancs.ac.uk/textreuse/ustrc.php)
- [TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages](https://zenodo.org/record/3707949)
- [Flickr8k Urdu Image-Caption Generation Dataset, 2020](https://github.com/abdullahzia510/Effecient-Urdu-Caption-Generation-using-Attention-Mechanism)
- [mLAMA: multilingual LAnguage Model Analysis, 2021](https://github.com/norakassner/mlama)
- [Urdu Word Segmentation using CRF, 2018](https://github.com/harisbinzia/Urdu-Word-Segmentation)
- [Apertium linguistic data for Urdu](https://github.com/apertium/apertium-urd)

### Urdu Text Classification

- [Fake News Classification, 2020](https://github.com/MaazAmjad/Datasets-for-Urdu-news) ([Old Version](https://github.com/MaazAmjad/Urdu-News-Augmented-Dataset))
- [iNLTK Urdu News Headlines Classification Benchmark, 2020](https://www.kaggle.com/disisbig/urdu-news-dataset)
- [Express News Headlines+Summary, 2019](https://github.com/mwaseemrandhawa/Urdu-News-Headline-Dataset)

### Urdu Named-Entity Recognition

- [MK-PUCIT-NER, 2019](https://www.kaggle.com/safiakanwal/mkpucit-ner-dataet)
- [WikiAnn: Cross-lingual Name Tagging & Linking, 2017](https://elisa-ie.github.io/wikiann/)

### Urdu Monolingual Corpora

- [UFAL Corpus, 2014](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5) - 5.4M sentences (with POS tags)
- CommonCrawl
- [OSCAr Corpus, 2020](https://oscar-corpus.com/)
- [CC-100 Corpus, 2019](http://data.statmt.org/cc-100/) - CC crawls from Jan-Dec 2018
- [WMT Raw 2017](http://data.statmt.org/ngrams/raw/) - CC crawls from 2012-2016
- [https://dumps.wikimedia.org/urwiki/](WikiDumps)
- Processed Dumps: [iNLTK Wiki Articles, 2020](https://www.kaggle.com/disisbig/urdu-wikipedia-articles), [Tatoeba Challenge, 2020](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/Backtranslations.md), [2016 UrduWikiCorpus](http://urdu-corpus.blogspot.com/p/published-packages.html)
- To process the latest dump yourself, use a library like [WiToKit](https://github.com/akb89/witokit)
- [Leipzig Corpora](https://wortschatz.uni-leipzig.de/en/download/Urdu)
- [MaαΈ΅αΊ–zan](https://github.com/zeerakahmed/makhzan)
- Commercial-licensed corpora
- [UrduWaC-2010 and urTenTen-2018, SketchEngine](https://www.sketchengine.eu/corpora-and-languages/urdu-text-corpora/)
- [A Gold Standard Urdu Raw Text Corpus, LDCIL](https://data.ldcil.org/a-gold-standard-urdu-raw-text-corpus)

### Urdu Sentiment Datasets

- [Urdu IMDb Movie Reviews](https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews) - IMDB Movie Reviews data in Urdu
- [Urdu Sentiment Benchmark, 2020](https://github.com/MuhammadYaseenKhan/Urdu-Sentiment-Corpus)
- [2010 Disaster Response Messages](https://huggingface.co/datasets/disaster_response_messages)
- Lexicon
- [Urdu Sentiment Lexicon](https://chaoticity.com/urdu-sentiment-lexicon/)
- [Sentiment Polarity Lexicons, 2017](https://www.kaggle.com/rtatman/sentiment-lexicons-for-81-languages)
- Roman Urdu
- [Hate Speech & Offensive Language Detection, 2020](https://github.com/haroonshakeel/roman_urdu_hate_speech) - 10k tweets
- [UCI Roman-Urdu Sentiment Classification, 2018](https://archive.ics.uci.edu/ml/datasets/Roman+Urdu+Data+Set) - 20k records
- [Did You Offend Me? Classification of Offensive Tweets, 2018](https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification/tree/b8433ff1ebb885bd657f5117eab6bd3798f20408) - 3k tweets

### Urdu OCR Datasets

- [Qaida](https://github.com/AtiqueUrRehman/qaida) - Synthetic datasets and pre-trained models
- [U-HAT](https://www.kaggle.com/hazrat/uhat-urdu-handwritten-text-dataset) - Urdu Hand-Written Text Dataset
- [45K+ Clean-Background-Urdu-Ligatures-Dataset, 2019](https://github.com/UltramindSoft/45K-Clean-Background-Urdu-Ligatures-Dataset)
- [IIIT-Hyderabad: Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks, 2017](https://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-urdu-ocr)
- [CLE Pakistan Urdu Image Corpora](https://www.cle.org.pk/clestore/imagecorpora.htm) (Corresponding [texts](https://www.cle.org.pk/clestore/index.htm))
- [Cursive-Text: A Benchmark for Urdu Text Recognition in Natural Scene Images, 2020](https://www.sciencedirect.com/science/article/pii/S2352340920306430) - 2500 images, email for dataset

### Urdu Parallel Corpora for Machine Translation

- [OPUS Corpora](https://opus.nlpl.eu/) (Select en->ur)
- Contains: [CC-Aligned](http://www.statmt.org/cc-aligned/), [Tanzil](http://tanzil.net/trans/), [JW300](https://www.aclweb.org/anthology/P19-1310/), [OpenSubtitles](https://www.aclweb.org/anthology/L16-1147/), [TED](https://www.ted.com/participate/translate), [QED](https://www.aclweb.org/anthology/L14-1675/), etc.
- [IIIT-Hyderabad MT Bhasha](http://preon.iiit.ac.in/~jerin/bhasha/)
- Contains *Mann ki Baat* and *Press Information Bureau* datasets
- [PM India Parallel Corpus](http://data.statmt.org/pmindia/)
- [English-Urdu Religious Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2582)
- [Anuvaad Parallel Corpora](https://github.com/project-anuvaad/anuvaad-parallel-corpus)
- [MechanicalTurks 2012 Parallel Corpora](https://github.com/joshua-decoder/indian-parallel-corpora)
- [Urdu-Nepali-English Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm) ([Test set here](https://www.cle.org.pk/software/ling_resources/testingcorpusmt.htm))
- [Cross-Language English-Urdu (CLEU) Corpus, 2018](http://ucrel.lancs.ac.uk/textreuse/cleu.php)
- [Flickr 8k Benchmark](https://forms.illinois.edu/sec/1713398) - 2.7k sentences
- [Universal Declaration of Human Rights (benchmark)](https://unicode.org/udhr/translations.html)
- Commercial-licensed corpora
- [EMILLE/CIIL Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0037/) - Contains monolingual data as well
- [National Platform for Language Technology](https://nplt.in/demo/index.php?route=product/category&path=75_59&limit=100)
- [Technology Development for Indian Languages](https://tdil-dc.in/index.php?option=com_download&task=fsearch&Itemid=547&lang=en) (Search "Urdu Corpus")

### Urdu Transliteration Datasets

- [Google Dakshina, 2020](https://github.com/google-research-datasets/dakshina)
- [TRANSLIT: A Large-scale Name Transliteration Resource, 2020](https://github.com/fbenites/TRANSLIT)
- [Roman to Urdu Transliteration Sentences, 2020](https://sci-hub.se/10.1142/s0218001421520017) (Drive Link available on request)
- [Roman-Urdu Conversion Data](https://github.com/Smat26/Roman-Urdu-Dataset#conversion)
- [Trilingual Ur-RomUr-Eng Dict, 2019](https://github.com/MoizRauf/Urdu--Roman-Urdu--English--Dictionary)

### Urdu Lexical Resources

- [Offline Eng-Urd Dictionary DB](https://github.com/YESALAM/UrduDictionary)
- [UrduHack Words-List](https://github.com/urduhack/urdu-words) - Includes N-grams, NER Labels
- [CLE Urdu WordNet](https://www.cle.org.pk/clestore/urduwordnet.htm) ([Demo](http://wordnet.cle.org.pk/), [PDF](https://www.cle.org.pk/software/ling_resources/UrduWordNetWordlist.htm))
- CLE Urdu [Verb List](https://www.cle.org.pk/software/ling_resources/urduverblist.htm), [Words List](https://www.cle.org.pk/software/ling_resources/wordlist.htm), [Most Frequent Words](https://www.cle.org.pk/software/ling_resources/UrduHighFreqWords.htm)
- [IndoWordnet Parallel Corpus](https://github.com/anoopkunchukuttan/indowordnet_parallel) (API - [pyiwn](https://github.com/riteshpanjwani/pyiwn), [Demo](https://www.cfilt.iitb.ac.in/indowordnet/))
- [MTurks-10k Multilingual Dictionary, 2014](https://github.com/AI4Bharat/indicnlp_catalog/issues/21)
- [Microsoft IT Terminology](https://www.microsoft.com/en-us/language/Terminology)
- [Urdu N-grams, 2020](https://www.kaggle.com/tafseerahmed/urdu-ngrams) - Uni-Gram, Bi-Gram, Tri-Gram and Tetra-Gram
- [CLE Urdu Books N-Grams](https://www.cle.org.pk/clestore/cleurdungrams.htm)
- [Roman Urdu Lexical Normalization, 2019](https://github.com/abdulrafae/normalization)

### Urdu Speech Datasets

- [Urdu 250 Isolated Words, 2018](https://www.kaggle.com/hazrat/urdu-speech-dataset)
- [CLE Phonetically Rich Urdu Speech Corpus](https://www.cle.org.pk/software/ling_resources/phoneticallyrichurduspeechcorpus.htm)
- [CMU Wilderness Speech Dataset, 2019](http://www.festvox.org/cmu_wilderness/)
- [FCBH Recordings](https://www.faithcomesbyhearing.com/audio-bible-resources/recordings-database)
- [LibriVox AudioBooks](https://librivox.org/search?primary_key=63&search_category=language&search_page=1&search_form=get_results)
- Commercial-licensed corpora
- [CLE Pakistan Urdu Speech Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-S0403/) ([Main website](https://www.cle.org.pk/clestore/speechcorpus.htm))
- [LDC UPenn Datasets](https://catalog.ldc.upenn.edu/) - Filter search by selecting language
- [Urdu Raw Speech Corpus, LDCIL](https://data.ldcil.org/urdu-raw-speech-corpus)
- [LDCIL ASR Corpus](https://www.ldcil.org/resourcesSpeechCorp.aspx)
- Emotion
- [Urdu-Sindhi Speech Emotion Corpus, 2020](https://zenodo.org/record/3685274) ([Paper](https://thesai.org/Downloads/Volume11No4/Paper_104-Introducing_the_Urdu_Sindhi_Speech_Emotion_Corpus.pdf))
- [Speech Emotion Recognition Benchmark, 2018](https://www.kaggle.com/bitlord/urdu-language-speech-dataset)

### Cross-lingual Datasets

- [Cross-lingual Natural Language Inference (XNLI) Corpus, 2020](https://github.com/facebookresearch/XNLI)
- [Google XTrEME Benchmark, 2020](https://github.com/google-research/xtreme) - Evaluation of cross-lingual generalization of multilingual models
- [Urdu-Punjabi Pairs, Apertium](https://github.com/hsumerf/urdu-panjabi-pair)

---

## Urdu NLP Tools, Libraries and Models

- [UrduHack](https://pypi.org/project/urduhack/)
- [PronouncUR](https://github.com/harisbinzia/PronouncUR) - Urdu words to pronouniciations format
- [iNLTK](https://inltk.readthedocs.io/)
- [Indic PoS/NER Tagger](https://github.com/avineshpvs/indic_tagger)
- [Urdu Morphological Analyzer, IIIT Hyderabad](http://ltrc.iiit.ac.in/showfile.php?filename=downloads/UrduResources/UrduMorphAnalyser.php)
- [EasyOCR](https://github.com/JaidedAI/EasyOCR)

### Language Models

- [HuggingFace Models](https://huggingface.co/models?filter=ur&pipeline_tag=fill-mask)
- [Google Multilingual-T5, 2020](https://github.com/google-research/multilingual-t5)
- [Google MuRIL, 2020](https://tfhub.dev/google/MuRIL/)
- [iNLTK Models, 2019](https://github.com/anuragshas/nlp-for-urdu)
- [XLM-RoBERTa, 2019](https://github.com/facebookresearch/XLM)
- [Multilingual BERT, 2019](https://github.com/google-research/bert/blob/master/multilingual.md)

### Word Embeddings

- [UrduHack Word-Vectors, 2019](https://github.com/urduhack/urdu-word-vectors) - Word2Vec and FastText models
- Facebook FastText models: [Wiki-2016](https://fasttext.cc/docs/en/pretrained-vectors.html), [CC+Wiki-2017](https://fasttext.cc/docs/en/crawl-vectors.html), [Multilingual Aligned, 2017](https://github.com/babylonhealth/fastText_multilingual)
- [BPEmb: Subword Embeddings, 2017](https://nlp.h-its.org/bpemb/) ([Multilingual Aligned](https://nlp.h-its.org/bpemb/multi/))
- [ConceptNet Embeddings, 2017](https://github.com/commonsense/conceptnet-numberbatch)
- [Polyglot Embeddings, 2013](https://sites.google.com/site/rmyeid/projects/polyglot)

### Translation Models

- [IL-Multi, 2020](https://github.com/jerinphilip/ilmulti)
- [Facebook M2M-100, 2020](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100)
- [Python Translators Services](https://github.com/UlionTse/translators) - Library to use Google, Bing, etc. translators for free

### Transliteration Libraries

- [PolyGlot](https://polyglot.readthedocs.io/en/latest/Transliteration.html)
- [LibIndicTrans](https://github.com/libindic/indic-trans) - Transliterate Roman/Hindi to Urdu and vice-versa
- [AksharaMukhi](http://aksharamukha.appspot.com/python) - Devanagari (Hindi) to Urdu script converter
- [Google Transliterate API](https://pypi.org/project/google-transliteration-api/) - Roman Urdu to Perso-Arabic

---

## Online Resources/Services

- [Pakistani Center for Language Engineering - Online Services](https://tech.cle.org.pk/)

### Urdu News websites

- [JANG Group](http://www.jang.com.pk/)
- [BBC Urdu](http://www.bbcurdu.com/)
- [Voice of America Urdu](http://www.voanews.com/urdu)
- [Nawa-i-Waqt Group](http://www.nawaiwaqt.com.pk)
- [Urdu Point Network](http://www.urdupoint.com/)

[More news websites](https://github.com/divkakwani/awesome-newspapers/blob/main/newspapers/ur.csv)...

### Dictionaries

- [ur.oxforddictionaries.com](https://ur.oxforddictionaries.com/) - Oxford Dictionary
- [English Urdu Dictionary](http://www.urduword.com) - English Urdu Dictionary
- [Urdu English Dictionary 2](http://www.urduenglishdictionary.org) - Urdu English Dictionary 2