{"id":6223,"url":"https://github.com/urduhack/awesome-urdu","name":"awesome-urdu","description":"📖 A curated list of resources dedicated to Urdu language.","projects_count":120,"last_synced_at":"2026-05-31T16:00:26.436Z","repository":{"id":41579807,"uuid":"142857278","full_name":"urduhack/awesome-urdu","owner":"urduhack","description":"📖 A curated list of resources dedicated to Urdu language.","archived":false,"fork":false,"pushed_at":"2021-05-11T07:42:31.000Z","size":22,"stargazers_count":78,"open_issues_count":1,"forks_count":15,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-05-15T03:03:15.701Z","etag":null,"topics":["awsome","awsome-list","awsome-urdu","dictionaries","english-urdu-dictionary","urdu","urdu-datasets","urdu-language"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/urduhack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-30T09:53:28.000Z","updated_at":"2026-05-07T06:42:07.000Z","dependencies_parsed_at":"2022-08-25T18:52:04.544Z","dependency_job_id":null,"html_url":"https://github.com/urduhack/awesome-urdu","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/urduhack/awesome-urdu","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/urduhack%2Fawesome-urdu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/urduhack%2Fawesome-urdu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/urduhack%2Fawesome-urdu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/urduhack%2Fawesome-urdu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/urduhack","download_url":"https://codeload.github.com/urduhack/awesome-urdu/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/urduhack%2Fawesome-urdu/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33737692,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"created_at":"2024-01-07T05:05:13.432Z","updated_at":"2026-05-31T16:00:26.437Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["Urdu NLP Tools, Libraries and Models","Urdu Datasets","Online Resources/Services"],"sub_categories":["Word Embeddings","General NLP Datasets","Urdu Parallel Corpora for Machine Translation","Language Models","Transliteration Libraries","Urdu Named-Entity Recognition","Urdu Lexical Resources","Urdu Monolingual Corpora","Urdu Sentiment Datasets","Urdu Text Classification","Urdu OCR Datasets","Urdu Transliteration Datasets","Urdu Speech Datasets","Cross-lingual Datasets","Translation Models","Urdu News websites","Dictionaries"],"readme":"# Awesome Urdu [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\n\n\u003e A curated list of resources dedicated to Urdu language.\n\u003e\n\u003e Maintainers - [Ikram Ali](https://github.com/akkefa)\n\n*Please read the [contribution guidelines](contributing.md) before contributing.*\n\nPlease feel free to create [pull requests](https://github.com/urduhack/awesome-urdu/pulls)\n\n## Urdu Datasets\n\n### General NLP Datasets\n\n- [Web news Data](https://github.com/urduhack/) - Urdu Web news Data\n- [Roman Urdu Dataset](https://github.com/Smat26/Roman-Urdu-Dataset) - Data for sentiment analysis, along with misc compiled data for Roman Urdu\n- [Collection of Urdu Datasets](https://github.com/mirfan899/Urdu) - Datasets for POS, NER and NLP tasks\n- [Urdu Universal Dependency Treebank](https://github.com/UniversalDependencies/UD_Urdu-UDTB)\n- [UrduSummary Corpus Benchmark, 2016](https://github.com/humsha/USCorpus)\n- [Rekhta Ghazals](https://github.com/amir9ume/urdu_ghazals_rekhta)\n- [Urdu Paraphrase Plagiarism Corpus, 2016](http://ucrel.lancs.ac.uk/textreuse/uppc.php)\n  - Derived from: [COrpus of Urdu News TExt Reuse (CoUNTeR), 2016](http://ucrel.lancs.ac.uk/textreuse/counter.php)\n  - Extension: [Urdu Short Text Reuse Corpus (USTRC), 2018](http://ucrel.lancs.ac.uk/textreuse/ustrc.php)\n- [TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages](https://zenodo.org/record/3707949)\n- [Flickr8k Urdu Image-Caption Generation Dataset, 2020](https://github.com/abdullahzia510/Effecient-Urdu-Caption-Generation-using-Attention-Mechanism)\n- [mLAMA: multilingual LAnguage Model Analysis, 2021](https://github.com/norakassner/mlama)\n- [Urdu Word Segmentation using CRF, 2018](https://github.com/harisbinzia/Urdu-Word-Segmentation)\n- [Apertium linguistic data for Urdu](https://github.com/apertium/apertium-urd)\n\n### Urdu Text Classification\n\n- [Fake News Classification, 2020](https://github.com/MaazAmjad/Datasets-for-Urdu-news) ([Old Version](https://github.com/MaazAmjad/Urdu-News-Augmented-Dataset))\n- [iNLTK Urdu News Headlines Classification Benchmark, 2020](https://www.kaggle.com/disisbig/urdu-news-dataset)\n- [Express News Headlines+Summary, 2019](https://github.com/mwaseemrandhawa/Urdu-News-Headline-Dataset)\n\n### Urdu Named-Entity Recognition\n\n- [MK-PUCIT-NER, 2019](https://www.kaggle.com/safiakanwal/mkpucit-ner-dataet)\n- [WikiAnn: Cross-lingual Name Tagging \u0026 Linking, 2017](https://elisa-ie.github.io/wikiann/)\n\n### Urdu Monolingual Corpora\n\n- [UFAL Corpus, 2014](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5) - 5.4M sentences (with POS tags)\n- CommonCrawl\n  - [OSCAr Corpus, 2020](https://oscar-corpus.com/)\n  - [CC-100 Corpus, 2019](http://data.statmt.org/cc-100/) - CC crawls from Jan-Dec 2018\n  - [WMT Raw 2017](http://data.statmt.org/ngrams/raw/) - CC crawls from 2012-2016\n- [https://dumps.wikimedia.org/urwiki/](WikiDumps)\n  - Processed Dumps: [iNLTK Wiki Articles, 2020](https://www.kaggle.com/disisbig/urdu-wikipedia-articles), [Tatoeba Challenge, 2020](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/Backtranslations.md), [2016 UrduWikiCorpus](http://urdu-corpus.blogspot.com/p/published-packages.html)\n  - To process the latest dump yourself, use a library like [WiToKit](https://github.com/akb89/witokit)\n- [Leipzig Corpora](https://wortschatz.uni-leipzig.de/en/download/Urdu)\n- [Maḵẖzan](https://github.com/zeerakahmed/makhzan)\n- Commercial-licensed corpora\n  - [UrduWaC-2010 and urTenTen-2018, SketchEngine](https://www.sketchengine.eu/corpora-and-languages/urdu-text-corpora/)\n  - [A Gold Standard Urdu Raw Text Corpus, LDCIL](https://data.ldcil.org/a-gold-standard-urdu-raw-text-corpus)\n\n### Urdu Sentiment Datasets\n\n- [Urdu IMDb Movie Reviews](https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews) - IMDB Movie Reviews data in Urdu\n- [Urdu Sentiment Benchmark, 2020](https://github.com/MuhammadYaseenKhan/Urdu-Sentiment-Corpus)\n- [2010 Disaster Response Messages](https://huggingface.co/datasets/disaster_response_messages)\n- Lexicon\n  - [Urdu Sentiment Lexicon](https://chaoticity.com/urdu-sentiment-lexicon/)\n  - [Sentiment Polarity Lexicons, 2017](https://www.kaggle.com/rtatman/sentiment-lexicons-for-81-languages)\n- Roman Urdu\n  - [Hate Speech \u0026 Offensive Language Detection, 2020](https://github.com/haroonshakeel/roman_urdu_hate_speech) - 10k tweets\n  - [UCI Roman-Urdu Sentiment Classification, 2018](https://archive.ics.uci.edu/ml/datasets/Roman+Urdu+Data+Set) - 20k records\n  - [Did You Offend Me? Classification of Offensive Tweets, 2018](https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification/tree/b8433ff1ebb885bd657f5117eab6bd3798f20408) - 3k tweets\n\n### Urdu OCR Datasets\n\n- [Qaida](https://github.com/AtiqueUrRehman/qaida) - Synthetic datasets and pre-trained models\n- [U-HAT](https://www.kaggle.com/hazrat/uhat-urdu-handwritten-text-dataset) - Urdu Hand-Written Text Dataset\n- [45K+ Clean-Background-Urdu-Ligatures-Dataset, 2019](https://github.com/UltramindSoft/45K-Clean-Background-Urdu-Ligatures-Dataset)\n- [IIIT-Hyderabad: Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks, 2017](https://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-urdu-ocr)\n- [CLE Pakistan Urdu Image Corpora](https://www.cle.org.pk/clestore/imagecorpora.htm) (Corresponding [texts](https://www.cle.org.pk/clestore/index.htm))\n- [Cursive-Text: A Benchmark for Urdu Text Recognition in Natural Scene Images, 2020](https://www.sciencedirect.com/science/article/pii/S2352340920306430) - 2500 images, email for dataset\n\n### Urdu Parallel Corpora for Machine Translation\n\n- [OPUS Corpora](https://opus.nlpl.eu/) (Select en-\u003eur)\n  - Contains: [CC-Aligned](http://www.statmt.org/cc-aligned/), [Tanzil](http://tanzil.net/trans/), [JW300](https://www.aclweb.org/anthology/P19-1310/), [OpenSubtitles](https://www.aclweb.org/anthology/L16-1147/), [TED](https://www.ted.com/participate/translate), [QED](https://www.aclweb.org/anthology/L14-1675/), etc.\n- [IIIT-Hyderabad MT Bhasha](http://preon.iiit.ac.in/~jerin/bhasha/)\n  - Contains *Mann ki Baat* and *Press Information Bureau* datasets\n- [PM India Parallel Corpus](http://data.statmt.org/pmindia/)\n- [English-Urdu Religious Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2582)\n- [Anuvaad Parallel Corpora](https://github.com/project-anuvaad/anuvaad-parallel-corpus)\n- [MechanicalTurks 2012 Parallel Corpora](https://github.com/joshua-decoder/indian-parallel-corpora)\n- [Urdu-Nepali-English Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm) ([Test set here](https://www.cle.org.pk/software/ling_resources/testingcorpusmt.htm))\n- [Cross-Language English-Urdu (CLEU) Corpus, 2018](http://ucrel.lancs.ac.uk/textreuse/cleu.php)\n- [Flickr 8k Benchmark](https://forms.illinois.edu/sec/1713398) - 2.7k sentences\n- [Universal Declaration of Human Rights (benchmark)](https://unicode.org/udhr/translations.html)\n- Commercial-licensed corpora\n  - [EMILLE/CIIL Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0037/) - Contains monolingual data as well\n  - [National Platform for Language Technology](https://nplt.in/demo/index.php?route=product/category\u0026path=75_59\u0026limit=100)\n  - [Technology Development for Indian Languages](https://tdil-dc.in/index.php?option=com_download\u0026task=fsearch\u0026Itemid=547\u0026lang=en) (Search \"Urdu Corpus\")\n\n### Urdu Transliteration Datasets\n\n- [Google Dakshina, 2020](https://github.com/google-research-datasets/dakshina)\n- [TRANSLIT: A Large-scale Name Transliteration Resource, 2020](https://github.com/fbenites/TRANSLIT)\n- [Roman to Urdu Transliteration Sentences, 2020](https://sci-hub.se/10.1142/s0218001421520017) (Drive Link available on request)\n- [Roman-Urdu Conversion Data](https://github.com/Smat26/Roman-Urdu-Dataset#conversion)\n- [Trilingual Ur-RomUr-Eng Dict, 2019](https://github.com/MoizRauf/Urdu--Roman-Urdu--English--Dictionary)\n\n### Urdu Lexical Resources\n\n- [Offline Eng-Urd Dictionary DB](https://github.com/YESALAM/UrduDictionary)\n- [UrduHack Words-List](https://github.com/urduhack/urdu-words) - Includes N-grams, NER Labels\n- [CLE Urdu WordNet](https://www.cle.org.pk/clestore/urduwordnet.htm) ([Demo](http://wordnet.cle.org.pk/), [PDF](https://www.cle.org.pk/software/ling_resources/UrduWordNetWordlist.htm))\n- CLE Urdu [Verb List](https://www.cle.org.pk/software/ling_resources/urduverblist.htm), [Words List](https://www.cle.org.pk/software/ling_resources/wordlist.htm), [Most Frequent Words](https://www.cle.org.pk/software/ling_resources/UrduHighFreqWords.htm)\n- [IndoWordnet Parallel Corpus](https://github.com/anoopkunchukuttan/indowordnet_parallel) (API - [pyiwn](https://github.com/riteshpanjwani/pyiwn), [Demo](https://www.cfilt.iitb.ac.in/indowordnet/))\n- [MTurks-10k Multilingual Dictionary, 2014](https://github.com/AI4Bharat/indicnlp_catalog/issues/21)\n- [Microsoft IT Terminology](https://www.microsoft.com/en-us/language/Terminology)\n- [Urdu N-grams, 2020](https://www.kaggle.com/tafseerahmed/urdu-ngrams) - Uni-Gram, Bi-Gram, Tri-Gram and Tetra-Gram\n- [CLE Urdu Books N-Grams](https://www.cle.org.pk/clestore/cleurdungrams.htm)\n- [Roman Urdu Lexical Normalization, 2019](https://github.com/abdulrafae/normalization)\n\n### Urdu Speech Datasets\n\n- [Urdu 250 Isolated Words, 2018](https://www.kaggle.com/hazrat/urdu-speech-dataset)\n- [CLE Phonetically Rich Urdu Speech Corpus](https://www.cle.org.pk/software/ling_resources/phoneticallyrichurduspeechcorpus.htm)\n- [CMU Wilderness Speech Dataset, 2019](http://www.festvox.org/cmu_wilderness/)\n- [FCBH Recordings](https://www.faithcomesbyhearing.com/audio-bible-resources/recordings-database)\n- [LibriVox AudioBooks](https://librivox.org/search?primary_key=63\u0026search_category=language\u0026search_page=1\u0026search_form=get_results)\n- Commercial-licensed corpora\n  - [CLE Pakistan Urdu Speech Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-S0403/) ([Main website](https://www.cle.org.pk/clestore/speechcorpus.htm))\n  - [LDC UPenn Datasets](https://catalog.ldc.upenn.edu/) - Filter search by selecting language\n  - [Urdu Raw Speech Corpus, LDCIL](https://data.ldcil.org/urdu-raw-speech-corpus)\n  - [LDCIL ASR Corpus](https://www.ldcil.org/resourcesSpeechCorp.aspx)\n- Emotion\n  - [Urdu-Sindhi Speech Emotion Corpus, 2020](https://zenodo.org/record/3685274) ([Paper](https://thesai.org/Downloads/Volume11No4/Paper_104-Introducing_the_Urdu_Sindhi_Speech_Emotion_Corpus.pdf))\n  - [Speech Emotion Recognition Benchmark, 2018](https://www.kaggle.com/bitlord/urdu-language-speech-dataset)\n\n### Cross-lingual Datasets\n\n- [Cross-lingual Natural Language Inference (XNLI) Corpus, 2020](https://github.com/facebookresearch/XNLI)\n- [Google XTrEME Benchmark, 2020](https://github.com/google-research/xtreme) - Evaluation of cross-lingual generalization of multilingual models\n- [Urdu-Punjabi Pairs, Apertium](https://github.com/hsumerf/urdu-panjabi-pair)\n\n---\n\n## Urdu NLP Tools, Libraries and Models\n\n- [UrduHack](https://pypi.org/project/urduhack/)\n- [PronouncUR](https://github.com/harisbinzia/PronouncUR) - Urdu words to pronouniciations format\n- [iNLTK](https://inltk.readthedocs.io/)\n- [Indic PoS/NER Tagger](https://github.com/avineshpvs/indic_tagger)\n- [Urdu Morphological Analyzer, IIIT Hyderabad](http://ltrc.iiit.ac.in/showfile.php?filename=downloads/UrduResources/UrduMorphAnalyser.php)\n- [EasyOCR](https://github.com/JaidedAI/EasyOCR)\n\n### Language Models\n\n- [HuggingFace Models](https://huggingface.co/models?filter=ur\u0026pipeline_tag=fill-mask)\n- [Google Multilingual-T5, 2020](https://github.com/google-research/multilingual-t5)\n- [Google MuRIL, 2020](https://tfhub.dev/google/MuRIL/)\n- [iNLTK Models, 2019](https://github.com/anuragshas/nlp-for-urdu)\n- [XLM-RoBERTa, 2019](https://github.com/facebookresearch/XLM)\n- [Multilingual BERT, 2019](https://github.com/google-research/bert/blob/master/multilingual.md)\n\n### Word Embeddings\n\n- [UrduHack Word-Vectors, 2019](https://github.com/urduhack/urdu-word-vectors) - Word2Vec and FastText models\n- Facebook FastText models: [Wiki-2016](https://fasttext.cc/docs/en/pretrained-vectors.html), [CC+Wiki-2017](https://fasttext.cc/docs/en/crawl-vectors.html), [Multilingual Aligned, 2017](https://github.com/babylonhealth/fastText_multilingual)\n- [BPEmb: Subword Embeddings, 2017](https://nlp.h-its.org/bpemb/) ([Multilingual Aligned](https://nlp.h-its.org/bpemb/multi/))\n- [ConceptNet Embeddings, 2017](https://github.com/commonsense/conceptnet-numberbatch)\n- [Polyglot Embeddings, 2013](https://sites.google.com/site/rmyeid/projects/polyglot)\n\n### Translation Models\n\n- [IL-Multi, 2020](https://github.com/jerinphilip/ilmulti)\n- [Facebook M2M-100, 2020](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100)\n- [Python Translators Services](https://github.com/UlionTse/translators) - Library to use Google, Bing, etc. translators for free\n\n### Transliteration Libraries\n\n- [PolyGlot](https://polyglot.readthedocs.io/en/latest/Transliteration.html)\n- [LibIndicTrans](https://github.com/libindic/indic-trans) - Transliterate Roman/Hindi to Urdu and vice-versa\n- [AksharaMukhi](http://aksharamukha.appspot.com/python) - Devanagari (Hindi) to Urdu script converter\n- [Google Transliterate API](https://pypi.org/project/google-transliteration-api/) - Roman Urdu to Perso-Arabic\n\n---\n\n## Online Resources/Services\n\n- [Pakistani Center for Language Engineering - Online Services](https://tech.cle.org.pk/)\n\n### Urdu News websites\n\n- [JANG Group](http://www.jang.com.pk/)\n- [BBC Urdu](http://www.bbcurdu.com/)\n- [Voice of America Urdu](http://www.voanews.com/urdu)\n- [Nawa-i-Waqt Group](http://www.nawaiwaqt.com.pk)\n- [Urdu Point Network](http://www.urdupoint.com/)\n\n[More news websites](https://github.com/divkakwani/awesome-newspapers/blob/main/newspapers/ur.csv)...\n\n### Dictionaries\n\n- [ur.oxforddictionaries.com](https://ur.oxforddictionaries.com/) - Oxford Dictionary\n- [English Urdu Dictionary](http://www.urduword.com) - English Urdu Dictionary\n- [Urdu English Dictionary 2](http://www.urduenglishdictionary.org) - Urdu English Dictionary 2\n","projects_url":"https://awesome.ecosyste.ms/api/v1/lists/urduhack%2Fawesome-urdu/projects"}