{"id":13722669,"url":"https://github.com/KurdishBLARK/InterdialectCorpus","last_synced_at":"2025-05-07T16:30:50.562Z","repository":{"id":113708516,"uuid":"286140690","full_name":"KurdishBLARK/InterdialectCorpus","owner":"KurdishBLARK","description":"A parallel corpus of Sorani, Kurmanji and English","archived":false,"fork":false,"pushed_at":"2020-10-06T01:17:58.000Z","size":22796,"stargazers_count":5,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-01-28T23:08:49.251Z","etag":null,"topics":["corpus","kurdish","kurdish-language-processing","machine-translation","natural-language-processing","parallel-corpus"],"latest_commit_sha":null,"homepage":"https://kurdishblark.github.io/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KurdishBLARK.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-08-09T00:30:35.000Z","updated_at":"2022-12-31T03:39:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"37fbe6e1-9342-4c21-9f8b-4ec3b9fff466","html_url":"https://github.com/KurdishBLARK/InterdialectCorpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KurdishBLARK%2FInterdialectCorpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KurdishBLARK%2FInterdialectCorpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KurdishBLARK%2FInterdialectCorpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KurdishBLARK%2FInterdialectCorpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KurdishBLARK","download_url":"https://codeload.github.com/KurdishBLARK/InterdialectCorpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252915205,"owners_count":21824520,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","kurdish","kurdish-language-processing","machine-translation","natural-language-processing","parallel-corpus"],"created_at":"2024-08-03T01:01:31.492Z","updated_at":"2025-05-07T16:30:45.524Z","avatar_url":"https://github.com/KurdishBLARK.png","language":null,"funding_links":[],"categories":["Development"],"sub_categories":["Resources"],"readme":"# Kurdish Parallel Corpus\n## A parallel corpus of Kurdish (Sorani and Kurmanji) and English\n\nThis repository contains a parallel corpus of Sorani (`ckb`) and Kurmanji (`kmr`) dialects of Kurdish along with English (`eng`). The development of the corpus is described in [our paper](https://arxiv.org/abs/2010.01554). Our approach is consisted of retrieving potentially-alignable news articles from multilingual websites, semi-automatically align sentences across dialects and languages based on lexical similarity and transliteration of scripts and, manually annotate correct translation pairs. For further information, please see the **[annotation guidelines](https://github.com/KurdishBLARK/InterdialectCorpus/tree/master/X_Guidelines)**.\n\nOur parallel corpus contains three manually-aligned corpus in Sorani-Kurmanji, Sorani-English and Kurmanji-English in various formats, namely [Translation Memory eXchange](https://en.wikipedia.org/wiki/Translation_Memory_eXchange) file format (`.tmx`), parallel annotated text useful for [ParaConc](https://paraconc.com/) and raw parallel texts (`.txt`). In the latter, each line corresponds to the same line in the other aligned file.\n\nThis corpus contains **12,327** translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji. We also provide **1,797** and **650** translation pairs in English-Kurmanji and English-Sorani. \n\n## Download\n\nYou can clone this repository or download individual directories as follows:\n\n- **[Sorani-English](https://github.com/KurdishBLARK/InterdialectCorpus/tree/master/CKB-ENG)**\n- **[Kurmanji-English](https://github.com/KurdishBLARK/InterdialectCorpus/tree/master/KMR-ENG)**\n- **[Sorani-Kurmanji](https://github.com/KurdishBLARK/InterdialectCorpus/tree/master/CKB-KMR)**\n\n\nIn addition, the statistical models reported as the baseline in our paper are provided in the [Moses-results](https://github.com/KurdishBLARK/InterdialectCorpus/tree/master/Moses-results) directory.\n\n## Cite this paper\n\nIf you use any part of the data, please consider citing **[this paper](https://arxiv.org/abs/2010.01554)** as follows:\n\n\t@misc{ahmadi2020leveraging,\n\t      title={Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus}, \n\t      author={Sina Ahmadi and Hossein Hassani and Daban Q. Jaff},\n\t      year={2020},\n\t      eprint={2010.01554},\n\t      archivePrefix={arXiv},\n\t      primaryClass={cs.CL}\n\t}\n\nThe link to the published version of the paper will be also provided later.\n\n## License\n\nThis corpus is available under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://github.com/KurdishBLARK/InterdialectCorpus/blob/master/LICENSE) license. For further information for commercial use, please get in touch with the [contributors](https://github.com/KurdishBLARK/InterdialectCorpus/graphs/contributors).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FKurdishBLARK%2FInterdialectCorpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FKurdishBLARK%2FInterdialectCorpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FKurdishBLARK%2FInterdialectCorpus/lists"}