{"id":41848072,"url":"https://github.com/camel-lab/wild_diacritics","last_synced_at":"2026-01-25T10:05:57.928Z","repository":{"id":243822540,"uuid":"810298634","full_name":"CAMeL-Lab/wild_diacritics","owner":"CAMeL-Lab","description":"Wild Diacritics paper repo.","archived":false,"fork":false,"pushed_at":"2024-08-09T13:37:20.000Z","size":88440,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-09T22:06:18.700Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CAMeL-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE_CC_BY_SA","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-06-04T12:27:47.000Z","updated_at":"2025-02-03T04:01:12.000Z","dependencies_parsed_at":"2024-06-11T11:16:32.328Z","dependency_job_id":"827291fa-1fef-410c-b61c-db4a1db46a8a","html_url":"https://github.com/CAMeL-Lab/wild_diacritics","commit_stats":null,"previous_names":["camel-lab/wild_diacritics"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/CAMeL-Lab/wild_diacritics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fwild_diacritics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fwild_diacritics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fwild_diacritics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fwild_diacritics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CAMeL-Lab","download_url":"https://codeload.github.com/CAMeL-Lab/wild_diacritics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fwild_diacritics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28751116,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T09:58:17.166Z","status":"ssl_error","status_checked_at":"2026-01-25T09:55:56.104Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-25T10:05:57.259Z","updated_at":"2026-01-25T10:05:57.919Z","avatar_url":"https://github.com/CAMeL-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Wild Diacritics\n\n## About\n\nThis repo contains code and data relating to the\n['Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization'](https://arxiv.org/abs/2406.05760#)\npaper published in the proceedings of\n[ACL 2024](https://2024.aclweb.org/).\n\n## Data\n\nThe files for the Wild2Max and WikiNewsMax datasets can all be found in the\n[data](./data) directory.\n\nIf you just need the datasets, you can find zipped versions\nof the datasets in the\n[releases page](https://github.com/CAMeL-Lab/wild_diacritics/releases).\n\n## Code\n\nYou can find the helping scripts used to generate all the numbers in the\npaper in the [wilddiacs_utils](./code/wilddiacs_utils) directory.\n\nYou can find all the evaluation scripts relating to the\n'Exploiting Diacritics in the Wild' section of the paper in the\n[exploiting_wilddiacs](./code/exploiting_wilddiacs/) directory.\n\nA fork of [CAMeL Tools](https://github.com/CAMeL-Lab/camel_tools)\nwith the Wild Diacritics edits outlined in the paper can be found in the\n[ct_wilddiac repo](https://github.com/CAMeL-Lab/ct_wilddiac).\n\n## License\n\nThe Wild2Max and WikiNewsMax datasets are available under the\n[Creative Commons Attribution-ShareAlike License](https://creativecommons.org/licenses/by-sa/4.0/).\nSee [LICENSE_CC_BY_SA](./LICENSE_CC_BY_SA) for more info.\n\nAll scripts and code in this repo are available under the MIT license.\nSee [LICENSE_MIT](./LICENSE_MIT) for more info.\n\n## Citing\n\nIf you find any of our work useful or publish work using the Wild2Max or\nWikiNewsMax datasets, please cite [our paper](https://arxiv.org/abs/2406.05760):\n\n```bibtex\n@misc{elgamal2024arabicdiacriticswildexploiting,\n      title={Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization}, \n      author={Salman Elgamal and Ossama Obeid and Tameem Kabbani and Go Inoue and Nizar Habash},\n      year={2024},\n      eprint={2406.05760},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2406.05760}, \n}\n```\n\nIf you publish work using the WikiNewsMax dataset, please additionally cite\n[the paper](https://aclanthology.org/W17-1302/) describing the original\nWikiNews dataset:\n\n```bibtex\n@inproceedings{darwish-etal-2017-arabic,\n    title = \"{A}rabic Diacritization: Stats, Rules, and Hacks\",\n    author = \"Darwish, Kareem  and\n      Mubarak, Hamdy  and\n      Abdelali, Ahmed\",\n    editor = \"Habash, Nizar  and\n      Diab, Mona  and\n      Darwish, Kareem  and\n      El-Hajj, Wassim  and\n      Al-Khalifa, Hend  and\n      Bouamor, Houda  and\n      Tomeh, Nadi  and\n      El-Haj, Mahmoud  and\n      Zaghouani, Wajdi\",\n    booktitle = \"Proceedings of the Third {A}rabic Natural Language Processing Workshop\",\n    month = apr,\n    year = \"2017\",\n    address = \"Valencia, Spain\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/W17-1302\",\n    doi = \"10.18653/v1/W17-1302\",\n    pages = \"9--17\",\n    abstract = \"In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29{\\%} and 12.77{\\%} without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fwild_diacritics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamel-lab%2Fwild_diacritics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fwild_diacritics/lists"}