https://github.com/camel-lab/camelprop
A dataset with Arabic words, English glosses, sourced from Wikimedia and annotated with maximal diacritization Resources
https://github.com/camel-lab/camelprop
Last synced: 4 months ago
JSON representation
A dataset with Arabic words, English glosses, sourced from Wikimedia and annotated with maximal diacritization Resources
- Host: GitHub
- URL: https://github.com/camel-lab/camelprop
- Owner: CAMeL-Lab
- Created: 2025-06-13T10:03:08.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-08-14T16:22:55.000Z (10 months ago)
- Last Synced: 2025-09-09T22:06:14.627Z (9 months ago)
- Language: Python
- Size: 73.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CamelProp
This repository contains CP-WIKI-D3K, a dataset of 3,362 Arabic proper nouns from Wikipedia, each annotated with gold-standard lemma diacritizations and aligned with their English equivalents.
It includes:
1. The full dataset
2. The postprocessing pipeline used to convert ChatGPT-4o outputs into final annotations, as described in [^1]
3. Markdown tables listing the examples used for few-shot and one-shot prompting
[^1]: Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset
Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash (2025)
arXiv:2505.02656