An open API service indexing awesome lists of open source software.

https://github.com/camel-lab/camelprop

A dataset with Arabic words, English glosses, sourced from Wikimedia and annotated with maximal diacritization Resources
https://github.com/camel-lab/camelprop

Last synced: 4 months ago
JSON representation

A dataset with Arabic words, English glosses, sourced from Wikimedia and annotated with maximal diacritization Resources

Awesome Lists containing this project

README

          

# CamelProp
This repository contains CP-WIKI-D3K, a dataset of 3,362 Arabic proper nouns from Wikipedia, each annotated with gold-standard lemma diacritizations and aligned with their English equivalents.
It includes:

1. The full dataset
2. The postprocessing pipeline used to convert ChatGPT-4o outputs into final annotations, as described in [^1]
3. Markdown tables listing the examples used for few-shot and one-shot prompting

[^1]: Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset
Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash (2025)
arXiv:2505.02656