Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shaltielshmid/hebrew-disambig-challenge-set
https://github.com/shaltielshmid/hebrew-disambig-challenge-set
Last synced: 1 day ago
JSON representation
- Host: GitHub
- URL: https://github.com/shaltielshmid/hebrew-disambig-challenge-set
- Owner: shaltielshmid
- Created: 2020-10-01T06:07:16.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-10-01T06:27:45.000Z (over 4 years ago)
- Last Synced: 2023-07-11T00:22:54.813Z (over 1 year ago)
- Size: 3.37 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration
There are 21 Hebrew homographs covered in this dataset, with two possible analyses per homograph. There are approximately 500-2K sentences for each of the two possible analyses for each homograph.
This dataset is provided for research purposes (see License section below).
## Files
Currently the dataset contains sample sentences for 21 homographs. Each homograph is in its own directory, with two files inside the directory - one for each possible analysis. The name of the directory/file represents the homograph/diacritized form.
For example, for the homograph `עמנו`:
```
----------------------------------------------
.
..
עמנו
עִמָּנוּ.txt
עַמֵּנוּ.txt
...
----------------------------------------------
```## Format
Each file is a text document, containing the tagged examples for that diacritic form.
Each example is a full sentence on it's own line, with the relevant homograph marked in between double-daggers (`U+2021`) correctly diacritized, marking the class.For example:
ידענו שוויתור על המנהג הזה יביא את הקץ על ‡עַמֵּנוּ‡ ועל היהדות כולה.
## License
We provide this dataset for research purposes and make no ownership claim on any part of it.## Citing
If you use this dataset in academic work, please cite as follows:```bibtex
@inproceedings{shmidman-etal-2020-challenge,
author = {Avi Shmidman and
Joshua Guedalia and
Shaltiel Shmidman and
Moshe Koppel and
Reut Tsarfaty},
title = {A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration},
booktitle = "EMNLP 2020 Findings, forthcoming",
publisher = "Association for Computational Linguistics",
year = {2020}
}
```