Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aliosm/arabic-text-diacritization
Benchmark Arabic text diacritization dataset
https://github.com/aliosm/arabic-text-diacritization
arabic-language comparison dataset diacritization iccais nlp sequence-labeling
Last synced: 18 days ago
JSON representation
Benchmark Arabic text diacritization dataset
- Host: GitHub
- URL: https://github.com/aliosm/arabic-text-diacritization
- Owner: AliOsm
- License: mit
- Created: 2019-02-22T16:04:04.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-07-26T10:55:53.000Z (over 5 years ago)
- Last Synced: 2024-10-10T19:27:47.908Z (about 1 month ago)
- Topics: arabic-language, comparison, dataset, diacritization, iccais, nlp, sequence-labeling
- Language: Python
- Homepage:
- Size: 26.7 MB
- Stars: 71
- Watchers: 5
- Forks: 18
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Arabic Text Diacritization
This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:
"[Arabic Text Diacritization Using Deep Neural Networks](https://arxiv.org/abs/1905.01965)", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, [ICCAIS 2019](http://www.iccais.tech).
## Files
### [dataset](/dataset)
- train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
- val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
- test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset### [helpers](/helpers)
- constants
- ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
- CLASSES_LIST.pickle - Contains list of all possible classes
- DIACRITICS_LIST.pickle - Contains list of all diacritics
- count_characters.py - Counts the number of Arabic letters and diacritics in a file
- count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
- diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
- diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
- file_lookup.py - Searches for a line in all files from a folder
- fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
- remove_diacritics.py - Removes diacritics from a file
- transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
- pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data### [existing_systems](/existing_systems)
- ali-soft - Contains some bugs that exist in [Ali-Soft](http://www.ali-soft.com) system
- farasa - Contains [Farasa](http://alt.qcri.org/farasa) system output, fixed output, and DER/WER statistics
- harakat - Contains [Harakat](https://harakat.ae) system testing script, output, fixed output, and DER/WER statistics
- madamira - Contains [MADAMIRA](https://camel.abudhabi.nyu.edu/madamira) system output, fixed output, and DER/WER statistics
- mishkal - Contains [Mishkal](https://tahadz.com/mishkal) system output, fixed output, and DER/WER statistics
- shakkala - Contains [Shakkala](https://ahmadai.com/shakkala) system data splitting script, output, fixed output, and DER/WER statistics
- tashkeela_model - Contains [Tashkeela-Model](https://github.com/Anwarvic/Tashkeela-Model) system output, fixed output, and DER/WER statistics for each n-gram model provided by them#### Note: All codes in this repository tested on [Ubuntu 18.04](http://releases.ubuntu.com/18.04)
## Contributors
1. [Ali Hamdi Ali Fadel](https://github.com/AliOsm).
2. [Ibraheem Tuffaha](https://github.com/IbraheemTuffaha).
3. [Bara' Al-Jawarneh](https://github.com/baraajaw).
4. [Mahmoud Al-Ayyoub](https://github.com/malayyoub).## License
The project is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).