https://github.com/aliosm/arabic-text-diacritization

Benchmark Arabic text diacritization dataset
https://github.com/aliosm/arabic-text-diacritization

arabic-language comparison dataset diacritization iccais nlp sequence-labeling

Last synced: about 1 month ago
JSON representation

Benchmark Arabic text diacritization dataset

Host: GitHub
URL: https://github.com/aliosm/arabic-text-diacritization
Owner: AliOsm
License: mit
Created: 2019-02-22T16:04:04.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-07-26T10:55:53.000Z (about 6 years ago)
Last Synced: 2025-08-31T06:48:36.732Z (about 2 months ago)
Topics: arabic-language, comparison, dataset, diacritization, iccais, nlp, sequence-labeling
Language: Python
Homepage:
Size: 26.7 MB
Stars: 75
Watchers: 4
Forks: 18
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# Arabic Text Diacritization

This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:

"[Arabic Text Diacritization Using Deep Neural Networks](https://arxiv.org/abs/1905.01965)", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, [ICCAIS 2019](http://www.iccais.tech).

## Files

### [dataset](/dataset)

- train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
- val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
- test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset

### [helpers](/helpers)
- constants
- ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
- CLASSES_LIST.pickle - Contains list of all possible classes
- DIACRITICS_LIST.pickle - Contains list of all diacritics
- count_characters.py - Counts the number of Arabic letters and diacritics in a file
- count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
- diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
- diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
- file_lookup.py - Searches for a line in all files from a folder
- fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
- remove_diacritics.py - Removes diacritics from a file
- transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
- pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data

### [existing_systems](/existing_systems)
- ali-soft - Contains some bugs that exist in [Ali-Soft](http://www.ali-soft.com) system
- farasa - Contains [Farasa](http://alt.qcri.org/farasa) system output, fixed output, and DER/WER statistics
- harakat - Contains [Harakat](https://harakat.ae) system testing script, output, fixed output, and DER/WER statistics
- madamira - Contains [MADAMIRA](https://camel.abudhabi.nyu.edu/madamira) system output, fixed output, and DER/WER statistics
- mishkal - Contains [Mishkal](https://tahadz.com/mishkal) system output, fixed output, and DER/WER statistics
- shakkala - Contains [Shakkala](https://ahmadai.com/shakkala) system data splitting script, output, fixed output, and DER/WER statistics
- tashkeela_model - Contains [Tashkeela-Model](https://github.com/Anwarvic/Tashkeela-Model) system output, fixed output, and DER/WER statistics for each n-gram model provided by them

#### Note: All codes in this repository tested on [Ubuntu 18.04](http://releases.ubuntu.com/18.04)

## Contributors
1. [Ali Hamdi Ali Fadel](https://github.com/AliOsm).

2. [Ibraheem Tuffaha](https://github.com/IbraheemTuffaha).

3. [Bara' Al-Jawarneh](https://github.com/baraajaw).

4. [Mahmoud Al-Ayyoub](https://github.com/malayyoub).

## License
The project is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aliosm/arabic-text-diacritization

Awesome Lists containing this project

README