Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thomas-chauvet/names_transliteration
Neural Machine Translation (NMT) applied to transliterate names in arabic characters to latin characters (romanization).
https://github.com/thomas-chauvet/names_transliteration
arabic characters cli data dataset deep-learning latin neural-network nlp nmt romanization seq2seq translation transliteration typer-cli
Last synced: 3 months ago
JSON representation
Neural Machine Translation (NMT) applied to transliterate names in arabic characters to latin characters (romanization).
- Host: GitHub
- URL: https://github.com/thomas-chauvet/names_transliteration
- Owner: thomas-chauvet
- Created: 2020-08-19T13:41:01.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2021-06-03T09:38:20.000Z (over 3 years ago)
- Last Synced: 2024-07-16T08:42:59.792Z (4 months ago)
- Topics: arabic, characters, cli, data, dataset, deep-learning, latin, neural-network, nlp, nmt, romanization, seq2seq, translation, transliteration, typer-cli
- Language: Jupyter Notebook
- Homepage: https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb
- Size: 4.81 MB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb)
# Names transliteration
In this [repository](https://github.com/thomas-chauvet/names_transliteration) you will find:
- a [dataset](https://raw.githubusercontent.com/thomas-chauvet/names_transliteration/master/data/clean/arabic_english.csv)
(and associated code to build it) containing
names in arabic characters and associated names in latin
characters (english),
- a (google colab) notebook to train a
[Neural Machine Translation](https://en.wikipedia.org/wiki/Neural_machine_translation) (NMT) model
based on [seq2seq](https://en.wikipedia.org/wiki/Seq2seq). The objective
of this model is to [transliterate](https://en.wikipedia.org/wiki/Transliteration) names
in arabic alphabet to latin alphabet. This task is also called
[romanization](https://en.wikipedia.org/wiki/Romanization).The model is trained thanks to Google Colab providing (free) GPU.
The model is based on Tensorflow tutorial
[NMT with attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention).## Data
We use 3 datasets:
* [Google transliteration data](https://github.com/google/transliteration/blob/master/ar2en.txt).
Example: *عادل; adel*
* [ANETAC dataset](https://github.com/MohamedHadjAmeur/ANETAC/blob/master/EN-AR%20NE/EN-AR%20Named-entities.txt).
Example: *PERSON; Adel; اديل*. For this file we'll filter on *PERSON* only,
* [NETranliteration COLING 2018](https://github.com/steveash/NETransliteration-COLING2018/blob/master/data/wd_arabic.normalized.aligned.tokens).These 3 datasets will give us a clean dataset containing names in arabic and
corresponding names in latin alphabet (english).## Pre-trained models
A pre-trained model (arabic to latin characters) is stored on
[dropbox](https://www.dropbox.com/s/leqc4k9c4hzfvi3/names-translation-model-2020-10-02.zip?dl=1).## Colab notebook
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb)
A jupyter notebook is provided to train the model used for transliteration.
## Web application - Streamlit
A streamlit is provided. You can find a deployed version [here](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py).
[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py)
## Library
Install library:
```bash
python setup.py install
```## CLI
- `get-data`: Get data from 3 sources to get a training dataset.
- `get-pretrained-model`: Download pre-trained model for the task.
- `train-nmt-model`: Train an NMT model.
- `transliterate-name`: Transliterate a name in arabic in latin character.## Python environment
Please refer to the `environment.yml` file for conda environment.
To create the environment with conda:
```bash
conda env create -f environment.yml
```