Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thomas-chauvet/names_transliteration

Neural Machine Translation (NMT) applied to transliterate names in arabic characters to latin characters (romanization).
https://github.com/thomas-chauvet/names_transliteration

arabic characters cli data dataset deep-learning latin neural-network nlp nmt romanization seq2seq translation transliteration typer-cli

Last synced: 3 months ago
JSON representation

Neural Machine Translation (NMT) applied to transliterate names in arabic characters to latin characters (romanization).

Awesome Lists containing this project

README

        

[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb)

# Names transliteration

In this [repository](https://github.com/thomas-chauvet/names_transliteration) you will find:
- a [dataset](https://raw.githubusercontent.com/thomas-chauvet/names_transliteration/master/data/clean/arabic_english.csv)
(and associated code to build it) containing
names in arabic characters and associated names in latin
characters (english),
- a (google colab) notebook to train a
[Neural Machine Translation](https://en.wikipedia.org/wiki/Neural_machine_translation) (NMT) model
based on [seq2seq](https://en.wikipedia.org/wiki/Seq2seq). The objective
of this model is to [transliterate](https://en.wikipedia.org/wiki/Transliteration) names
in arabic alphabet to latin alphabet. This task is also called
[romanization](https://en.wikipedia.org/wiki/Romanization).

The model is trained thanks to Google Colab providing (free) GPU.

The model is based on Tensorflow tutorial
[NMT with attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention).

## Data

We use 3 datasets:
* [Google transliteration data](https://github.com/google/transliteration/blob/master/ar2en.txt).
Example: *عادل; adel*
* [ANETAC dataset](https://github.com/MohamedHadjAmeur/ANETAC/blob/master/EN-AR%20NE/EN-AR%20Named-entities.txt).
Example: *PERSON; Adel; اديل*. For this file we'll filter on *PERSON* only,
* [NETranliteration COLING 2018](https://github.com/steveash/NETransliteration-COLING2018/blob/master/data/wd_arabic.normalized.aligned.tokens).

These 3 datasets will give us a clean dataset containing names in arabic and
corresponding names in latin alphabet (english).

## Pre-trained models

A pre-trained model (arabic to latin characters) is stored on
[dropbox](https://www.dropbox.com/s/leqc4k9c4hzfvi3/names-translation-model-2020-10-02.zip?dl=1).

## Colab notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb)

A jupyter notebook is provided to train the model used for transliteration.

## Web application - Streamlit

A streamlit is provided. You can find a deployed version [here](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py).

[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py)

## Library

Install library:
```bash
python setup.py install
```

## CLI

- `get-data`: Get data from 3 sources to get a training dataset.
- `get-pretrained-model`: Download pre-trained model for the task.
- `train-nmt-model`: Train an NMT model.
- `transliterate-name`: Transliterate a name in arabic in latin character.

## Python environment

Please refer to the `environment.yml` file for conda environment.

To create the environment with conda:
```bash
conda env create -f environment.yml
```