Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thomas-chauvet/names_transliteration

Neural Machine Translation (NMT) applied to transliterate names in arabic characters to latin characters (romanization).
https://github.com/thomas-chauvet/names_transliteration

arabic characters cli data dataset deep-learning latin neural-network nlp nmt romanization seq2seq translation transliteration typer-cli

Last synced: 3 months ago
JSON representation

Neural Machine Translation (NMT) applied to transliterate names in arabic characters to latin characters (romanization).

Host: GitHub
URL: https://github.com/thomas-chauvet/names_transliteration
Owner: thomas-chauvet
Created: 2020-08-19T13:41:01.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2021-06-03T09:38:20.000Z (over 3 years ago)
Last Synced: 2024-07-16T08:42:59.792Z (4 months ago)
Topics: arabic, characters, cli, data, dataset, deep-learning, latin, neural-network, nlp, nmt, romanization, seq2seq, translation, transliteration, typer-cli
Language: Jupyter Notebook
Homepage: https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb
Size: 4.81 MB
Stars: 6
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        [![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb)

# Names transliteration

In this [repository](https://github.com/thomas-chauvet/names_transliteration) you will find:

- a [dataset](https://raw.githubusercontent.com/thomas-chauvet/names_transliteration/master/data/clean/arabic_english.csv) 

(and associated code to build it) containing 

names in arabic characters and associated names in latin 

characters (english),

- a (google colab) notebook to train a 

[Neural Machine Translation](https://en.wikipedia.org/wiki/Neural_machine_translation) (NMT) model

based on [seq2seq](https://en.wikipedia.org/wiki/Seq2seq). The objective

of this model is to [transliterate](https://en.wikipedia.org/wiki/Transliteration) names

in arabic alphabet to latin alphabet. This task is also called 

[romanization](https://en.wikipedia.org/wiki/Romanization).

The model is trained thanks to Google Colab providing (free) GPU.

The model is based on Tensorflow tutorial 

[NMT with attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention).

## Data

We use 3 datasets:

*   [Google transliteration data](https://github.com/google/transliteration/blob/master/ar2en.txt).

Example: *عادل; adel*

*   [ANETAC dataset](https://github.com/MohamedHadjAmeur/ANETAC/blob/master/EN-AR%20NE/EN-AR%20Named-entities.txt). 

Example: *PERSON; Adel; اديل*. For this file we'll filter on *PERSON* only,

*   [NETranliteration COLING 2018](https://github.com/steveash/NETransliteration-COLING2018/blob/master/data/wd_arabic.normalized.aligned.tokens).

These 3 datasets will give us a clean dataset containing names in arabic and 

corresponding names in latin alphabet (english).

## Pre-trained models

A pre-trained model (arabic to latin characters) is stored on 

[dropbox](https://www.dropbox.com/s/leqc4k9c4hzfvi3/names-translation-model-2020-10-02.zip?dl=1).

## Colab notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb)

A jupyter notebook is provided to train the model used for transliteration.

## Web application - Streamlit

A streamlit is provided. You can find a deployed version [here](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py).

[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/thomas-chauvet/names_transliteration/app.py)

## Library

Install library:

```bash

python setup.py install

```

## CLI

- `get-data`: Get data from 3 sources to get a training dataset.

- `get-pretrained-model`: Download pre-trained model for the task.

- `train-nmt-model`: Train an NMT model.

- `transliterate-name`: Transliterate a name in arabic in latin character.

## Python environment

Please refer to the `environment.yml` file for conda environment.

To create the environment with conda:

```bash

conda env create -f environment.yml

```