https://github.com/stefan-it/nmt-mk-en
Neural Machine Translation system for Macedonian to English
https://github.com/stefan-it/nmt-mk-en
fairseq macedonian neural-machine-translation transformer
Last synced: 10 months ago
JSON representation
Neural Machine Translation system for Macedonian to English
- Host: GitHub
- URL: https://github.com/stefan-it/nmt-mk-en
- Owner: stefan-it
- Created: 2017-07-15T21:28:03.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2018-04-21T11:37:30.000Z (about 8 years ago)
- Last Synced: 2024-10-28T07:56:15.639Z (over 1 year ago)
- Topics: fairseq, macedonian, neural-machine-translation, transformer
- Language: Shell
- Size: 21.8 MB
- Stars: 4
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Neural Machine Translation system for Macedonian to English
This repository contains all data and documentation for building a neural
machine translation system for Macedonian to English. This work was done during
the the M.Sc. course (summer term) [Machine Translation](http://cis.lmu.de/~fraser/mt_2017/)
held by [Prof. Dr. Alex Fraser](http://cis.lmu.de/~fraser/).
# Dataset
The [*SETimes corpus*](http://nlp.ffzg.hr/resources/corpora/setimes/) contains
of 207,777 parallel sentences for the Macedonian and English language pair.
For all experiments the corpus was split into training, development and
test set:
| Data set | Sentences | Download
| ----------- | --------- | -----------------------------------------------------------------------------------------------------------------------------------------
| Training | 205,777 | via [GitHub](https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.train.tgz) or located in `data/setimes.mk-en.train.tgz`
| Development | 1,000 | via [GitHub](https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.dev.tgz) or located in `data/setimes.mk-en.dev.tgz`
| Test | 1,000 | via [GitHub](https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.test.tgz) or located in `data/setimes.mk-en.test.tgz`
# *fairseq* - Facebook AI Research Sequence-to-Sequence Toolkit
The first NMT system for Macedonian to English is built with [*fairseq*](https://github.com/facebookresearch/fairseq).
We trained three systems with different architectures:
* Standard Bi-LSTM
* CNN as encoder, LSTM as decoder
* Fully convolutional
## Preprocessing
All necessary scripts can be found in the `scripts` folder of this repository.
In the first step, we need to download and extract the parallel *SETimes* corpus
for Macedonian to English:
```bash
wget http://nlp.ffzg.hr/data/corpora/setimes/setimes.en-mk.txt.tgz
tar -xf setimes.en-mk.txt.tgz
```
The `data_preparation.sh` scripts performs the following steps on the corpus:
* download of the *MOSES* tokenizer script; tokenization of the whole corpus
* download of the *BPE* scripts; learning and applying *BPE* on the corpus
```bash
./data_preparation setimes.en-mk.mk.txt setimes.en-mk.en.txt
```
After that the corpus is split into training, development and test set:
```bash
./split_dataset corpus.clean.bpe.32000.mk corpus.clean.bpe.32000.en
```
The following folder structure needs to be created:
```bash
mkdir {train,dev,test}
mv dev.* dev
mv train.* train
mv test.* test
mkdir model-data
```
After that the `fairseq` tool can be invoked to preprocess the corpus:
```bash
fairseq preprocess -sourcelang mk -targetlang en -trainpref train/train \
-validpref dev/dev -testpref test/test -thresholdsrc 3 \
-thresholdtgt 3 -destdir model-data
```
## Training
After the preprossing steps the three models can be trained.
### Standard Bi-LSTM
With the following command the bi-lstm model can be trained:
```bash
fairseq train -sourcelang mk -targetlang en -datadir model-data -model blstm \
-nhid 512 -dropout 0.2 -dropout_hid 0 -optim adam -lr 0.0003125 \
-savedir model-blstm
```
### CNN as encoder, LSTM as decoder
With the following command the CNN as encoder, LSTM as decoder model can be
trained:
```bash
fairseq train -sourcelang mk -targetlang en -datadir model-data -model conv \
-nenclayer 6 -dropout 0.2 -dropout_hid 0 -savedir model-conv
```
### Fully convolutional
With the following command the fully convolutional model can be trained:
```bash
fairseq train -sourcelang mk -targetlang en -datadir model-data -model fconv \
-nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 \
-clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir model-fconv
```
## Decoding
### Standard Bi-LSTM
With the following command the bi-lstm model can decode the test set:
```bash
fairseq generate -sourcelang mk -targetlang en \
-path model-blstm/model_best.th7 -datadir model-data -beam 10 \
-nbest 1 -dataset test > model-blstm/system.output
```
### CNN as encoder, LSTM as decoder
With the following command the CNN as encoder, LSTM as decoder model can
decode the test set:
```bash
fairseq generate -sourcelang mk -targetlang en -path model-conv/model_best.th7 \
-datadir model-data -beam 10 -nbest 1 \
-dataset test > model-conv/system.output
```
### Fully convolutional
With the following command the fully convolutional model can decode the test set:
```bash
fairseq generate -sourcelang mk -targetlang en -path model-fconv/model_best.th7 \
-datadir model-data -beam 10 -nbest 1 \
-dataset test > model-fconv/system.output
```
## Calculating the BLEU-score
With the helper script `fairseq_bleu.sh` the BLEU-score of all models can be
calculated very easy. The script expects the system output file as command
line argument:
```bash
./fairseq_bleu.sh model-blstm/system.output
```
## Results
We use different *BPE* merge operations: 16.000 and 32.000. Here are
the results on the final test set:
| Model | *BPE* merge operations | BLEU-Score
| ---------------------------- | ----------------------- | ----------
| Bi-LSTM | 32.000 | 46,84
| Bi-LSTM | 16.000 | 47,57
| CNN encoder, LSTM decoder | 32.000 | 19,83
| CNN encoder, LSTM decoder | 16.000 | 9,59
| Fully convolutional | 32.000 | 48,81
| Fully convolutional | 16.000 | **49,03**
The best bleu-score was obtained with the fully convolutional model with
16.000 merge operations.
# *tensor2tensor* - Transformer
The second NMT system for Macedonian to English is built with the [*tensor2tensor*](https://github.com/tensorflow/tensor2tensor)
library. We trained two systems: one subword-based system and one
character-based NMT system.
**Notice**: The problem description for this task is found in `translate_enmk.py`
in the root repository. This problem was once directly included and available
in *tensor2tensor*. But I decided to replace the integrated *tensor2tensor*
problem for Macedonian to English with a more challenging one. To replicate
all experiments in this repository, the `translate_enmk.py` problem is now a
user-defined problem and must be included in the following way:
```bash
cp translate_enmk.py /tmp
echo "from . import my_submodule" > /tmp/__init__.py
```
To use this problem, the `--t2t_usr_dir` commandline option must point to the
appropriate folder (in this example `/tmp`). For more information about
user-defined problems, see offical
[documentation](https://github.com/tensorflow/tensor2tensor#adding-your-own-components).
## Training (Transformer base)
The following training steps are tested with *tensor2tensor* in version *1.5.1*.
First, we create the initial directory structure:
```bash
mkdir -p t2t_data t2t_datagen t2t_train t2t_output
```
In the next step, the training and development datasets are downloaded and
prepared:
```bash
t2t-datagen --data_dir=t2t_data --tmp_dir=t2t_datagen/ \
--problem=translate_enmk_setimes32k --t2t_usr_dir /tmp
```
Then the training step can be started:
```bash
t2t-trainer --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \
--model=transformer --hparams_set=transformer_base --output_dir=t2t_output \
--t2t_usr_dir /tmp
```
The number of GPUs used for training can be specified with the `--worker_gpu`
option.
## Decoding
In the next step, the test dataset is downloaded and extracted:
```bash
wget "https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.test.tgz"
tar -xzf setimes.mk-en.test.tgz
```
Then the decoding step for the test dataset can be started:
```bash
t2t-decoder --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \
--model=transformer --decode_hparams="beam_size=4,alpha=0.6" \
--decode_from_file=test.mk --decode_to_file=system.output \
--hparams_set=transformer_big --output_dir=t2t_output/ \
--t2t_usr_dir /tmp
```
## Calculating the BLEU-score
The BLEU-score can be calculated with the built-in `t2t-bleu` tool:
```bash
t2t-bleu --translation=system.output --reference=test.en
```
## Results
The following results can be achieved using the Transformer model. A
character-based model was also trained and measured. A big transformer model
was also trained using *tensor2tensor* in version *1.2.9* (latest version has
a bug, see [this](https://github.com/tensorflow/tensor2tensor/issues/529) issue).
| Model | BLEU-Score
| ---------------------------- | ----------
| Transformer | **54,00** (uncased)
| Transformer (big) | 43,74 (uncased)
| Transformer (char-based) | 37.43 (uncased)
# Further work
We want to train a char-based NMT system with the [dl4mt-c2c](https://github.com/nyu-dl/dl4mt-c2c)
library in near future.
# Acknowledgments
We would like to thank the *Leibniz-Rechenzentrum der Bayerischen Akademie der
Wissenschaften* ([LRZ](https://www.lrz.de/english/)) for giving us access to the
NVIDIA *DGX-1* supercomputer.
# Presentations
* Short-presentation at [Deep Learning Workshop @ LRZ](https://www.lrz.de/services/compute/courses/2017-09-14_hdlw1s17/),
can be found [here](short-presentation/stefan_schweter_dlw17.pdf).