https://github.com/stefan-it/nmt-en-et

Neural Machine Translation system for English to Estonian (WMT18 data)
https://github.com/stefan-it/nmt-en-et

Last synced: about 2 months ago
JSON representation

Neural Machine Translation system for English to Estonian (WMT18 data)

Host: GitHub
URL: https://github.com/stefan-it/nmt-en-et
Owner: stefan-it
Created: 2018-04-11T17:15:08.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2018-04-23T11:34:27.000Z (about 7 years ago)
Last Synced: 2025-02-08T09:46:54.263Z (4 months ago)
Language: Shell
Size: 248 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Neural Machine Translation system for English to Estonian

This repository contains all data and documentation for building a neural
machine translation system for English to Estonian.

# Dataset

The *WMT18 English-Estonian* data comes from [EMNLP 2018 website](http://www.statmt.org/wmt18/translation-task.html).

## Training

The original training data from WMT18 is used. The training data for English-Estonian is the following:

| Data set | Sentences | Download
| ---------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------
| Europarl v8 | 652,944 | [WMT18](http://data.statmt.org/wmt18/translation-task/training-parallel-ep-v8.tgz)
| ParaCrawl corpus | 1,298,103 | [WMT18](https://s3.amazonaws.com/web-language-models/paracrawl/release1/paracrawl-release1.en-et.zipporah0-dedup-clean.tgz)
| Rapid corpus | 226,978 | [WMT18](http://data.statmt.org/wmt18/translation-task/rapid2016.tgz)

## Development

For development data, the SGML tags needs to be stripped. For this purpose, the
`input-from-sgm.perl` script from [Moses](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/input-from-sgm.perl)
was used. The resulting development set was uploaded to this GitHub repository.

| Data set | Sentences | Download
| ---------------- | --------- | --------------------------------------------------------------------------------------------
| Development | 2,000 | via [GitHub](https://github.com/stefan-it/nmt-en-et/raw/master/data/newsdev2018-enet.tar.gz)

## Summary

Here comes a summary of training and development set:

| Data set | Sentences
| ---------------- | ---------
| Training (all) | 2,178,025
| Development | 2,000

# *fairseq*

The first NMT system for English to Estonian is built with [*fairseq*](https://github.com/facebookresearch/fairseq).
We trained one system with fully convolutional encoder & decoder.

## Preparation

### Training data

Download the training corpora:

```bash
wget "http://data.statmt.org/wmt18/translation-task/training-parallel-ep-v8.tgz"
wget "https://s3.amazonaws.com/web-language-models/paracrawl/release1/paracrawl-release1.en-et.zipporah0-dedup-clean.tgz"
wget "http://data.statmt.org/wmt18/translation-task/rapid2016.tgz"
```

Extract them:

```bash
tar -xzf paracrawl-release1.en-et.zipporah0-dedup-clean.tgz
tar -xzf rapid2016.tgz
tar -xzf training-parallel-ep-v8.tgz
```

Combine the training corpora for each source and target language to one corpus:

```bash
cat paracrawl-release1.en-et.zipporah0-dedup-clean.en rapid2016.en-et.en \
training/europarl-v8.et-en.en > training.en

cat paracrawl-release1.en-et.zipporah0-dedup-clean.et rapid2016.en-et.et \
training/europarl-v8.et-en.et > training.et
```

Then the `data_preparation.sh` script tokenizes and applies bpe to the training
corpus:

```bash
cd scripts
./data_preparation.sh ../training.en ../training.et
cd ..
```

Finally, the folder structure for `fairseq` for the training corpus needs to
be created:

```bash
mkdir train
cp scripts/corpus.clean.bpe.32000.en train/train.en
cp scripts/corpus.clean.bpe.32000.et train/train.et
```

### Development data

The development data needs to be extracted first:

```bash
cp data/newsdev2018-enet.tar.gz .
tar -xzf newsdev2018-enet.tar.gz
```

Then the development data can also be tokenized and preprocessed:

```bash
cd scripts
rm -f bpe.32000 corpus.* vocab.*
./data_preparation.sh ../newsdev2018-enet-src.en ../newsdev2018-enet-ref.et
cd ..
```

Finally, the folder structure for `fairseq` for the development data needs
to be created:

```bash
mkdir dev
cp scripts/corpus.bpe.32000.en dev/dev.en
cp scripts/corpus.bpe.32000.et dev/dev.et
```

## Training

The following `fairseq` commands executes all necessary pre-training steps
(lile generating the vocabulary):

```bash
mkdir model-data
fairseq preprocess -sourcelang en -targetlang et -trainpref train/train \
-validpref dev/dev -testpref dev/dev -thresholdsrc 3 \
-thresholdtgt 3 -destdir model-data
```

After that, a new model can be trained. In this example we use a fully
convolutional encoder & decoder model:

```bash
mkdir model-fconv
fairseq train -sourcelang en -targetlang et -datadir model-data -model fconv \
-nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 \
-clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir model-fconv
```

## Decoding

## Calculating the BLEU-score

## Results

# *tensor2tensor* - Transformer

Another NMT system for English-Estonian is built with the [*tensor2tensor*](https://github.com/tensorflow/tensor2tensor)
library.

## Results

### Model A

Model A is trained on a *DGX-1* with 8 NVIDIA P100 GPUs for 30.000 steps.

The *TensorBoard* figure shows some nice statistics:

![TensorBoard English-Estonian](figures/t2t-model-a.png)

The obtained BLEU-scores on the development set for this models are:

| Model | BLEU-score
| ----- | ----------
| A | 16.32 (uncased)
| A | 15.98 (cased)

# Acknowledgments

We would like to thank the *Leibniz-Rechenzentrum der Bayerischen Akademie der
Wissenschaften* ([LRZ](https://www.lrz.de/english/)) for giving us access to the
NVIDIA *DGX-1* supercomputer.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stefan-it/nmt-en-et

Awesome Lists containing this project

README