Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zwhe99/selftraining4unmt

Implementaion of our ACL 2022 paper "Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation"
https://github.com/zwhe99/selftraining4unmt

machine-translation nlp unsupervised-learning

Last synced: 3 months ago
JSON representation

Implementaion of our ACL 2022 paper "Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation"

Awesome Lists containing this project

README

        

## Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

This is the implementaion of our [paper](https://arxiv.org/abs/2203.08394):

```
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
Zhiwei He*, Xing Wang, Rui Wang, Shuming Shi, Zhaopeng Tu
ACL 2022 (long paper, main conference)
```

We based this code heavily on the original code of [XLM](https://github.com/facebookresearch/XLM) and [MASS](https://github.com/microsoft/MASS).

## Dependencies

* Python3

* Pytorch1.7.1

```shell
pip3 install torch==1.7.1+cu110
```

* fastBPE

* Apex

```shell
git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard 0c2c6ee
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
```

## Data ready

We prepared the data following the instruction from [XLM (Section III)](https://github.com/facebookresearch/XLM/blob/main/README.md#iii-applications-supervised--unsupervised-mt). We used their released scripts, BPE codes and vocabularies. However, there are some differences with them:

* All available data is used, not just 5,000,000 sentences per language

* For Romanian, we augment it with the monolingual data from WMT16.

* Noisy sentences are removed:

```shell
python3 filter_noisy_data.py --input all.en --lang en --output clean.en
```

* For English-German, we used the processed data provided by [KaiTao Song](https://github.com/StillKeepTry).

Considering that it can take a very long time to prepare the data, we provide the processed data for download:

* [English-French](https://drive.google.com/file/d/15OBlFMjuwkbaY47xWdPysMyfpB-CqVoC/view?usp=sharing)

* [English-German](https://drive.google.com/file/d/1W-ngJpUvfRwSmWAUR2GZejMHBlRCMjfS/view?usp=sharing)
* [English-Romanian](https://drive.google.com/file/d/1fTP7PIbebewoLZD1rShFManED9cMysrV/view?usp=sharing)

## Pre-trained models

We adopted the released [XLM](https://github.com/facebookresearch/XLM) and [MASS](https://github.com/microsoft/MASS) models for all language pairs. In order to better reproduce the results for MASS on En-De, we used monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and selected the best model (epoch@270) by perplexity (PPL) on the validation set.

Here are pre-trained models we used:

| Languages | XLM | MASS |
| :--------------- | :----------------------------------------------------------: | :----------------------------------------------------------: |
| English-French | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_enfr_1024.pth) | [Model](https://drive.google.com/file/d/1St5fFGnjv74Ikj_5GoVJEA8jPOjtavws/view?usp=sharing) |
| English-German | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_ende_1024.pth) | [Model](https://drive.google.com/file/d/13feylC1qFvG8kcNi-9JXVnzEYo0OouRK/view?usp=sharing) |
| English-Romanian | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_enro_1024.pth) | [Model](https://drive.google.com/file/d/1itUQvBgogjWE9P6H8yXfUSDsCBTK349W/view?usp=sharing) |

## Model training

We provide training scripts and trained models for UNMT baseline and our approach with online self-training.

**Training scripts**

Train UNMT model with online self-training and XLM initialization:

```shell
cd scripts
sh run-xlm-unmt-st-ende.sh
```

***Note*:** remember to modify the path variables in the header of the shell script.

**Trained model**

We selected the best model by BLEU score on the validation set for both directions. Therefore, we release En-X and X-En models for each experiment.


Approch
XLM
MASS


UNMT
En-Fr
Fr-En
En-Fr
Fr-En


En-De
De-En
En-De
De-En


En-Ro
Ro-En
En-Ro
Ro-En


UNMT-ST
En-Fr
Fr-En
En-Fr
Fr-En


En-De
De-En
En-De
De-En


En-Ro
Ro-En
En-Ro
Ro-En

## Evaluation

#### Generate translations

Input sentences must have the same tokenization and BPE codes than the ones used in the model.

```shell
cat input.en.bpe | \
python3 translate.py \
--exp_name translate \
--src_lang en --tgt_lang de \
--model_path trained_model.pth \
--output_path output.de.bpe \
--batch_size 8
```

#### Remove bpe

```shell
sed -r 's/(@@ )|(@@ ?$)//g' output.de.bpe > output.de.tok
```

#### Evaluate

```shell
BLEU_SCRIPT_PATH=src/evaluation/multi-bleu.perl
BLEU_SCRIPT_PATH ref.de.tok < output.de.tok
```

## Citation

```bibtext
@inproceedings{he-etal-2022-bridging,
title = "Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation",
author = "He, Zhiwei and
Wang, Xing and
Wang, Rui and
Shi, Shuming and
Tu, Zhaopeng",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics"
}
```