Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zwhe99/selftraining4unmt
Implementaion of our ACL 2022 paper "Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation"
https://github.com/zwhe99/selftraining4unmt
machine-translation nlp unsupervised-learning
Last synced: 3 months ago
JSON representation
Implementaion of our ACL 2022 paper "Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation"
- Host: GitHub
- URL: https://github.com/zwhe99/selftraining4unmt
- Owner: zwhe99
- Created: 2022-03-07T03:41:52.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-10-06T02:51:29.000Z (over 1 year ago)
- Last Synced: 2023-10-06T03:29:48.243Z (over 1 year ago)
- Topics: machine-translation, nlp, unsupervised-learning
- Language: Python
- Homepage: https://arxiv.org/abs/2203.08394
- Size: 186 KB
- Stars: 29
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
This is the implementaion of our [paper](https://arxiv.org/abs/2203.08394):
```
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
Zhiwei He*, Xing Wang, Rui Wang, Shuming Shi, Zhaopeng Tu
ACL 2022 (long paper, main conference)
```We based this code heavily on the original code of [XLM](https://github.com/facebookresearch/XLM) and [MASS](https://github.com/microsoft/MASS).
## Dependencies
* Python3
* Pytorch1.7.1
```shell
pip3 install torch==1.7.1+cu110
```* fastBPE
* Apex
```shell
git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard 0c2c6ee
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
```## Data ready
We prepared the data following the instruction from [XLM (Section III)](https://github.com/facebookresearch/XLM/blob/main/README.md#iii-applications-supervised--unsupervised-mt). We used their released scripts, BPE codes and vocabularies. However, there are some differences with them:
* All available data is used, not just 5,000,000 sentences per language
* For Romanian, we augment it with the monolingual data from WMT16.
* Noisy sentences are removed:
```shell
python3 filter_noisy_data.py --input all.en --lang en --output clean.en
```* For English-German, we used the processed data provided by [KaiTao Song](https://github.com/StillKeepTry).
Considering that it can take a very long time to prepare the data, we provide the processed data for download:
* [English-French](https://drive.google.com/file/d/15OBlFMjuwkbaY47xWdPysMyfpB-CqVoC/view?usp=sharing)
* [English-German](https://drive.google.com/file/d/1W-ngJpUvfRwSmWAUR2GZejMHBlRCMjfS/view?usp=sharing)
* [English-Romanian](https://drive.google.com/file/d/1fTP7PIbebewoLZD1rShFManED9cMysrV/view?usp=sharing)## Pre-trained models
We adopted the released [XLM](https://github.com/facebookresearch/XLM) and [MASS](https://github.com/microsoft/MASS) models for all language pairs. In order to better reproduce the results for MASS on En-De, we used monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and selected the best model (epoch@270) by perplexity (PPL) on the validation set.
Here are pre-trained models we used:
| Languages | XLM | MASS |
| :--------------- | :----------------------------------------------------------: | :----------------------------------------------------------: |
| English-French | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_enfr_1024.pth) | [Model](https://drive.google.com/file/d/1St5fFGnjv74Ikj_5GoVJEA8jPOjtavws/view?usp=sharing) |
| English-German | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_ende_1024.pth) | [Model](https://drive.google.com/file/d/13feylC1qFvG8kcNi-9JXVnzEYo0OouRK/view?usp=sharing) |
| English-Romanian | [Model](https://dl.fbaipublicfiles.com/XLM/mlm_enro_1024.pth) | [Model](https://drive.google.com/file/d/1itUQvBgogjWE9P6H8yXfUSDsCBTK349W/view?usp=sharing) |## Model training
We provide training scripts and trained models for UNMT baseline and our approach with online self-training.
**Training scripts**
Train UNMT model with online self-training and XLM initialization:
```shell
cd scripts
sh run-xlm-unmt-st-ende.sh
```***Note*:** remember to modify the path variables in the header of the shell script.
**Trained model**
We selected the best model by BLEU score on the validation set for both directions. Therefore, we release En-X and X-En models for each experiment.
Approch
XLM
MASS
UNMT
En-Fr
Fr-En
En-Fr
Fr-En
En-De
De-En
En-De
De-En
En-Ro
Ro-En
En-Ro
Ro-En
UNMT-ST
En-Fr
Fr-En
En-Fr
Fr-En
En-De
De-En
En-De
De-En
En-Ro
Ro-En
En-Ro
Ro-En
## Evaluation
#### Generate translations
Input sentences must have the same tokenization and BPE codes than the ones used in the model.
```shell
cat input.en.bpe | \
python3 translate.py \
--exp_name translate \
--src_lang en --tgt_lang de \
--model_path trained_model.pth \
--output_path output.de.bpe \
--batch_size 8
```#### Remove bpe
```shell
sed -r 's/(@@ )|(@@ ?$)//g' output.de.bpe > output.de.tok
```#### Evaluate
```shell
BLEU_SCRIPT_PATH=src/evaluation/multi-bleu.perl
BLEU_SCRIPT_PATH ref.de.tok < output.de.tok
```## Citation
```bibtext
@inproceedings{he-etal-2022-bridging,
title = "Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation",
author = "He, Zhiwei and
Wang, Xing and
Wang, Rui and
Shi, Shuming and
Tu, Zhaopeng",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics"
}
```