An open API service indexing awesome lists of open source software.

https://github.com/thunlp-mt/plm4mt

Code for our work "MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators" in ACL 2022
https://github.com/thunlp-mt/plm4mt

Last synced: about 1 year ago
JSON representation

Code for our work "MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators" in ACL 2022

Awesome Lists containing this project

README

          

# PLM4MT

This is the code for our ACL 2022 work [MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators](http://arxiv.org/abs/2110.06609). The implementation is on top of the open-source NMT toolkit [THUMT](https://github.com/THUNLP-MT/THUMT).

## Contents

* [Prerequisites](#prerequisites)
* [mGPT](#mgpt)
* [Format](#format)
* [Training](#training)
* [Decoding](#decoding)
* [Postprocessing](#postprocessing)
* [License](#license)
* [Citation](#citation)

## Prerequisites

* Python >= 3.7
* tensorflow-cpu >= 2.0
* torch >= 1.7
* transformers

Please read the document of [THUMT](https://github.com/THUNLP-MT/THUMT/blob/master/docs/index.md) before using this Repository.

## mGPT
You can download the mGPT checkpoint at [this url](https://huggingface.co/THUMT/mGPT).

## Format

We use `` to separate a source and a target sentence. For the WMT14 En-De dataset, the training file contains lines with the following format:

```
Graphical artwork, corporate identity and corporate design. Grafische Gestaltung, Layout, Corporate Identity und Corporate Design.
```

Here `` is a tag to indicate the source language, which can be omitted.

For inference, the test set contains lines like:

```
Gutach: Increased safety for pedestrians
```

## Training

Using the following command to train a prompt for translation:

```[bash]
CODES=
CKPT=
export PYTHONPATH=$CODES:$PYTHONPATH

export USE_TF=0
export USE_TORCH=1

python $CODES/thumt/bin/trainer.py \
--half \
--input \
--model \
--ptm $CKPT \
--parameters=device_list=[0,1,2,3,4,5,6,7],\
train_steps=40000,update_cycle=16,batch_size=256,\
save_checkpoint_steps=2000,max_length=256 \
--hparam_set base
```

Here `model_name` has the following three options:

* `mgpt_prompt`: mGPT with Prompt tuning
* `mgpt_prefix`: mGPT with Prefix-tuning
* `mgpt_msp`: mGPT with multi-stage prompting

## Decoding

The following command decodes an input file:
```
CODES=
export PYTHONPATH=:$PYTHONPATH

python $CODES/thumt/bin/translator.py \
--input \
--ptm \
--output \
--model \
--half --prefix \
--parameters=device_list=[0,1,2,3],\
decode_alpha=0.0,\
decode_batch_size=4,\
prompt_length=128
```

## Postprocessing

We use `tools/punc.cpp` to replace punctuations for Chinese. Use the following command to compile the code:

```[bash]
g++ -std=c++11 -o punc tools/punc.cpp
```

Then use the following command to replace punctuations

```[bash]
cat | ./punc |
```

## License

Open source licensing is under the [BSD-3-Clause](https://opensource.org/licenses/BSD-3-Clause), which allows free use for research purposes.

## Citation

```
@article{tan2021msp,
title={{MSP}: Multi-stage prompting for making pre-trained language models better translators},
author={Tan, Zhixing and Zhang, Xiangwen and Wang, Shuo and Liu, Yang},
journal={arXiv preprint arXiv:2110.06609},
year={2021}
}
```