An open API service indexing awesome lists of open source software.

https://github.com/thunlp-mt/uce4bt


https://github.com/thunlp-mt/uce4bt

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

          

# Improving Back-Translation with Uncertainty-based Confidence Estimation
## Contents
* [Introduction](#introduction)
* [Prerequisites](#prerequisites)
* [Usage](#usage)
* [Contact](#contact)

## Introduction

This is the implementation of our work Improving Back-Translation with Uncertainty-based Confidence Estimation.

@inproceedings{Wang:2019:EMNLP,

title = "Improving Back-Translation with Uncertainty-based Confidence Estimation",
author = "Wang, Shuo and Liu, Yang and Wang, Chao and Luan, Huanbo and Sun, Maosong",
booktitle = "EMNLP",
year = "2019"
}

The implementation is on top of [THUMT](https://github.com/thumt/THUMT).

## Prerequisites
This repository runs in the same environment as THUMT, please refer to the user manual of THUMT to config the environment.

## Usage
Note: The usage is not user-friendly. May improve later.


Suppose the local path to this repository is CODE_DIR.

1. Standard training:

python [CODE_DIR]/thumt/bin/trainer.py \

--input [source corpus] [target corpus] \
--side none \
--vocabulary [source vocabulary] [target vocabulary] \
--model transformer \
--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

You can train a target-source translation model by simply exchanging source corpus and target corpus, source vocabulary and target vocabulary.

2. Translate target-side monolingual corpus:

python [CODE_DIR]/thumt/bin/translator.py \

--input [monolingual corpus] \
--output [translated corpus] \
--vocabulary [target vocabulary] [source vocabulary] \
--model transformer \
--checkpoint [path to the target-source model] \
--parameters=device_list=[0]

We recommand splitting the entire monolingual corpus into small corpora before translation if the monolingual corpus is too big.

3. Uncertainty estimation for the translated corpus:

python [CODE_DIR]/thumt/bin/scorer.py \

--input [monolingual corpus] [translated corpus] \
--vocabulary [target vocabulary] [source vocabulary] \
--mean_file [word-level mean] \
--var_file [word-level var] \
--rv_file [word-level var/mean] \
--sen_mean [sentence-level mean] \
--sen_var [sentence-level var] \
--sen_rv [sentence-level var/mean] \
--model transformer \
--checkpoint [path to the target-source model] \
--parameters=model_uncertainty=true,device_list=[0]

4. Confidence-aware training:

python [CODE_DIR]/thumt/bin/trainer.py \

--input [source corpus] [target corpus] \
--word_confidence [word-level uncertainty file] \
--sen_confidence [sentence-level uncertainty file] \
--side source_sentence_source_word \
--vocabulary [source vocabulary] [target vocabulary] \
--model transformer \
--checkpoint [path to the source-target checkpoint] \
--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

## Contact

If you have questions, suggestions and bug reports, please email [wangshuo18@mails.tsinghua.edu.cn](mailto:wangshuo18@mails.tsinghua.edu.cn).