https://github.com/thunlp-mt/uce4bt

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/thunlp-mt/uce4bt
Owner: THUNLP-MT
License: bsd-3-clause
Created: 2019-11-06T03:03:09.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-04-25T14:02:14.000Z (about 6 years ago)
Last Synced: 2025-03-28T10:54:19.565Z (over 1 year ago)
Language: Python
Size: 298 KB
Stars: 19
Watchers: 3
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Improving Back-Translation with Uncertainty-based Confidence Estimation

## Contents

* [Introduction](#introduction)

* [Prerequisites](#prerequisites)

* [Usage](#usage)

* [Contact](#contact)

## Introduction

This is the implementation of our work Improving Back-Translation with Uncertainty-based Confidence Estimation. 

@inproceedings{Wang:2019:EMNLP,

    title = "Improving Back-Translation with Uncertainty-based Confidence Estimation",

    author = "Wang, Shuo and Liu, Yang and Wang, Chao and Luan, Huanbo and Sun, Maosong",

    booktitle = "EMNLP",

    year = "2019"

}



The implementation is on top of [THUMT](https://github.com/thumt/THUMT).

## Prerequisites

This repository runs in the same environment as THUMT, please refer to the user manual of THUMT to config the environment.

## Usage

Note: The usage is not user-friendly. May improve later.




Suppose the local path to this repository is CODE_DIR.

1. Standard training:

python [CODE_DIR]/thumt/bin/trainer.py \

	--input [source corpus] [target corpus] \

	--side none \

	--vocabulary [source vocabulary] [target vocabulary] \

	--model transformer \

	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]



You can train a target-source translation model by simply exchanging source corpus and target corpus, source vocabulary and target vocabulary.

2. Translate target-side monolingual corpus:

python [CODE_DIR]/thumt/bin/translator.py \

	--input [monolingual corpus] \

	--output [translated corpus] \

	--vocabulary [target vocabulary] [source vocabulary] \

	--model transformer \

	--checkpoint [path to the target-source model] \

	--parameters=device_list=[0]



We recommand splitting the entire monolingual corpus into small corpora before translation if the monolingual corpus is too big.

3. Uncertainty estimation for the translated corpus:

python [CODE_DIR]/thumt/bin/scorer.py \

	--input [monolingual corpus] [translated corpus] \

	--vocabulary [target vocabulary] [source vocabulary] \

	--mean_file [word-level mean] \

	--var_file [word-level var] \

	--rv_file [word-level var/mean] \

	--sen_mean [sentence-level mean] \

	--sen_var [sentence-level var] \

	--sen_rv [sentence-level var/mean] \

	--model transformer \

	--checkpoint [path to the target-source model] \

	--parameters=model_uncertainty=true,device_list=[0]



4. Confidence-aware training:

python [CODE_DIR]/thumt/bin/trainer.py \

	--input [source corpus] [target corpus] \

	--word_confidence [word-level uncertainty file] \

	--sen_confidence [sentence-level uncertainty file] \

	--side source_sentence_source_word \

	--vocabulary [source vocabulary] [target vocabulary] \

	--model transformer \

	--checkpoint [path to the source-target checkpoint] \

	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]



## Contact

If you have questions, suggestions and bug reports, please email [wangshuo18@mails.tsinghua.edu.cn](mailto:wangshuo18@mails.tsinghua.edu.cn).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thunlp-mt/uce4bt

Awesome Lists containing this project

README