https://github.com/thunlp-mt/uce4bt
https://github.com/thunlp-mt/uce4bt
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/thunlp-mt/uce4bt
- Owner: THUNLP-MT
- License: bsd-3-clause
- Created: 2019-11-06T03:03:09.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-04-25T14:02:14.000Z (about 6 years ago)
- Last Synced: 2025-03-28T10:54:19.565Z (about 1 year ago)
- Language: Python
- Size: 298 KB
- Stars: 19
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Improving Back-Translation with Uncertainty-based Confidence Estimation
## Contents
* [Introduction](#introduction)
* [Prerequisites](#prerequisites)
* [Usage](#usage)
* [Contact](#contact)
## Introduction
This is the implementation of our work Improving Back-Translation with Uncertainty-based Confidence Estimation.
@inproceedings{Wang:2019:EMNLP,
title = "Improving Back-Translation with Uncertainty-based Confidence Estimation",
author = "Wang, Shuo and Liu, Yang and Wang, Chao and Luan, Huanbo and Sun, Maosong",
booktitle = "EMNLP",
year = "2019"
}
The implementation is on top of [THUMT](https://github.com/thumt/THUMT).
## Prerequisites
This repository runs in the same environment as THUMT, please refer to the user manual of THUMT to config the environment.
## Usage
Note: The usage is not user-friendly. May improve later.
Suppose the local path to this repository is CODE_DIR.
1. Standard training:
python [CODE_DIR]/thumt/bin/trainer.py \
--input [source corpus] [target corpus] \
--side none \
--vocabulary [source vocabulary] [target vocabulary] \
--model transformer \
--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]
You can train a target-source translation model by simply exchanging source corpus and target corpus, source vocabulary and target vocabulary.
2. Translate target-side monolingual corpus:
python [CODE_DIR]/thumt/bin/translator.py \
--input [monolingual corpus] \
--output [translated corpus] \
--vocabulary [target vocabulary] [source vocabulary] \
--model transformer \
--checkpoint [path to the target-source model] \
--parameters=device_list=[0]
We recommand splitting the entire monolingual corpus into small corpora before translation if the monolingual corpus is too big.
3. Uncertainty estimation for the translated corpus:
python [CODE_DIR]/thumt/bin/scorer.py \
--input [monolingual corpus] [translated corpus] \
--vocabulary [target vocabulary] [source vocabulary] \
--mean_file [word-level mean] \
--var_file [word-level var] \
--rv_file [word-level var/mean] \
--sen_mean [sentence-level mean] \
--sen_var [sentence-level var] \
--sen_rv [sentence-level var/mean] \
--model transformer \
--checkpoint [path to the target-source model] \
--parameters=model_uncertainty=true,device_list=[0]
4. Confidence-aware training:
python [CODE_DIR]/thumt/bin/trainer.py \
--input [source corpus] [target corpus] \
--word_confidence [word-level uncertainty file] \
--sen_confidence [sentence-level uncertainty file] \
--side source_sentence_source_word \
--vocabulary [source vocabulary] [target vocabulary] \
--model transformer \
--checkpoint [path to the source-target checkpoint] \
--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]
## Contact
If you have questions, suggestions and bug reports, please email [wangshuo18@mails.tsinghua.edu.cn](mailto:wangshuo18@mails.tsinghua.edu.cn).