https://github.com/idiap/han_nmt
Document-Level Neural Machine Translation with Hierarchical Attention Networks
https://github.com/idiap/han_nmt
Last synced: about 1 year ago
JSON representation
Document-Level Neural Machine Translation with Hierarchical Attention Networks
- Host: GitHub
- URL: https://github.com/idiap/han_nmt
- Owner: idiap
- License: gpl-3.0
- Created: 2018-09-21T07:36:50.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-05-09T01:08:02.000Z (about 4 years ago)
- Last Synced: 2025-03-23T01:03:29.588Z (about 1 year ago)
- Language: JavaScript
- Size: 3.59 MB
- Stars: 68
- Watchers: 7
- Forks: 19
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
## Description
Implementation of the paper ["Document-Level Neural Machine Translation with Hierarchical Attention Networks"](http://www.aclweb.org/anthology/D18-1325). It is based on OpenNMT (v.2.1) https://github.com/OpenNMT/OpenNMT-py
This is a restricted version. It DOES NOT work for shards, and multimodal translation.
## Preprocess
The data, similary for any NMT baseline, consists of a source file and a target file which are aligned at sentence-level. However, the sentences should be in order for each document (i.e. not shuffled). Additionally, the model requires a file (doc_file) indicating the beginning of each document in the source file. Each line of the doc_file indicates the number of lines at the source file where a new document starts.
Example:
> 0
> 10
> 25
There are 3 documents. The first one from line 0 to line 9, the second from line 10 to 24, the third from line 25 to the end.
Command:
```
python preprocess.py -train_src [source_file] -train_tgt [target_file] -train_doc [doc_file]
-valid_src [source_dev_file] -valid_tgt [target_dev_file] -valid_doc [doc_dev_file] -save_data [out_file]
```
The folder preprocess_TED_zh-en contains the files to preprocess the TED Talks zh-en dataset from https://wit3.fbk.eu/mt.php?release=2015-01.
## Training
Training the sentence-level NMT baseline:
```
python train.py -data [data_set] -save_model [sentence_level_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 4096 -start_decay_at 20 -report_every 500 -epochs 20 -gpuid 0 -max_generator_batches 16 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot
-train_part sentences
```
Training HAN-encoder using the sentence-level NMT model:
```
python train.py -data [data_set] -save_model [HAN_enc_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot
-train_part all -context_type HAN_enc -context_size 3 -train_from [sentence_level_model]
```
Training HAN-decoder using the sentence-level NMT model:
```
python train.py -data [data_set] -save_model [HAN_dec_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot
-train_part all -context_type HAN_dec -context_size 3 -train_from [sentence_level_model]
```
Training HAN-joint using the HAN-encoder model:
```
python train.py -data [data_set] -save_model [HAN_joint_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot
-train_part all -context_type HAN_join -context_size 3 -train_from [HAN_enc_model]
```
Input options:
- train_part: [sentences, context, all]
- context_type: [HAN_enc, HAN_dec, HAN_join, HAN_dec_source, HAN_dec_context]
- context_size: number of previous sentences
NOTE: The transformer model is sensitive to variation on hyperparameters. The HAN is also sensitive to the batch size.
## Translation
The translation is done sentence by sentence despite not being necesary for HAN_enc or baseline (this could be improved).
```
python translate.py -model [model] -src [test_source_file] -doc [test_doc_file]
-output [out_file] -translate_part all -batch_size 1000 -gpu 0
```
Input options:
- translate_part: [sentences, all]
- batch_size: maximun number of sentences to keep in memory at once.
## Test files reported in the paper
The output files of the 3 reported systems: transformer NMT, cache NMT, HAN-decoder NMT, HAN-encoder NMT, HAN-encoder-decoder NMT.
> - sub_es-en: Opensubtitles
> - sub_zh-en: TV subtitles
> - TED_es-en: TED Talks WIT 2015
> - TED_zh-en: TED Talks WIT 2014
## Reference:
>Miculicich, L., Ram, D., Pappas, N. & Henderson, J. Document-Level Neural Machine Translation with Hierarchical Attention Networks. EMNLP 2018.
https://www.aclweb.org/anthology/D18-1325/
## Contact:
lmiculicich@idiap.ch