Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/zhaocq-nlp/njunmt-tf

An open-source neural machine translation system developed by Natural Language Processing Group, Nanjing University.
https://github.com/zhaocq-nlp/njunmt-tf

attention neural-machine-translation njunmt-tf nmt tensor2tensor tensorflow transformer translation

Last synced: 2 months ago
JSON representation

An open-source neural machine translation system developed by Natural Language Processing Group, Nanjing University.

Host: GitHub
URL: https://github.com/zhaocq-nlp/njunmt-tf
Owner: zhaocq-nlp
License: apache-2.0
Created: 2017-12-29T01:17:43.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2018-09-06T07:51:52.000Z (over 6 years ago)
Last Synced: 2023-10-19T20:35:42.507Z (over 1 year ago)
Topics: attention, neural-machine-translation, njunmt-tf, nmt, tensor2tensor, tensorflow, transformer, translation
Language: Python
Homepage:
Size: 835 KB
Stars: 99
Watchers: 16
Forks: 41
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# NJUNMT-tf

NJUNMT-tf is a general purpose sequence modeling tool in TensorFlow while neural machine translation is the main target task.

## Key features

**NJUNMT-tf builds NMT models almost from scratch without any high-level TensorFlow APIs which often hide details of many network components and lead to obscure code structure that is difficult to understand and manipulate. NJUNMT-tf only depends on basic TensorFlow modules, like array_ops, math_ops and nn_ops. Each operation in the code is under control.**

NJUNMT-tf focuses on modularity and extensibility using standard TensorFlow modules and practices to support advanced modeling capability:

- arbitrarily complex encoder architectures, e.g. Bidirectional RNN encoder, Unidirectional RNN encoder and self-attention.
- arbitrarily complex decoder architectures, e.g. Conditional GRU/LSTM decoder, attention decoder and self-attention.
- hybrid encoder-decoder models, e.g. self-attention encoder and RNN decoder or vice versa.

and all of the above can be used simultaneously to train novel and complex architectures.

The code also supports:

- model ensemble.
- learning rate decaying according to loss on evaluation data.
- model validation on evaluation data with BLEU score and early stop strategy.
- monitoring with [TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard).
- capability for [BPE](https://github.com/rsennrich/subword-nmt)

## Requirements

- `tensorflow` (`>=1.6`)
- `pyyaml`

## Quickstart

Here is a minimal workflow to get you started in using NJUNMT-tf. This example uses a toy Chinese-English dataset for machine translation with a toy setting.

1\. Build the word vocabularies:

``` bash
python -m bin.generate_vocab testdata/toy.zh --max_vocab_size 100 > testdata/vocab.zh
python -m bin.generate_vocab testdata/toy.en0 --max_vocab_size 100 > testdata/vocab.en
```

2\. Train with preset sequence-to-sequence parameters:
``` bash
export CUDA_VISIBLE_DEVICES=
python -m bin.train --model_dir test_model \
--config_paths "
./njunmt/example_configs/toy_seq2seq.yml,
./njunmt/example_configs/toy_training_options.yml,
./default_configs/default_optimizer.yml"
```

3\. Translate a test file with the latest checkpoint:
``` bash
export CUDA_VISIBLE_DEVICES=
python -m bin.infer --model_dir test_models \
--infer "
beam_size: 4
source_words_vocabulary: testdata/vocab.zh
target_words_vocabulary: testdata/vocab.en" \
--infer_data "
- features_file: testdata/toy.zh
labels_file: testdata/toy.en
output_file: toy.trans
output_attention: false"
```

**Note:** do not expect any good translation results with this toy example. Consider training on [larger parallel datasets](http://www.statmt.org/wmt16/translation-task.html) instead.

## Configuration

As you can see, there are two ways to manipulate hyperparameters of the process:

- tf FLAGS
- yaml-style config file

For example, there is a config file specifying the datasets for training procedure.
```
# datasets.yml
data:
train_features_file: testdata/toy.zh
train_labels_file: testdata/toy.en0
eval_features_file: testdata/toy.zh
eval_labels_file: testdata/toy.en
source_words_vocabulary: testdata/vocab.zh
target_words_vocabulary: testdata/vocab.en
```

You can either use the command:
``` bash
python -m bin.train --config_paths "datasets.yml" ...
```
or
``` bash
python -m bin.train --data "
train_features_file: testdata/toy.zh
train_labels_file: testdata/toy.en0
eval_features_file: testdata/toy.zh
eval_labels_file: testdata/toy.en
source_words_vocabulary: testdata/vocab.zh
target_words_vocabulary: testdata/vocab.en" ...
```
They are of the same effect.

The available FLAGS (or the top levels of yaml configs) for bin.train are as follows:
- **config_paths**: the paths for config files
- **model_dir**: the directory for saving checkpoints
- **problem_name**: The top name scope, "seq2seq" by default
- **train**: training options, e.g. batch size, maximum length
- **data**: training data, evaluation data, vocabulary and (optional) BPE codes
- **hooks**: a list of training hooks (not provided, in the current version)
- **metrics**: a list of evaluation metrics on evaluation data
- **model**: the class name of the model
- **model_params**: parameters for the model
- **optimizer_params**: parameters for optimizer

The available FLAGS (or the top levels of yaml configs) for bin.infer are as follows:
- **config_paths**: the paths for config files
- **model_dir**: the checkpoint directory or directories separated by commas for model ensemble
- **infer**: inference options, e.g. beam size, length penalty rate
- **infer_data**: a list of data file to be translated
- **weight_scheme**: the weight scheme for model ensemble (only "average" available now)

**Note that:**
- each FLAG should be a string of yaml-style
- the hyperparameters provided by FLAGS will overwrite those presented in config files
- illegal parameters will interrupt the program, so see [sample.yml](https://github.com/zhaocq-nlp/NJUNMT-tf/blob/master/njunmt/example_configs/sample.yml) of more detailed discription for each parameter.

## Benchmarks

The RNN benchmarks are performed on 1 GTX 1080Ti GPU with predefined configurations:

- `default_configs/adam_loss_decay.yml`
- `default_configs/default_metrics.yml`
- `default_configs/default_training_options.yml`
- `default_configs/seq2seq_cgru.yml`

The Transformer benchmarks are performed on 1 GTX 1080Ti GPU with predefined configurations:

- `default_configs/transformer_base.yml`
- `default_configs/transformer_training_options.yml`

Note that in Transformer model, we set `batch_tokens_size=2500` with `update_cycle=10` to realize pseudo parallel training.

The beam sizes for RNN and Transformer are 10 and 4 respectively.

The datasets are preprocessed using [fetch_wmt2017_ende.sh](https://github.com/zhaocq-nlp/MT-data-processing/blob/master/fetch_wmt2017_ende.sh) and [fetch_wmt2018_zhen.sh](https://github.com/zhaocq-nlp/MT-data-processing/blob/master/fetch_wmt2018_zhen.sh) referring to [Edinburgh’s Report](http://statmt.org/wmt17/pdf/WMT39.pdf).

The BLEU scores are evaluated by the wrapper script [run_mteval.sh](https://github.com/zhaocq-nlp/NJUNMT-tf/blob/master/njunmt/tools/mteval/run_mteval.sh). For EN-ZH experiments, the BLEU scores are evaluated at character-level while others are evaluated at word-level.

Dataset
Model
BLEU

newstest2016(dev)
newstest2017

WMT17 EN-DE
RNN
29.6
23.6

Transformer
33.5
27.0

WMT17 DE-EN
RNN
34.0
29.6

Transformer
37.6
33.1

Dataset
Model
BLEU

newsdev2017(dev)
newstest2017

WMT17 ZH-EN
RNN
19.7
21.2

Transformer
22.7
25.0

WMT17 EN-ZH
RNN
30.0
30.2

Transformer
34.9
35.0

## TODO

The following features remain unimplemented:

- multi-gpu training
- schedule sampling
- minimum risk training

## Acknowledgments

The implementation is inspired by the following:
- *[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)*
- [dl4mt-tutorial](https://github.com/nyu-dl/dl4mt-tutorial)
- [OpenNMT-tf](https://github.com/OpenNMT/OpenNMT-tf)
- [Google's seq2seq](https://github.com/google/seq2seq)
*[Massive Exploration of Neural Machine Translation Architectures](https://arxiv.org/abs/1703.03906)*
- [THUMT](https://github.com/thumt/THUMT)
- [Google's tensor2tensor](https://github.com/tensorflow/tensor2tensor)
*[Attention is All You Need](https://arxiv.org/abs/1706.03762)*
- *[Stronger Baselines for Trustable Results in Neural Machine Translation](http://www.aclweb.org/anthology/W17-3203.pdf)*

## Contact

Any comments or suggestions are welcome.

Please email [[email protected]](mailto:[email protected]).