{"id":13545716,"url":"https://github.com/tensorflow/nmt","last_synced_at":"2025-09-30T08:31:46.637Z","repository":{"id":41300900,"uuid":"95723115","full_name":"tensorflow/nmt","owner":"tensorflow","description":"TensorFlow Neural Machine Translation Tutorial","archived":true,"fork":false,"pushed_at":"2022-10-09T08:07:34.000Z","size":1262,"stargazers_count":6374,"open_issues_count":274,"forks_count":1959,"subscribers_count":251,"default_branch":"master","last_synced_at":"2024-09-27T21:01:19.522Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tensorflow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-29T00:35:52.000Z","updated_at":"2024-09-27T03:35:11.000Z","dependencies_parsed_at":"2023-01-19T16:45:50.669Z","dependency_job_id":null,"html_url":"https://github.com/tensorflow/nmt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fnmt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fnmt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fnmt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorflow%2Fnmt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tensorflow","download_url":"https://codeload.github.com/tensorflow/nmt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234722054,"owners_count":18876896,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T11:01:11.368Z","updated_at":"2025-09-30T08:31:46.048Z","avatar_url":"https://github.com/tensorflow.png","language":"Python","funding_links":[],"categories":["Python","Tutorials and Blogs 🎒","Blog Articles, Papers, Case Studies","Machine Learning ##"],"sub_categories":["Machine Translation","Design Interview ###"],"readme":"# Neural Machine Translation (seq2seq) Tutorial\n\n*Authors: Thang Luong, Eugene Brevdo, Rui Zhao ([Google Research Blogpost](https://research.googleblog.com/2017/07/building-your-own-neural-machine.html), [Github](https://github.com/tensorflow/nmt))*\n\n*This version of the tutorial requires [TensorFlow Nightly](https://github.com/tensorflow/tensorflow/#installation).\nFor using the stable TensorFlow versions, please consider other branches such as\n[tf-1.4](https://github.com/tensorflow/nmt/tree/tf-1.4).*\n\n*If make use of this codebase for your research, please cite\n[this](#bibtex).*\n\n- [Introduction](#introduction)\n- [Basic](#basic)\n   - [Background on Neural Machine Translation](#background-on-neural-machine-translation)\n   - [Installing the Tutorial](#installing-the-tutorial)\n   - [Training – *How to build our first NMT system*](#training--how-to-build-our-first-nmt-system)\n      - [Embedding](#embedding)\n      - [Encoder](#encoder)\n      - [Decoder](#decoder)\n      - [Loss](#loss)\n      - [Gradient computation \u0026 optimization](#gradient-computation--optimization)\n   - [Hands-on – *Let's train an NMT model*](#hands-on--lets-train-an-nmt-model)\n   - [Inference – *How to generate translations*](#inference--how-to-generate-translations)\n- [Intermediate](#intermediate)\n   - [Background on the Attention Mechanism](#background-on-the-attention-mechanism)\n   - [Attention Wrapper API](#attention-wrapper-api)\n   - [Hands-on – *Building an attention-based NMT model*](#hands-on--building-an-attention-based-nmt-model)\n- [Tips \u0026 Tricks](#tips--tricks)\n   - [Building Training, Eval, and Inference Graphs](#building-training-eval-and-inference-graphs)\n   - [Data Input Pipeline](#data-input-pipeline)\n   - [Other details for better NMT models](#other-details-for-better-nmt-models)\n      - [Bidirectional RNNs](#bidirectional-rnns)\n      - [Beam search](#beam-search)\n      - [Hyperparameters](#hyperparameters)\n      - [Multi-GPU training](#multi-gpu-training)\n- [Benchmarks](#benchmarks)\n   - [IWSLT English-Vietnamese](#iwslt-english-vietnamese)\n   - [WMT German-English](#wmt-german-english)\n   - [WMT English-German \u0026mdash; *Full Comparison*](#wmt-english-german--full-comparison)\n   - [Standard HParams](#standard-hparams)\n- [Other resources](#other-resources)\n- [Acknowledgment](#acknowledgment)\n- [References](#references)\n- [BibTex](#bibtex)\n\n\n# Introduction\n\nSequence-to-sequence (seq2seq) models\n([Sutskever et al., 2014](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf),\n[Cho et al., 2014](http://emnlp2014.org/papers/pdf/EMNLP2014179.pdf)) have\nenjoyed great success in a variety of tasks such as machine translation, speech\nrecognition, and text summarization. This tutorial gives readers a full\nunderstanding of seq2seq models and shows how to build a competitive seq2seq\nmodel from scratch. We focus on the task of Neural Machine Translation (NMT)\nwhich was the very first testbed for seq2seq models with\nwild\n[success](https://research.googleblog.com/2016/09/a-neural-network-for-machine.html). The\nincluded code is lightweight, high-quality, production-ready, and incorporated\nwith the latest research ideas. We achieve this goal by:\n\n1. Using the recent decoder / attention\n   wrapper\n   [API](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/seq2seq/python/ops),\n   TensorFlow 1.2 data iterator\n1. Incorporating our strong expertise in building recurrent and seq2seq models\n1. Providing tips and tricks for building the very best NMT models and replicating\n   [Google’s NMT (GNMT) system](https://research.google.com/pubs/pub45610.html).\n\nWe believe that it is important to provide benchmarks that people can easily\nreplicate. As a result, we have provided full experimental results and\npretrained our models on the following publicly available datasets:\n\n1. *Small-scale*: English-Vietnamese parallel corpus of TED talks (133K sentence\n   pairs) provided by\n   the\n   [IWSLT Evaluation Campaign](https://sites.google.com/site/iwsltevaluation2015/).\n1. *Large-scale*: German-English parallel corpus (4.5M sentence pairs) provided\n   by the [WMT Evaluation Campaign](http://www.statmt.org/wmt16/translation-task.html).\n\nWe first build up some basic knowledge about seq2seq models for NMT, explaining\nhow to build and train a vanilla NMT model. The second part will go into details\nof building a competitive NMT model with attention mechanism. We then discuss\ntips and tricks to build the best possible NMT models (both in speed and\ntranslation quality) such as TensorFlow best practices (batching, bucketing),\nbidirectional RNNs, beam search, as well as scaling up to multiple GPUs using GNMT attention.\n\n# Basic\n\n## Background on Neural Machine Translation\n\nBack in the old days, traditional phrase-based translation systems performed\ntheir task by breaking up source sentences into multiple chunks and then\ntranslated them phrase-by-phrase. This led to disfluency in the translation\noutputs and was not quite like how we, humans, translate. We read the entire\nsource sentence, understand its meaning, and then produce a translation. Neural\nMachine Translation (NMT) mimics that!\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"80%\" src=\"nmt/g3doc/img/encdec.jpg\" /\u003e\n\u003cbr\u003e\nFigure 1. \u003cb\u003eEncoder-decoder architecture\u003c/b\u003e – example of a general approach for\nNMT. An encoder converts a source sentence into a \"meaning\" vector which is\npassed through a \u003ci\u003edecoder\u003c/i\u003e to produce a translation.\n\u003c/p\u003e\n\nSpecifically, an NMT system first reads the source sentence using an *encoder*\nto build\na\n[\"thought\" vector](https://www.theguardian.com/science/2015/may/21/google-a-step-closer-to-developing-machines-with-human-like-intelligence),\na sequence of numbers that represents the sentence meaning; a *decoder*, then,\nprocesses the sentence vector to emit a translation, as illustrated in\nFigure 1. This is often referred to as the *encoder-decoder architecture*. In\nthis manner, NMT addresses the local translation problem in the traditional\nphrase-based approach: it can capture *long-range dependencies* in languages,\ne.g., gender agreements; syntax structures; etc., and produce much more fluent\ntranslations as demonstrated\nby\n[Google Neural Machine Translation systems](https://research.googleblog.com/2016/09/a-neural-network-for-machine.html).\n\nNMT models vary in terms of their exact architectures. A natural choice for\nsequential data is the recurrent neural network (RNN), used by most NMT models.\nUsually an RNN is used for both the encoder and decoder. The RNN models,\nhowever, differ in terms of: (a) *directionality* – unidirectional or\nbidirectional; (b) *depth* – single- or multi-layer; and (c) *type* – often\neither a vanilla RNN, a Long Short-term Memory (LSTM), or a gated recurrent unit\n(GRU). Interested readers can find more information about RNNs and LSTM on\nthis [blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).\n\nIn this tutorial, we consider as examples a *deep multi-layer RNN* which is\nunidirectional and uses LSTM as a recurrent unit. We show an example of such a\nmodel in Figure 2. In this example, we build a model to translate a source\nsentence \"I am a student\" into a target sentence \"Je suis étudiant\". At a high\nlevel, the NMT model consists of two recurrent neural networks: the *encoder*\nRNN simply consumes the input source words without making any prediction; the\n*decoder*, on the other hand, processes the target sentence while predicting the\nnext words.\n\nFor more information, we refer readers\nto [Luong (2016)](https://github.com/lmthang/thesis) which this tutorial is\nbased on.\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"48%\" src=\"nmt/g3doc/img/seq2seq.jpg\" /\u003e\n\u003cbr\u003e\nFigure 2. \u003cb\u003eNeural machine translation\u003c/b\u003e – example of a deep recurrent\narchitecture proposed by for translating a source sentence \"I am a student\" into\na target sentence \"Je suis étudiant\". Here, \"\u0026lts\u0026gt\" marks the start of the\ndecoding process while \"\u0026lt/s\u0026gt\" tells the decoder to stop.\n\u003c/p\u003e\n\n## Installing the Tutorial\n\nTo install this tutorial, you need to have TensorFlow installed on your system.\nThis tutorial requires TensorFlow Nightly. To install TensorFlow, follow\nthe [installation instructions here](https://www.tensorflow.org/install/).\n\nOnce TensorFlow is installed, you can download the source code of this tutorial\nby running:\n\n``` shell\ngit clone https://github.com/tensorflow/nmt/\n```\n\n## Training – How to build our first NMT system\n\nLet's first dive into the heart of building an NMT model with concrete code\nsnippets through which we will explain Figure 2 in more detail. We defer data\npreparation and the full code to later. This part refers to\nfile\n[**model.py**](nmt/model.py).\n\nAt the bottom layer, the encoder and decoder RNNs receive as input the\nfollowing: first, the source sentence, then a boundary marker \"\\\u003cs\\\u003e\" which\nindicates the transition from the encoding to the decoding mode, and the target\nsentence.  For *training*, we will feed the system with the following tensors,\nwhich are in time-major format and contain word indices:\n\n-  **encoder_inputs** [max_encoder_time, batch_size]: source input words.\n-  **decoder_inputs** [max_decoder_time, batch_size]: target input words.\n-  **decoder_outputs** [max_decoder_time, batch_size]: target output words,\n   these are decoder_inputs shifted to the left by one time step with an\n   end-of-sentence tag appended on the right.\n\nHere for efficiency, we train with multiple sentences (batch_size) at\nonce. Testing is slightly different, so we will discuss it later.\n\n### Embedding\n\nGiven the categorical nature of words, the model must first look up the source\nand target embeddings to retrieve the corresponding word representations. For\nthis *embedding layer* to work, a vocabulary is first chosen for each language.\nUsually, a vocabulary size V is selected, and only the most frequent V words are\ntreated as unique.  All other words are converted to an \"unknown\" token and all\nget the same embedding.  The embedding weights, one set per language, are\nusually learned during training.\n\n``` python\n# Embedding\nembedding_encoder = variable_scope.get_variable(\n    \"embedding_encoder\", [src_vocab_size, embedding_size], ...)\n# Look up embedding:\n#   encoder_inputs: [max_time, batch_size]\n#   encoder_emb_inp: [max_time, batch_size, embedding_size]\nencoder_emb_inp = embedding_ops.embedding_lookup(\n    embedding_encoder, encoder_inputs)\n```\n\nSimilarly, we can build *embedding_decoder* and *decoder_emb_inp*. Note that one\ncan choose to initialize embedding weights with pretrained word representations\nsuch as word2vec or Glove vectors. In general, given a large amount of training\ndata we can learn these embeddings from scratch.\n\n### Encoder\n\nOnce retrieved, the word embeddings are then fed as input into the main network,\nwhich consists of two multi-layer RNNs – an encoder for the source language and\na decoder for the target language. These two RNNs, in principle, can share the\nsame weights; however, in practice, we often use two different RNN parameters\n(such models do a better job when fitting large training datasets). The\n*encoder* RNN uses zero vectors as its starting states and is built as follows:\n\n``` python\n# Build RNN cell\nencoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)\n\n# Run Dynamic RNN\n#   encoder_outputs: [max_time, batch_size, num_units]\n#   encoder_state: [batch_size, num_units]\nencoder_outputs, encoder_state = tf.nn.dynamic_rnn(\n    encoder_cell, encoder_emb_inp,\n    sequence_length=source_sequence_length, time_major=True)\n```\n\nNote that sentences have different lengths to avoid wasting computation, we tell\n*dynamic_rnn* the exact source sentence lengths through\n*source_sequence_length*. Since our input is time major, we set\n*time_major=True*. Here, we build only a single layer LSTM, *encoder_cell*. We\nwill describe how to build multi-layer LSTMs, add dropout, and use attention in\na later section.\n\n### Decoder\n\nThe *decoder* also needs to have access to the source information, and one\nsimple way to achieve that is to initialize it with the last hidden state of the\nencoder, *encoder_state*. In Figure 2, we pass the hidden state at the source\nword \"student\" to the decoder side.\n\n``` python\n# Build RNN cell\ndecoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)\n```\n\n``` python\n# Helper\nhelper = tf.contrib.seq2seq.TrainingHelper(\n    decoder_emb_inp, decoder_lengths, time_major=True)\n# Decoder\ndecoder = tf.contrib.seq2seq.BasicDecoder(\n    decoder_cell, helper, encoder_state,\n    output_layer=projection_layer)\n# Dynamic decoding\noutputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)\nlogits = outputs.rnn_output\n```\n\nHere, the core part of this code is the *BasicDecoder* object, *decoder*, which\nreceives *decoder_cell* (similar to encoder_cell), a *helper*, and the previous\n*encoder_state* as inputs. By separating out decoders and helpers, we can reuse\ndifferent codebases, e.g., *TrainingHelper* can be substituted with\n*GreedyEmbeddingHelper* to do greedy decoding. See more\nin\n[helper.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/seq2seq/python/ops/helper.py).\n\nLastly, we haven't mentioned *projection_layer* which is a dense matrix to turn\nthe top hidden states to logit vectors of dimension V. We illustrate this\nprocess at the top of Figure 2.\n\n``` python\nprojection_layer = layers_core.Dense(\n    tgt_vocab_size, use_bias=False)\n```\n\n### Loss\n\nGiven the *logits* above, we are now ready to compute our training loss:\n\n``` python\ncrossent = tf.nn.sparse_softmax_cross_entropy_with_logits(\n    labels=decoder_outputs, logits=logits)\ntrain_loss = (tf.reduce_sum(crossent * target_weights) /\n    batch_size)\n```\n\nHere, *target_weights* is a zero-one matrix of the same size as\n*decoder_outputs*. It masks padding positions outside of the target sequence\nlengths with values 0.\n\n***Important note***: It's worth pointing out that we divide the loss by\n*batch_size*, so our hyperparameters are \"invariant\" to batch_size. Some people\ndivide the loss by (*batch_size* * *num_time_steps*), which plays down the\nerrors made on short sentences. More subtly, our hyperparameters (applied to the\nformer way) can't be used for the latter way. For example, if both approaches\nuse SGD with a learning of 1.0, the latter approach effectively uses a much\nsmaller learning rate of 1 / *num_time_steps*.\n\n### Gradient computation \u0026 optimization\n\nWe have now defined the forward pass of our NMT model. Computing the\nbackpropagation pass is just a matter of a few lines of code:\n\n``` python\n# Calculate and clip gradients\nparams = tf.trainable_variables()\ngradients = tf.gradients(train_loss, params)\nclipped_gradients, _ = tf.clip_by_global_norm(\n    gradients, max_gradient_norm)\n```\n\nOne of the important steps in training RNNs is gradient clipping. Here, we clip\nby the global norm.  The max value, *max_gradient_norm*, is often set to a value\nlike 5 or 1. The last step is selecting the optimizer.  The Adam optimizer is a\ncommon choice.  We also select a learning rate.  The value of *learning_rate*\ncan is usually in the range 0.0001 to 0.001; and can be set to decrease as\ntraining progresses.\n\n``` python\n# Optimization\noptimizer = tf.train.AdamOptimizer(learning_rate)\nupdate_step = optimizer.apply_gradients(\n    zip(clipped_gradients, params))\n```\n\nIn our own experiments, we use standard SGD (tf.train.GradientDescentOptimizer)\nwith a decreasing learning rate schedule, which yields better performance. See\nthe [benchmarks](#benchmarks).\n\n## Hands-on – Let's train an NMT model\n\nLet's train our very first NMT model, translating from Vietnamese to English!\nThe entry point of our code\nis\n[**nmt.py**](nmt/nmt.py).\n\nWe will use a *small-scale parallel corpus of TED talks* (133K training\nexamples) for this exercise. All data we used here can be found\nat:\n[https://nlp.stanford.edu/projects/nmt/](https://nlp.stanford.edu/projects/nmt/). We\nwill use tst2012 as our dev dataset, and tst2013 as our test dataset.\n\nRun the following command to download the data for training NMT model:\\\n\t`nmt/scripts/download_iwslt15.sh /tmp/nmt_data`\n\nRun the following command to start the training:\n\n``` shell\nmkdir /tmp/nmt_model\npython -m nmt.nmt \\\n    --src=vi --tgt=en \\\n    --vocab_prefix=/tmp/nmt_data/vocab  \\\n    --train_prefix=/tmp/nmt_data/train \\\n    --dev_prefix=/tmp/nmt_data/tst2012  \\\n    --test_prefix=/tmp/nmt_data/tst2013 \\\n    --out_dir=/tmp/nmt_model \\\n    --num_train_steps=12000 \\\n    --steps_per_stats=100 \\\n    --num_layers=2 \\\n    --num_units=128 \\\n    --dropout=0.2 \\\n    --metrics=bleu\n```\n\nThe above command trains a 2-layer LSTM seq2seq model with 128-dim hidden units\nand embeddings for 12 epochs. We use a dropout value of 0.2 (keep probability\n0.8). If no error, we should see logs similar to the below with decreasing\nperplexity values as we train.\n\n```\n# First evaluation, global step 0\n  eval dev: perplexity 17193.66\n  eval test: perplexity 17193.27\n# Start epoch 0, step 0, lr 1, Tue Apr 25 23:17:41 2017\n  sample train data:\n    src_reverse: \u003c/s\u003e \u003c/s\u003e Điều đó , dĩ nhiên , là câu chuyện trích ra từ học thuyết của Karl Marx .\n    ref: That , of course , was the \u003cunk\u003e distilled from the theories of Karl Marx . \u003c/s\u003e \u003c/s\u003e \u003c/s\u003e\n  epoch 0 step 100 lr 1 step-time 0.89s wps 5.78K ppl 1568.62 bleu 0.00\n  epoch 0 step 200 lr 1 step-time 0.94s wps 5.91K ppl 524.11 bleu 0.00\n  epoch 0 step 300 lr 1 step-time 0.96s wps 5.80K ppl 340.05 bleu 0.00\n  epoch 0 step 400 lr 1 step-time 1.02s wps 6.06K ppl 277.61 bleu 0.00\n  epoch 0 step 500 lr 1 step-time 0.95s wps 5.89K ppl 205.85 bleu 0.00\n```\n\nSee [**train.py**](nmt/train.py) for more details.\n\nWe can start Tensorboard to view the summary of the model during training:\n\n``` shell\ntensorboard --port 22222 --logdir /tmp/nmt_model/\n```\n\nTraining the reverse direction from English and Vietnamese can be done simply by changing:\\\n\t`--src=en --tgt=vi`\n\n## Inference – How to generate translations\n\nWhile you're training your NMT models (and once you have trained models), you\ncan obtain translations given previously unseen source sentences. This process\nis called inference. There is a clear distinction between training and inference\n(*testing*): at inference time, we only have access to the source sentence,\ni.e., *encoder_inputs*. There are many ways to perform decoding.  Decoding\nmethods include greedy, sampling, and beam-search decoding. Here, we will\ndiscuss the greedy decoding strategy.\n\nThe idea is simple and we illustrate it in Figure 3:\n\n1. We still encode the source sentence in the same way as during training to\n   obtain an *encoder_state*, and this *encoder_state* is used to initialize the\n   decoder.\n2. The decoding (translation) process is started as soon as the decoder receives\n   a starting symbol \"\\\u003cs\\\u003e\" (refer as *tgt_sos_id* in our code);\n3. For each timestep on the decoder side, we treat the RNN's output as a set of\n   logits.  We choose the most likely word, the id associated with the maximum\n   logit value, as the emitted word (this is the \"greedy\" behavior).  For\n   example in Figure 3, the word \"moi\" has the highest translation probability\n   in the first decoding step.  We then feed this word as input to the next\n   timestep.\n4. The process continues until the end-of-sentence marker \"\\\u003c/s\\\u003e\" is produced as\n   an output symbol (refer as *tgt_eos_id* in our code).\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"40%\" src=\"nmt/g3doc/img/greedy_dec.jpg\" /\u003e\n\u003cbr\u003e\nFigure 3. \u003cb\u003eGreedy decoding\u003c/b\u003e – example of how a trained NMT model produces a\ntranslation for a source sentence \"Je suis étudiant\" using greedy search.\n\u003c/p\u003e\n\nStep 3 is what makes inference different from training. Instead of always\nfeeding the correct target words as an input, inference uses words predicted by\nthe model. Here's the code to achieve greedy decoding.  It is very similar to\nthe training decoder.\n\n``` python\n# Helper\nhelper = tf.contrib.seq2seq.GreedyEmbeddingHelper(\n    embedding_decoder,\n    tf.fill([batch_size], tgt_sos_id), tgt_eos_id)\n\n# Decoder\ndecoder = tf.contrib.seq2seq.BasicDecoder(\n    decoder_cell, helper, encoder_state,\n    output_layer=projection_layer)\n# Dynamic decoding\noutputs, _ = tf.contrib.seq2seq.dynamic_decode(\n    decoder, maximum_iterations=maximum_iterations)\ntranslations = outputs.sample_id\n```\n\nHere, we use *GreedyEmbeddingHelper* instead of *TrainingHelper*. Since we do\nnot know the target sequence lengths in advance, we use *maximum_iterations* to\nlimit the translation lengths. One heuristic is to decode up to two times the\nsource sentence lengths.\n\n``` python\nmaximum_iterations = tf.round(tf.reduce_max(source_sequence_length) * 2)\n```\n\nHaving trained a model, we can now create an inference file and translate some\nsentences:\n\n``` shell\ncat \u003e /tmp/my_infer_file.vi\n# (copy and paste some sentences from /tmp/nmt_data/tst2013.vi)\n\npython -m nmt.nmt \\\n    --out_dir=/tmp/nmt_model \\\n    --inference_input_file=/tmp/my_infer_file.vi \\\n    --inference_output_file=/tmp/nmt_model/output_infer\n\ncat /tmp/nmt_model/output_infer # To view the inference as output\n```\n\nNote the above commands can also be run while the model is still being trained\nas long as there exists a training\ncheckpoint. See [**inference.py**](nmt/inference.py) for more details.\n\n# Intermediate\n\nHaving gone through the most basic seq2seq model, let's get more advanced! To\nbuild state-of-the-art neural machine translation systems, we will need more\n\"secret sauce\": the *attention mechanism*, which was first introduced\nby [Bahdanau et al., 2015](https://arxiv.org/abs/1409.0473), then later refined\nby [Luong et al., 2015](https://arxiv.org/abs/1508.04025) and others. The key\nidea of the attention mechanism is to establish direct short-cut connections\nbetween the target and the source by paying \"attention\" to relevant source\ncontent as we translate. A nice byproduct of the attention mechanism is an\neasy-to-visualize alignment matrix between the source and target sentences (as\nshown in Figure 4).\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"40%\" src=\"nmt/g3doc/img/attention_vis.jpg\" /\u003e\n\u003cbr\u003e\nFigure 4. \u003cb\u003eAttention visualization\u003c/b\u003e – example of the alignments between source\nand target sentences. Image is taken from (Bahdanau et al., 2015).\n\u003c/p\u003e\n\nRemember that in the vanilla seq2seq model, we pass the last source state from\nthe encoder to the decoder when starting the decoding process. This works well\nfor short and medium-length sentences; however, for long sentences, the single\nfixed-size hidden state becomes an information bottleneck. Instead of discarding\nall of the hidden states computed in the source RNN, the attention mechanism\nprovides an approach that allows the decoder to peek at them (treating them as a\ndynamic memory of the source information). By doing so, the attention mechanism\nimproves the translation of longer sentences. Nowadays, attention mechanisms are\nthe defacto standard and have been successfully applied to many other tasks\n(including image caption generation, speech recognition, and text\nsummarization).\n\n## Background on the Attention Mechanism\n\nWe now describe an instance of the attention mechanism proposed in (Luong et\nal., 2015), which has been used in several state-of-the-art systems including\nopen-source toolkits such as [OpenNMT](http://opennmt.net/about/) and in the TF\nseq2seq API in this tutorial. We will also provide connections to other variants\nof the attention mechanism.\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"48%\" src=\"nmt/g3doc/img/attention_mechanism.jpg\" /\u003e\n\u003cbr\u003e\nFigure 5. \u003cb\u003eAttention mechanism\u003c/b\u003e – example of an attention-based NMT system\nas described in (Luong et al., 2015) . We highlight in detail the first step of\nthe attention computation. For clarity, we don't show the embedding and\nprojection layers in Figure (2).\n\u003c/p\u003e\n\nAs illustrated in Figure 5, the attention computation happens at every decoder\ntime step.  It consists of the following stages:\n\n1. The current target hidden state is compared with all source states to derive\n   *attention weights* (can be visualized as in Figure 4).\n1. Based on the attention weights we compute a *context vector* as the weighted\n   average of the source states.\n1. Combine the context vector with the current target hidden state to yield the\n   final *attention vector*\n1. The attention vector is fed as an input to the next time step (*input\n   feeding*).  The first three steps can be summarized by the equations below:\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"80%\" src=\"nmt/g3doc/img/attention_equation_0.jpg\" /\u003e\n\u003cbr\u003e\n\u003c/p\u003e\n\nHere, the function `score` is used to compared the target hidden state $$h_t$$\nwith each of the source hidden states $$\\overline{h}_s$$, and the result is normalized to\nproduced attention weights (a distribution over source positions). There are\nvarious choices of the scoring function; popular scoring functions include the\nmultiplicative and additive forms given in Eq. (4). Once computed, the attention\nvector $$a_t$$ is used to derive the softmax logit and loss.  This is similar to the\ntarget hidden state at the top layer of a vanilla seq2seq model. The function\n`f` can also take other forms.\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"80%\" src=\"nmt/g3doc/img/attention_equation_1.jpg\" /\u003e\n\u003cbr\u003e\n\u003c/p\u003e\n\nVarious implementations of attention mechanisms can be found\nin\n[attention_wrapper.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py).\n\n***What matters in the attention mechanism?***\n\nAs hinted in the above equations, there are many different attention variants.\nThese variants depend on the form of the scoring function and the attention\nfunction, and on whether the previous state $$h_{t-1}$$ is used instead of\n$$h_t$$ in the scoring function as originally suggested in (Bahdanau et al.,\n2015). Empirically, we found that only certain choices matter. First, the basic\nform of attention, i.e., direct connections between target and source, needs to\nbe present. Second, it's important to feed the attention vector to the next\ntimestep to inform the network about past attention decisions as demonstrated in\n(Luong et al., 2015). Lastly, choices of the scoring function can often result\nin different performance. See more in the [benchmark results](#benchmarks)\nsection.\n\n## Attention Wrapper API\n\nIn our implementation of\nthe\n[AttentionWrapper](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py),\nwe borrow some terminology\nfrom [(Weston et al., 2015)](https://arxiv.org/abs/1410.3916) in their work on\n*memory networks*. Instead of having readable \u0026 writable memory, the attention\nmechanism presented in this tutorial is a *read-only* memory. Specifically, the\nset of source hidden states (or their transformed versions, e.g.,\n$$W\\overline{h}_s$$ in Luong's scoring style or $$W_2\\overline{h}_s$$ in\nBahdanau's scoring style) is referred to as the *\"memory\"*. At each time step,\nwe use the current target hidden state as a *\"query\"* to decide on which parts\nof the memory to read.  Usually, the query needs to be compared with keys\ncorresponding to individual memory slots. In the above presentation of the\nattention mechanism, we happen to use the set of source hidden states (or their\ntransformed versions, e.g., $$W_1h_t$$ in Bahdanau's scoring style) as\n\"keys\". One can be inspired by this memory-network terminology to derive other\nforms of attention!\n\nThanks to the attention wrapper, extending our vanilla seq2seq code with\nattention is trivial. This part refers to\nfile [**attention_model.py**](nmt/attention_model.py)\n\nFirst, we need to define an attention mechanism, e.g., from (Luong et al.,\n2015):\n\n``` python\n# attention_states: [batch_size, max_time, num_units]\nattention_states = tf.transpose(encoder_outputs, [1, 0, 2])\n\n# Create an attention mechanism\nattention_mechanism = tf.contrib.seq2seq.LuongAttention(\n    num_units, attention_states,\n    memory_sequence_length=source_sequence_length)\n```\n\nIn the previous [Encoder](#encoder) section, *encoder_outputs* is the set of all\nsource hidden states at the top layer and has the shape of *[max_time,\nbatch_size, num_units]* (since we use *dynamic_rnn* with *time_major* set to\n*True* for efficiency). For the attention mechanism, we need to make sure the\n\"memory\" passed in is batch major, so we need to transpose\n*attention_states*. We pass *source_sequence_length* to the attention mechanism\nto ensure that the attention weights are properly normalized (over non-padding\npositions only).\n\nHaving defined an attention mechanism, we use *AttentionWrapper* to wrap the\ndecoding cell:\n\n``` python\ndecoder_cell = tf.contrib.seq2seq.AttentionWrapper(\n    decoder_cell, attention_mechanism,\n    attention_layer_size=num_units)\n```\n\nThe rest of the code is almost the same as in the Section [Decoder](#decoder)!\n\n## Hands-on – building an attention-based NMT model\n\nTo enable attention, we need to use one of `luong`, `scaled_luong`, `bahdanau`\nor `normed_bahdanau` as the value of the `attention` flag during training. The\nflag specifies which attention mechanism we are going to use. In addition, we\nneed to create a new directory for the attention model, so we don't reuse the\npreviously trained basic NMT model.\n\nRun the following command to start the training:\n\n``` shell\nmkdir /tmp/nmt_attention_model\n\npython -m nmt.nmt \\\n    --attention=scaled_luong \\\n    --src=vi --tgt=en \\\n    --vocab_prefix=/tmp/nmt_data/vocab  \\\n    --train_prefix=/tmp/nmt_data/train \\\n    --dev_prefix=/tmp/nmt_data/tst2012  \\\n    --test_prefix=/tmp/nmt_data/tst2013 \\\n    --out_dir=/tmp/nmt_attention_model \\\n    --num_train_steps=12000 \\\n    --steps_per_stats=100 \\\n    --num_layers=2 \\\n    --num_units=128 \\\n    --dropout=0.2 \\\n    --metrics=bleu\n```\n\nAfter training, we can use the same inference command with the new out_dir for\ninference:\n\n``` shell\npython -m nmt.nmt \\\n    --out_dir=/tmp/nmt_attention_model \\\n    --inference_input_file=/tmp/my_infer_file.vi \\\n    --inference_output_file=/tmp/nmt_attention_model/output_infer\n```\n\n# Tips \u0026 Tricks\n\n## Building Training, Eval, and Inference Graphs\n\nWhen building a machine learning model in TensorFlow, it's often best to build\nthree separate graphs:\n\n-  The Training graph, which:\n    -  Batches, buckets, and possibly subsamples input data from a set of\n       files/external inputs.\n    -  Includes the forward and backprop ops.\n    -  Constructs the optimizer, and adds the training op.\n\n-  The Eval graph, which:\n    -  Batches and buckets input data from a set of files/external inputs.\n    -  Includes the training forward ops, and additional evaluation ops that\n       aren't used for training.\n\n-  The Inference graph, which:\n    -  May not batch input data.\n    -  Does not subsample or bucket input data.\n    -  Reads input data from placeholders (data can be fed directly to the graph\n       via *feed_dict* or from a C++ TensorFlow serving binary).\n    -  Includes a subset of the model forward ops, and possibly additional\n       special inputs/outputs for storing state between session.run calls.\n\nBuilding separate graphs has several benefits:\n\n-  The inference graph is usually very different from the other two, so it makes\n   sense to build it separately.\n-  The eval graph becomes simpler since it no longer has all the additional\n   backprop ops.\n-  Data feeding can be implemented separately for each graph.\n-  Variable reuse is much simpler.  For example, in the eval graph there's no\n   need to reopen variable scopes with *reuse=True* just because the Training\n   model created these variables already.  So the same code can be reused\n   without sprinkling *reuse=* arguments everywhere.\n-  In distributed training, it is commonplace to have separate workers perform\n   training, eval, and inference.  These need to build their own graphs anyway.\n   So building the system this way prepares you for distributed training.\n\nThe primary source of complexity becomes how to share Variables across the three\ngraphs in a single machine setting. This is solved by using a separate session\nfor each graph. The training session periodically saves checkpoints, and the\neval session and the infer session restore parameters from checkpoints. The\nexample below shows the main differences between the two approaches.\n\n**Before: Three models in a single graph and sharing a single Session**\n\n``` python\nwith tf.variable_scope('root'):\n  train_inputs = tf.placeholder()\n  train_op, loss = BuildTrainModel(train_inputs)\n  initializer = tf.global_variables_initializer()\n\nwith tf.variable_scope('root', reuse=True):\n  eval_inputs = tf.placeholder()\n  eval_loss = BuildEvalModel(eval_inputs)\n\nwith tf.variable_scope('root', reuse=True):\n  infer_inputs = tf.placeholder()\n  inference_output = BuildInferenceModel(infer_inputs)\n\nsess = tf.Session()\n\nsess.run(initializer)\n\nfor i in itertools.count():\n  train_input_data = ...\n  sess.run([loss, train_op], feed_dict={train_inputs: train_input_data})\n\n  if i % EVAL_STEPS == 0:\n    while data_to_eval:\n      eval_input_data = ...\n      sess.run([eval_loss], feed_dict={eval_inputs: eval_input_data})\n\n  if i % INFER_STEPS == 0:\n    sess.run(inference_output, feed_dict={infer_inputs: infer_input_data})\n```\n\n**After: Three models in three graphs, with three Sessions sharing the same Variables**\n\n``` python\ntrain_graph = tf.Graph()\neval_graph = tf.Graph()\ninfer_graph = tf.Graph()\n\nwith train_graph.as_default():\n  train_iterator = ...\n  train_model = BuildTrainModel(train_iterator)\n  initializer = tf.global_variables_initializer()\n\nwith eval_graph.as_default():\n  eval_iterator = ...\n  eval_model = BuildEvalModel(eval_iterator)\n\nwith infer_graph.as_default():\n  infer_iterator, infer_inputs = ...\n  infer_model = BuildInferenceModel(infer_iterator)\n\ncheckpoints_path = \"/tmp/model/checkpoints\"\n\ntrain_sess = tf.Session(graph=train_graph)\neval_sess = tf.Session(graph=eval_graph)\ninfer_sess = tf.Session(graph=infer_graph)\n\ntrain_sess.run(initializer)\ntrain_sess.run(train_iterator.initializer)\n\nfor i in itertools.count():\n\n  train_model.train(train_sess)\n\n  if i % EVAL_STEPS == 0:\n    checkpoint_path = train_model.saver.save(train_sess, checkpoints_path, global_step=i)\n    eval_model.saver.restore(eval_sess, checkpoint_path)\n    eval_sess.run(eval_iterator.initializer)\n    while data_to_eval:\n      eval_model.eval(eval_sess)\n\n  if i % INFER_STEPS == 0:\n    checkpoint_path = train_model.saver.save(train_sess, checkpoints_path, global_step=i)\n    infer_model.saver.restore(infer_sess, checkpoint_path)\n    infer_sess.run(infer_iterator.initializer, feed_dict={infer_inputs: infer_input_data})\n    while data_to_infer:\n      infer_model.infer(infer_sess)\n```\n\nNotice how the latter approach is \"ready\" to be converted to a distributed\nversion.\n\nOne other difference in the new approach is that instead of using *feed_dicts*\nto feed data at each *session.run* call (and thereby performing our own\nbatching, bucketing, and manipulating of data), we use stateful iterator\nobjects.  These iterators make the input pipeline much easier in both the\nsingle-machine and distributed setting. We will cover the new input data\npipeline (as introduced in TensorFlow 1.2) in the next section.\n\n## Data Input Pipeline\n\nPrior to TensorFlow 1.2, users had two options for feeding data to the\nTensorFlow training and eval pipelines:\n\n1. Feed data directly via *feed_dict* at each training *session.run* call.\n1. Use the queueing mechanisms in *tf.train* (e.g. *tf.train.batch*) and\n   *tf.contrib.train*.\n1. Use helpers from a higher level framework like *tf.contrib.learn* or\n   *tf.contrib.slim* (which effectively use #2).\n\nThe first approach is easier for users who aren't familiar with TensorFlow or\nneed to do exotic input modification (i.e., their own minibatch queueing) that\ncan only be done in Python.  The second and third approaches are more standard\nbut a little less flexible; they also require starting multiple python threads\n(queue runners).  Furthermore, if used incorrectly queues can lead to deadlocks\nor opaque error messages.  Nevertheless, queues are significantly more efficient\nthan using *feed_dict* and are the standard for both single-machine and\ndistributed training.\n\nStarting in TensorFlow 1.2, there is a new system available for reading data\ninto TensorFlow models: dataset iterators, as found in the **tf.data**\nmodule. Data iterators are flexible, easy to reason about and to manipulate, and\nprovide efficiency and multithreading by leveraging the TensorFlow C++ runtime.\n\nA **dataset** can be created from a batch data Tensor, a filename, or a Tensor\ncontaining multiple filenames.  Some examples:\n\n``` python\n# Training dataset consists of multiple files.\ntrain_dataset = tf.data.TextLineDataset(train_files)\n\n# Evaluation dataset uses a single file, but we may\n# point to a different file for each evaluation round.\neval_file = tf.placeholder(tf.string, shape=())\neval_dataset = tf.data.TextLineDataset(eval_file)\n\n# For inference, feed input data to the dataset directly via feed_dict.\ninfer_batch = tf.placeholder(tf.string, shape=(num_infer_examples,))\ninfer_dataset = tf.data.Dataset.from_tensor_slices(infer_batch)\n```\n\nAll datasets can be treated similarly via input processing.  This includes\nreading and cleaning the data, bucketing (in the case of training and eval),\nfiltering, and batching.\n\nTo convert each sentence into vectors of word strings, for example, we use the\ndataset map transformation:\n\n``` python\ndataset = dataset.map(lambda string: tf.string_split([string]).values)\n```\n\nWe can then switch each sentence vector into a tuple containing both the vector\nand its dynamic length:\n\n``` python\ndataset = dataset.map(lambda words: (words, tf.size(words))\n```\n\nFinally, we can perform a vocabulary lookup on each sentence.  Given a lookup\ntable object table, this map converts the first tuple elements from a vector of\nstrings to a vector of integers.\n\n\n``` python\ndataset = dataset.map(lambda words, size: (table.lookup(words), size))\n```\n\nJoining two datasets is also easy.  If two files contain line-by-line\ntranslations of each other and each one is read into its own dataset, then a new\ndataset containing the tuples of the zipped lines can be created via:\n\n``` python\nsource_target_dataset = tf.data.Dataset.zip((source_dataset, target_dataset))\n```\n\nBatching of variable-length sentences is straightforward. The following\ntransformation batches *batch_size* elements from *source_target_dataset*, and\nrespectively pads the source and target vectors to the length of the longest\nsource and target vector in each batch.\n\n``` python\nbatched_dataset = source_target_dataset.padded_batch(\n        batch_size,\n        padded_shapes=((tf.TensorShape([None]),  # source vectors of unknown size\n                        tf.TensorShape([])),     # size(source)\n                       (tf.TensorShape([None]),  # target vectors of unknown size\n                        tf.TensorShape([]))),    # size(target)\n        padding_values=((src_eos_id,  # source vectors padded on the right with src_eos_id\n                         0),          # size(source) -- unused\n                        (tgt_eos_id,  # target vectors padded on the right with tgt_eos_id\n                         0)))         # size(target) -- unused\n```\n\nValues emitted from this dataset will be nested tuples whose tensors have a\nleftmost dimension of size *batch_size*.  The structure will be:\n\n-  iterator[0][0] has the batched and padded source sentence matrices.\n-  iterator[0][1] has the batched source size vectors.\n-  iterator[1][0] has the batched and padded target sentence matrices.\n-  iterator[1][1] has the batched target size vectors.\n\nFinally, bucketing that batches similarly-sized source sentences together is\nalso possible.  Please see the\nfile\n[utils/iterator_utils.py](nmt/utils/iterator_utils.py) for\nmore details and the full implementation.\n\nReading data from a Dataset requires three lines of code: create the iterator,\nget its values, and initialize it.\n\n``` python\nbatched_iterator = batched_dataset.make_initializable_iterator()\n\n((source, source_lengths), (target, target_lengths)) = batched_iterator.get_next()\n\n# At initialization time.\nsession.run(batched_iterator.initializer, feed_dict={...})\n```\n\nOnce the iterator is initialized, every *session.run* call that accesses source\nor target tensors will request the next minibatch from the underlying dataset.\n\n## Other details for better NMT models\n\n### Bidirectional RNNs\n\nBidirectionality on the encoder side generally gives better performance (with\nsome degradation in speed as more layers are used). Here, we give a simplified\nexample of how to build an encoder with a single bidirectional layer:\n\n``` python\n# Construct forward and backward cells\nforward_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)\nbackward_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)\n\nbi_outputs, encoder_state = tf.nn.bidirectional_dynamic_rnn(\n    forward_cell, backward_cell, encoder_emb_inp,\n    sequence_length=source_sequence_length, time_major=True)\nencoder_outputs = tf.concat(bi_outputs, -1)\n```\n\nThe variables *encoder_outputs* and *encoder_state* can be used in the same way\nas in Section Encoder. Note that, for multiple bidirectional layers, we need to\nmanipulate the encoder_state a bit, see [model.py](nmt/model.py), method\n*_build_bidirectional_rnn()* for more details.\n\n### Beam search\n\nWhile greedy decoding can give us quite reasonable translation quality, a beam\nsearch decoder can further boost performance. The idea of beam search is to\nbetter explore the search space of all possible translations by keeping around a\nsmall set of top candidates as we translate. The size of the beam is called\n*beam width*; a minimal beam width of, say size 10, is generally sufficient. For\nmore information, we refer readers to Section 7.2.3\nof [Neubig, (2017)](https://arxiv.org/abs/1703.01619). Here's an example of how\nbeam search can be done:\n\n``` python\n# Replicate encoder infos beam_width times\ndecoder_initial_state = tf.contrib.seq2seq.tile_batch(\n    encoder_state, multiplier=hparams.beam_width)\n\n# Define a beam-search decoder\ndecoder = tf.contrib.seq2seq.BeamSearchDecoder(\n        cell=decoder_cell,\n        embedding=embedding_decoder,\n        start_tokens=start_tokens,\n        end_token=end_token,\n        initial_state=decoder_initial_state,\n        beam_width=beam_width,\n        output_layer=projection_layer,\n        length_penalty_weight=0.0,\n        coverage_penalty_weight=0.0)\n\n# Dynamic decoding\noutputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)\n```\n\nNote that the same *dynamic_decode()* API call is used, similar to the\nSection [Decoder](#decoder). Once decoded, we can access the translations as\nfollows:\n\n``` python\ntranslations = outputs.predicted_ids\n# Make sure translations shape is [batch_size, beam_width, time]\nif self.time_major:\n   translations = tf.transpose(translations, perm=[1, 2, 0])\n```\n\nSee [model.py](nmt/model.py), method *_build_decoder()* for more details.\n\n### Hyperparameters\n\nThere are several hyperparameters that can lead to additional\nperformances. Here, we list some based on our own experience [ Disclaimers:\nothers might not agree on things we wrote! ].\n\n***Optimizer***: while Adam can lead to reasonable results for \"unfamiliar\"\narchitectures, SGD with scheduling will generally lead to better performance if\nyou can train with SGD.\n\n***Attention***: Bahdanau-style attention often requires bidirectionality on the\nencoder side to work well; whereas Luong-style attention tends to work well for\ndifferent settings. For this tutorial code, we recommend using the two improved\nvariants of Luong \u0026 Bahdanau-style attentions: *scaled_luong* \u0026 *normed\nbahdanau*.\n\n### Multi-GPU training\n\nTraining a NMT model may take several days. Placing different RNN layers on\ndifferent GPUs can improve the training speed. Here’s an example to create\nRNN layers on multiple GPUs.\n\n``` python\ncells = []\nfor i in range(num_layers):\n  cells.append(tf.contrib.rnn.DeviceWrapper(\n      tf.contrib.rnn.LSTMCell(num_units),\n      \"/gpu:%d\" % (num_layers % num_gpus)))\ncell = tf.contrib.rnn.MultiRNNCell(cells)\n```\n\nIn addition, we need to enable the `colocate_gradients_with_ops` option in\n`tf.gradients` to parallelize the gradients computation.\n\nYou may notice the speed improvement of the attention based NMT model is very\nsmall as the number of GPUs increases. One major drawback of the standard\nattention architecture is using the top (final) layer’s output to query\nattention at each time step. That means each decoding step must wait its\nprevious step completely finished; hence, we can’t parallelize the decoding\nprocess by simply placing RNN layers on multiple GPUs.\n\nThe [GNMT attention architecture](https://arxiv.org/pdf/1609.08144.pdf)\nparallelizes the decoder's computation by using the bottom (first) layer’s\noutput to query attention. Therefore, each decoding step can start as soon as\nits previous step's first layer and attention computation finished. We\nimplemented the architecture in\n[GNMTAttentionMultiCell](nmt/gnmt_model.py),\na subclass of *tf.contrib.rnn.MultiRNNCell*. Here’s an example of how to create\na decoder cell with the *GNMTAttentionMultiCell*.\n\n``` python\ncells = []\nfor i in range(num_layers):\n  cells.append(tf.contrib.rnn.DeviceWrapper(\n      tf.contrib.rnn.LSTMCell(num_units),\n      \"/gpu:%d\" % (num_layers % num_gpus)))\nattention_cell = cells.pop(0)\nattention_cell = tf.contrib.seq2seq.AttentionWrapper(\n    attention_cell,\n    attention_mechanism,\n    attention_layer_size=None,  # don't add an additional dense layer.\n    output_attention=False,)\ncell = GNMTAttentionMultiCell(attention_cell, cells)\n```\n\n# Benchmarks\n\n## IWSLT English-Vietnamese\n\nTrain: 133K examples, vocab=vocab.(vi|en), train=train.(vi|en)\ndev=tst2012.(vi|en),\ntest=tst2013.(vi|en), [download script](nmt/scripts/download_iwslt15.sh).\n\n***Training details***. We train 2-layer LSTMs of 512 units with bidirectional\nencoder (i.e., 1 bidirectional layers for the encoder), embedding dim\nis 512. LuongAttention (scale=True) is used together with dropout keep_prob of\n0.8. All parameters are uniformly. We use SGD with learning rate 1.0 as follows:\ntrain for 12K steps (~ 12 epochs); after 8K steps, we start halving learning\nrate every 1K step.\n\n***Results***.\n\nBelow are the averaged results of 2 models\n([model 1](http://download.tensorflow.org/models/nmt/envi_model_1.zip),\n[model 2](http://download.tensorflow.org/models/nmt/envi_model_2.zip)).\\\nWe measure the translation quality in terms of BLEU scores [(Papineni et al., 2002)](http://www.aclweb.org/anthology/P02-1040.pdf).\n\nSystems | tst2012 (dev) | test2013 (test)\n--- | :---: | :---:\nNMT (greedy) | 23.2 | 25.5\nNMT (beam=10) | 23.8 | **26.1**\n[(Luong \u0026 Manning, 2015)](https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf) | - | 23.3\n\n**Training Speed**: (0.37s step-time, 15.3K wps) on *K40m* \u0026 (0.17s step-time, 32.2K wps) on *TitanX*.\\\nHere, step-time means the time taken to run one mini-batch (of size 128). For wps, we count words on both the source and target.\n\n## WMT German-English\n\nTrain: 4.5M examples, vocab=vocab.bpe.32000.(de|en),\ntrain=train.tok.clean.bpe.32000.(de|en), dev=newstest2013.tok.bpe.32000.(de|en),\ntest=newstest2015.tok.bpe.32000.(de|en),\n[download script](nmt/scripts/wmt16_en_de.sh)\n\n***Training details***. Our training hyperparameters are similar to the\nEnglish-Vietnamese experiments except for the following details. The data is\nsplit into subword units using [BPE](https://github.com/rsennrich/subword-nmt)\n(32K operations). We train 4-layer LSTMs of 1024 units with bidirectional\nencoder (i.e., 2 bidirectional layers for the encoder), embedding dim\nis 1024. We train for 350K steps (~ 10 epochs); after 170K steps, we start\nhalving learning rate every 17K step.\n\n***Results***.\n\nThe first 2 rows are the averaged results of 2 models\n([model 1](http://download.tensorflow.org/models/nmt/deen_model_1.zip),\n[model 2](http://download.tensorflow.org/models/nmt/deen_model_2.zip)).\nResults in the third row is with GNMT attention\n([model](http://download.tensorflow.org/models/nmt/10122017/deen_gnmt_model_4_layer.zip))\n; trained with 4 GPUs.\n\nSystems | newstest2013 (dev) | newstest2015\n--- | :---: | :---:\nNMT (greedy) | 27.1 | 27.6\nNMT (beam=10) | 28.0 | 28.9\nNMT + GNMT attention (beam=10) | 29.0 | **29.9**\n[WMT SOTA](http://matrix.statmt.org/) | - | 29.3\n\nThese results show that our code builds strong baseline systems for NMT.\\\n(Note that WMT systems generally utilize a huge amount monolingual data which we currently do not.)\n\n**Training Speed**: (2.1s step-time, 3.4K wps) on *Nvidia K40m* \u0026 (0.7s step-time, 8.7K wps) on *Nvidia TitanX* for standard models.\\\nTo see the speed-ups with GNMT attention, we benchmark on *K40m* only:\n\nSystems | 1 gpu | 4 gpus | 8 gpus\n--- | :---: | :---: | :---:\nNMT (4 layers) | 2.2s, 3.4K | 1.9s, 3.9K | -\nNMT (8 layers) | 3.5s, 2.0K | - | 2.9s, 2.4K\nNMT + GNMT attention (4 layers) | 2.6s, 2.8K | 1.7s, 4.3K | -\nNMT + GNMT attention (8 layers) | 4.2s, 1.7K | - | 1.9s, 3.8K\n\nThese results show that without GNMT attention, the gains from using multiple gpus are minimal.\\\nWith GNMT attention, we obtain from 50%-100% speed-ups with multiple gpus.\n\n## WMT English-German \u0026mdash; Full Comparison\nThe first 2 rows are our models with GNMT\nattention:\n[model 1 (4 layers)](http://download.tensorflow.org/models/nmt/10122017/ende_gnmt_model_4_layer.zip),\n[model 2 (8 layers)](http://download.tensorflow.org/models/nmt/10122017/ende_gnmt_model_8_layer.zip).\n\nSystems | newstest2014 | newstest2015\n--- | :---: | :---:\n*Ours* \u0026mdash; NMT + GNMT attention (4 layers) | 23.7 | 26.5\n*Ours* \u0026mdash; NMT + GNMT attention (8 layers) | 24.4 | **27.6**\n[WMT SOTA](http://matrix.statmt.org/) | 20.6 | 24.9\nOpenNMT [(Klein et al., 2017)](https://arxiv.org/abs/1701.02810) | 19.3 | -\ntf-seq2seq [(Britz et al., 2017)](https://arxiv.org/abs/1703.03906) | 22.2 | 25.2\nGNMT [(Wu et al., 2016)](https://research.google.com/pubs/pub45610.html) | **24.6** | -\n\nThe above results show our models are very competitive among models of similar architectures.\\\n[Note that OpenNMT uses smaller models and the current best result (as of this writing) is 28.4 obtained by the Transformer network [(Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762) which has a significantly different architecture.]\n\n\n## Standard HParams\n\nWe have provided\n[a set of standard hparams](nmt/standard_hparams/)\nfor using pre-trained checkpoint for inference or training NMT architectures\nused in the Benchmark.\n\nWe will use the WMT16 German-English data, you can download the data by the\nfollowing command.\n\n```\nnmt/scripts/wmt16_en_de.sh /tmp/wmt16\n```\n\nHere is an example command for loading the pre-trained GNMT WMT German-English\ncheckpoint for inference.\n\n```\npython -m nmt.nmt \\\n    --src=de --tgt=en \\\n    --ckpt=/path/to/checkpoint/translate.ckpt \\\n    --hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json \\\n    --out_dir=/tmp/deen_gnmt \\\n    --vocab_prefix=/tmp/wmt16/vocab.bpe.32000 \\\n    --inference_input_file=/tmp/wmt16/newstest2014.tok.bpe.32000.de \\\n    --inference_output_file=/tmp/deen_gnmt/output_infer \\\n    --inference_ref_file=/tmp/wmt16/newstest2014.tok.bpe.32000.en\n```\n\nHere is an example command for training the GNMT WMT German-English model.\n\n```\npython -m nmt.nmt \\\n    --src=de --tgt=en \\\n    --hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json \\\n    --out_dir=/tmp/deen_gnmt \\\n    --vocab_prefix=/tmp/wmt16/vocab.bpe.32000 \\\n    --train_prefix=/tmp/wmt16/train.tok.clean.bpe.32000 \\\n    --dev_prefix=/tmp/wmt16/newstest2013.tok.bpe.32000 \\\n    --test_prefix=/tmp/wmt16/newstest2015.tok.bpe.32000\n```\n\n\n# Other resources\n\nFor deeper reading on Neural Machine Translation and sequence-to-sequence\nmodels, we highly recommend the following materials\nby\n[Luong, Cho, Manning, (2016)](https://sites.google.com/site/acl16nmt/);\n[Luong, (2016)](https://github.com/lmthang/thesis);\nand [Neubig, (2017)](https://arxiv.org/abs/1703.01619).\n\nThere's a wide variety of tools for building seq2seq models, so we pick one per\nlanguage:\\\nStanford NMT\n[https://nlp.stanford.edu/projects/nmt/](https://nlp.stanford.edu/projects/nmt/)\n*[Matlab]* \\\ntf-seq2seq\n[https://github.com/google/seq2seq](https://github.com/google/seq2seq)\n*[TensorFlow]* \\\nNemantus\n[https://github.com/rsennrich/nematus](https://github.com/rsennrich/nematus)\n*[Theano]* \\\nOpenNMT [http://opennmt.net/](http://opennmt.net/) *[Torch]*\\\nOpenNMT-py [https://github.com/OpenNMT/OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) *[PyTorch]*\n\n\n\n# Acknowledgment\nWe would like to thank Denny Britz, Anna Goldie, Derek Murray, and Cinjon Resnick for their work bringing new features to TensorFlow and the seq2seq library. Additional thanks go to Lukasz Kaiser for the initial help on the seq2seq codebase; Quoc Le for the suggestion to replicate GNMT; Yonghui Wu and Zhifeng Chen for details on the GNMT systems; as well as the Google Brain team for their support and feedback!\n\n# References\n\n-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua\n   Bengio. 2015.[ Neural machine translation by jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf). ICLR.\n-  Minh-Thang Luong, Hieu Pham, and Christopher D\n   Manning. 2015.[ Effective approaches to attention-based neural machine translation](https://arxiv.org/pdf/1508.04025.pdf). EMNLP.\n-  Ilya Sutskever, Oriol Vinyals, and Quoc\n   V. Le. 2014.[ Sequence to sequence learning with neural networks](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf). NIPS.\n\n# BibTex\n\n```\n@article{luong17,\n  author  = {Minh{-}Thang Luong and Eugene Brevdo and Rui Zhao},\n  title   = {Neural Machine Translation (seq2seq) Tutorial},\n  journal = {https://github.com/tensorflow/nmt},\n  year    = {2017},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorflow%2Fnmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftensorflow%2Fnmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorflow%2Fnmt/lists"}