{"id":13639194,"url":"https://github.com/Element-Research/rnn","last_synced_at":"2025-04-19T22:31:48.651Z","repository":{"id":28972129,"uuid":"32498520","full_name":"Element-Research/rnn","owner":"Element-Research","description":"Recurrent Neural Network library for Torch7's nn","archived":false,"fork":false,"pushed_at":"2017-12-21T06:29:48.000Z","size":2283,"stargazers_count":938,"open_issues_count":78,"forks_count":314,"subscribers_count":99,"default_branch":"master","last_synced_at":"2024-08-03T01:14:09.643Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Element-Research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.2nd.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-19T03:23:26.000Z","updated_at":"2024-07-01T14:11:06.000Z","dependencies_parsed_at":"2022-07-17T20:47:10.060Z","dependency_job_id":null,"html_url":"https://github.com/Element-Research/rnn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Element-Research%2Frnn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Element-Research%2Frnn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Element-Research%2Frnn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Element-Research%2Frnn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Element-Research","download_url":"https://codeload.github.com/Element-Research/rnn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223810297,"owners_count":17206730,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T01:00:58.556Z","updated_at":"2024-11-09T09:30:42.599Z","avatar_url":"https://github.com/Element-Research.png","language":"Lua","funding_links":[],"categories":["Lua","Software Package","Codes","Libraries"],"sub_categories":["Tools","[Tools](#tools-1)","Speech Recognition","Model related"],"readme":"# rnn: recurrent neural networks #\n\nNote: this repository is deprecated in favor of https://github.com/torch/rnn.\n\nThis is a Recurrent Neural Network library that extends Torch's nn. \nYou can use it to build RNNs, LSTMs, GRUs, BRNNs, BLSTMs, and so forth and so on.\nThis library includes documentation for the following objects:\n\nModules that consider successive calls to `forward` as different time-steps in a sequence :\n * [AbstractRecurrent](#rnn.AbstractRecurrent) : an abstract class inherited by Recurrent and LSTM;\n * [Recurrent](#rnn.Recurrent) : a generalized recurrent neural network container;\n * [LSTM](#rnn.LSTM) : a vanilla Long-Short Term Memory module;\n  * [FastLSTM](#rnn.FastLSTM) : a faster [LSTM](#rnn.LSTM) with optional support for batch normalization;\n * [GRU](#rnn.GRU) : Gated Recurrent Units module;\n * [MuFuRu](#rnn.MuFuRu) : [Multi-function Recurrent Unit](https://arxiv.org/abs/1606.03002) module;\n * [Recursor](#rnn.Recursor) : decorates a module to make it conform to the [AbstractRecurrent](#rnn.AbstractRecurrent) interface;\n * [Recurrence](#rnn.Recurrence) : decorates a module that outputs `output(t)` given `{input(t), output(t-1)}`;\n * [NormStabilizer](#rnn.NormStabilizer) : implements [norm-stabilization](http://arxiv.org/abs/1511.08400) criterion (add this module between RNNs);\n\nModules that `forward` entire sequences through a decorated `AbstractRecurrent` instance :\n * [AbstractSequencer](#rnn.AbstractSequencer) : an abstract class inherited by Sequencer, Repeater, RecurrentAttention, etc.;\n * [Sequencer](#rnn.Sequencer) : applies an encapsulated module to all elements in an input sequence  (Tensor or Table);\n * [SeqLSTM](#rnn.SeqLSTM) : a very fast version of `nn.Sequencer(nn.FastLSTM)` where the `input` and `output` are tensors;\n  * [SeqLSTMP](#rnn.SeqLSTMP) : `SeqLSTM` with a projection layer;\n * [SeqGRU](#rnn.SeqGRU) : a very fast version of `nn.Sequencer(nn.GRU)` where the `input` and `output` are tensors;\n * [SeqBRNN](#rnn.SeqBRNN) : Bidirectional RNN based on SeqLSTM;\n * [BiSequencer](#rnn.BiSequencer) : used for implementing Bidirectional RNNs and LSTMs;\n * [BiSequencerLM](#rnn.BiSequencerLM) : used for implementing Bidirectional RNNs and LSTMs for language models;\n * [Repeater](#rnn.Repeater) : repeatedly applies the same input to an AbstractRecurrent instance;\n * [RecurrentAttention](#rnn.RecurrentAttention) : a generalized attention model for [REINFORCE modules](https://github.com/nicholas-leonard/dpnn#nn.Reinforce);\n\nMiscellaneous modules and criterions :\n * [MaskZero](#rnn.MaskZero) : zeroes the `output` and `gradOutput` rows of the decorated module for commensurate `input` rows which are tensors of zeros;\n * [TrimZero](#rnn.TrimZero) : same behavior as `MaskZero`, but more efficient when `input` contains lots zero-masked rows;\n * [LookupTableMaskZero](#rnn.LookupTableMaskZero) : extends `nn.LookupTable` to support zero indexes for padding. Zero indexes are forwarded as tensors of zeros;\n * [MaskZeroCriterion](#rnn.MaskZeroCriterion) : zeros the `gradInput` and `err` rows of the decorated criterion for commensurate `input` rows which are tensors of zeros;\n * [SeqReverseSequence](#rnn.SeqReverseSequence) : reverses an input sequence on a specific dimension;\n\nCriterions used for handling sequential inputs and targets :\n * [SequencerCriterion](#rnn.SequencerCriterion) : sequentially applies the same criterion to a sequence of inputs and targets (Tensor or Table).\n * [RepeaterCriterion](#rnn.RepeaterCriterion) : repeatedly applies the same criterion with the same target on a sequence.\n\nTo install this repository:\n```\ngit clone git@github.com:Element-Research/rnn.git\ncd rnn\nluarocks make rocks/rnn-scm-1.rockspec\n```\nNote that `luarocks intall rnn` now installs https://github.com/torch/rnn instead.\n\n\u003ca name='rnn.examples'\u003e\u003c/a\u003e\n## Examples ##\n\nThe following are example training scripts using this package :\n\n  * [RNN/LSTM/GRU](examples/recurrent-language-model.lua) for Penn Tree Bank dataset;\n  * [Noise Contrastive Estimate](examples/noise-contrastive-estimate.lua) for training multi-layer [SeqLSTM](#rnn.SeqLSTM) language models on the [Google Billion Words dataset](https://github.com/Element-Research/dataload#dl.loadGBW). The example uses [MaskZero](#rnn.MaskZero) to train independent variable length sequences using the [NCEModule](https://github.com/Element-Research/dpnn#nn.NCEModule) and [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion). This script is our fastest yet boasting speeds of 20,000 words/second (on NVIDIA Titan X) with a 2-layer LSTM having 250 hidden units, a batchsize of 128 and sequence length of a 100. Note that you will need to have [Torch installed with Lua instead of LuaJIT](http://torch.ch/docs/getting-started.html#_);\n  * [Recurrent Model for Visual Attention](examples/recurrent-visual-attention.lua) for the MNIST dataset;\n  * [Encoder-Decoder LSTM](examples/encoder-decoder-coupling.lua) shows you how to couple encoder and decoder `LSTMs` for sequence-to-sequence networks;\n  * [Simple Recurrent Network](examples/simple-recurrent-network.lua) shows a simple example for building and training a simple recurrent neural network;\n  * [Simple Sequencer Network](examples/simple-sequencer-network.lua) is a version of the above script that uses the Sequencer to decorate the `rnn` instead;\n  * [Sequence to One](examples/sequence-to-one.lua) demonstrates how to do many to one sequence learning as is the case for sentiment analysis;\n  * [Multivariate Time Series](examples/recurrent-time-series.lua) demonstrates how train a simple RNN to do multi-variate time-series predication.\n\n### External Resources\n\n  * [rnn-benchmarks](https://github.com/glample/rnn-benchmarks) : benchmarks comparing Torch (using this library), Theano and TensorFlow.\n  * [Harvard Jupyter Notebook Tutorial](http://nbviewer.jupyter.org/github/CS287/Lectures/blob/gh-pages/notebooks/ElementRNNTutorial.ipynb) : an in-depth tutorial for how to use the Element-Research rnn package by Harvard University;\n  * [dpnn](https://github.com/Element-Research/dpnn) : this is a dependency of the __rnn__ package. It contains useful nn extensions, modules and criterions;\n  * [dataload](https://github.com/Element-Research/dataload) : a collection of torch dataset loaders;\n  * [RNN/LSTM/BRNN/BLSTM training script ](https://github.com/nicholas-leonard/dp/blob/master/examples/recurrentlanguagemodel.lua) for Penn Tree Bank or Google Billion Words datasets;\n  * A brief (1 hours) overview of Torch7, which includes some details about the __rnn__ packages (at the end), is available via this [NVIDIA GTC Webinar video](http://on-demand.gputechconf.com/gtc/2015/webinar/torch7-applied-deep-learning-for-vision-natural-language.mp4). In any case, this presentation gives a nice overview of Logistic Regression, Multi-Layer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks using Torch7;\n  * [Sequence to Sequence mapping using encoder-decoder RNNs](https://github.com/rahul-iisc/seq2seq-mapping) : a complete training example using synthetic data.\n  * [ConvLSTM](https://github.com/viorik/ConvLSTM) is a repository for training a [Spatio-temporal video autoencoder with differentiable memory](http://arxiv.org/abs/1511.06309).\n  * An [time series example](https://github.com/rracinskij/rnntest01/blob/master/rnntest01.lua) for univariate timeseries prediction.\n  \n## Citation ##\n\nIf you use __rnn__ in your work, we'd really appreciate it if you could cite the following paper:\n\nLéonard, Nicholas, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. [rnn: Recurrent Library for Torch.](http://arxiv.org/abs/1511.07889) arXiv preprint arXiv:1511.07889 (2015).\n\nAny significant contributor to the library will also get added as an author to the paper.\nA [significant contributor](https://github.com/Element-Research/rnn/graphs/contributors) \nis anyone who added at least 300 lines of code to the library.\n\n## Troubleshooting ##\n\nMost issues can be resolved by updating the various dependencies:\n```bash\nluarocks install torch\nluarocks install nn\nluarocks install dpnn\nluarocks install torchx\n```\n\nIf you are using CUDA :\n```bash\nluarocks install cutorch\nluarocks install cunn\nluarocks install cunnx\n```\n\nAnd don't forget to update this package :\n```bash\ngit clone git@github.com:Element-Research/rnn.git\ncd rnn\nluarocks make rocks/rnn-scm-1.rockspec\n```\n\nIf that doesn't fix it, open and issue on github.\n\n\u003ca name='rnn.AbstractRecurrent'\u003e\u003c/a\u003e\n## AbstractRecurrent ##\nAn abstract class inherited by [Recurrent](#rnn.Recurrent), [LSTM](#rnn.LSTM) and [GRU](#rnn.GRU).\nThe constructor takes a single argument :\n```lua\nrnn = nn.AbstractRecurrent([rho])\n```\nArgument `rho` is the maximum number of steps to backpropagate through time (BPTT).\nSub-classes can set this to a large number like 99999 (the default) if they want to backpropagate through \nthe entire sequence whatever its length. Setting lower values of rho are \nuseful when long sequences are forward propagated, but we only whish to \nbackpropagate through the last `rho` steps, which means that the remainder \nof the sequence doesn't need to be stored (so no additional cost). \n\n### [recurrentModule] getStepModule(step) ###\nReturns a module for time-step `step`. This is used internally by sub-classes \nto obtain copies of the internal `recurrentModule`. These copies share \n`parameters` and `gradParameters` but each have their own `output`, `gradInput` \nand any other intermediate states. \n\n### setOutputStep(step) ###\nThis is a method reserved for internal use by [Recursor](#rnn.Recursor) \nwhen doing backward propagation. It sets the object's `output` attribute\nto point to the output at time-step `step`. \nThis method was introduced to solve a very annoying bug.\n\n\u003ca name='rnn.AbstractRecurrent.maskZero'\u003e\u003c/a\u003e\n### maskZero(nInputDim) ###\nDecorates the internal `recurrentModule` with [MaskZero](#rnn.MaskZero). \nThe `output` Tensor (or table thereof) of the `recurrentModule`\nwill have each row (i.e. samples) zeroed when the commensurate row of the `input` \nis a tensor of zeros. \n\nThe `nInputDim` argument must specify the number of non-batch dims \nin the first Tensor of the `input`. In the case of an `input` table,\nthe first Tensor is the first one encountered when doing a depth-first search.\n\nCalling this method makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\nWhen a sample time-step is masked (i.e. `input` is a row of zeros), then \nthe hidden state is effectively reset (i.e. forgotten) for the next non-mask time-step.\nIn other words, it is possible seperate unrelated sequences with a masked element.\n\n### trimZero(nInputDim) ###\nDecorates the internal `recurrentModule` with [TrimZero](#rnn.TrimZero). \n\n### [output] updateOutput(input) ###\nForward propagates the input for the current step. The outputs or intermediate \nstates of the previous steps are used recurrently. This is transparent to the \ncaller as the previous outputs and intermediate states are memorized. This \nmethod also increments the `step` attribute by 1.\n\n\u003ca name='rnn.AbstractRecurrent.updateGradInput'\u003e\u003c/a\u003e\n### updateGradInput(input, gradOutput) ###\nLike `backward`, this method should be called in the reverse order of \n`forward` calls used to propagate a sequence. So for example :\n\n```lua\nrnn = nn.LSTM(10, 10) -- AbstractRecurrent instance\nlocal outputs = {}\nfor i=1,nStep do -- forward propagate sequence\n   outputs[i] = rnn:forward(inputs[i])\nend\n\nfor i=nStep,1,-1 do -- backward propagate sequence in reverse order\n   gradInputs[i] = rnn:backward(inputs[i], gradOutputs[i])\nend\n\nrnn:forget()\n``` \n\nThe reverse order implements backpropagation through time (BPTT).\n\n### accGradParameters(input, gradOutput, scale) ###\nLike `updateGradInput`, but for accumulating gradients w.r.t. parameters.\n\n### recycle(offset) ###\nThis method goes hand in hand with `forget`. It is useful when the current\ntime-step is greater than `rho`, at which point it starts recycling \nthe oldest `recurrentModule` `sharedClones`, \nsuch that they can be reused for storing the next step. This `offset` \nis used for modules like `nn.Recurrent` that use a different module \nfor the first step. Default offset is 0.\n\n\u003ca name='rnn.AbstractRecurrent.forget'\u003e\u003c/a\u003e\n### forget(offset) ###\nThis method brings back all states to the start of the sequence buffers, \ni.e. it forgets the current sequence. It also resets the `step` attribute to 1.\nIt is highly recommended to call `forget` after each parameter update. \nOtherwise, the previous state will be used to activate the next, which \nwill often lead to instability. This is caused by the previous state being\nthe result of now changed parameters. It is also good practice to call \n`forget` at the start of each new sequence.\n\n\u003ca name='rnn.AbstractRecurrent.maxBPTTstep'\u003e\u003c/a\u003e\n###  maxBPTTstep(rho) ###\nThis method sets the maximum number of time-steps for which to perform \nbackpropagation through time (BPTT). So say you set this to `rho = 3` time-steps,\nfeed-forward for 4 steps, and then backpropgate, only the last 3 steps will be \nused for the backpropagation. If your AbstractRecurrent instance is wrapped \nby a [Sequencer](#rnn.Sequencer), this will be handled auto-magically by the Sequencer.\nOtherwise, setting this value to a large value (i.e. 9999999), is good for most, if not all, cases.\n\n\u003ca name='rnn.AbstractRecurrent.backwardOnline'\u003e\u003c/a\u003e\n### backwardOnline() ###\nThis method was deprecated Jan 6, 2016. \nSince then, by default, `AbstractRecurrent` instances use the \nbackwardOnline behaviour. \nSee [updateGradInput](#rnn.AbstractRecurrent.updateGradInput) for details.\n\n### training() ###\nIn training mode, the network remembers all previous `rho` (number of time-steps)\nstates. This is necessary for BPTT. \n\n### evaluate() ###\nDuring evaluation, since their is no need to perform BPTT at a later time, \nonly the previous step is remembered. This is very efficient memory-wise, \nsuch that evaluation can be performed using potentially infinite-length \nsequence.\n \n\u003ca name='rnn.Recurrent'\u003e\u003c/a\u003e\n## Recurrent ##\nReferences :\n * A. [Sutsekever Thesis Sec. 2.5 and 2.8](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf)\n * B. [Mikolov Thesis Sec. 3.2 and 3.3](http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf)\n * C. [RNN and Backpropagation Guide](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3.9311\u0026rep=rep1\u0026type=pdf)\n\nA [composite Module](https://github.com/torch/nn/blob/master/doc/containers.md#containers) for implementing Recurrent Neural Networks (RNN), excluding the output layer. \n\nThe `nn.Recurrent(start, input, feedback, [transfer, rho, merge])` constructor takes 6 arguments:\n * `start` : the size of the output (excluding the batch dimension), or a Module that will be inserted between the `input` Module and `transfer` module during the first step of the propagation. When `start` is a size (a number or `torch.LongTensor`), then this *start* Module will be initialized as `nn.Add(start)` (see Ref. A).\n * `input` : a Module that processes input Tensors (or Tables). Output must be of same size as `start` (or its output in the case of a `start` Module), and same size as the output of the `feedback` Module.\n * `feedback` : a Module that feedbacks the previous output Tensor (or Tables) up to the `merge` module.\n * `merge` : a [table Module](https://github.com/torch/nn/blob/master/doc/table.md#table-layers) that merges the outputs of the `input` and `feedback` Module before being forwarded through the `transfer` Module.\n * `transfer` : a non-linear Module used to process the output of the `merge` module, or in the case of the first step, the output of the `start` Module.\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Due to the vanishing gradients effect, references A and B recommend `rho = 5` (or lower). Defaults to 99999.\n \nAn RNN is used to process a sequence of inputs. \nEach step in the sequence should be propagated by its own `forward` (and `backward`), \none `input` (and `gradOutput`) at a time. \nEach call to `forward` keeps a log of the intermediate states (the `input` and many `Module.outputs`) \nand increments the `step` attribute by 1. \nMethod `backward` must be called in reverse order of the sequence of calls to `forward` in \norder to backpropgate through time (BPTT). This reverse order is necessary \nto return a `gradInput` for each call to `forward`. \n\nThe `step` attribute is only reset to 1 when a call to the `forget` method is made. \nIn which case, the Module is ready to process the next sequence (or batch thereof).\nNote that the longer the sequence, the more memory that will be required to store all the \n`output` and `gradInput` states (one for each time step). \n\nTo use this module with batches, we suggest using different \nsequences of the same size within a batch and calling `updateParameters` \nevery `rho` steps and `forget` at the end of the sequence. \n\nNote that calling the `evaluate` method turns off long-term memory; \nthe RNN will only remember the previous output. This allows the RNN \nto handle long sequences without allocating any additional memory.\n\n\nFor a simple concise example of how to make use of this module, please consult the \n[simple-recurrent-network.lua](examples/simple-recurrent-network.lua)\ntraining script.\n\n\u003ca name='rnn.Recurrent.Sequencer'\u003e\u003c/a\u003e\n### Decorate it with a Sequencer ###\n\nNote that any `AbstractRecurrent` instance can be decorated with a [Sequencer](#rnn.Sequencer) \nsuch that an entire sequence (a table) can be presented with a single `forward/backward` call.\nThis is actually the recommended approach as it allows RNNs to be stacked and makes the \nrnn conform to the Module interface, i.e. each call to `forward` can be \nfollowed by its own immediate call to `backward` as each `input` to the \nmodel is an entire sequence, i.e. a table of tensors where each tensor represents\na time-step.\n\n```lua\nseq = nn.Sequencer(module)\n```\n\nThe [simple-sequencer-network.lua](examples/simple-sequencer-network.lua) training script\nis equivalent to the above mentionned [simple-recurrent-network.lua](examples/simple-recurrent-network.lua)\nscript, except that it decorates the `rnn` with a `Sequencer` which takes \na table of `inputs` and `gradOutputs` (the sequence for that batch).\nThis lets the `Sequencer` handle the looping over the sequence.\n\nYou should only think about using the `AbstractRecurrent` modules without \na `Sequencer` if you intend to use it for real-time prediction. \nActually, you can even use an `AbstractRecurrent` instance decorated by a `Sequencer`\nfor real time prediction by calling `Sequencer:remember()` and presenting each \ntime-step `input` as `{input}`.\n\nOther decorators can be used such as the [Repeater](#rnn.Repeater) or [RecurrentAttention](#rnn.RecurrentAttention).\nThe `Sequencer` is only the most common one. \n\n\u003ca name='rnn.LSTM'\u003e\u003c/a\u003e\n## LSTM ##\nReferences :\n * A. [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778v1.pdf)\n * B. [Long-Short Term Memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)\n * C. [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069v1.pdf)\n * D. [nngraph LSTM implementation on github](https://github.com/wojzaremba/lstm)\n\nThis is an implementation of a vanilla Long-Short Term Memory module. \nWe used Ref. A's LSTM as a blueprint for this module as it was the most concise.\nYet it is also the vanilla LSTM described in Ref. C. \n\nThe `nn.LSTM(inputSize, outputSize, [rho])` constructor takes 3 arguments:\n * `inputSize` : a number specifying the size of the input;\n * `outputSize` : a number specifying the size of the output;\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999.\n\n![LSTM](doc/image/LSTM.png) \n\nThe actual implementation corresponds to the following algorithm:\n```lua\ni[t] = σ(W[x-\u003ei]x[t] + W[h-\u003ei]h[t−1] + W[c-\u003ei]c[t−1] + b[1-\u003ei])      (1)\nf[t] = σ(W[x-\u003ef]x[t] + W[h-\u003ef]h[t−1] + W[c-\u003ef]c[t−1] + b[1-\u003ef])      (2)\nz[t] = tanh(W[x-\u003ec]x[t] + W[h-\u003ec]h[t−1] + b[1-\u003ec])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x-\u003eo]x[t] + W[h-\u003eo]h[t−1] + W[c-\u003eo]c[t] + b[1-\u003eo])        (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n```\nwhere `W[s-\u003eq]` is the weight matrix from `s` to `q`, `t` indexes the time-step,\n`b[1-\u003eq]` are the biases leading into `q`, `σ()` is `Sigmoid`, `x[t]` is the input,\n`i[t]` is the input gate (eq. 1), `f[t]` is the forget gate (eq. 2), \n`z[t]` is the input to the cell (which we call the hidden) (eq. 3), \n`c[t]` is the cell (eq. 4), `o[t]` is the output gate (eq. 5), \nand `h[t]` is the output of this module (eq. 6). Also note that the \nweight matrices from cell to gate vectors are diagonal `W[c-\u003es]`, where `s` \nis `i`,`f`, or `o`.\n\nAs you can see, unlike [Recurrent](#rnn.Recurrent), this \nimplementation isn't generic enough that it can take arbitrary component Module\ndefinitions at construction. However, the LSTM module can easily be adapted \nthrough inheritance by overriding the different factory methods :\n  * `buildGate` : builds generic gate that is used to implement the input, forget and output gates;\n  * `buildInputGate` : builds the input gate (eq. 1). Currently calls `buildGate`;\n  * `buildForgetGate` : builds the forget gate (eq. 2). Currently calls `buildGate`;\n  * `buildHidden` : builds the hidden (eq. 3);\n  * `buildCell` : builds the cell (eq. 4);\n  * `buildOutputGate` : builds the output gate (eq. 5). Currently calls `buildGate`;\n  * `buildModel` : builds the actual LSTM model which is used internally (eq. 6).\n  \nNote that we recommend decorating the `LSTM` with a `Sequencer` \n(refer to [this](#rnn.Recurrent.Sequencer) for details).\n  \n\u003ca name='rnn.FastLSTM'\u003e\u003c/a\u003e\n## FastLSTM ##\n\nA faster version of the [LSTM](#rnn.LSTM). \nBasically, the input, forget and output gates, as well as the hidden state are computed at one fellswoop.\n\nNote that `FastLSTM` does not use peephole connections between cell and gates. The algorithm from `LSTM` changes as follows:\n```lua\ni[t] = σ(W[x-\u003ei]x[t] + W[h-\u003ei]h[t−1] + b[1-\u003ei])                      (1)\nf[t] = σ(W[x-\u003ef]x[t] + W[h-\u003ef]h[t−1] + b[1-\u003ef])                      (2)\nz[t] = tanh(W[x-\u003ec]x[t] + W[h-\u003ec]h[t−1] + b[1-\u003ec])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x-\u003eo]x[t] + W[h-\u003eo]h[t−1] + b[1-\u003eo])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n```\ni.e. omitting the summands `W[c-\u003ei]c[t−1]` (eq. 1), `W[c-\u003ef]c[t−1]` (eq. 2), and `W[c-\u003eo]c[t]` (eq. 5).\n\n### usenngraph ###\nThis is a static attribute of the `FastLSTM` class. The default value is `false`.\nSetting `usenngraph = true` will force all new instantiated instances of `FastLSTM` \nto use `nngraph`'s `nn.gModule` to build the internal `recurrentModule` which is \ncloned for each time-step.\n\n\u003ca name='rnn.FastLSTM.bn'\u003e\u003c/a\u003e\n#### Recurrent Batch Normalization ####\n\nThis extends the `FastLSTM` class to enable faster convergence during training by zero-centering the input-to-hidden and hidden-to-hidden transformations. \nIt reduces the [internal covariate shift](https://arXiv.org/1502.03167v3) between time steps. It is an implementation of Cooijmans et. al.'s [Recurrent Batch Normalization](https://arxiv.org/1603.09025). The hidden-to-hidden transition of each LSTM cell is normalized according to \n```lua\ni[t] = σ(BN(W[x-\u003ei]x[t]) + BN(W[h-\u003ei]h[t−1]) + b[1-\u003ei])                      (1)\nf[t] = σ(BN(W[x-\u003ef]x[t]) + BN(W[h-\u003ef]h[t−1]) + b[1-\u003ef])                      (2)\nz[t] = tanh(BN(W[x-\u003ec]x[t]) + BN(W[h-\u003ec]h[t−1]) + b[1-\u003ec])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                                 (4)\no[t] = σ(BN(W[x-\u003eo]x[t]) + BN(W[h-\u003eo]h[t−1]) + b[1-\u003eo])                      (5)\nh[t] = o[t]tanh(c[t])                                                        (6)\n``` \nwhere the batch normalizing transform is:                                   \n```lua\n  BN(h; gamma, beta) = beta + gamma *      hd - E(hd)\n                                      ------------------\n                                       sqrt(E(σ(hd) + eps))                       \n```\nwhere `hd` is a vector of (pre)activations to be normalized, `gamma`, and `beta` are model parameters that determine the mean and standard deviation of the normalized activation. `eps` is a regularization hyperparameter to keep the division numerically stable and `E(hd)` and `E(σ(hd))` are the estimates of the mean and variance in the mini-batch respectively. The authors recommend initializing `gamma` to a small value and found 0.1 to be the value that did not cause vanishing gradients. `beta`, the shift parameter, is `null` by default.\n\nTo turn on batch normalization during training, do:\n```lua\nnn.FastLSTM.bn = true\nlstm = nn.FastLSTM(inputsize, outputsize, [rho, eps, momentum, affine]\n``` \n\nwhere `momentum` is same as `gamma` in the equation above (defaults to 0.1), `eps` is defined above and `affine` is a boolean whose state determines if the learnable affine transform is turned off (`false`) or on (`true`, the default).\n\n\u003ca name='rnn.GRU'\u003e\u003c/a\u003e\n## GRU ##\n\nReferences :\n * A. [Learning Phrase Representations Using RNN Encoder-Decoder For Statistical Machine Translation.](http://arxiv.org/pdf/1406.1078.pdf)\n * B. [Implementing a GRU/LSTM RNN with Python and Theano](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/)\n * C. [An Empirical Exploration of Recurrent Network Architectures](http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf)\n * D. [Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling](http://arxiv.org/abs/1412.3555)\n * E. [RnnDrop: A Novel Dropout for RNNs in ASR](http://www.stat.berkeley.edu/~tsmoon/files/Conference/asru2015.pdf)\n * F. [A Theoretically Grounded Application of Dropout in Recurrent Neural Networks](http://arxiv.org/abs/1512.05287)\n\nThis is an implementation of Gated Recurrent Units module. \n\nThe `nn.GRU(inputSize, outputSize [,rho [,p [, mono]]])` constructor takes 3 arguments likewise `nn.LSTM` or 4 arguments for dropout:\n * `inputSize` : a number specifying the size of the input;\n * `outputSize` : a number specifying the size of the output;\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;\n * `p` : dropout probability for inner connections of GRUs.\n * `mono` : Monotonic sample for dropouts inside GRUs. Only needed in a `TrimZero` + `BGRU`(p\u003e0) situation.\n\n![GRU](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/10/Screen-Shot-2015-10-23-at-10.36.51-AM.png) \n\nThe actual implementation corresponds to the following algorithm:\n```lua\nz[t] = σ(W[x-\u003ez]x[t] + W[s-\u003ez]s[t−1] + b[1-\u003ez])            (1)\nr[t] = σ(W[x-\u003er]x[t] + W[s-\u003er]s[t−1] + b[1-\u003er])            (2)\nh[t] = tanh(W[x-\u003eh]x[t] + W[hr-\u003ec](s[t−1]r[t]) + b[1-\u003eh])  (3)\ns[t] = (1-z[t])h[t] + z[t]s[t-1]                           (4)\n```\nwhere `W[s-\u003eq]` is the weight matrix from `s` to `q`, `t` indexes the time-step, `b[1-\u003eq]` are the biases leading into `q`, `σ()` is `Sigmoid`, `x[t]` is the input and `s[t]` is the output of the module (eq. 4). Note that unlike the [LSTM](#rnn.LSTM), the GRU has no cells.\n\nThe GRU was benchmark on `PennTreeBank` dataset using [recurrent-language-model.lua](examples/recurrent-language-model.lua) script. \nIt slightly outperfomed `FastLSTM`, however, since LSTMs have more parameters than GRUs, \nthe dataset larger than `PennTreeBank` might change the performance result. \nDon't be too hasty to judge on which one is the better of the two (see Ref. C and D).\n\n```\n                Memory   examples/s\n    FastLSTM      176M        16.5K \n    GRU            92M        15.8K\n```\n\n__Memory__ is measured by the size of `dp.Experiment` save file. __examples/s__ is measured by the training speed at 1 epoch, so, it may have a disk IO bias.\n\n![GRU-BENCHMARK](doc/image/gru-benchmark.png)\n\nRNN dropout (see Ref. E and F) was benchmark on `PennTreeBank` dataset using `recurrent-language-model.lua` script, too. The details can be found in the script. In the benchmark, `GRU` utilizes a dropout after `LookupTable`, while `BGRU`, stands for Bayesian GRUs, uses dropouts on inner connections (naming as Ref. F), but not after `LookupTable`.\n\nAs Yarin Gal (Ref. F) mentioned, it is recommended that one may use `p = 0.25` for the first attempt.\n\n![GRU-BENCHMARK](doc/image/bgru-benchmark.png)\n\n### SAdd\n\nTo implement `GRU`, a simple module is added, which cannot be possible to build only using `nn` modules.\n\n```lua\nmodule = nn.SAdd(addend, negate)\n```\nApplies a single scalar addition to the incoming data, i.e. y_i = x_i + b, then negate all components if `negate` is true. Which is used to implement `s[t] = (1-z[t])h[t] + z[t]s[t-1]` of `GRU` (see above Equation (4)).\n\n```lua\nnn.SAdd(-1, true)\n```\nHere, if the incoming data is `z[t]`, then the output becomes `-(z[t]-1)=1-z[t]`. Notice that `nn.Mul()` multiplies a scalar which is a learnable parameter.\n\n\u003ca name='rnn.MuFuRu'\u003e\u003c/a\u003e\n## MuFuRu ##\n\nReferences :\n * A. [MuFuRU: The Multi-Function Recurrent Unit.](https://arxiv.org/abs/1606.03002)\n * B. [Tensorflow Implementation of the Multi-Function Recurrent Unit](https://github.com/dirkweissenborn/mufuru)\n\nThis is an implementation of the Multi-Function Recurrent Unit module. \n\nThe `nn.MuFuRu(inputSize, outputSize [,ops [,rho]])` constructor takes 2 required arguments, plus optional arguments:\n * `inputSize` : a number specifying the dimension of the input;\n * `outputSize` : a number specifying the dimension of the output;\n * `ops`: a table of strings, representing which composition operations should be used. The table can be any subset of `{'keep', 'replace', 'mul', 'diff', 'forget', 'sqrt_diff', 'max', 'min'}`. By default, all composition operations are enabled.\n * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;\n\nThe Multi-Function Recurrent Unit generalizes the GRU by allowing weightings of arbitrary composition operators to be learned. As in the GRU, the reset gate is computed based on the current input and previous hidden state, and used to compute a new feature vector:\n\n```lua\nr[t] = σ(W[x-\u003er]x[t] + W[s-\u003er]s[t−1] + b[1-\u003er])            (1)\nv[t] = tanh(W[x-\u003ev]x[t] + W[sr-\u003ev](s[t−1]r[t]) + b[1-\u003ev])  (2)\n```\n\nwhere `W[a-\u003eb]` denotes the weight matrix from activation `a` to `b`, `t` denotes the time step, `b[1-\u003ea]` is the bias for activation `a`, and `s[t-1]r[t]` is the element-wise multiplication of the two vectors.\n\nUnlike in the GRU, rather than computing a single update gate (`z[t]` in [GRU](#rnn.GRU)), MuFuRU computes a weighting over an arbitrary number of composition operators.\n\nA composition operator is any differentiable operator which takes two vectors of the same size, the previous hidden state, and a new feature vector, and returns a new vector representing the new hidden state. The GRU implicitly defines two such operations, `keep` and `replace`, defined as `keep(s[t-1], v[t]) = s[t-1]` and `replace(s[t-1], v[t]) = v[t]`.\n\n[Ref. A](https://arxiv.org/abs/1606.03002) proposes 6 additional operators, which all operate element-wise:\n\n* `mul(x,y) = x * y`\n* `diff(x,y) = x - y`\n* `forget(x,y) = 0`\n* `sqrt_diff(x,y) = 0.25 * sqrt(|x - y|)`\n* `max(x,y)`\n* `min(x,y)`\n\nThe weightings of each operation are computed via a softmax from the current input and previous hidden state, similar to the update gate in the GRU. The produced hidden state is then the element-wise weighted sum of the output of each operation.\n```lua\n\np^[t][j] = W[x-\u003epj]x[t] + W[s-\u003epj]s[t−1] + b[1-\u003epj])         (3)\n(p[t][1], ... p[t][J])  = softmax (p^[t][1], ..., p^[t][J])  (4)\ns[t] = sum(p[t][j] * op[j](s[t-1], v[t]))                    (5)\n```\n\nwhere `p[t][j]` is the weightings for operation `j` at time step `t`, and `sum` in equation 5 is over all operators `J`.\n\n\u003ca name='rnn.Recursor'\u003e\u003c/a\u003e\n## Recursor ##\n\nThis module decorates a `module` to be used within an `AbstractSequencer` instance.\nIt does this by making the decorated module conform to the `AbstractRecurrent` interface,\nwhich like the `LSTM` and `Recurrent` classes, this class inherits. \n\n```lua\nrec = nn.Recursor(module[, rho])\n```\n\nFor each successive call to `updateOutput` (i.e. `forward`), this \ndecorator will create a `stepClone()` of the decorated `module`. \nSo for each time-step, it clones the `module`. Both the clone and \noriginal share parameters and gradients w.r.t. parameters. However, for \nmodules that already conform to the `AbstractRecurrent` interface, \nthe clone and original module are one and the same (i.e. no clone).\n\nExamples :\n\nLet's assume I want to stack two LSTMs. I could use two sequencers :\n\n```lua\nlstm = nn.Sequential()\n   :add(nn.Sequencer(nn.LSTM(100,100)))\n   :add(nn.Sequencer(nn.LSTM(100,100)))\n```\n\nUsing a `Recursor`, I make the same model with a single `Sequencer` :\n\n```lua\nlstm = nn.Sequencer(\n   nn.Recursor(\n      nn.Sequential()\n         :add(nn.LSTM(100,100))\n         :add(nn.LSTM(100,100))\n      )\n   )\n```\n\nActually, the `Sequencer` will wrap any non-`AbstractRecurrent` module automatically, \nso I could simplify this further to :\n\n```lua\nlstm = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.LSTM(100,100))\n      :add(nn.LSTM(100,100))\n   )\n```\n\nI can also add a `Linear` between the two `LSTM`s. In this case,\na `Linear` will be cloned (and have its parameters shared) for each time-step,\nwhile the `LSTM`s will do whatever cloning internally :\n\n```lua\nlstm = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.LSTM(100,100))\n      :add(nn.Linear(100,100))\n      :add(nn.LSTM(100,100))\n   )\n``` \n\n`AbstractRecurrent` instances like `Recursor`, `Recurrent` and `LSTM` are \nexpcted to manage time-steps internally. Non-`AbstractRecurrent` instances\ncan be wrapped by a `Recursor` to have the same behavior. \n\nEvery call to `forward` on an `AbstractRecurrent` instance like `Recursor` \nwill increment the `self.step` attribute by 1, using a shared parameter clone\nfor each successive time-step (for a maximum of `rho` time-steps, which defaults to 9999999).\nIn this way, `backward` can be called in reverse order of the `forward` calls \nto perform backpropagation through time (BPTT). Which is exactly what \n[AbstractSequencer](#rnn.AbstractSequencer) instances do internally.\nThe `backward` call, which is actually divided into calls to `updateGradInput` and \n`accGradParameters`, decrements by 1 the `self.udpateGradInputStep` and `self.accGradParametersStep`\nrespectively, starting at `self.step`.\nSuccessive calls to `backward` will decrement these counters and use them to \nbackpropagate through the appropriate internall step-wise shared-parameter clones.\n\nAnyway, in most cases, you will not have to deal with the `Recursor` object directly as\n`AbstractSequencer` instances automatically decorate non-`AbstractRecurrent` instances\nwith a `Recursor` in their constructors.\n\nFor a concrete example of its use, please consult the [simple-recurrent-network.lua](examples/simple-recurrent-network.lua)\ntraining script for an example of its use.\n\n\u003ca name='rnn.Recurrence'\u003e\u003c/a\u003e\n## Recurrence ##\n\nA extremely general container for implementing pretty much any type of recurrence.\n\n```lua\nrnn = nn.Recurrence(recurrentModule, outputSize, nInputDim, [rho])\n```\n\nUnlike [Recurrent](#rnn.Recurrent), this module doesn't manage a separate \nmodules like `inputModule`, `startModule`, `mergeModule` and the like.\nInstead, it only manages a single `recurrentModule`, which should \noutput a Tensor or table : `output(t)` \ngiven an input table : `{input(t), output(t-1)}`.\nUsing a mix of `Recursor` (say, via `Sequencer`) with `Recurrence`, one can implement \npretty much any type of recurrent neural network, including LSTMs and RNNs.\n\nFor the first step, the `Recurrence` forwards a Tensor (or table thereof)\nof zeros through the recurrent layer (like LSTM, unlike Recurrent).\nSo it needs to know the `outputSize`, which is either a number or \n`torch.LongStorage`, or table thereof. The batch dimension should be \nexcluded from the `outputSize`. Instead, the size of the batch dimension \n(i.e. number of samples) will be extrapolated from the `input` using \nthe `nInputDim` argument. For example, say that our input is a Tensor of size \n`4 x 3` where `4` is the number of samples, then `nInputDim` should be `1`.\nAs another example, if our input is a table of table [...] of tensors \nwhere the first tensor (depth first) is the same as in the previous example,\nthen our `nInputDim` is also `1`.\n\n\nAs an example, let's use `Sequencer` and `Recurrence` \nto build a Simple RNN for language modeling :\n\n```lua\nrho = 5\nhiddenSize = 10\noutputSize = 5 -- num classes\nnIndex = 10000\n\n-- recurrent module\nrm = nn.Sequential()\n   :add(nn.ParallelTable()\n      :add(nn.LookupTable(nIndex, hiddenSize))\n      :add(nn.Linear(hiddenSize, hiddenSize)))\n   :add(nn.CAddTable())\n   :add(nn.Sigmoid())\n\nrnn = nn.Sequencer(\n   nn.Sequential()\n      :add(nn.Recurrence(rm, hiddenSize, 1))\n      :add(nn.Linear(hiddenSize, outputSize))\n      :add(nn.LogSoftMax())\n)\n```\n\nNote : We could very well reimplement the `LSTM` module using the\nnewer `Recursor` and `Recurrent` modules, but that would mean \nbreaking backwards compatibility for existing models saved on disk.\n\n\u003ca name='rnn.NormStabilizer'\u003e\u003c/a\u003e\n## NormStabilizer ##\n\nRef. A : [Regularizing RNNs by Stabilizing Activations](http://arxiv.org/abs/1511.08400)\n\nThis module implements the [norm-stabilization](http://arxiv.org/abs/1511.08400) criterion:\n\n```lua\nns = nn.NormStabilizer([beta])\n``` \n\nThis module regularizes the hidden states of RNNs by minimizing the difference between the\nL2-norms of consecutive steps. The cost function is defined as :\n```\nloss = beta * 1/T sum_t( ||h[t]|| - ||h[t-1]|| )^2\n``` \nwhere `T` is the number of time-steps. Note that we do not divide the gradient by `T`\nsuch that the chosen `beta` can scale to different sequence sizes without being changed.\n\nThe sole argument `beta` is defined in ref. A. Since we don't divide the gradients by\nthe number of time-steps, the default value of `beta=1` should be valid for most cases. \n\nThis module should be added between RNNs (or LSTMs or GRUs) to provide better regularization of the hidden states. \nFor example :\n```lua\nlocal stepmodule = nn.Sequential()\n   :add(nn.FastLSTM(10,10))\n   :add(nn.NormStabilizer())\n   :add(nn.FastLSTM(10,10))\n   :add(nn.NormStabilizer())\nlocal rnn = nn.Sequencer(stepmodule)\n``` \n\nTo use it with `SeqLSTM` you can do something like this :\n```lua\nlocal rnn = nn.Sequential()\n   :add(nn.SeqLSTM(10,10))\n   :add(nn.Sequencer(nn.NormStabilizer()))\n   :add(nn.SeqLSTM(10,10))\n   :add(nn.Sequencer(nn.NormStabilizer()))\n``` \n\n\u003ca name='rnn.AbstractSequencer'\u003e\u003c/a\u003e\n## AbstractSequencer ##\nThis abstract class implements a light interface shared by \nsubclasses like : `Sequencer`, `Repeater`, `RecurrentAttention`, `BiSequencer` and so on.\n  \n\u003ca name='rnn.Sequencer'\u003e\u003c/a\u003e\n## Sequencer ##\n\nThe `nn.Sequencer(module)` constructor takes a single argument, `module`, which is the module \nto be applied from left to right, on each element of the input sequence.\n\n```lua\nseq = nn.Sequencer(module)\n```\n\nThis Module is a kind of [decorator](http://en.wikipedia.org/wiki/Decorator_pattern) \nused to abstract away the intricacies of `AbstractRecurrent` modules. While an `AbstractRecurrent` instance \nrequires that a sequence to be presented one input at a time, each with its own call to `forward` (and `backward`),\nthe `Sequencer` forwards an `input` sequence (a table) into an `output` sequence (a table of the same length).\nIt also takes care of calling `forget` on AbstractRecurrent instances.\n\n### Input/Output Format\n\nThe `Sequencer` requires inputs and outputs to be of shape `seqlen x batchsize x featsize` :\n\n * `seqlen` is the number of time-steps that will be fed into the `Sequencer`.\n * `batchsize` is the number of examples in the batch. Each example is its own independent sequence.\n * `featsize` is the size of the remaining non-batch dimensions. So this could be `1` for language models, or `c x h x w` for convolutional models, etc.\n \n![Hello Fuzzy](doc/image/hellofuzzy.png)\n\nAbove is an example input sequence for a character level language model.\nIt has `seqlen` is 5 which means that it contains sequences of 5 time-steps. \nThe openning `{` and closing `}` illustrate that the time-steps are elements of a Lua table, although \nit also accepts full Tensors of shape `seqlen x batchsize x featsize`. \nThe `batchsize` is 2 as their are two independent sequences : `{ H, E, L, L, O }` and `{ F, U, Z, Z, Y, }`.\nThe `featsize` is 1 as their is only one feature dimension per character and each such character is of size 1.\nSo the input in this case is a table of `seqlen` time-steps where each time-step is represented by a `batchsize x featsize` Tensor.\n\n![Sequence](doc/image/sequence.png)\n\nAbove is another example of a sequence (input or output). \nIt has a `seqlen` of 4 time-steps. \nThe `batchsize` is again 2 which means there are two sequences.\nThe `featsize` is 3 as each time-step of each sequence has 3 variables.\nSo each time-step (element of the table) is represented again as a tensor\nof size `batchsize x featsize`. \nNote that while in both examples the `featsize` encodes one dimension, \nit could encode more. \n\n\n### Example\n\nFor example, `rnn` : an instance of nn.AbstractRecurrent, can forward an `input` sequence one forward at a time:\n```lua\ninput = {torch.randn(3,4), torch.randn(3,4), torch.randn(3,4)}\nrnn:forward(input[1])\nrnn:forward(input[2])\nrnn:forward(input[3])\n``` \n\nEquivalently, we can use a Sequencer to forward the entire `input` sequence at once:\n\n```lua\nseq = nn.Sequencer(rnn)\nseq:forward(input)\n``` \n\nWe can also forward Tensors instead of Tables :\n\n```lua\n-- seqlen x batchsize x featsize\ninput = torch.randn(3,3,4)\nseq:forward(input)\n``` \n\n### Details\n\nThe `Sequencer` can also take non-recurrent Modules (i.e. non-AbstractRecurrent instances) and apply it to each \ninput to produce an output table of the same length. \nThis is especially useful for processing variable length sequences (tables).\n\nInternally, the `Sequencer` expects the decorated `module` to be an \n`AbstractRecurrent` instance. When this is not the case, the `module` \nis automatically decorated with a [Recursor](#rnn.Recursor) module, which makes it \nconform to the `AbstractRecurrent` interface. \n\nNote : this is due a recent update (27 Oct 2015), as before this \n`AbstractRecurrent` and and non-`AbstractRecurrent` instances needed to \nbe decorated by their own `Sequencer`. The recent update, which introduced the \n`Recursor` decorator, allows a single `Sequencer` to wrap any type of module, \n`AbstractRecurrent`, non-`AbstractRecurrent` or a composite structure of both types.\nNevertheless, existing code shouldn't be affected by the change.\n\nFor a concise example of its use, please consult the [simple-sequencer-network.lua](examples/simple-sequencer-network.lua)\ntraining script.\n\n\u003ca name='rnn.Sequencer.remember'\u003e\u003c/a\u003e\n### remember([mode]) ###\nWhen `mode='neither'` (the default behavior of the class), the Sequencer will additionally call [forget](#nn.AbstractRecurrent.forget) before each call to `forward`. \nWhen `mode='both'` (the default when calling this function), the Sequencer will never call [forget](#nn.AbstractRecurrent.forget).\nIn which case, it is up to the user to call `forget` between independent sequences.\nThis behavior is only applicable to decorated AbstractRecurrent `modules`.\nAccepted values for argument `mode` are as follows :\n\n * 'eval' only affects evaluation (recommended for RNNs)\n * 'train' only affects training\n * 'neither' affects neither training nor evaluation (default behavior of the class)\n * 'both' affects both training and evaluation (recommended for LSTMs)\n\n### forget() ###\nCalls the decorated AbstractRecurrent module's `forget` method.\n\n\u003ca name='rnn.SeqLSTM'\u003e\u003c/a\u003e\n## SeqLSTM ##\n\nThis module is a faster version of `nn.Sequencer(nn.FastLSTM(inputsize, outputsize))` :\n\n```lua\nseqlstm = nn.SeqLSTM(inputsize, outputsize)\n``` \n\nEach time-step is computed as follows (same as [FastLSTM](#rnn.FastLSTM)):\n\n```lua\ni[t] = σ(W[x-\u003ei]x[t] + W[h-\u003ei]h[t−1] + b[1-\u003ei])                      (1)\nf[t] = σ(W[x-\u003ef]x[t] + W[h-\u003ef]h[t−1] + b[1-\u003ef])                      (2)\nz[t] = tanh(W[x-\u003ec]x[t] + W[h-\u003ec]h[t−1] + b[1-\u003ec])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x-\u003eo]x[t] + W[h-\u003eo]h[t−1] + b[1-\u003eo])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\n``` \n\nA notable difference is that this module expects the `input` and `gradOutput` to \nbe tensors instead of tables. The default shape is `seqlen x batchsize x inputsize` for\nthe `input` and `seqlen x batchsize x outputsize` for the `output` :\n\n```lua\ninput = torch.randn(seqlen, batchsize, inputsize)\ngradOutput = torch.randn(seqlen, batchsize, outputsize)\n\noutput = seqlstm:forward(input)\ngradInput = seqlstm:backward(input, gradOutput)\n``` \n\nNote that if you prefer to transpose the first two dimension (i.e. `batchsize x seqlen` instead of the default `seqlen x batchsize`)\nyou can set `seqlstm.batchfirst = true` following initialization.\n\nFor variable length sequences, set `seqlstm.maskzero = true`. \nThis is equivalent to calling `maskZero(1)` on a `FastLSTM` wrapped by a `Sequencer`:\n```lua\nfastlstm = nn.FastLSTM(inputsize, outputsize)\nfastlstm:maskZero(1)\nseqfastlstm = nn.Sequencer(fastlstm)\n``` \n\nFor `maskzero = true`, input sequences are expected to be seperated by tensor of zeros for a time step.\n\nThe `seqlstm:toFastLSTM()` method generates a [FastLSTM](#rnn.FastLSTM) instance initialized with the parameters \nof the `seqlstm` instance. Note however that the resulting parameters will not be shared (nor can they ever be).\n\nLike the `FastLSTM`, the `SeqLSTM` does not use peephole connections between cell and gates (see [FastLSTM](#rnn.FastLSTM) for details).\n\nLike the `Sequencer`, the `SeqLSTM` provides a [remember](rnn.Sequencer.remember) method.\n\nNote that a `SeqLSTM` cannot replace `FastLSTM` in code that decorates it with a\n`AbstractSequencer` or `Recursor` as this would be equivalent to `Sequencer(Sequencer(FastLSTM))`.\nYou have been warned.\n\n\u003ca name='rnn.SeqLSTMP'\u003e\u003c/a\u003e\n## SeqLSTMP ##\nReferences:\n * A. [LSTM RNN Architectures for Large Scale Acoustic Modeling](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43905.pdf)\n * B. [Exploring the Limits of Language Modeling](https://arxiv.org/pdf/1602.02410v2.pdf)\n \n```lua\nlstmp = nn.SeqLSTMP(inputsize, hiddensize, outputsize)\n``` \n\nThe `SeqLSTMP` is a subclass of [SeqLSTM](#rnn.SeqLSTM). \nIt differs in that after computing the hidden state `h[t]` (eq. 6), it is \nprojected onto `r[t]` using a simple linear transform (eq. 7). \nThe computation of the gates also uses the previous such projection `r[t-1]` (eq. 1, 2, 3, 5).\nThis differs from `SeqLSTM` which uses `h[t-1]` instead of `r[t-1]`.\n \nThe computation of a time-step outlined in `SeqLSTM` is replaced with the following:\n```lua\ni[t] = σ(W[x-\u003ei]x[t] + W[r-\u003ei]r[t−1] + b[1-\u003ei])                      (1)\nf[t] = σ(W[x-\u003ef]x[t] + W[r-\u003ef]r[t−1] + b[1-\u003ef])                      (2)\nz[t] = tanh(W[x-\u003ec]x[t] + W[h-\u003ec]r[t−1] + b[1-\u003ec])                   (3)\nc[t] = f[t]c[t−1] + i[t]z[t]                                         (4)\no[t] = σ(W[x-\u003eo]x[t] + W[r-\u003eo]r[t−1] + b[1-\u003eo])                      (5)\nh[t] = o[t]tanh(c[t])                                                (6)\nr[t] = W[h-\u003er]h[t]                                                   (7)\n``` \n\nThe algorithm is outlined in ref. A and benchmarked with state of the art results on the Google billion words dataset in ref. B.\n`SeqLSTMP` can be used with an `hiddensize \u003e\u003e outputsize` such that the effective size of the memory cells `c[t]` \nand gates `i[t]`, `f[t]` and `o[t]` can be much larger than the actual input `x[t]` and output `r[t]`.\nFor fixed `inputsize` and `outputsize`, the `SeqLSTMP` will be able to remember much more information than the `SeqLSTM`.\n\n\u003ca name='rnn.SeqGRU'\u003e\u003c/a\u003e\n## SeqGRU ##\n\nThis module is a faster version of `nn.Sequencer(nn.GRU(inputsize, outputsize))` :\n\n```lua\nseqGRU = nn.SeqGRU(inputsize, outputsize)\n``` \n\nUsage of SeqGRU differs from GRU in the same manner as SeqLSTM differs from LSTM. Therefore see [SeqLSTM](#rnn.SeqLSTM) for more details.\n\n\u003ca name='rnn.SeqBRNN'\u003e\u003c/a\u003e\n## SeqBRNN ##\n\n```lua\nbrnn = nn.SeqBRNN(inputSize, outputSize, [batchFirst], [merge])\n``` \n\nA bi-directional RNN that uses SeqLSTM. Internally contains a 'fwd' and 'bwd' module of SeqLSTM. Expects an input shape of ```seqlen x batchsize x inputsize```.\nBy setting [batchFirst] to true, the input shape can be ```batchsize x seqLen x inputsize```. Merge module defaults to CAddTable(), summing the outputs from each\noutput layer.\n\nExample:\n```\ninput = torch.rand(1, 1, 5)\nbrnn = nn.SeqBRNN(5, 5)\nprint(brnn:forward(input))\n``` \nPrints an output of a 1x1x5 tensor.\n\n\u003ca name='rnn.BiSequencer'\u003e\u003c/a\u003e\n## BiSequencer ##\nApplies encapsulated `fwd` and `bwd` rnns to an input sequence in forward and reverse order.\nIt is used for implementing Bidirectional RNNs and LSTMs.\n\n```lua\nbrnn = nn.BiSequencer(fwd, [bwd, merge])\n```\n\nThe input to the module is a sequence (a table) of tensors\nand the output is a sequence (a table) of tensors of the same length.\nApplies a `fwd` rnn (an [AbstractRecurrent](#rnn.AbstractRecurrent) instance) to each element in the sequence in\nforward order and applies the `bwd` rnn in reverse order (from last element to first element).\nThe `bwd` rnn defaults to:\n\n```lua\nbwd = fwd:clone()\nbwd:reset()\n```\n\nFor each step (in the original sequence), the outputs of both rnns are merged together using\nthe `merge` module (defaults to `nn.JoinTable(1,1)`). \nIf `merge` is a number, it specifies the [JoinTable](https://github.com/torch/nn/blob/master/doc/table.md#nn.JoinTable)\nconstructor's `nInputDim` argument. Such that the `merge` module is then initialized as :\n\n```lua\nmerge = nn.JoinTable(1,merge)\n```\n\nInternally, the `BiSequencer` is implemented by decorating a structure of modules that makes \nuse of 3 Sequencers for the forward, backward and merge modules.\n\nSimilarly to a [Sequencer](#rnn.Sequencer), the sequences in a batch must have the same size.\nBut the sequence length of each batch can vary.\n\nNote : make sure you call `brnn:forget()` after each call to `updateParameters()`. \nAlternatively, one could call `brnn.bwdSeq:forget()` so that only `bwd` rnn forgets.\nThis is the minimum requirement, as it would not make sense for the `bwd` rnn to remember future sequences.\n\n\n\u003ca name='rnn.BiSequencerLM'\u003e\u003c/a\u003e\n## BiSequencerLM ##\n\nApplies encapsulated `fwd` and `bwd` rnns to an input sequence in forward and reverse order.\nIt is used for implementing Bidirectional RNNs and LSTMs for Language Models (LM).\n\n```lua\nbrnn = nn.BiSequencerLM(fwd, [bwd, merge])\n```\n\nThe input to the module is a sequence (a table) of tensors\nand the output is a sequence (a table) of tensors of the same length.\nApplies a `fwd` rnn (an [AbstractRecurrent](#rnn.AbstractRecurrent) instance to the \nfirst `N-1` elements in the sequence in forward order.\nApplies the `bwd` rnn in reverse order to the last `N-1` elements (from second-to-last element to first element).\nThis is the main difference of this module with the [BiSequencer](#rnn.BiSequencer).\nThe latter cannot be used for language modeling because the `bwd` rnn would be trained to predict the input it had just be fed as input.\n\n![BiDirectionalLM](doc/image/bidirectionallm.png)\n\nThe `bwd` rnn defaults to:\n\n```lua\nbwd = fwd:clone()\nbwd:reset()\n```\n\nWhile the `fwd` rnn will output representations for the last `N-1` steps,\nthe `bwd` rnn will output representations for the first `N-1` steps.\nThe missing outputs for each rnn ( the first step for the `fwd`, the last step for the `bwd`)\nwill be filled with zero Tensors of the same size the commensure rnn's outputs.\nThis way they can be merged. If `nn.JoinTable` is used (the default), then the first \nand last output elements will be padded with zeros for the missing `fwd` and `bwd` rnn outputs, respectively.\n\nFor each step (in the original sequence), the outputs of both rnns are merged together using\nthe `merge` module (defaults to `nn.JoinTable(1,1)`). \nIf `merge` is a number, it specifies the [JoinTable](https://github.com/torch/nn/blob/master/doc/table.md#nn.JoinTable)\nconstructor's `nInputDim` argument. Such that the `merge` module is then initialized as :\n\n```lua\nmerge = nn.JoinTable(1,merge)\n```\n\nSimilarly to a [Sequencer](#rnn.Sequencer), the sequences in a batch must have the same size.\nBut the sequence length of each batch can vary.\n\nNote that LMs implemented with this module will not be classical LMs as they won't measure the \nprobability of a word given the previous words. Instead, they measure the probabiliy of a word\ngiven the surrounding words, i.e. context. While for mathematical reasons you may not be able to use this to measure the \nprobability of a sequence of words (like a sentence), \nyou can still measure the pseudo-likeliness of such a sequence (see [this](http://arxiv.org/pdf/1504.01575.pdf) for a discussion).\n\n\u003ca name='rnn.Repeater'\u003e\u003c/a\u003e\n## Repeater ##\nThis Module is a [decorator](http://en.wikipedia.org/wiki/Decorator_pattern) similar to [Sequencer](#rnn.Sequencer).\nIt differs in that the sequence length is fixed before hand and the input is repeatedly forwarded \nthrough the wrapped `module` to produce an output table of length `nStep`:\n```lua\nr = nn.Repeater(module, nStep)\n```\nArgument `module` should be an `AbstractRecurrent` instance.\nThis is useful for implementing models like [RCNNs](http://jmlr.org/proceedings/papers/v32/pinheiro14.pdf),\nwhich are repeatedly presented with the same input.\n\n\u003ca name='rnn.RecurrentAttention'\u003e\u003c/a\u003e\n## RecurrentAttention ##\nReferences :\n  \n  * A. [Recurrent Models of Visual Attention](http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf)\n  * B. [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http://incompleteideas.net/sutton/williams-92.pdf)\n  \nThis module can be used to implement the Recurrent Attention Model (RAM) presented in Ref. A :\n```lua\nram = nn.RecurrentAttention(rnn, action, nStep, hiddenSize)\n```\n\n`rnn` is an [AbstractRecurrent](#rnn.AbstractRecurrent) instance. \nIts input is `{x, z}` where `x` is the input to the `ram` and `z` is an \naction sampled from the `action` module. \nThe output size of the `rnn` must be equal to `hiddenSize`.\n\n`action` is a [Module](https://github.com/torch/nn/blob/master/doc/module.md#nn.Module) \nthat uses a [REINFORCE module](https://github.com/nicholas-leonard/dpnn#nn.Reinforce) (ref. B) like \n[ReinforceNormal](https://github.com/nicholas-leonard/dpnn#nn.ReinforceNormal), \n[ReinforceCategorical](https://github.com/nicholas-leonard/dpnn#nn.ReinforceCategorical), or \n[ReinforceBernoulli](https://github.com/nicholas-leonard/dpnn#nn.ReinforceBernoulli) \nto sample actions given the previous time-step's output of the `rnn`. \nDuring the first time-step, the `action` module is fed with a Tensor of zeros of size `input:size(1) x hiddenSize`.\nIt is important to understand that the sampled actions do not receive gradients \nbackpropagated from the training criterion. \nInstead, a reward is broadcast from a Reward Criterion like [VRClassReward](https://github.com/nicholas-leonard/dpnn#nn.VRClassReward) Criterion to \nthe `action`'s REINFORCE module, which will backprogate graidents computed from the `output` samples \nand the `reward`. \nTherefore, the `action` module's outputs are only used internally, within the RecurrentAttention module.\n\n`nStep` is the number of actions to sample, i.e. the number of elements in the `output` table.\n\n`hiddenSize` is the output size of the `rnn`. This variable is necessary \nto generate the zero Tensor to sample an action for the first step (see above).\n\nA complete implementation of Ref. A is available [here](examples/recurrent-visual-attention.lua).\n\n\u003ca name='rnn.MaskZero'\u003e\u003c/a\u003e\n## MaskZero ##\nThis module zeroes the `output` rows of the decorated module \nfor commensurate `input` rows which are tensors of zeros.\n\n```lua\nmz = nn.MaskZero(module, nInputDim)\n```\n\nThe `output` Tensor (or table thereof) of the decorated `module`\nwill have each row (samples) zeroed when the commensurate row of the `input` \nis a tensor of zeros. \n\nThe `nInputDim` argument must specify the number of non-batch dims \nin the first Tensor of the `input`. In the case of an `input` table,\nthe first Tensor is the first one encountered when doing a depth-first search.\n\nThis decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\nCaveat: `MaskZero` not guarantee that the `output` and `gradInput` tensors of the internal modules \nof the decorated `module` will be zeroed as well when the `input` is zero as well. \n`MaskZero` only affects the immediate `gradInput` and `output` of the module that it encapsulates.\nHowever, for most modules, the gradient update for that time-step will be zero because \nbackpropagating a gradient of zeros will typically yield zeros all the way to the input.\nIn this respect, modules to avoid in encapsulating inside a `MaskZero` are `AbsractRecurrent` \ninstances as the flow of gradients between different time-steps internally. \nInstead, call the [AbstractRecurrent.maskZero](#rnn.AbstractRecurrent.maskZero) method\nto encapsulate the internal `recurrentModule`.\n\n\u003ca name='rnn.TrimZero'\u003e\u003c/a\u003e\n## TrimZero ##\n\nWARNING : only use this module if your input contains lots of zeros. \nIn almost all cases, [`MaskZero`](#rnn.MaskZero) will be faster, especially with CUDA.\n\nRef. A : [TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing](https://bi.snu.ac.kr/Publications/Conferences/Domestic/KIIS2016S_JHKim.pdf)\n\nThe usage is the same with `MaskZero`.\n\n```lua\nmz = nn.TrimZero(module, nInputDim)\n```\n\nThe only difference from `MaskZero` is that it reduces computational costs by varying a batch size, if any, for the case that varying lengths are provided in the input. \nNotice that when the lengths are consistent, `MaskZero` will be faster, because `TrimZero` has an operational cost. \n\nIn short, the result is the same with `MaskZero`'s, however, `TrimZero` is faster than `MaskZero` only when sentence lengths is costly vary.\n\nIn practice, e.g. language model, `TrimZero` is expected to be faster than `MaskZero` about 30%. (You can test with it using `test/test_trimzero.lua`.)\n\n\u003ca name='rnn.LookupTableMaskZero'\u003e\u003c/a\u003e\n## LookupTableMaskZero ##\nThis module extends `nn.LookupTable` to support zero indexes. Zero indexes are forwarded as zero tensors.\n\n```lua\nlt = nn.LookupTableMaskZero(nIndex, nOutput)\n```\n\nThe `output` Tensor will have each row zeroed when the commensurate row of the `input` is a zero index. \n\nThis lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\n\u003ca name='rnn.MaskZeroCriterion'\u003e\u003c/a\u003e\n## MaskZeroCriterion ##\nThis criterion zeroes the `err` and `gradInput` rows of the decorated criterion \nfor commensurate `input` rows which are tensors of zeros.\n\n```lua\nmzc = nn.MaskZeroCriterion(criterion, nInputDim)\n```\n\nThe `gradInput` Tensor (or table thereof) of the decorated `criterion`\nwill have each row (samples) zeroed when the commensurate row of the `input` \nis a tensor of zeros. The `err` will also disregard such zero rows.\n\nThe `nInputDim` argument must specify the number of non-batch dims \nin the first Tensor of the `input`. In the case of an `input` table,\nthe first Tensor is the first one encountered when doing a depth-first search.\n\nThis decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.\n\n\u003ca name='rnn.SeqReverseSequence'\u003e\u003c/a\u003e\n## SeqReverseSequence ##\n\n```lua\nreverseSeq = nn.SeqReverseSequence(dim)\n```\n\nReverses an input tensor on a specified dimension. The reversal dimension can be no larger than three.\n\nExample:\n```\ninput = torch.Tensor({{1,2,3,4,5}, {6,7,8,9,10}})\nreverseSeq = nn.SeqReverseSequence(1)\nprint(reverseSeq:forward(input))\n\nGives us an output of torch.Tensor({{6,7,8,9,10},{1,2,3,4,5}})\n```\n\n\u003ca name='rnn.SequencerCriterion'\u003e\u003c/a\u003e\n## SequencerCriterion ##\n\nThis Criterion is a [decorator](http://en.wikipedia.org/wiki/Decorator_pattern):\n\n```lua\nc = nn.SequencerCriterion(criterion, [sizeAverage])\n``` \n\nBoth the `input` and `target` are expected to be a sequence, either as a table or Tensor. \nFor each step in the sequence, the corresponding elements of the input and target \nwill be applied to the `criterion`.\nThe output of `forward` is the sum of all individual losses in the sequence. \nThis is useful when used in conjunction with a [Sequencer](#rnn.Sequencer).\n\nIf `sizeAverage` is `true` (default is `false`), the `output` loss and `gradInput` is averaged over each time-step.\n\n\u003ca name='rnn.RepeaterCriterion'\u003e\u003c/a\u003e\n## RepeaterCriterion ##\n\nThis Criterion is a [decorator](http://en.wikipedia.org/wiki/Decorator_pattern):\n\n```lua\nc = nn.RepeaterCriterion(criterion)\n``` \n\nThe `input` is expected to be a sequence (table or Tensor). A single `target` is \nrepeatedly applied using the same `criterion` to each element in the `input` sequence.\nThe output of `forward` is the sum of all individual losses in the sequence.\nThis is useful for implementing models like [RCNNs](http://jmlr.org/proceedings/papers/v32/pinheiro14.pdf),\nwhich are repeatedly presented with the same target.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FElement-Research%2Frnn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FElement-Research%2Frnn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FElement-Research%2Frnn/lists"}