{"id":18682850,"url":"https://github.com/yandex/faster-rnnlm","last_synced_at":"2025-04-04T19:09:27.223Z","repository":{"id":35503245,"uuid":"39773115","full_name":"yandex/faster-rnnlm","owner":"yandex","description":"Faster Recurrent Neural Network Language Modeling Toolkit with Noise Contrastive Estimation and Hierarchical Softmax","archived":false,"fork":false,"pushed_at":"2022-04-26T17:52:26.000Z","size":423,"stargazers_count":560,"open_issues_count":32,"forks_count":140,"subscribers_count":47,"default_branch":"master","last_synced_at":"2025-03-28T18:12:03.777Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yandex.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-27T12:25:26.000Z","updated_at":"2025-03-16T00:50:02.000Z","dependencies_parsed_at":"2022-08-27T13:20:13.591Z","dependency_job_id":null,"html_url":"https://github.com/yandex/faster-rnnlm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex%2Ffaster-rnnlm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex%2Ffaster-rnnlm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex%2Ffaster-rnnlm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yandex%2Ffaster-rnnlm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yandex","download_url":"https://codeload.github.com/yandex/faster-rnnlm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247234921,"owners_count":20905854,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T10:13:00.698Z","updated_at":"2025-04-04T19:09:27.205Z","avatar_url":"https://github.com/yandex.png","language":"C++","funding_links":[],"categories":["Codes","Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# Faster RNNLM (HS/NCE) toolkit\nIn a nutshell, the goal of this project is to create an rnnlm implementation that can be trained on huge datasets (several billions of words) and very large vocabularies (several hundred thousands) and used in real-world ASR and MT problems.\nBesides, to achieve better results this implementation supports such praised setups as ReLU+DiagonalInitialization [1], GRU [2], NCE [3], and RMSProp [4].\n\nHow fast is it?\nWell, on One Billion Word Benchmark [8] and 3.3GHz CPU the program with standard parameters (sigmoid hidden layer of size 256 and hierarchical softmax) processes more then 250k words per second in 8 threads, i.e. 15 millions of words per minute.\nAs a result an epoch takes less than one hour. Check [Experiments section](#experiments) for more numbers and figures.\n\nThe distribution includes `./run_benchmark.sh` script to compare training speed on your machine among several implementations.\nThe scripts downloads Penn Tree Bank corpus and trains four models: Mikolov's rnnlm with class-based softmax from [here](http://www.fit.vutbr.cz/~imikolov/rnnlm/), Edrenkin's rnnlm with HS from Kaldi project, faster-rnnlm with hierarchical softmax, and faster-rnnlm with noise contrastive estimation.\nNote that while models with class-based softmax can achieve a little lower entropy then models hierarchical softmax, their training is infeasible for large vocabularies.\nOn the other hand, NCE speed doesn't depend on the size of the vocabulary.\nWhats more, models trained with NCE is comparable with class-based models in terms of resulting entropy.\n\n## Quick start\nRun `./build.sh` to download Eigen library and build faster-rnnlm.\n\nTo train a simple model with GRU hidden unit and Noise Contrastive Estimation, use the following command:\n\n   `./rnnlm -rnnlm model_name -train train.txt -valid validation.txt -hidden 128 -hidden-type gru -nce 20 -alpha 0.01`\n\nFiles train.txt and test.txt must contain one sentence per line. All distinct words that are found in the training file will be used for the nnet vocab, their counts will determine Huffman tree structure and remain fixed for this nnet. If you prefer using limited vocabulary (say, top 1 million words) you should map all other words to \u003cunk\u003e or another token of your choice. Limited vocabulary is usually a good idea if it helps you to have enough training examples for each word.\n\nTo apply the model use following command:\n\n   `./rnnlm -rnnlm model_name -test train.txt`\n\nLogprobs (log10) of each sentence are printed to stdout. Entropy of the corpus in bits is printed to stderr.\n\n## Model architecture\nThe neural network has an input embedding layer, a few hidden layers, an output layer, and optional direct input-output connections.\n\n### Hidden layer\nAt the moment the following hidden layers are supported: sigmoid, tanh, relu, gru, gru-bias, gru-insyn, gru-full.\nFirst three types are quite standard.\nLast four types stand for different modification of Gated Recurrent Unit. Namely, gru-insyn follows formulas from [2]; gru-full adds bias terms for reset and update gates; gru uses identity matrices for input transformation without bias; gru-bias is gru with bias terms.\nThe fastest layer is relu, the slowest one is gru-full.\n\n### Output layer\nStandard output layer for classification problems is softmax.\nHowever, as softmax outputs must be normalized, i.e. sum over all classes must be one, its calculation is infeasible for a very large vocabulary.\nTo overcome this problem one can use either softmax factorization or implicit normalization.\nBy default, we approximate softmax via Hierarchical Softmax over Huffman Tree [6].\nIt allows to calculate softmax in logarithmic linear time, but reduces the quality of the model.\nImplicit normalization means that one calculates next word probability as in full softmax case, but without explicit normalization over all the words.\nOf course, it is not guaranteed that such *probabilities* will sum to up.\nBut in practice the sum is quite close to one due to custom loss function.\nCheckout [3] for more details.\n\n### Maximum entropy model\nAs was noted in [0], simultaneous training of neural network together with maximum entropy model could lead to significant improvement.\nIn a nutshell, maxent model tries to approximate probability of target as a linear combination of its history features.\nE.g. in order to estimate probability if word \"d\" in the sentence \"a b c d\", the model will sum the following features: f(\"d\") + f(\"c d\") + f(\"b c d\") + f(\"a b c d\").\nYou can use maxent with both HS and NCE output layers.\n\n## Experiments\nWe provide results of model evaluation on two popular datasets: PTB and One Billion Word Benchmark.\nCheckout [doc/RESULTS.md](doc/RESULTS.md) for reasonable parameters.\n\n### Penn Treebank Benchmark\nThe most popular corpus for LM benchmarks is English Penn Treebank.\nIts train part contains a little less than 1kk words and the size of vocabulary is 10k words.\nIn other words, it's akin to Iris flower dataset.\nThe size of vocabulary allows one to use less efficient softmax approximation.\nWe compare faster-rnnlm with the [latest version](https://f25ea9ccb7d3346ce6891573d543960492b92c30.googledrive.com/host/0ByxdPXuxLPS5RFM5dVNvWVhTd0U/rnnlm-0.4b.tgz) of rnnlm toolkit from [here](http://www.fit.vutbr.cz/~imikolov/rnnlm/).\nAs expected, class-based works a little better than hierarchical softmax, but it is much slower.\nOn the other hand, perplexity for NCE and class-based softmax is comparable while training time differs significantly.\nWhat's more, training speed for class-based softmax will decrease with an increase in the size of the vocabulary, while NCE doesn't bother about it.\n(At least, in theory; in practice, bigger vocabulary will probably increase cache miss frequency.)\nFor fair speed comparison we use only one thread for faster-rnnlm.\n\nNote. We use the following setting: learning_rate = 0.1, noise_samples=30 (for nce), bptt=32+8, threads=1 (for faster-rnnlm).\n![Time and perplexity for different implementations and softmax types](doc/ptb_class_vs_faster.png?raw=true)\n\nIt was shown that RNN models with sigmoid activation functions trained with NCE criterion outperforms ones trained with CE criterion over approximated softmax (e.g. [3]).\nWe tried to reproduce this improvements using other popular architectures, namely, truncated ReLU, Structurally Constrained Recurrent Network [9] with 40 context units, and Gated Recurrent Unit [2].\nSurprisingly, not all types of hidden units benefit from NCE.\nTruncated ReLU achieves the lowest perplexity among all the other units during CE training, and the highest - during NCE training.\nWe used truncated ReLU as standard ReLU works even worse.\n\"Smart\" units (SCRN and GRU) demonstrate superior results.\n\nNote. We report the best perplexity after grid search using the following parameters: learning_rate = {0.01, 0.03, 0.1, 0.3, 1.0}, noise_samples = {10, 20, 60} (for nce only), bptt={32+8, 1+8}, diagonal_initialization={None, 0.1, 0.5, 0.9, 1.0}, L2 = {1e-5, 1e-6, 0}.\n![Hierarchical Softmax versus Noise Contrastive Estimation](doc/ptb_nce_vs_hs_per_size.png?raw=true)\n\nThe following figure shows dependency between number of noise samples and final perplexity for different types of units.\nDashed lines indicate perplexity for models with Hierarchical Softmax.\nIt's easy to see that the samples used, the lower the final perplexity is.\nHowever, even 5 samples is enough for NCE to work better than HS.\nExcept for relu-trunc, thas couldn't be trained with NCE for any number of noise samples.\n\nNote. We report the best perplexity after grid search. The size of the hidden layer is 200.\n![Noise Contrastive Estimation with different count of noise samples](doc/ptb_nce_per_count.png?raw=true)\n\n\n### One Billion Word Benchmark\nFor One Billion Word Benchmark we use setup as is it was described in [8] using [official scripts](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark).\nAround 0.8 billion words in the training corpus; 793471 words in the vocabulary (including \\\u003cs\\\u003e and \\\u003c/s\\\u003e words).\nWe use heldout-00000 for validation, and heldout-00001 for testing.\n\nHierarchical softmax versus Noise Contrastive Estimation.\nIn a nutshell, for bigger vocabularies drawbacks of HS become more significant.\nAs a result, NCE training results in much smaller values of perplexity.\nIt's easy to see that performance of Truncated ReLU on this dataset agrees with experiments on PTB.\nNamely, RNN with Truncated ReLU units could be training more efficiently with CE, if the layer size is small.\nHowever, relative performance of the other unit types have changed.\nIn contrast to PTB experiments, on One Billion Words corpus the simplest unit achieves the best quality.\n\nNote. We report the best perplexity on heldout-00001 after grid search over the learning_rate, bptt, and diagonal_initialization. We use 50 noise samples for NCE training.\n![Hierarchical Softmax versus Noise Contrastive Estimation](doc/1kkk_nce_vs_hs.png?raw=true)\n\nThe following graph demonstrates dependency between number of noise samples and final perplexity.\nJust as in the case of PTB, 5 samples is enough for NCE to significantly outperform NCE.\n![Noise Contrastive Estimation with different count of noise samples](doc/1kkk_nce_per_count.png?raw=true)\n\nOne important property of RNNLM models is that they are complementary to standard N-gram LM.\nOne way to achieve this is to train maxent model as a part of the neural network mode.\nThat could be achieved by --direct and --direct-order options.\nAnother way to achieve the same effect is to use external language model.\nWe use Interpolated KN 5-gram model that is shipped with the benchmark.\n\nMaxent model significantly decrease perplexity for all hidden layer types and sizes.\nMoreover, it diminishes the impact of layer size.\nAs expected, combination of RNNLM-ME and KN works better than any of them (perplexity of the KN model is 73).\n\nNote. We took the best performing models from the previous and added maxent layer of size 1000 and order 3.\n![Mixture of models](doc/1kkk_direct_vs_nodirect.png?raw=true)\n\n\n## Command line options\nWe opted to use command line options that are compatible with [Mikolov's rnnlm](http://www.fit.vutbr.cz/~imikolov/rnnlm/).\nAs result one can just replace the binary to switch between implementations.\n\nThe program has three modes, i.e. training, evaluation, and sampling.\n\nAll modes require model name:\n\n```\n    --rnnlm \u003cfile\u003e\n      Path to model file\n```\n\nWill create \u003cfile\u003e and \u003cfile\u003e.nnet files (for storing vocab/counts in the text form and the net itself in binary form).\nIf the \u003cfile\u003e and \u003cfile\u003e.nnet already exist, the tool will attempt to load them instead of starting new training.\nIf the \u003cfile\u003e exists and \u003cfile\u003e.nnet doesn't, the tool will use existing vocabulary and new weights.\n\nTo run program in test mode, you must provide test file. If you use NCE and would like to calculate entropy, you must use --nce_accurate_test flag. All other options are ignored in apply mode\n\n```\n    --test \u003cfile\u003e\n      Test file\n    --nce-accurate-test (0 | 1)\n      Explicitly normalize output probabilities; use this option\n      to compute actual entropy (default: 0)\n```\n\nTo run program in sampling mode, you must select positive number of sentences to sample.\n\n```\n  --generate-samples \u003cint\u003e\n    Number of sentences to generate in sampling mode (default: 0)\n  --generate-temperature \u003cfloat\u003e\n    Softmax temperature (use lower values to get robuster results) (default: 1)\n```\n\nTo train program, you must provide train and validation files\n\n```\n  --train \u003cfile\u003e\n    Train file\n  --valid \u003cfile\u003e\n    Validation file (used for early stopping)\n```\n\nModel structure options\n\n```\n  --hidden \u003cint\u003e\n    Size of embedding and hidden layers (default: 100)\n  --hidden-type \u003cstring\u003e\n    Hidden layer activation (sigmoid, tanh, relu, gru, gru-bias, gru-insyn, gru-full)\n    (default: sigmoid)\n  --hidden-count \u003cint\u003e\n    Count of hidden layers; all hidden layers have the same type and size (default: 1)\n  --arity \u003cint\u003e\n    Arity of the HS tree; for HS mode only (default: 2)\n  --direct \u003cint\u003e\n    Size of maxent layer in millions (default: 0)\n  --direct-order \u003cint\u003e\n    Maximum order of ngram features (default: 0)\n```\n\nLearning reverse model, i.e. a model that predicts words from last one to first one, could be useful for mixture.\n\n```\n  --reverse-sentence (0 | 1)\n    Predict sentence words in reversed order (default: 0)\n```\n\n\nThe performance does not scale linearly with the number of threads (it is sub-linear due to cache misses, false HogWild assumptions, etc).\nTesting, validation and sampling are always performed by a single thread regardless of this setting.\nAlso checkout \"Performance notes\" section\n\n```\n  --threads \u003cint\u003e\n    Number of threads to use\n```\n\nBy default, recurrent weights are initialized using uniform distribution.\nIn [1] another method to initialize weights was suggested, i.e. identity matrix multiplied by some positive constant.\nThe option below corresponds to this constant.\n\n```\n  --diagonal-initialization \u003cfloat\u003e\n    Initialize recurrent matrix with x * I (x is the value and I is identity matrix)\n    Must be greater then zero to have any effect (default: 0)\n```\n\nOptimization options\n\n```\n  --rmsprop \u003cfloat\u003e\n    RMSprop coefficient; rmsprop=1 disables rmsprop and rmsprop=0 equivalent to RMS\n    (default: 1)\n  --gradient-clipping \u003cfloat\u003e\n    Clip updates above the value (default: 1)\n  --learn-recurrent (0 | 1)\n    Learn hidden layer weights (default: 1)\n  --learn-embeddings (0 | 1)\n    Learn embedding weights (default: 1)\n  --alpha \u003cfloat\u003e\n    Learning rate for recurrent and embedding weights (default: 0.1)\n  --maxent-alpha \u003cfloat\u003e\n    Learning rate for maxent layer (default: 0.1)\n  --beta \u003cfloat\u003e\n    Weight decay for recurrent and embedding weight, i.e. L2-regularization\n    (default: 1e-06)\n  --maxent-beta \u003cfloat\u003e\n    Weight decay for maxent layer, i.e. L2-regularization (default: 1e-06)\n```\n\nThe program supports truncated back propagation through time.\nGradients from hidden to input are back propagated on each time step.\nHowever gradients from hidden to previous hidden are propagated for bptt steps within each bppt-period block.\nThis trick could speed up training and wrestle gradient explosion.\nSee [7] for details.\nTo disable any truncation set bptt to zero.\n\n```\n  --bptt \u003cint\u003e\n    Length of truncated BPTT unfolding\n    Set to zero to back-propagate through entire sentence (default: 3)\n  --bptt-skip \u003cint\u003e\n    Number of steps without BPTT;\n    Doesn't have any effect if bptt is 0 (default: 10)\n```\n\nEarly stopping options (see [0]).\nLet `ratio' be a ratio of previous epoch validation entropy to new one.\n\n```\n  --stop \u003cfloat\u003e\n    If `ratio' less than `stop' then start leaning rate decay (default: 1.003)\n  --lr-decay-factor \u003cfloat\u003e\n    Learning rate decay factor (default: 2)\n  --reject-threshold \u003cfloat\u003e\n    If (whats more) `ratio' less than `reject-threshold' then purge the epoch\n    (default: 0.997)\n  --retry \u003cint\u003e\n    Stop training once `ratio' has hit `stop' at least `retry' times (default: 2)\n```\n\nNoise Contrastive Estimation is used iff number of noise samples (--nce option) is greater then zero.\nOtherwise HS is used.\nReasonable value for nce is 20.\n\n```\n  --nce \u003cint\u003e\n    Number of noise samples; if nce is position then NCE is used instead of HS\n    (default: 0)\n  --use-cuda (0 | 1)\n    Use CUDA to compute validation entropy and test entropy in accurate mode,\n    i.e. if nce-accurate-test is true (default: 0)\n  --use-cuda-memory-efficient (0 | 1)\n    Do not copy the whole maxent layer on GPU. Slower, but could be useful to deal with huge\n    maxent layers (default: 0)\n  --nce-unigram-power \u003cfloat\u003e\n    Discount power for unigram frequency (default: 1)\n  --nce-lnz \u003cfloat\u003e\n    Ln of normalization constant (default: 9)\n  --nce-unigram-min-cells \u003cfloat\u003e\n    Minimum number of cells for each word in unigram table (works\n    akin to Laplacian smoothing) (default: 5)\n  --nce-maxent-model \u003cstring\u003e\n    Use given the model as a noise generator\n    The model must a pure maxent model trained by the program (default: )\n```\n\nOther options\n\n```\n  --epoch-per-file \u003cint\u003e\n    Treat one pass over the train file as given number of epochs (default: 1)\n  --seed \u003cint\u003e\n    Random seed for weight initialization and sampling (default: 0)\n  --show-progress (0 | 1)\n    Show training progress (default: 1)\n  --show-train-entropy (0 | 1)\n    Show average entropy on train set for the first thread (default: 0)\n    Train entropy calculation doesn't work for NCE\n\n```\n\n\n## Performance notes\nTo speed up matrix operations we use [Eigen](http://eigen.tuxfamily.org/) (C++ template library for linear algebra).\nBesides, we use data parallelism with sentence-batch HogWild [5].\nThe best performance could be achieved if all the threads are binded to the same CPU (one thread per core). This could be done by means of `taskset` tool (available by default in most Linux distros).\nE.g. if you have 2 CPUs and each CPU has 8 real cores + 8 hyper threading cores, you should use the following command:\n\n```\ntaskset -c 0,1,2,3,4,5,6,7 ./rnnlm -threads 8 ...\n```\n\nIn NCE mode CUDA is used to accelerate validation entropy calculation.\nOf course, if you don't have GPU, you can use CPU to calculate entropy, but it will take a lot of time.\n\n## Usage advice\n\n  - You don't need to repeat structural parameters (hidden, hidden-type, reverse, direct, direct-order) when using an existing model. They will be ignored. The vocabulary saved in the model will be reused.\n  - The vocabulary is built based on the training file on the first run of the tool for a particular model. The program will ignore sentences with OOVs in train time (or report them in test time).\n  - Vocabulary size plays very small role in the performance (it is logarithmic in the size of vocabulary due to the Huffman tree decomposition). Hidden layer size and the amount of training data are the main factors.\n  - Usually NCE works better then HS in terms of both PPL and WER.\n  - Direct connections could dramatically improve model quality. Especially in case of HS. Reasonable values to start from are `-direct 1000 -direct-order 4`.\n  - The model will be written to file after a training epoch if and only if its validation entropy improved compared to the previous epoch.\n  - It is a good idea to shuffle sentences in the set before splitting them into training and validation sets (GNU shuf \u0026 split are one of the possible choices to do it). For huge datasets use --epoch-per-file option.\n\n\n## References\n[0] Mikolov, T. (2012). Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April.\n\n[1] Le, Q. V., Jaitly, N., \u0026 Hinton, G. E. (2015). A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv preprint arXiv:1504.00941.\n\n[2] Chung, J., Gulcehre, C., Cho, K., \u0026 Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.\n\n[3] Chen, X., Liu, X., Gales, M. J. F., \u0026 Woodland, P. C. (2015). Recurrent neural network language model training with noise contrastive estimation for speech recognition.\n\n[4] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol.  4, 2012\n\n[5] Recht, B., Re, C., Wright, S., \u0026 Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems (pp. 693-701).\nChicago\n\n[6] Mikolov, T., Chen, K., Corrado, G., \u0026 Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.\n\n[7] Sutskever, I. (2013). Training recurrent neural networks (Doctoral dissertation, University of Toronto).\n\n[8] Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., \u0026 Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. [GitHub](https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark)\n\n[9] Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., \u0026 Ranzato, M. A. (2014). Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyandex%2Ffaster-rnnlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyandex%2Ffaster-rnnlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyandex%2Ffaster-rnnlm/lists"}