{"id":16864422,"url":"https://github.com/stonesjtu/pytorch-nce","last_synced_at":"2025-04-06T03:10:07.276Z","repository":{"id":48522565,"uuid":"91752114","full_name":"Stonesjtu/Pytorch-NCE","owner":"Stonesjtu","description":"The Noise Contrastive Estimation for softmax output written in Pytorch","archived":false,"fork":false,"pushed_at":"2019-11-06T14:40:59.000Z","size":19634,"stargazers_count":318,"open_issues_count":3,"forks_count":45,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-30T02:08:16.439Z","etag":null,"topics":["importance-sampling","language-model","nce","nce-criterion","pytorch","softmax","speedup"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Stonesjtu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-19T01:18:57.000Z","updated_at":"2025-01-20T10:54:05.000Z","dependencies_parsed_at":"2022-09-17T05:00:15.549Z","dependency_job_id":null,"html_url":"https://github.com/Stonesjtu/Pytorch-NCE","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2FPytorch-NCE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2FPytorch-NCE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2FPytorch-NCE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stonesjtu%2FPytorch-NCE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Stonesjtu","download_url":"https://codeload.github.com/Stonesjtu/Pytorch-NCE/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247427006,"owners_count":20937201,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["importance-sampling","language-model","nce","nce-criterion","pytorch","softmax","speedup"],"created_at":"2024-10-13T14:42:15.493Z","updated_at":"2025-04-06T03:10:07.256Z","avatar_url":"https://github.com/Stonesjtu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"An NCE implementation in pytorch\n===\n\nAbout NCE\n---\n\nNoise Contrastive Estimation (NCE) is an approximation method that is used to work around\nthe huge computational cost of large softmax layer. The basic idea is to convert the prediction\nproblem into classification problem at training stage. It has been proved that these two criterions\nconverges to the same minimal point as long as noise distribution is close enough to real one.\n\nNCE bridges the gap between generative models and discriminative models, rather than simply speedup\nthe softmax layer. With NCE, you can turn almost anything into posterior with less effort (I think).\n\nRefs:\n\nNCE:\n\u003e http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf\n\nNCE on rnnlm:\n\u003e https://pdfs.semanticscholar.org/144e/357b1339c27cce7a1e69f0899c21d8140c1f.pdf\n\n### Comparison with other methods\n\nA review of softmax speedup methods:\n\u003e http://ruder.io/word-embeddings-softmax/\n\nNCE vs. IS (Importance Sampling): Nce is a binary classification while IS is sort of multi-class\nclassification problem.\n\u003e http://demo.clab.cs.cmu.edu/cdyer/nce_notes.pdf\n\nNCE vs. GAN (Generative Adversarial Network):\n\u003e https://arxiv.org/abs/1412.6515\n\n### On improving NCE\n\n#### Sampling methods\n\nIn NCE, unigram distribution is usually used to approximate the noise distribution because it's fast to\nsample from. Sampling from a unigram is equal to multinomial sampling, which is of complexity $O(\\log(N))$\nvia binary search tree. The cost of sampling becomes significant when noise ratio increases.\n\nSince the unigram distribution can be obtained before training and remains unchanged across training,\nsome works are proposed to make use of this property to speedup the sampling procedure. Alias method is\none of them.\n\n\u003cimg src=\"https://github.com/Stonesjtu/Pytorch-NCE/blob/master/res/alias.gif?raw=true\" alt=\"diagram of constructing auxiliary data structure\" height=\"200\" /\u003e\n\nBy constructing data structures, alias method can reduce the sampling complexity from $O(log(N))$ to $O(1)$,\nand it's easy to parallelize.\n\nRefs:\n\nalias method:\n\u003e https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/\n\n#### Generic NCE (full-NCE)\n\nConventional NCE only perform the contrasting on linear(softmax) layer, that is, given an input of a\nlinear layer, the model outputs are $p(noise|input)$ and $p(target|input)$. In fact NCE can be applied\nto more general situations where models are capable to output likelihood values for both real data and\nnoise data.\n\nIn this code base, I use a variant of generic NCE named full-NCE (f-NCE) to clarify. Unlike normal NCE,\nf-NCE samples the noises at input embedding.\n\nRefs:\n\nwhole sentence language model by IBM (ICASSP2018)\n\nBi-LSTM language model by speechlab,SJTU (ICSLP2016?)\n\n#### Batched NCE\n\nConventional NCE requires different noise samples per data token. Such computational pattern is not fully\nGPU-efficient because it needs batched matrix multiplication. A trick is to share the noise samples across\nthe whole mini-batch, thus sparse batched matrix multiplication is converted to more efficient\ndense matrix multiplication. The batched NCE is already supported by Tensorflow.\n\nA more aggressive approach is to called self contrasting (named by myself). Instead of sampling from noise\ndistribution, the noises are simply the other training tokens the within the same mini-batch.\n\nRef:\n\nbatched NCE\n\u003e https://arxiv.org/pdf/1708.05997.pdf\n\nself contrasting:\n\u003e https://www.isi.edu/natural-language/mt/simple-fast-noise.pdf\n\nRun the word language model example\n---\n\nThere's an example illustrating how to use the NCE module in `example` folder.\nThis example is forked from the pytorch/examples repo.\n\n### Requirements\n\nPlease run `pip install -r requirements` first to see if you have the required python lib.\n- `tqdm` is used for process bar during training\n- `dill` is a more flexible replacement for pickle\n\n### NCE related Arguments\n\n- `--nce`: whether to use NCE as approximation\n- `--noise-ratio \u003c50\u003e`: numbers of noise samples per batch, the noise is shared among the\ntokens in a single batch, for training speed.\n- `--norm-term \u003c9\u003e`: the constant normalization term `Ln(z)`\n- `--index-module \u003clinear\u003e`: index module to use for NCE module (currently\n\u003clinear\u003e and \u003cgru\u003e available, \u003cgru\u003e does not support PPL calculating )\n- `--train`: train or just evaluation existing model\n- `--vocab \u003cNone\u003e`: use vocabulary file if specified, otherwise use the words in train.txt\n- `--loss [full, nce, sampled, mix]`: choose one of the loss type for training, the loss is\nconverted to `full` for PPL evaluation automatically.\n\n### Examples\n\nRun NCE criterion with linear module:\n```bash\npython main.py --cuda --noise-ratio 10 --norm-term 9 --nce --train\n```\n\nRun NCE criterion with gru module:\n```bash\npython main.py --cuda --noise-ratio 10 --norm-term 9 --nce --train --index-module gru\n```\n\nRun conventional CE criterion:\n```bash\npython main.py --cuda --train\n```\n\n### A small benchmark in swbd+fisher dataset\n\nIt's a performance showcase. The dataset is not bundled in this repo however.\nThe model is trained on concatenated sentences,but the hidden states are not\npassed across batches. An `\u003cs\u003e` is inserted between sentences. The model is\nevaluated on `\u003cs\u003e` padded sentences separately.\n\nGenerally a model trained on concatenated sentences performs slightly worse than\nthe one trained on separate sentences. But we saves 50% of training time by reducing\nthe sentence padding operation.\n\n#### dataset statistics\n- training samples: 2200000 sentences, 22403872 words\n- built vocabulary size: ~30K\n\n#### testbed\n- 1080 Ti\n- i7 7700K\n- pytorch-0.4.0\n- cuda-8.0\n- cudnn-6.0.1\n\n#### how to run:\n```bash\npython main.py --train --batch-size 96 --cuda --loss nce --noise-ratio 500 --nhid 300 \\\n  --emsize 300 --log-interval 1000 --nlayers 1 --dropout 0 --weight-decay 1e-8 \\\n  --data data/swb --min-freq 3 --lr 2 --save nce-500-swb --concat\n```\n\n#### Running time\n- crossentropy: 6.5 mins/epoch (56K tokens/sec)\n- nce: 2 mins/epoch (187K tokens/sec)\n\n#### performance\n\nThe rescore is performed on swbd 50-best, thanks to HexLee.\n\n| training loss type | evaluation type | PPL     | WER                          |\n| :---:              | :---:           | :--:    | :--:                         |\n| 3gram              | normed          | ??      | 19.4                         |\n| CE(no concat)      | normed(full)    | 53      | 13.1                         |\n| CE                 | normed(full)    | 55      | 13.3                         |\n| NCE                | unnormed(NCE)   | invalid | 13.4                         |\n| NCE                | normed(full)    | 55      | 13.4                         |\n| importance sample  | normed(full)    | 55      | 13.4                         |\n| importance sample  | sampled(500)    | invalid | 19.0(worse than w/o rescore) |\n\n\n### File structure\n\n- `example/log/`: some log files of this scripts\n- `nce/`: the NCE module wrapper\n    - `nce/nce_loss.py`: the NCE loss\n    - `nce/alias_multinomial.py`: alias method sampling\n    - `nce/index_linear.py`: an index module used by NCE, as a replacement for normal Linear module\n    - `nce/index_gru.py`: an index module used by NCE, as a replacement for the whole language model module\n- `sample.py`: a simple script for NCE linear.\n- `example`: a word langauge model sample to use NCE as loss.\n    - `example/vocab.py`: a wrapper for vocabulary object\n    - `example/model.py`: the wrapper of all `nn.Module`s.\n    - `example/generic_model.py`: the model wrapper for index_gru NCE module\n    - `example/main.py`: entry point\n    - `example/utils.py`: some util functions for better code structure\n\n-----------------\n### Modified README from Pytorch/examples\n\nThis example trains a multi-layer LSTM on a language modeling task.\nBy default, the training script uses the PTB dataset, provided.\n\n```bash\npython main.py --train --cuda --epochs 6        # Train a LSTM on PTB with CUDA\n```\n\nThe model will automatically use the cuDNN backend if run on CUDA with\ncuDNN installed.\n\nDuring training, if a keyboard interrupt (Ctrl-C) is received,\ntraining is stopped and the current model is evaluated against the test dataset.\n\nThe `main.py` script accepts the following arguments:\n\n```bash\noptional arguments:\n  -h, --help         show this help message and exit\n  --data DATA        location of the data corpus\n  --emsize EMSIZE    size of word embeddings\n  --nhid NHID        humber of hidden units per layer\n  --nlayers NLAYERS  number of layers\n  --lr LR            initial learning rate\n  --lr-decay         learning rate decay when no progress is observed on validation set\n  --weight-decay     weight decay(L2 normalization)\n  --clip CLIP        gradient clipping\n  --epochs EPOCHS    upper epoch limit\n  --batch-size N     batch size\n  --dropout DROPOUT  dropout applied to layers (0 = no dropout)\n  --seed SEED        random seed\n  --cuda             use CUDA\n  --log-interval N   report interval\n  --save SAVE        path to save the final model\n  --bptt             max length of truncated bptt\n  --concat           use concatenated sentence instead of individual sentence\n```\n\n\n# CHANGELOG\n\n- 2019.09.09: Improve numeric stability by directly calculation on logits\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstonesjtu%2Fpytorch-nce","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstonesjtu%2Fpytorch-nce","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstonesjtu%2Fpytorch-nce/lists"}