{"id":13741069,"url":"https://github.com/moses-smt/nplm","last_synced_at":"2025-05-08T21:32:23.788Z","repository":{"id":141233317,"uuid":"39263575","full_name":"moses-smt/nplm","owner":"moses-smt","description":"Fork of http://nlg.isi.edu/software/nplm/ with some efficiency tweaks and adaptation for use in mosesdecoder.","archived":false,"fork":true,"pushed_at":"2015-09-03T17:15:56.000Z","size":2850,"stargazers_count":14,"open_issues_count":0,"forks_count":10,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-08-04T04:07:22.982Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"rsennrich/nplm","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moses-smt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-17T16:21:48.000Z","updated_at":"2024-02-09T18:04:15.000Z","dependencies_parsed_at":"2023-03-13T06:15:13.798Z","dependency_job_id":null,"html_url":"https://github.com/moses-smt/nplm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fnplm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fnplm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fnplm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moses-smt%2Fnplm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moses-smt","download_url":"https://codeload.github.com/moses-smt/nplm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224774612,"owners_count":17367764,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:55.109Z","updated_at":"2024-11-15T11:30:52.220Z","avatar_url":"https://github.com/moses-smt.png","language":"C++","funding_links":[],"categories":["Software"],"sub_categories":["Utilities"],"readme":"2013-07-30\n\nPrerequisites\n-------------\n\nBefore compiling, you must have the following:\n\nA C++ compiler and GNU make\n\nBoost 1.47.0 or later\nhttp://www.boost.org\n\nEigen 3.1.x\nhttp://eigen.tuxfamily.org\n\nOptional:\n\nIntel MKL 11.x\nhttp://software.intel.com/en-us/intel-mkl\nRecommended for better performance.\n\nPython 2.7.x, not 3.x\nhttp://python.org\n\nCython 0.19.x\nhttp://cython.org\nNeeded only for building Python bindings.\n\nBuilding\n--------\n\nTo compile, edit the Makefile to reflect the locations of the Boost\nand Eigen include directories.\n\nIf you want to use the Intel MKL library (recommended if you have it),\nuncomment the line\n    MKL=/path/to/mkl\nediting it to point to the MKL root directory.\n\nBy default, multithreading using OpenMP is enabled. To turn it off,\ncomment out the line\n    OMP=1\n\nThen run 'make install'. This creates several programs in the bin/\ndirectory and a library lib/neuralLM.a.\n\nNotes on particular configurations:\n\n- Intel C++ compiler and OpenMP. With version 12, you may get a\n  \"pragma not found\" error. This is reportedly fixed in ComposerXE\n  update 9.\n\n- Mac OS X and OpenMP. The Clang compiler (/usr/bin/c++) doesn't\n  support OpenMP. If the g++ that comes with XCode doesn't work\n  either, try the one installed by MacPorts (/opt/local/bin/g++ or\n  /opt/local/bin/g++-mp-*).\n\nTraining a language model\n-------------------------\n\nBuilding a language model requires some preprocessing. In addition to\nany preprocessing of your own (tokenization, lowercasing, mapping of\ndigits, etc.), prepareNeuralLM (run with --help for options) does the\nfollowing:\n\n- Splits into training and validation data. The training data is used\n  to actually train the model, while the validation data is used to\n  check its performance.\n- Creates a vocabulary of the k most frequent words, mapping all other\n  words to \u003cunk\u003e.\n- Adds start \u003cs\u003e and stop \u003c/s\u003e symbols.\n- Converts to numberized n-grams.\n\nA typical invocation would be:\n\n    prepareNeuralLM --train_text mydata.txt --ngram_size 3 \\\n                    --n_vocab 5000 --words_file words \\\n                    --train_file train.ngrams \\\n                    --validation_size 500 --validation_file validation.ngrams\n\nwhich would generate the files train.ngrams, validation.ngrams, and words.\n\nThese files are fed into trainNeuralNetwork (run with --help for\noptions). A typical invocation would be:\n\n    trainNeuralNetwork --train_file train.ngrams \\\n                       --validation_file validation.ngrams \\\n                       --num_epochs 10 \\\n                       --words_file words \\\n                       --model_prefix model\n\nAfter each pass through the data, the trainer will print the\nlog-likelihood of both the training data and validation data (higher\nis better) and generate a series of model files called model.1,\nmodel.2, and so on. You choose which model you want based on the\nvalidation log-likelihood.\n\nYou can find a working example in the example/ directory. The Makefile\nthere generates a language model from a raw text file.\n\nNotes:\n\n- Vocabulary. You should set --n_vocab to something less than the\n  actual vocabulary size of the training data (and will receive a\n  warning if it's not). Otherwise, no probability will be learned for\n  unknown words. On the other hand, there is no need to limit n_vocab\n  for the sake of speed. At present, we have tested it up to 100000.\n\n- Normalization. Most of the computational cost normally (no pun\n  intended) associated with a large vocabulary has to do with\n  normalization of the conditional probability distribution P(word |\n  context). The trainer uses noise-contrastive estimation to avoid\n  this cost during training (Gutmann and Hyvärinen, 2010), and, by\n  default, sets the normalization factors to one to avoid this cost\n  during testing (Mnih and Hinton, 2009).\n\n  If you set --normalization 1, the trainer will try to learn the\n  normalization factors, and you should accordingly turn on\n  normalization when using the resulting model. The default initial\n  value --normalization_init 0 should be fine; you can try setting it\n  a little higher, but not lower.\n\n- Validation. The trainer computes the log-likelihood of a validation\n  data set (which should be disjoint from the training data). You use\n  this to decide when to stop training, and the trainer also uses it\n  to throttle the learning rate. This computation always uses exact\n  normalization and is therefore much slower, per instance, than\n  training. Therefore, you should make the validation data\n  (--validation_size) as small as you can. (For example, Section 00 of\n  the Penn Treebank has about 2000 sentences and 50,000 words.)\n\nPython code\n-----------\n\nprepareNeuralLM.py performs the same function as prepareNeuralLM, but in\nPython. This may be handy if you want to make modifications.\n\nnplm.py is a pure Python module for reading and using language models\ncreated by trainNeuralNetwork. See testNeuralLM.py for example usage.\n\nIn src/python are Python bindings (using Cython) for the C++ code. To\nbuild them, run 'make python/nplm.so'.\n\nUsing in a decoder\n------------------\n\nTo use the language model in a decoder, include neuralLM.h and link\nagainst neuralLM.a. This provides a class nplm::neuralLM, with the\nfollowing methods:\n\n    void set_normalization(bool normalization);\n\nTurn normalization on or off (default: off). If normalization is off,\nthe probabilities output by the model will not be normalized. In\ngeneral, this means that summing over all possible words will not give\na probability of one. If normalization is on, computes exact\nprobabilities (too slow to be recommended for decoding).\n\n    void set_map_digits(char c);\n\nMap all digits (0-9) to the specified character. This should match\nwhatever mapping you used during preprocessing.\n\n    void set_log_base(double base);\n\nSet the base of the log-probabilities returned by lookup_ngram. The\ndefault is e (natural log), whereas most other language modeling\ntoolkits use base 10.\n\n    void read(const string \u0026filename);\n\nRead model from file.\n\n    int get_order();\n\nReturn the order of the language model.\n\n    int lookup_word(const string \u0026word);\n\nMap a word to an index for use with lookup_ngram().\n\n    double lookup_ngram(const vector\u003cint\u003e \u0026ngram);\n    double lookup_ngram(const int *ngram, int n);\n\nLook up the log-probability of ngram.\n\nEnd.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoses-smt%2Fnplm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoses-smt%2Fnplm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoses-smt%2Fnplm/lists"}