{"id":17044453,"url":"https://github.com/srush/mrf-lm","last_synced_at":"2025-04-30T09:17:16.026Z","repository":{"id":34500661,"uuid":"38441426","full_name":"srush/MRF-LM","owner":"srush","description":null,"archived":false,"fork":false,"pushed_at":"2015-07-20T03:20:22.000Z","size":2792,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-30T09:17:13.127Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/srush.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-02T15:41:47.000Z","updated_at":"2025-01-13T15:20:52.000Z","dependencies_parsed_at":"2022-09-06T04:11:08.527Z","dependency_job_id":null,"html_url":"https://github.com/srush/MRF-LM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/srush%2FMRF-LM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/srush%2FMRF-LM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/srush%2FMRF-LM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/srush%2FMRF-LM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/srush","download_url":"https://codeload.github.com/srush/MRF-LM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251674590,"owners_count":21625646,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T09:34:28.733Z","updated_at":"2025-04-30T09:17:16.002Z","avatar_url":"https://github.com/srush.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MRF-LM\n## Fast Markov Random Field Language Models\n\n[Documentation in progress]\n\nAn implementation of a fast variational inference algorithm for Markov\nRandom Field language models as well as other Markov sequence models.\n\nThis algorithm implemented in this project is described in the paper\n\n    A Fast Variational Approach for Learning Markov Random Field Language Models\n    Yacine Jernite, Alexander M. Rush, and David Sontag.\n    Proceedings of ICML 2015.\n\nAvailable [here](http://people.seas.harvard.edu/~srush/icml15.pdf).\n\n## Building\n\nTo build the main C++ library, run\n\n    bash build.sh\n\nThis will build liblbfgs (needed for optimization) as well as the main\nexecutables. The package requires a C++ compiler with support for\nOpenMP.\n\n## Training a Language Model\n\nThe training procedure requires two steps.\n\nFirst you construct a moments file from the text data of interest. We include the\nstandard Penn Treebank language modelling data set as an example. This data is located under `lm_data/` . To extract moments from this file run\n\n    python Moments.py --K 2 --train lm_data/ptb.train.txt --valid lm_data/ptb.valid.txt\n\n\nNext run the main `mrflm` executable providing the training moments, validation moments, and an output file for the model.\n\n    ./mrflm --train=lm_data/ptb.train.txt_moments_K2.dat --valid=lm_data/ptb.valid.txt_moments_K2.dat --output=lm.model\n\nThis command will train a language model, compute validation\nlog-likelihood, and write the parameters out to `lm.model`. (These\nparameter settings will correspond to Figure 6 in the paper.)\n\n## MRF-LM\n\nThe main MRF executable has several options for controlling the\nmodel used, training procedure, and the parameters of dual decomposition.\n\n    usage: ./mrflm --train=string --valid=string --output=string [options] ...\n    options:\n      --train           Training moments file. (string)\n      --valid           Validation moments file. (string)\n      -o, --output      Output file to write model to. (string)\n      -m, --model       Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters). (string [=LM])\n      -D, --dims        Size of embedding for low-rank MRF. (int [=100])\n      -c, --cores       Number of cores to use for OpenMP. (int [=20])\n      -d, --dual-rate   Dual decomposition subgradient rate (\\alpha_1). (double [=20])\n      --dual-iter       Dual decomposition subgradient epochs to run. (int [=500])\n      --mult-rate       Dual decomposition subgradient decay rate. (double [=0.5])\n      --keep-deltas     Keep dual delta values to hot-start between training epochs. (bool [=0])\n      -?, --help        print this message\n\nThere is a separate executable for testing the model after it is written.\n\n    usage: ./mrflm_test --model-name=string --valid=string [options] ...\n    options:\n      -o, --model-name      Output file to write model to. (string)\n      -m, --model           Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters), Tag (POS tagger). (string [=LM])\n          --valid           Validation moments file. (string)\n          --train           Training moments file. (string [=])\n          --embeddings      File to write word-embeddings to. (string [=])\n          --vocab           Word vocab file. (string [=])\n          --tag-features    Features for the tagging model. (string [=])\n          --tag-file        Tag test file. (string [=])\n          --tag-vocab       Tag vocab file. (string [=])\n      -c, --cores           Number of cores to use for OpenMP. (int [=20])\n      -?, --help            print this message\n\n\n## Training a Tagging Model\n\nThe tagging model can be trained in a very similar way. We assume that the data is in the CoNLL parsing format and under\nthe `tag_data` directory. To construct the moments run the following command\n\n    python MomentsTag.py tag_data/ptb.train.txt tag_data/ptb.valid.txt tag_data/ptb.test.txt tag\n\n\nNext run the main `mrflm` executable providing the moments, the tag features, and validation in data.\n\n    ./mrflm --train=tag_data/ptb.train.txt.tag.counts  --valid=tag_data/ptb.valid.txt.tag.counts --output=tag.model --model=Tag --tag-features=tag_data/ptb.train.txt.tag.features --valid-tag=tag_data/ptb.valid.txt.tag.words\n\nThis command will train a tagging model, compute validation by running the Viterbi algorithm, and write the parameters out to `tag.model`.\n\n## Advanced Usage\n\n### Word Embeddings\n\nOnce a model is trained `mrflm_test` can be used to view the embeddings produced by the model.\n\n    ./mrflm_test  --model-name=lm.model --embeddings embed --vocab lm_data/ptb.train.txt_vocab_K2.dat\n\nThis will output two files. The file `embed` will contain the word embedding vectors one per line. The\nfile `embed.nn` will contain the 10 nearest neighbors for each word in the vocabulary.\n\n### Tagger\n\nThe tagging model can also be used after it is trained. To run the tagger on a data set (such as test), use the following command.\n\n    ./mrflm_test --model-name=tag.model   --model Tag --tag-file=tag_data/ptb.test.txt.tag.words --tag-features=tag_data/ptb.train.txt.tag.features --vocab=tag_data/ptb.train.txt.tag.names --tag-vocab=tag_data/ptb.train.txt.tag.tagnames --train=tag_data/ptb.train.txt.tag.counts  --cores=1\n\n\n### Moments File Format\n\nThe input to the main implementation is a file containing the moments of the lifted\nMRF. The moments file assumes the lifted graph is star-shaped with the central variable\nas index one. The format of the file is\n\n    {N = # of samples}\n    {L = # of variables}\n    {# of states of variable 1} {# of states of variable 2} ...(l columns)\n    {M1 = # of 2-\u003e1 pairs}\n    {State in 2} {State in 1} {Counts}\n    {State in 2} {State in 1} {Counts}\n    ...(M1 rows)\n    {M2 = # of 3-\u003e1 pairs}\n    {State in 3} {State in 1} {Counts}\n    {State in 3} {State in 1} {Counts}\n    ...(M2 rows)\n\nThis file format is used for both language modelling and tagging.\n\n### LM Moments File\n\nConsider a language modelling setup.\n\nFor example, let's say we were building a language model\nwith the training corpus:\n\n    the cat chased the mouse\n\nIf our model has context K = 2, then we transform the corpus to:\n\n    \u003cS\u003e \u003cS\u003e the cat chased the mouse\n\nAfter the transformation the number of samples is N = 7, the number of\nvariables is L = K+1 = 3, the vocabulary size/number of states is V=5, and\nthe dictionary is:\n\n    \u003cS\u003e 1\n    the 2\n    cat 3\n    chased 4\n    mouse 5\n\nThe corresponding moments file would then look like:\n\n    7\n    3\n    5 5 5\n    7\n    5 1 1\n    1 1 1\n    1 2 1\n    2 3 1\n    3 4 1\n    4 2 1\n    2 5 1\n    7\n    2 1 1\n    5 1 1\n    1 2 1\n    1 3 1\n    2 4 1\n    3 2 1\n    4 5 1\n\n### Tagging Moments File\n\nNow consider a tagging setup. Let's say we were building a tagging\nmodel with the training corpus:\n\n    the/D cat/N chased/V the/D mouse/N\n\nIf our model has context K=1, M=3 (roughly corresponding to Figure~7 in the paper) then we transform the corpus to:\n\n    \u003cS\u003e/\u003cT\u003e the/D cat/N chased/V the/D mouse/N\n\nAfter the transformation the number of samples is N = 6, the number of\nlifted variables is L = M + K+1 = 5, the number of tag states is T=4 and V=5 as above,\nand the tag dictionary is:\n\n    \u003cT\u003e 1\n    D 2\n    N 3\n    V 4\n\nThe corresponding moments file would then look like:\n\n    6\n    5\n    4 5 5 5\n    6\n    3 1 1\n    1 2 1\n    2 3 1\n    3 4 1\n    4 2 1\n    2 3 1\n    ...\n\n\n\n### Code Structure\n\nThe code is broken into three main classes\n\n* `Train.h`; Generic L-BFGS training. Implements most of Algorithm 2.\n\n* `Inference.h`; Lifted inference on a star-shaped MRF. Implements Algorithm 1.\n\n* `Model.h`; Pairwise MRF parameters. Implements likelihood computation, gradient updates, and lifted structure.\n\nThe `Model.h` class is a full-rank MRF by default, but can be easily\nextended to allow for alternative parameterization. See `LM.h` for the low-rank\nlanguage model with back-prop (Model 2 in the paper), and `Tag.h` for a feature\nfactorized part-of-speech tagging model.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrush%2Fmrf-lm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrush%2Fmrf-lm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrush%2Fmrf-lm/lists"}