https://github.com/srush/mrf-lm
https://github.com/srush/mrf-lm
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/srush/mrf-lm
- Owner: srush
- License: lgpl-3.0
- Created: 2015-07-02T15:41:47.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2015-07-20T03:20:22.000Z (almost 11 years ago)
- Last Synced: 2025-04-30T09:17:13.127Z (about 1 year ago)
- Language: Shell
- Size: 2.66 MB
- Stars: 6
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MRF-LM
## Fast Markov Random Field Language Models
[Documentation in progress]
An implementation of a fast variational inference algorithm for Markov
Random Field language models as well as other Markov sequence models.
This algorithm implemented in this project is described in the paper
A Fast Variational Approach for Learning Markov Random Field Language Models
Yacine Jernite, Alexander M. Rush, and David Sontag.
Proceedings of ICML 2015.
Available [here](http://people.seas.harvard.edu/~srush/icml15.pdf).
## Building
To build the main C++ library, run
bash build.sh
This will build liblbfgs (needed for optimization) as well as the main
executables. The package requires a C++ compiler with support for
OpenMP.
## Training a Language Model
The training procedure requires two steps.
First you construct a moments file from the text data of interest. We include the
standard Penn Treebank language modelling data set as an example. This data is located under `lm_data/` . To extract moments from this file run
python Moments.py --K 2 --train lm_data/ptb.train.txt --valid lm_data/ptb.valid.txt
Next run the main `mrflm` executable providing the training moments, validation moments, and an output file for the model.
./mrflm --train=lm_data/ptb.train.txt_moments_K2.dat --valid=lm_data/ptb.valid.txt_moments_K2.dat --output=lm.model
This command will train a language model, compute validation
log-likelihood, and write the parameters out to `lm.model`. (These
parameter settings will correspond to Figure 6 in the paper.)
## MRF-LM
The main MRF executable has several options for controlling the
model used, training procedure, and the parameters of dual decomposition.
usage: ./mrflm --train=string --valid=string --output=string [options] ...
options:
--train Training moments file. (string)
--valid Validation moments file. (string)
-o, --output Output file to write model to. (string)
-m, --model Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters). (string [=LM])
-D, --dims Size of embedding for low-rank MRF. (int [=100])
-c, --cores Number of cores to use for OpenMP. (int [=20])
-d, --dual-rate Dual decomposition subgradient rate (\alpha_1). (double [=20])
--dual-iter Dual decomposition subgradient epochs to run. (int [=500])
--mult-rate Dual decomposition subgradient decay rate. (double [=0.5])
--keep-deltas Keep dual delta values to hot-start between training epochs. (bool [=0])
-?, --help print this message
There is a separate executable for testing the model after it is written.
usage: ./mrflm_test --model-name=string --valid=string [options] ...
options:
-o, --model-name Output file to write model to. (string)
-m, --model Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters), Tag (POS tagger). (string [=LM])
--valid Validation moments file. (string)
--train Training moments file. (string [=])
--embeddings File to write word-embeddings to. (string [=])
--vocab Word vocab file. (string [=])
--tag-features Features for the tagging model. (string [=])
--tag-file Tag test file. (string [=])
--tag-vocab Tag vocab file. (string [=])
-c, --cores Number of cores to use for OpenMP. (int [=20])
-?, --help print this message
## Training a Tagging Model
The tagging model can be trained in a very similar way. We assume that the data is in the CoNLL parsing format and under
the `tag_data` directory. To construct the moments run the following command
python MomentsTag.py tag_data/ptb.train.txt tag_data/ptb.valid.txt tag_data/ptb.test.txt tag
Next run the main `mrflm` executable providing the moments, the tag features, and validation in data.
./mrflm --train=tag_data/ptb.train.txt.tag.counts --valid=tag_data/ptb.valid.txt.tag.counts --output=tag.model --model=Tag --tag-features=tag_data/ptb.train.txt.tag.features --valid-tag=tag_data/ptb.valid.txt.tag.words
This command will train a tagging model, compute validation by running the Viterbi algorithm, and write the parameters out to `tag.model`.
## Advanced Usage
### Word Embeddings
Once a model is trained `mrflm_test` can be used to view the embeddings produced by the model.
./mrflm_test --model-name=lm.model --embeddings embed --vocab lm_data/ptb.train.txt_vocab_K2.dat
This will output two files. The file `embed` will contain the word embedding vectors one per line. The
file `embed.nn` will contain the 10 nearest neighbors for each word in the vocabulary.
### Tagger
The tagging model can also be used after it is trained. To run the tagger on a data set (such as test), use the following command.
./mrflm_test --model-name=tag.model --model Tag --tag-file=tag_data/ptb.test.txt.tag.words --tag-features=tag_data/ptb.train.txt.tag.features --vocab=tag_data/ptb.train.txt.tag.names --tag-vocab=tag_data/ptb.train.txt.tag.tagnames --train=tag_data/ptb.train.txt.tag.counts --cores=1
### Moments File Format
The input to the main implementation is a file containing the moments of the lifted
MRF. The moments file assumes the lifted graph is star-shaped with the central variable
as index one. The format of the file is
{N = # of samples}
{L = # of variables}
{# of states of variable 1} {# of states of variable 2} ...(l columns)
{M1 = # of 2->1 pairs}
{State in 2} {State in 1} {Counts}
{State in 2} {State in 1} {Counts}
...(M1 rows)
{M2 = # of 3->1 pairs}
{State in 3} {State in 1} {Counts}
{State in 3} {State in 1} {Counts}
...(M2 rows)
This file format is used for both language modelling and tagging.
### LM Moments File
Consider a language modelling setup.
For example, let's say we were building a language model
with the training corpus:
the cat chased the mouse
If our model has context K = 2, then we transform the corpus to:
the cat chased the mouse
After the transformation the number of samples is N = 7, the number of
variables is L = K+1 = 3, the vocabulary size/number of states is V=5, and
the dictionary is:
1
the 2
cat 3
chased 4
mouse 5
The corresponding moments file would then look like:
7
3
5 5 5
7
5 1 1
1 1 1
1 2 1
2 3 1
3 4 1
4 2 1
2 5 1
7
2 1 1
5 1 1
1 2 1
1 3 1
2 4 1
3 2 1
4 5 1
### Tagging Moments File
Now consider a tagging setup. Let's say we were building a tagging
model with the training corpus:
the/D cat/N chased/V the/D mouse/N
If our model has context K=1, M=3 (roughly corresponding to Figure~7 in the paper) then we transform the corpus to:
/ the/D cat/N chased/V the/D mouse/N
After the transformation the number of samples is N = 6, the number of
lifted variables is L = M + K+1 = 5, the number of tag states is T=4 and V=5 as above,
and the tag dictionary is:
1
D 2
N 3
V 4
The corresponding moments file would then look like:
6
5
4 5 5 5
6
3 1 1
1 2 1
2 3 1
3 4 1
4 2 1
2 3 1
...
### Code Structure
The code is broken into three main classes
* `Train.h`; Generic L-BFGS training. Implements most of Algorithm 2.
* `Inference.h`; Lifted inference on a star-shaped MRF. Implements Algorithm 1.
* `Model.h`; Pairwise MRF parameters. Implements likelihood computation, gradient updates, and lifted structure.
The `Model.h` class is a full-rank MRF by default, but can be easily
extended to allow for alternative parameterization. See `LM.h` for the low-rank
language model with back-prop (Model 2 in the paper), and `Tag.h` for a feature
factorized part-of-speech tagging model.