https://github.com/srush/mrf-lm

Last synced: about 1 year ago
JSON representation
Host: GitHub
URL: https://github.com/srush/mrf-lm
Owner: srush
License: lgpl-3.0
Created: 2015-07-02T15:41:47.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2015-07-20T03:20:22.000Z (almost 11 years ago)
Last Synced: 2025-04-30T09:17:13.127Z (about 1 year ago)
Language: Shell
Size: 2.66 MB
Stars: 6
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # MRF-LM

## Fast Markov Random Field Language Models

[Documentation in progress]

An implementation of a fast variational inference algorithm for Markov

Random Field language models as well as other Markov sequence models.

This algorithm implemented in this project is described in the paper

    A Fast Variational Approach for Learning Markov Random Field Language Models

    Yacine Jernite, Alexander M. Rush, and David Sontag.

    Proceedings of ICML 2015.

Available [here](http://people.seas.harvard.edu/~srush/icml15.pdf).

## Building

To build the main C++ library, run

    bash build.sh

This will build liblbfgs (needed for optimization) as well as the main

executables. The package requires a C++ compiler with support for

OpenMP.

## Training a Language Model

The training procedure requires two steps.

First you construct a moments file from the text data of interest. We include the

standard Penn Treebank language modelling data set as an example. This data is located under `lm_data/` . To extract moments from this file run

    python Moments.py --K 2 --train lm_data/ptb.train.txt --valid lm_data/ptb.valid.txt

Next run the main `mrflm` executable providing the training moments, validation moments, and an output file for the model.

    ./mrflm --train=lm_data/ptb.train.txt_moments_K2.dat --valid=lm_data/ptb.valid.txt_moments_K2.dat --output=lm.model

This command will train a language model, compute validation

log-likelihood, and write the parameters out to `lm.model`. (These

parameter settings will correspond to Figure 6 in the paper.)

## MRF-LM

The main MRF executable has several options for controlling the

model used, training procedure, and the parameters of dual decomposition.

    usage: ./mrflm --train=string --valid=string --output=string [options] ...

    options:

      --train           Training moments file. (string)

      --valid           Validation moments file. (string)

      -o, --output      Output file to write model to. (string)

      -m, --model       Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters). (string [=LM])

      -D, --dims        Size of embedding for low-rank MRF. (int [=100])

      -c, --cores       Number of cores to use for OpenMP. (int [=20])

      -d, --dual-rate   Dual decomposition subgradient rate (\alpha_1). (double [=20])

      --dual-iter       Dual decomposition subgradient epochs to run. (int [=500])

      --mult-rate       Dual decomposition subgradient decay rate. (double [=0.5])

      --keep-deltas     Keep dual delta values to hot-start between training epochs. (bool [=0])

      -?, --help        print this message

There is a separate executable for testing the model after it is written.

    usage: ./mrflm_test --model-name=string --valid=string [options] ...

    options:

      -o, --model-name      Output file to write model to. (string)

      -m, --model           Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters), Tag (POS tagger). (string [=LM])

          --valid           Validation moments file. (string)

          --train           Training moments file. (string [=])

          --embeddings      File to write word-embeddings to. (string [=])

          --vocab           Word vocab file. (string [=])

          --tag-features    Features for the tagging model. (string [=])

          --tag-file        Tag test file. (string [=])

          --tag-vocab       Tag vocab file. (string [=])

      -c, --cores           Number of cores to use for OpenMP. (int [=20])

      -?, --help            print this message

## Training a Tagging Model

The tagging model can be trained in a very similar way. We assume that the data is in the CoNLL parsing format and under

the `tag_data` directory. To construct the moments run the following command

    python MomentsTag.py tag_data/ptb.train.txt tag_data/ptb.valid.txt tag_data/ptb.test.txt tag

Next run the main `mrflm` executable providing the moments, the tag features, and validation in data.

    ./mrflm --train=tag_data/ptb.train.txt.tag.counts  --valid=tag_data/ptb.valid.txt.tag.counts --output=tag.model --model=Tag --tag-features=tag_data/ptb.train.txt.tag.features --valid-tag=tag_data/ptb.valid.txt.tag.words

This command will train a tagging model, compute validation by running the Viterbi algorithm, and write the parameters out to `tag.model`.

## Advanced Usage

### Word Embeddings

Once a model is trained `mrflm_test` can be used to view the embeddings produced by the model.

    ./mrflm_test  --model-name=lm.model --embeddings embed --vocab lm_data/ptb.train.txt_vocab_K2.dat

This will output two files. The file `embed` will contain the word embedding vectors one per line. The

file `embed.nn` will contain the 10 nearest neighbors for each word in the vocabulary.

### Tagger

The tagging model can also be used after it is trained. To run the tagger on a data set (such as test), use the following command.

    ./mrflm_test --model-name=tag.model   --model Tag --tag-file=tag_data/ptb.test.txt.tag.words --tag-features=tag_data/ptb.train.txt.tag.features --vocab=tag_data/ptb.train.txt.tag.names --tag-vocab=tag_data/ptb.train.txt.tag.tagnames --train=tag_data/ptb.train.txt.tag.counts  --cores=1

### Moments File Format

The input to the main implementation is a file containing the moments of the lifted

MRF. The moments file assumes the lifted graph is star-shaped with the central variable

as index one. The format of the file is

    {N = # of samples}

    {L = # of variables}

    {# of states of variable 1} {# of states of variable 2} ...(l columns)

    {M1 = # of 2->1 pairs}

    {State in 2} {State in 1} {Counts}

    {State in 2} {State in 1} {Counts}

    ...(M1 rows)

    {M2 = # of 3->1 pairs}

    {State in 3} {State in 1} {Counts}

    {State in 3} {State in 1} {Counts}

    ...(M2 rows)

This file format is used for both language modelling and tagging.

### LM Moments File

Consider a language modelling setup.

For example, let's say we were building a language model

with the training corpus:

    the cat chased the mouse

If our model has context K = 2, then we transform the corpus to:

      the cat chased the mouse

After the transformation the number of samples is N = 7, the number of

variables is L = K+1 = 3, the vocabulary size/number of states is V=5, and

the dictionary is:

     1

    the 2

    cat 3

    chased 4

    mouse 5

The corresponding moments file would then look like:

    7

    3

    5 5 5

    7

    5 1 1

    1 1 1

    1 2 1

    2 3 1

    3 4 1

    4 2 1

    2 5 1

    7

    2 1 1

    5 1 1

    1 2 1

    1 3 1

    2 4 1

    3 2 1

    4 5 1

### Tagging Moments File

Now consider a tagging setup. Let's say we were building a tagging

model with the training corpus:

    the/D cat/N chased/V the/D mouse/N

If our model has context K=1, M=3 (roughly corresponding to Figure~7 in the paper) then we transform the corpus to:

    / the/D cat/N chased/V the/D mouse/N

After the transformation the number of samples is N = 6, the number of

lifted variables is L = M + K+1 = 5, the number of tag states is T=4 and V=5 as above,

and the tag dictionary is:

     1

    D 2

    N 3

    V 4

The corresponding moments file would then look like:

    6

    5

    4 5 5 5

    6

    3 1 1

    1 2 1

    2 3 1

    3 4 1

    4 2 1

    2 3 1

    ...

### Code Structure

The code is broken into three main classes

* `Train.h`; Generic L-BFGS training. Implements most of Algorithm 2.

* `Inference.h`; Lifted inference on a star-shaped MRF. Implements Algorithm 1.

* `Model.h`; Pairwise MRF parameters. Implements likelihood computation, gradient updates, and lifted structure.

The `Model.h` class is a full-rank MRF by default, but can be easily

extended to allow for alternative parameterization. See `LM.h` for the low-rank

language model with back-prop (Model 2 in the paper), and `Tag.h` for a feature

factorized part-of-speech tagging model.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/srush/mrf-lm

Awesome Lists containing this project

README