Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/blei-lab/ctm-c
This implements variational inference for the correlated topic model.
https://github.com/blei-lab/ctm-c
Last synced: 2 months ago
JSON representation
This implements variational inference for the correlated topic model.
- Host: GitHub
- URL: https://github.com/blei-lab/ctm-c
- Owner: blei-lab
- Created: 2014-10-05T11:59:47.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-10-05T12:09:40.000Z (over 10 years ago)
- Last Synced: 2024-08-03T18:21:06.523Z (5 months ago)
- Language: C
- Size: 422 KB
- Stars: 21
- Watchers: 25
- Forks: 7
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
- awesome-topic-models - ctm-c - C implementation of the correlated topic model by David Blei (Research Implementations / Embedding based Topic Models)
README
---------------------------
Correlated Topic Model in C
---------------------------David M. Blei and John D. Lafferty
blei[at]cs.princeton.edu(C) Copyright 2007, David M. Blei and John D. Lafferty
This file is part of CTM-C.
CTM-C is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.CTM-C is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
USA----
This is a C-implementation of the correlated topic model (CTM) from
Blei and Lafferty (2007). This code requires the GSL library.Any questions or comments about this code should be sent to the topic
models mailing list, which is a forum for discussing topic models in
general. To join, go to http://lists.cs.princeton.edu and click on
"topic-models." DO NOT EMAIL EITHER OF THE AUTHORS WITH QUESTIONS
ABOUT THIS CODE. ALL QUESTIONS WILL BE ANSWERED ON THE MAILING LIST.------------------------------------------------------------------------
TABLE OF CONTENTS
A. COMPILING
B. DATA FORMAT
C. MODEL ESTIMATION
D. MODEL EXAMINATION
1. output of estimation
2. viewing the topics with ctm-topics.py
3. using lasso-graph.rE. POSTERIOR INFERENCE ON NEW DOCUMENTS
------------------------------------------------------------------------
A. COMPILING
Type "make" in a shell. Note: the Makefule currently points to the
(inefficient) GSL version of the BLAS. You will probably want to
point to the BLAS library on your machine.------------------------------------------------------------------------
B. Data format
Under the CTM, the words of each document are assumed exchangeable.
Thus, each document is succinctly represented as a sparse vector of
word counts. The data is a file where each line is of the form:[M] [term_1]:[count_1] [term_2]:[count_2] ... [term_N]:[count_3]
* [M] is the number of unique terms in the document
* [term_i] is an integer associated with the i-th term in the
vocabulary.* [count_i] is how many times the i-th term appeared in the document.
------------------------------------------------------------------------
C. Estimating a model
The command to estimate a model is:
./ctm est
For example:
./ctm est my-training-data.dat 10 seed CTM10 settings.txt
- is the file described above in part B.
- <# topics> is the desired number of topics into which to decompose
the documents- indicates how to initialize EM: randomly, seeded,
or from a partially fit model. If from a model, type the name of
the model into the command line, rather than the word "model." For
example, if your model was in the directory "CTM100" and had the
prefix "010" then you'd type "CTM100/010" for the starting point of
EM. (We recommend using "seed" to begin with.)- is the directory in which to place the files associated with
this run of variational EM. (See part D below.)- is a settings file. For example, the settings.txt file
is good for EM and looks like this:em max iter 1000
var max iter 20
cg max iter -1
em convergence 1e-3
var convergence 1e-6
cg convergence 1e-6
lag 10
covariance estimate mleThe first item ("em max iter") is the maximum number of EM
iterations.The second item ("var max iter") is the maximum number of variational iterations,
i.e., passes through each variational parameter (-1 indicates to
iterate until the convergence criterion is met.)The third item ("cg max iter") is the maximum number of conjugate
gradient iterations in fitting the variational mean and variance per
document.Items 4-6 are convergence criterions for EM, variational inference,
and conjugate gradient, respectively.The 7th item ("lag") is the multiple of iterations of EM after which
to save a version of the model. This is useful, for example, if you
want to monitor how the model changes from iteration to iteration.The 8th item ("covariance estimate") is what technique to estimate
the covariance with. The choices are "mle" or "shrinkage."
Additional R code is provided in this directory to implement L1
regularization of the topic covariance matrix as described in Blei
and Lafferty (2007).------------------------------------------------------------------------
D. MODEL EXAMINATION
1. Once EM has converged, the model directory will be populated with
several files that can be used to examine the resulting model fit, for
example to make topic graph figures or compute similarity between
documents.All the files are stored in row major format. They can be read into R
with the command:x <- matrix(scan(FILENAME), byrow=T, nrow=NR, ncol=NC),
where FILENAME is the file, NR is the number of rows, and NC is the
number of columns.Let K be the number of topics and V be the number of words in the
vocabulary. The files are as follows:final-cov.dat, final-inv-cov.dat, final-log-det-inv-cov: These are
files corresponding to the (K-1) x (K-1) covariance matrix between
topics. Note that this code implements the logistic normal where
a K-2 Gaussian is mapped to the K-1 simplex. (This is slightly
different from the treatment in the paper, where the K-1 Gaussian
is mapped to the K-1 simplex.)final-mu.dat: This is the K-1 mean vector of the logistic normal
over topic proportions.final-log-beta.dat: This is a K X V topic matrix. The ith row
contains the log probabilities of the words for the ith topic.
Combined with a vector of words in order, this can be used to
inspect the top N words from each topic.final-lambda.dat and final-nu.dat: This is a D x K matrix of the
variational mean parameter for each document's topic proportions.final-nu: This is the D x K matrix of the variational variance
parameter for each document in the collection.likelihood.dat: This is a record of the likelihood bound at each
iteration of EM. The columns are: likelihood bound, convergence
criterion, time in seconds of the iteration, average number of
variational iterations per document, the percentage of documents
that reached the variational convergence criterion.2. The script in ctm-topics.py lists the top N words from each topic.
To use, you need the NumPy package installed. Executepython ctm-topics.py final-log-beta.dat vocab.dat 25
where vocab.dat is a file with one word per line ordered according to
the numbering in your data. This will print out the top 25 words from
each topic.3. Finally, the file lasso-graph.r provides R code to build graphs of
topics using the lasso. Details are in the file.------------------------------------------------------------------------
E. POSTERIOR INFERENCE ON NEW DOCUMENTS
To perform posterior inference on a set of documents with the same
vocabulary, run the command./ctm inf
For example:
./ctm inf holdout.dat CTM10/final CTM10/holdout inf-settings.txt
This will result in a number of files with prefix "results-prefix."
They are as follows:- inf-lambda.dat, inf-nu.dat: as above.
- inf-ctm-lhood: the likelihood bound for each document
- inf-phi-sums: A D x K matrix of the sum of the phi variables for
each document. This gives an idea about how many words are
associated with each topic.