{"id":13696617,"url":"https://github.com/blei-lab/ctm-c","last_synced_at":"2025-04-23T20:25:57.820Z","repository":{"id":21495903,"uuid":"24814821","full_name":"blei-lab/ctm-c","owner":"blei-lab","description":"This implements variational inference for the correlated topic model. ","archived":false,"fork":false,"pushed_at":"2014-10-05T12:09:40.000Z","size":432,"stargazers_count":21,"open_issues_count":0,"forks_count":7,"subscribers_count":24,"default_branch":"master","last_synced_at":"2025-03-30T03:11:45.245Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blei-lab.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-10-05T11:59:47.000Z","updated_at":"2022-07-19T15:39:31.000Z","dependencies_parsed_at":"2022-08-21T01:40:11.873Z","dependency_job_id":null,"html_url":"https://github.com/blei-lab/ctm-c","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fctm-c","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fctm-c/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fctm-c/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fctm-c/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blei-lab","download_url":"https://codeload.github.com/blei-lab/ctm-c/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250507799,"owners_count":21442100,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T18:00:43.569Z","updated_at":"2025-04-23T20:25:57.798Z","avatar_url":"https://github.com/blei-lab.png","language":"C","funding_links":[],"categories":["Research Implementations"],"sub_categories":["Embedding based Topic Models"],"readme":"---------------------------\nCorrelated Topic Model in C\n---------------------------\n\nDavid M. Blei and John D. Lafferty\nblei[at]cs.princeton.edu\n\n(C) Copyright 2007, David M. Blei and John D. Lafferty\n\nThis file is part of CTM-C.\n\nCTM-C is free software; you can redistribute it and/or modify it under\nthe terms of the GNU General Public License as published by the Free\nSoftware Foundation; either version 2 of the License, or (at your\noption) any later version.\n\nCTM-C is distributed in the hope that it will be useful, but WITHOUT\nANY WARRANTY; without even the implied warranty of MERCHANTABILITY or\nFITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License\nfor more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program; if not, write to the Free Software\nFoundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307\nUSA\n\n----\n\nThis is a C-implementation of the correlated topic model (CTM) from\nBlei and Lafferty (2007).  This code requires the GSL library.\n\nAny questions or comments about this code should be sent to the topic\nmodels mailing list, which is a forum for discussing topic models in\ngeneral.  To join, go to http://lists.cs.princeton.edu and click on\n\"topic-models.\"  DO NOT EMAIL EITHER OF THE AUTHORS WITH QUESTIONS\nABOUT THIS CODE.  ALL QUESTIONS WILL BE ANSWERED ON THE MAILING LIST.\n\n------------------------------------------------------------------------\n\nTABLE OF CONTENTS\n\nA. COMPILING\n\nB. DATA FORMAT\n\nC. MODEL ESTIMATION\n\nD. MODEL EXAMINATION\n   1. output of estimation\n   2. viewing the topics with ctm-topics.py\n   3. using lasso-graph.r\n\nE. POSTERIOR INFERENCE ON NEW DOCUMENTS\n\n------------------------------------------------------------------------\n\nA. COMPILING\n\nType \"make\" in a shell.  Note: the Makefule currently points to the\n(inefficient) GSL version of the BLAS.  You will probably want to\npoint to the BLAS library on your machine.\n\n------------------------------------------------------------------------\n\nB. Data format\n\nUnder the CTM, the words of each document are assumed exchangeable.\nThus, each document is succinctly represented as a sparse vector of\nword counts. The data is a file where each line is of the form:\n\n     [M] [term_1]:[count_1] [term_2]:[count_2] ...  [term_N]:[count_3]\n\n* [M] is the number of unique terms in the document\n\n* [term_i] is an integer associated with the i-th term in the\n  vocabulary.\n\n* [count_i] is how many times the i-th term appeared in the document.\n\n------------------------------------------------------------------------\n\nC. Estimating a model\n\nThe command to estimate a model is:\n\n./ctm est \u003cdataset\u003e \u003ck\u003e \u003crand/seed/model\u003e \u003cdir\u003e \u003csettings\u003e\n\nFor example:\n\n./ctm est my-training-data.dat 10 seed CTM10 settings.txt\n\n- \u003cdataset\u003e is the file described above in part B.\n\n- \u003c# topics\u003e is the desired number of topics into which to decompose\n  the documents\n\n- \u003crand/seed/model\u003e indicates how to initialize EM: randomly, seeded,\n  or from a partially fit model.  If from a model, type the name of\n  the model into the command line, rather than the word \"model.\"  For\n  example, if your model was in the directory \"CTM100\" and had the\n  prefix \"010\" then you'd type \"CTM100/010\" for the starting point of\n  EM.  (We recommend using \"seed\" to begin with.)\n\n- \u003cdir\u003e is the directory in which to place the files associated with\n  this run of variational EM.  (See part D below.)\n\n- \u003csettings\u003e is a settings file.  For example, the settings.txt file\n  is good for EM and looks like this:\n\n                  em max iter 1000\n                  var max iter 20\n                  cg max iter -1\n                  em convergence 1e-3\n                  var convergence 1e-6\n                  cg convergence 1e-6\n                  lag 10\n                  covariance estimate mle\n\n  The first item (\"em max iter\") is the maximum number of EM\n  iterations.\n\n  The second item (\"var max iter\") is the maximum number of variational iterations,\n  i.e., passes through each variational parameter (-1 indicates to\n  iterate until the convergence criterion is met.)\n\n  The third item (\"cg max iter\") is the maximum number of conjugate\n  gradient iterations in fitting the variational mean and variance per\n  document.\n\n  Items 4-6 are convergence criterions for EM, variational inference,\n  and conjugate gradient, respectively.\n\n  The 7th item (\"lag\") is the multiple of iterations of EM after which\n  to save a version of the model.  This is useful, for example, if you\n  want to monitor how the model changes from iteration to iteration.\n\n  The 8th item (\"covariance estimate\") is what technique to estimate\n  the covariance with.  The choices are \"mle\" or \"shrinkage.\"\n  Additional R code is provided in this directory to implement L1\n  regularization of the topic covariance matrix as described in Blei\n  and Lafferty (2007).\n\n------------------------------------------------------------------------\n\nD. MODEL EXAMINATION\n\n1. Once EM has converged, the model directory will be populated with\nseveral files that can be used to examine the resulting model fit, for\nexample to make topic graph figures or compute similarity between\ndocuments.\n\nAll the files are stored in row major format.  They can be read into R\nwith the command:\n\n     x \u003c- matrix(scan(FILENAME), byrow=T, nrow=NR, ncol=NC),\n\nwhere FILENAME is the file, NR is the number of rows, and NC is the\nnumber of columns.\n\nLet K be the number of topics and V be the number of words in the\nvocabulary.  The files are as follows:\n\n    final-cov.dat, final-inv-cov.dat, final-log-det-inv-cov: These are\n    files corresponding to the (K-1) x (K-1) covariance matrix between\n    topics.  Note that this code implements the logistic normal where\n    a K-2 Gaussian is mapped to the K-1 simplex.  (This is slightly\n    different from the treatment in the paper, where the K-1 Gaussian\n    is mapped to the K-1 simplex.)\n\n    final-mu.dat: This is the K-1 mean vector of the logistic normal\n    over topic proportions.\n\n    final-log-beta.dat: This is a K X V topic matrix.  The ith row\n    contains the log probabilities of the words for the ith topic.\n    Combined with a vector of words in order, this can be used to\n    inspect the top N words from each topic.\n\n    final-lambda.dat and final-nu.dat: This is a D x K matrix of the\n    variational mean parameter for each document's topic proportions.\n\n    final-nu: This is the D x K matrix of the variational variance\n    parameter for each document in the collection.\n\n    likelihood.dat: This is a record of the likelihood bound at each\n    iteration of EM.  The columns are: likelihood bound, convergence\n    criterion, time in seconds of the iteration, average number of\n    variational iterations per document, the percentage of documents\n    that reached the variational convergence criterion.\n\n2. The script in ctm-topics.py lists the top N words from each topic.\nTo use, you need the NumPy package installed.  Execute\n\n     python ctm-topics.py final-log-beta.dat vocab.dat 25\n\nwhere vocab.dat is a file with one word per line ordered according to\nthe numbering in your data.  This will print out the top 25 words from\neach topic.\n\n3. Finally, the file lasso-graph.r provides R code to build graphs of\ntopics using the lasso.  Details are in the file.\n\n------------------------------------------------------------------------\n\nE. POSTERIOR INFERENCE ON NEW DOCUMENTS\n\nTo perform posterior inference on a set of documents with the same\nvocabulary, run the command\n\n./ctm inf \u003cdataset\u003e \u003cmodel-prefix\u003e \u003cresults-prefix\u003e \u003csettings\u003e\n\nFor example:\n\n./ctm inf holdout.dat CTM10/final CTM10/holdout inf-settings.txt\n\nThis will result in a number of files with prefix \"results-prefix.\"\nThey are as follows:\n\n- inf-lambda.dat, inf-nu.dat: as above.\n\n- inf-ctm-lhood: the likelihood bound for each document\n\n- inf-phi-sums: A D x K matrix of the sum of the phi variables for\n  each document.  This gives an idea about how many words are\n  associated with each topic.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblei-lab%2Fctm-c","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblei-lab%2Fctm-c","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblei-lab%2Fctm-c/lists"}