{"id":13696892,"url":"https://github.com/anthonylife/discLDA","last_synced_at":"2025-05-03T17:32:30.262Z","repository":{"id":6016622,"uuid":"7240111","full_name":"anthonylife/discLDA","owner":"anthonylife","description":"discriminative LDA","archived":false,"fork":false,"pushed_at":"2012-12-28T11:03:27.000Z","size":404,"stargazers_count":3,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-03T18:21:23.174Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anthonylife.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-12-19T12:17:25.000Z","updated_at":"2014-07-31T14:18:12.000Z","dependencies_parsed_at":"2022-09-22T06:11:03.368Z","dependency_job_id":null,"html_url":"https://github.com/anthonylife/discLDA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anthonylife%2FdiscLDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anthonylife%2FdiscLDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anthonylife%2FdiscLDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anthonylife%2FdiscLDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anthonylife","download_url":"https://codeload.github.com/anthonylife/discLDA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224369812,"owners_count":17299961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T18:00:49.251Z","updated_at":"2024-11-13T00:31:09.553Z","avatar_url":"https://github.com/anthonylife.png","language":"C++","funding_links":[],"categories":["Models"],"sub_categories":["Miscellaneous topic models"],"readme":"****************************************************************************\n\n                              GibbsLDA++\n       A C/C++ Implementation of Latent Dirichlet Allocation (LDA) \n       using Gibbs Sampling for Parameter Estimation and Inference\n\n                   http://gibbslda.sourceforge.net/\n\n                         Copyright (C) 2007 by\n                            Xuan-Hieu Phan\n            hieuxuan@ecei.tohoku.ac.jp or pxhieu@gmail.com\n                Graduate School of Information Sciences\n                          Tohoku University\n\n****************************************************************************\n\n\n                          TABLE OF CONTENTS\n\n1. Introduction\n   1.1. Description\n   1.2. News, Comments, and Bug Reports.\n   1.3. License   \n\n2. Compile GibbsLDA++\n   2.1. Download\n   2.2. Compiling\n\n3. How to Use GibbsLDA++\n   3.1. Command Line \u0026 Input Parameters\n      3.1.1. Parameter Estimation from Scratch\n      3.1.2. Parameter Estimation from a Previously Estimated Model\n      3.1.3. Inference for Previously Unseen (New) Data\n   3.2 Input Data Format\n   3.3. Outputs\n      3.3.1. Outputs of Gibbs Sampling Estimation of GibbsLDA++\n      3.3.2. Outputs of Gibbs Sampling Inference for Previously Unseen Data\n   3.4. Case Study\n\n4. Links, Acknowledgements, and References\n\n****************************************************************************\n\n\n1. Introduction\n\n\n  1.1. Description\n\n  GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) \n  using Gibbs Sampling technique for parameter estimation and inference. It is \n  very fast and is designed to analyze hidden/latent topic structures of \n  large-scale datasets including very large collections of text/Web documents.\n\n  LDA was first introduced by David Blei et al [Blei03]. There have been\n  several implementations of this model in C (using Variational Methods), \n  Java, and Matlab. We decided to release this implementation of LDA in C/C++ \n  using Gibbs Sampling to provide an alternative choice to the topic-model \n  community.\n\n  The release of GibbsLDA++ is useful for the following (potential) application\n  areas):\n\n    + Information Retrieval (analyzing semantic/latent topic/concept structures\n      of large text collection for a more intelligent information searching.\n    + Document Classification/Clustering, Document Summarization, and Text/Web\n      Data Mining community in general.\n    + Collaborative Filtering\n    + Content-based Image Clustering, Object Recognition, and other applications\n      of Computer Vision in general.\n    + Other potential applications in biological data.\n\n\n  1.2. News, Comments, and Bug Reports.\n\n  We highly appreciate any suggestion, comment, and bug report.\n\n\n  1.3. License\n\n  GibbsLDA++ is a free software; you can redistribute it and/or modify it under \n  the terms of the GNU General Public License as published by the Free Software\n  Foundation; either version 2 of the License, or (at your option) any later \n  version.\n\n  GibbsLDA++ is distributed in the hope that it will be useful, but WITHOUT ANY \n  WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR \n  A PARTICULAR PURPOSE. See the GNU General Public License for more details.\n\n  You should have received a copy of the GNU General Public License along with\n  GibbsLDA++; if not, write to the Free Software Foundation, Inc., 59 Temple \n  Place, Suite 330, Boston, MA 02111-1307 USA.\n\n\n2. Compile GibbsLDA++\n\n\n  2.1. Download\n\n  You can find and download document, source code, and case studies of \n  GibbsLDA++ at:\n\n  http://gibbslda.sourceforge.net/\n  http://sourceforge.net/projects/gibbslda\n\n\n  2.2. Compiling \n\n  On Unix/Linux/Cygwin/MinGW environments:\n\n  + System requirements:\n\n    - A C/C++ compiler and the STL library. In the  Makefile, we use  g++ as \n    the default compiler command, if the C/C++ compiler on your system has \n    another name (e.g.,  cc, cpp, CC, CPP, etc.), you can modify the CC variable \n    in the Makefile in order to use make utility smoothly.\n    \n    - The computational time of GibbsLDA++ much depends on the size of input \n    data, the CPU speed, and the memory size. If your dataset is quite large \n    (e.g., larger than 100,000 documents or so), it is better to train GibbsLDA++ \n    on a minimum of 2GHz CPU, 1Gb RAM system. \n\n  + Untar and unzip GibbsLDA++:\n\n    $ gunzip GibbsLDA++.tar.gz\n    $ tar -xf GibbsLDA++.tar\n\n  + Go to home directory of GibbsLDA++ (i.e., GibbsLDA++ directory), type:\n\n    $ make clean\n    $ make all\n\n\n3. How to Use GibbsLDA++\n\n\n  3.1. Command Line \u0026 Input Parameters\n\n  After compiling GibbsLDA++, we have \"lda\" executable file in \"GibbsLDA++/src\" \n  directory. We use this for parameter estimation and inference for new data.\n\n\n  3.1.1. Parameter Estimation from Scratch\n\n    $ lda -est [-alpha \u003cdouble\u003e] [-beta \u003cdouble\u003e] [-ntopics \u003cint\u003e] \\\n      [-niters \u003cint\u003e] [-savestep \u003cint\u003e] [-twords \u003cint\u003e] -dfile \u003cstring\u003e\n    \n    in which (parameters in [] are optional):\n\n    -est: \n        ESTimate the LDA model from scratch\n\n    -alpha \u003cdouble\u003e: \n        The value of alpha, hyper-parameter of LDA. The default value\n        of alpha is 50 / K (K is the the number of topics). See [Griffiths04] \n        for a detailed discussion of choosing alpha and beta values.\n\n    -beta \u003cdouble\u003e:\n        The value of beta, also the hyper-parameter of LDA. Its default value\n        is 0.1 \n\n    -ntopics \u003cint\u003e:\n        The number of topics. Its default value is 100. This depends on the \n        input dataset. See [Griffiths04] and [Blei03] for a more careful \n        discussion of selecting the number of topics.\n\n    -niters \u003cint\u003e:\n        The number of Gibbs sampling iterations. The default value is 2000.\n\n    -savestep \u003cint\u003e:\n        The step (counted by the number of Gibbs sampling iterations) at which\n        the LDA model is saved to hard disk. The default value is 200.\n\n    -twords \u003cint\u003e:\n        The number of most likely words for each topic. The default value is zero.\n        If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++\n        will print out the list of top 20 most likely words per each topic each \n        time it save the model to hard disk according to the parameter \"savestep\" \n        above.\n\n    -dfile \u003cstring\u003e:\n        The input training data file. See Section 3.2 for a description of \n        input data format.\n\n\n  3.1.2. Parameter Estimation from a Previously Estimated Model\n \n    $ lda -estc -dir \u003cstring\u003e -model \u003cstring\u003e [-niters \u003cint\u003e] -savestep \u003cint\u003e] \\\n      [-twords \u003cint\u003e]\n\n    in which (parameters in [] are optional):\n\n    -estc:\n        Continue to ESTimate the model from a previously estimated model.\n\n    -dir \u003cstring\u003e:\n        The directory contain the previously estimated model\n\n    -model \u003cstring\u003e:\n        The name of the previously estimated model. See Section 3.3 to know \n        the way GibbsLDA++ saves outputs on hard disk.\n\n    -niters \u003cint\u003e:\n        The number of Gibbs sampling iterations to continue estimating. The \n        default value is 2000.\n\n    -savestep \u003cint\u003e:\n        The step (counted by the number of Gibbs sampling iterations) at which\n        the LDA model is saved to hard disk. The default value is 200.\n\n    -twords \u003cint\u003e:\n        The number of most likely words for each topic. The default value is zero.\n        If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++\n        will print out the list of top 20 most likely words per each topic each \n        time it save the model to hard disk according to the parameter \"savestep\" \n        above.\n\n\n  3.1.3. Inference for Previously Unseen (New) Data\n\n    $ lda -inf -dir \u003cstring\u003e -model \u003cstring\u003e [-niters \u003cint\u003e] [-twords \u003cint\u003e] \\\n      -dfile \u003cstring\u003e\n\n    in which (parameters in [] are optional):\n\n    -inf: \n        Do INFerence for previously unseen (new) data using a previously estimated\n        LDA model.        \n\n    -dir \u003cstring\u003e:\n        The directory contain the previously estimated model\n\n    -model \u003cstring\u003e:\n        The name of the previously estimated model. See Section 3.3 to know \n        the way GibbsLDA++ saves outputs on hard disk.\n\n    -niters \u003cint\u003e:\n        The number of Gibbs sampling iterations for inference. The default value \n        is 20.\n\n    -twords \u003cint\u003e:\n        The number of most likely words for each topic of the new data. The \n        default value is zero. If you set this parameter a value larger than zero,\n        e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words \n        per each topic after inference.\n\n    -dfile \u003cint\u003e:\n        The file containing new data. See Section 3.2 for a description of input \n        data format.\n\n\n  3.2 Input Data Format\n\n  Both data for training/estimating the model and new data (i.e., previously \n  unseen data) have the same format as follows:\n\n    [M]\n    [document_1]\n    [document_2]\n    ...\n    [document_M]\n\n  in which the first line is the total number for documents [M]. Each line \n  after that is one document. [document_i] is the i^th document of the dataset \n  that consists of a list of Ni words/terms.\n\n    [document_i] = [word_i1] [word_i2] ... [word_iNi]\n\n  in which all [word_ij] (i=1..M, j=1..Ni) are text strings and they are \n  separated by the space character.\n\n  Note that the terms document and word here are abstract and should not only be \n  understood as normal text documents. This is because LDA can be used to discover \n  the underlying topic structures of any kind of discrete data. Therefore, \n  GibbsLDA++ is not limited to text and natural language processing but can \n  also be applied to other kinds of data like images and biological sequences. \n  Also, keep in mind that for text/Web data collections, we should first preprocess \n  the data (e.g., removing stop words and rare words, stemming, etc.) before \n  estimating with GibbsLDA++.\n\n\n  3.3 Outputs\n\n\n  3.3.1. Outputs of Gibbs Sampling Estimation of GibbsLDA++\n\n  Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:\n\n    \u003cmodel_name\u003e.others\n    \u003cmodel_name\u003e.phi  \n    \u003cmodel_name\u003e.theta\n    \u003cmodel_name\u003e.tassign\n    \u003cmodel_name\u003e.twords\n\n  in which:\n\n    + \u003cmodel_name\u003e:\n       is the name of a LDA model corresponding to the time step it was saved \n       on the hard disk. For example, the name of the model was saved at the Gibbs \n       sampling iteration 400th will be \"model-00400\". Similarly, the model was \n       saved at the 1200th iteration is \"model-01200\". The model name of the last\n       Gibbs sampling iteration is \"model-final\".\n\n    + \u003cmodel_name\u003e.others: \n       This file contains some parameters of LDA model, such as:\n          alpha=?\n          beta=?\n          ntopics=? # i.e., number of topics)\n          ndocs=? # i.e., number of documents)\n          nwords=? # i.e., the vocabulary size)\n          liter=? # i.e., the Gibbs sampling iteration at which the model was saved)\n\n    + \u003cmodel_name\u003e.phi:\n       This file contains the word-topic distributions, \n       i.e., p(word_w | topic_t). Each line is a topic, each column is a word in \n       the vocabulary\n\n    + \u003cmodel_name\u003e.theta:\n       This file contains the topic-document distributions, \n       i.e., p(topic_t | document_m). Each line is a document and each column is \n       a topic.\n       \n    + \u003cmodel_name\u003e.tassign:\n       This file contains the topic assignments for words in training data. Each \n       line is a document that consists of a list of \u003cword_ij\u003e:\u003ctopic of word_ij\u003e\n\n    + \u003cmodel_file\u003e.twords:\n       This file contains \u003ctwords\u003e most likely words of each topic. \u003ctwords\u003e is \n       specified in the command line (see Sections 3.1.1 and 3.1.2).\n\n  GibbsLDA++ also saves a file called \"wordmap.txt\" that contains the maps between\n  words and word's IDs (integer). This is because GibbsLDA++ works directly with \n  integer IDs of words/terms inside instead of text strings.\n\n\n  3.3.2. Outputs of Gibbs Sampling Inference for Previously Unseen Data\n\n  The outputs of GibbsLDA++ inference are almost the same as those of the estimation\n  process except that the contents of those files are of the new data. The \n  \u003cmodel_name\u003e is exactly the same as the filename of the input (new) data.\n\n\n  3.4. Case Study\n\n  For example, we want to estimate a LDA model for a collection of documents stored\n  in file called \"models/casestudy/trndocs.dat\" and then use that model to do \n  inference for new data stored in file \"models/casestudy/newdocs.dat\".\n\n  We want to estimate for 100 topics with alpha = 0.5 and beta = 0.1. We want to \n  perform 1000 Gibbs sampling iterations, save a model at every 100 iterations, and\n  each time a model is saved, print out the list of 20 most likely words for each \n  topics. Supposing that we are now at the home directory of GibbsLDA++, We will \n  execute the following command to estimate LDA model from scratch:\n\n     $ src/lda -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 \\ \n       -twords 20 -dfile models/casestudy/trndocs.dat\n\n  Now look into the \"models/casestudy\" directory, we can see the outputs as described\n  in Section 3.3.1.\n\n  Now, we want to continue to perform another 800 Gibbs sampling iterations from the \n  previously estimated model \"model-01000\" with savestep=100, twords=30, we perform \n  the following command:\n\n     $ src/lda -estc -dir models/casestudy/ -model model-01000 -niters 800 \\ \n       -savestep 100 -twords 30\n\n  Now, look into the casestudy directory to see the outputs.\n\n  Now, if we want to do inference (30 Gibbs sampling iterations) for the new data \n  \"newdocs.dat\" (note that the new data file is stored in the same directory of \n  the LDA models) using one of the previously estimated LDA models, for example \n  \"model-01800\", we perform the following command:\n\n     $  src/lda -inf -dir models/casestudy/ -model model-01800 -niters 30 \\ \n        -twords 20 -dfile newdocs.dat\n\n  Now, look into the casestudy directory, we can see the outputs of the inferences:\n     + newdocs.dat.others\n     + newdocs.dat.phi\n     + newdocs.dat.tassign\n     + newdocs.dat.theta\n     + newdocs.dat.twords\n\n\n4. Links, Acknowledgements, and References \n\n\n  4.1. Links\n\n  Here are some pointers to other implementations of LDA\n  \n  - LDA-C (Variational Methods): \n    http://www.cs.princeton.edu/~blei/lda-c/index.html\n\n  - Matlab Topic Modeling: \n    http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm\n\n  - Java version of LDA-C and a short Java version of Gibbs Sampling for LDA: \n    http://www.arbylon.net/projects/\n\n  - LDA package (using Variational Methods, including C and Matlab):\n    http://chasen.org/~daiti-m/dist/lda/\n\n\n  4.2. References\n\n  - [Andrieu03] C. Andrieu, N.D. Freitas, A. Doucet, and M. Jordan: \"An \n    introduction to MCMC for machine learning\", Machine Learning (2003).\n\n  - [Blei03] D. Blei, A. Ng, and M. Jordan: \"Latent Dirichlet Allocation\",\n    JMLR (2003).\n\n  - [Blei07] D. Blei and J. Lafferty: \"A correlated topic model of Science\",\n    The Annals of Applied Statistics (2007).\n\n  - [Griffiths] T. Griffiths: \"Gibbs sampling in the generative model of\n    Latent Dirichlet Allocation\", Technical Report. \n    http://citeseer.ist.psu.edu/613963.html\n\n  - [Griffiths04] T. Griffiths and M. Steyvers: \"Finding scientific topics\",\n    Proc. of the National Academy of Sciences (2004).\n\n  - [Heinrich] G. Heinrich: \"Parameter estimation for text analysis\",\n    Technical Report, http://www.arbylon.net/publications/text-est.pdf\n\n  - [Hofmann99] T. Hofmann: \"Probabilistic latent semantic analysis\",\n    Proc. of UAI (1999).\n\n  - [Wei06] X. Wei and W.B. Croft: \"LDA-based document models for ad-hoc \n    retrieval\", Proc. of SIGIR (2006).\n\n\n  4.3. Acknowledgements\n\n  Our code is based on the Java code of Gregor Heinrich\n  (http://www.arbylon.net/projects/LdaGibbsSampler.java) and the theoretical\n  description of Gibbs Sampling for LDA in [Heinrich]. I would like to thank\n  Heinrich for sharing the code and a comprehensive technical report.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanthonylife%2FdiscLDA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanthonylife%2FdiscLDA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanthonylife%2FdiscLDA/lists"}