Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/blei-lab/diln
This implements the discrete infinite logistic normal, a Bayesian nonparametric topic model that finds correlated topics.
https://github.com/blei-lab/diln
Last synced: about 1 month ago
JSON representation
This implements the discrete infinite logistic normal, a Bayesian nonparametric topic model that finds correlated topics.
- Host: GitHub
- URL: https://github.com/blei-lab/diln
- Owner: blei-lab
- License: lgpl-2.1
- Created: 2014-10-09T18:26:31.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2014-10-09T18:27:42.000Z (about 10 years ago)
- Last Synced: 2024-08-03T18:21:50.763Z (4 months ago)
- Language: C
- Size: 129 KB
- Stars: 6
- Watchers: 31
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.txt
- License: license.txt
Awesome Lists containing this project
- awesome-topic-models - diln - C implementation of Discrete Infinite Logistic Normal (with HDP option) by John Paisley (Research Implementations / Embedding based Topic Models)
README
-----------------------------------------------------------------------
The Discrete Infinite Logistic Normal (with HDP option) in C
-----------------------------------------------------------------------(C) Copyright 2010, John Paisley, Chong Wang and David Blei
Written by John Paisley, [email protected].
This file is part of DILN-C
DILN-C is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or
(at your option) any later version.DILN-C is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
License for more details.You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation,
Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA-----------------------------------------------------------------------
This is a C implementation of the discrete infinite logistic normal (DILN)
for topic modeling. Variational Bayes is used for inference.The hierarchical Dirichlet process (HDP) is also a model option.
In both model priors, the top-level is represented as a stick-breaking
Dirichlet process, and each second-level probability distribution is
represented as the normalization of a sequence of gamma random variables.This code requires the GSL, http://www.gnu.org/software/gsl/
-----------------------------------------------------------------------
TABLE OF CONTENTS
A. COMPILING
B. DATA FORMAT
C. TRAINING ON A CORPUS
D. OUTPUT
E. FILES INCLUDED
-----------------------------------------------------------------------
A. COMPILING
Type "make" in a shell. You will need to change the Makefile to
point to the GSL on your machine.B. DATA FORMAT ********************************************************
This code uses the same data format as in CTM-C by David M. Blei.
A data file contains an entire corpus for training. Each line of a
data file represents a document as follows:[M] [term_1]:[count_1] [term_2]:[count_2] ... [term_N]:[count_N]
[M]: The number of unique terms in the document
[term_i]: An integer associated with the i-th term in a vocabulary.
[count_i]: The number of times that the i-th term appears in the document.
Notes: [count_i] [term_i+1] are separated by a space. Only terms with
counts greater than zero should be included.C. TRAINING ON A CORPUS ************************************************
Below is a list of inputs to DILNtm.exe
Command Line: DILNtm.exe argv[1] argv[2] argv[3] argv[4] argv[5] (optional)
argv[1] : corpus file
argv[2] : number of topics (must be > 2)
argv[3] : method (1 = DILN, 2 = HDP)
argv[4] : if argv[4] integer -> number of iterations
if 0 < argv[4] < 1 -> error threshold (fractional change in bound)
argv[5] : Dirichlet base concentration parameter
default = 0.5*|Vocab| -> Dir(0.5,...,0.5)We currently do not provide the ability to do testing.
D. OUTPUT **************************************************************
The code outputs parameter values into individual csv files. The list of output
parameters are given below (output files are [name].txt). (*) indicates that
these parameters are not output for HDP.--- Below, each column is a document and each row is a topic ---
A: matrix of posterior gamma parameters (first parameter)
B: matrix of posterior gamma parameters (second parameter)
*mu: matrix of log-normal vector posterior means (doc specific)
*sig: matrix of log-normal vector posterior variances (doc specific)--------------------------------------------------------
*u: posterior mean of log-normal vectors
*Kern: posterior covariance matrix (kernel) for log-normal vectors
V: top-level stick-breaking proportions
Gam: posterior of topics. each row is a topic. each col is a word
Lbound: lower bound as a function of iteration
alpha: top-level scaling parameter
beta: second-level scaling parameterE. FILES INCLUDED *******************************************************
main.c
DILNfunctions.c (.h) : functions specific to DILN (HDP) inference
gsl_wrapper.c (.h) : wrapper functions to interact with the gsl
importData.c (.h) : functions for importing (and exporting) datasettings.txt : Contains additional initializations and settings not input
in the command line. The default values are:alpha_init = 20 (top-level scaling parameter initialization)
beta_init = 5 (second-level scaling parameter initialization)
bool_learn_alpha = 1 (a boolean indicating whether to learn alpha)
bool_learn_beta = 0 (a boolean indicating whether to learn beta)
Kmeans_iterations = 1 (number of Kmeans iterations for initialization)Makefile : should be changed to point to the GSL on your machine
README.txt : this file
license.txt : gnu license