Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/slycoder/TopicModels.jl

TopicModels for Julia
https://github.com/slycoder/TopicModels.jl

Last synced: 3 months ago
JSON representation

TopicModels for Julia

Host: GitHub
URL: https://github.com/slycoder/TopicModels.jl
Owner: slycoder
License: mit
Created: 2013-03-17T03:32:26.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2020-05-31T01:53:24.000Z (over 4 years ago)
Last Synced: 2024-05-12T17:34:56.784Z (6 months ago)
Language: Julia
Size: 238 KB
Stars: 37
Watchers: 6
Forks: 17
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-topic-models - TopicModels.jl - Julia implementation of LDA (Models / Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf))
awesome-julia-datasciences - Topic Models - TopicModels for Julia. (APL / General-Purpose Machine Learning)

README

        # Topic Models for Julia

Topic models are Bayesian, hierarchical mixture models of discrete data.  

This package implements utilities for reading and manipulating data commonly 

associated with topic models as well as inference and prediction procedures

for such models.

## Model description

The bulk of the package is designed for a particular topic model, Latent 

Dirichlet Allocation (LDA, Blei et al., 2003).  This model assumes a corpus

composed of a collection of bags of words; each bag of words is termed a

document.  The space whence the words are drawn is termed the lexicon.

Formally, the model is defined as

```

  For each topic k,

    phi_k ~ Dirichlet(beta)

  For each document d,

    theta ~ Dirichlet(alpha)

    For each word w,

      z ~ Multinomial(theta)

      w ~ Multinomial(phi_z)

```

alpha and beta are hyperparameters of the model.  The number of topics, K,

is a fixed parameter of the model, and w is observed.  This package fits 

the topics using collapsed Gibbs sampling (Griffiths and Steyvers, 2004).

## Package usage

We describe the functions of the package using an example. First we load 

corpora from data files as follows:

```

  testDocuments = readDocuments(open("cora.documents"))

  testLexicon = readLexicon(open("cora.lexicon"))

```

These read files in LDA-C format.  The lexicon file is assumed to have one

word per line.  The document file consists of one document per line.  Each

document consists of a collection of tuples; the first element of each tuple

expresses the word while the second element expresses the number of times

that word appears in the document.  The words are indicated by an index 

into the lexicon into the lexicon file, starting at zero.  The tuples are

separated by spaces and the entire line is prefixed by a number indicating

the number of tuples for that document.

With the documents loaded, we instantiate a model that we want to train:

```

  model = Model(fill(0.1, 10), 0.01, length(testLexicon), testDocuments)

```

This is a model with 10 topics.  alpha is set to a uniform Dirichlet prior

with 0.1 weight on each topic (the dimension of this variable is used

to determine the number of topics).  The second parameter indicates that

the prior weight on phi (i.e. beta) should be set to 0.01.  The third

parameter is the lexicon size; here we just use the lexicon we have 

just read.  The fourth parameter is the collection of documents.

```

  trainModel(testDocuments, model, 30)

```

With the model defined, we can train the model on a corpus of documents.

The trainModel command takes the corpus as the first argument, the model

as the second argument, and the number of iterations of collapsed Gibbs

samplign to perform as the third argument.  The contents of the model

will be mutated in place.

Finally we can examine the output of the trained model using topTopicWords.

```

  topWords = topTopicWords(model, testLexicon, 10)

```

This function retrieves the top words associated with each topic; this

serves as a useful summary of the model.  The first parameter is the model,

the second is the lexicon backing the corpus, and the third parameter

is the number of words to retrieve for each topic.  The output is an array

of arrays of the words in sorted order of prevalence in the topic.

## See also

The R package whence much of this code was derived at 

https://github.com/slycoder/R-lda-deprecated.