https://github.com/kymmt90/lda

Latent Dirichlet Allocation in Java 8
https://github.com/kymmt90/lda

Last synced: 4 months ago
JSON representation

Latent Dirichlet Allocation in Java 8

Host: GitHub
URL: https://github.com/kymmt90/lda
Owner: kymmt90
Created: 2015-02-11T04:39:24.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2015-03-15T03:18:35.000Z (over 10 years ago)
Last Synced: 2025-02-24T03:58:17.285Z (4 months ago)
Language: Java
Homepage:
Size: 1.26 MB
Stars: 0
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        LDA in Java 8

=============

Latent Dirichlet Allocation in Java 8.

Latent Dirichlet Allocation (LDA) [Blei+ 2003] is the basic probabilistic topic model.

Please see following for more details:

- [Latent Dirichlet allocation - Wikipedia, the free encyclopedia](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Now, this software supports [collapsed Gibbs sampling](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf) [Griffiths and Steyvers 2004] for model inference.

This repository includes dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets) [Lichman 2013].

Requierments

------------

- Java 8

- Apache Commons

  - Math

  - Lang

- Maven

For unit testing, these libraries are also needed.

- JUnit

- Mockito

Usage

-----

### Dataset Form

The form of bag-of-words dataset follows [Bag of Words Data Set](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words) in [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.html).

The form of doc-vocab-count dataset is following:

    #Documents

    #Vocabularies

    #NonZeros

    docID vocabID count

    docID vocabID count

    ...

    docID vocabID count

The form of vocabularies dataset is following:

    vocab1

    vocab2

    vocab3

    ...

    vocabN

Each number of lines is `vocabID`.

### Example

There is `lda.BagOfWords` to read dataset from files.

`lda.BagOfWords` object and other parameters are passed to initialize `lda.LDA`.

For example:

    Dataset dataset = new Dataset("path/to/doc-vocab-counts", "path/to/vocabs");

    LDA lda = new LDA(0.1                    /* initial alpha */,

                      0.1                    /* initial beta */,

                      50                     /* the number of topics */,

                      bow                    /* bag-of-words */,

                      CGS                    /* use collapsed Gibbs sampler for inference */,

                      "path/to/properties"   /* properties file path */);

    lda.run();

These items are available as properties:

    numIteration=

    seed=

The results of topics can be refered as follows:

    List> vocabs

        = LDA.getVocabsSortedByPhi(0 /* = topic ID */);

    vocabs.get(0).getLeft();  // the largest probability vocabulary in topic-0

    vocabs.get(0).getRight(); // the probability value of the above vocabulary

Please see `example.Example#main` for more details.

Execute these commands at the directory `LDA` to build and run `example.Example#main`.

    $ mvn clean package dependency:copy-dependencies -DincludeScope=runtime

    $ java -jar target/LDA-.jar

License

-------

- [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kymmt90/lda

Awesome Lists containing this project

README