Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kymmt90/lda
Latent Dirichlet Allocation in Java 8
https://github.com/kymmt90/lda
Last synced: about 1 month ago
JSON representation
Latent Dirichlet Allocation in Java 8
- Host: GitHub
- URL: https://github.com/kymmt90/lda
- Owner: kymmt90
- Created: 2015-02-11T04:39:24.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2015-03-15T03:18:35.000Z (almost 10 years ago)
- Last Synced: 2023-02-26T23:26:35.977Z (almost 2 years ago)
- Language: Java
- Homepage:
- Size: 1.26 MB
- Stars: 0
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
LDA in Java 8
=============Latent Dirichlet Allocation in Java 8.
Latent Dirichlet Allocation (LDA) [Blei+ 2003] is the basic probabilistic topic model.
Please see following for more details:- [Latent Dirichlet allocation - Wikipedia, the free encyclopedia](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
Now, this software supports [collapsed Gibbs sampling](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf) [Griffiths and Steyvers 2004] for model inference.
This repository includes dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets) [Lichman 2013].
Requierments
------------- Java 8
- Apache Commons
- Math
- Lang
- MavenFor unit testing, these libraries are also needed.
- JUnit
- MockitoUsage
-----### Dataset Form
The form of bag-of-words dataset follows [Bag of Words Data Set](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words) in [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.html).
The form of doc-vocab-count dataset is following:#Documents
#Vocabularies
#NonZeros
docID vocabID count
docID vocabID count
...
docID vocabID countThe form of vocabularies dataset is following:
vocab1
vocab2
vocab3
...
vocabNEach number of lines is `vocabID`.
### Example
There is `lda.BagOfWords` to read dataset from files.
`lda.BagOfWords` object and other parameters are passed to initialize `lda.LDA`.
For example:Dataset dataset = new Dataset("path/to/doc-vocab-counts", "path/to/vocabs");
LDA lda = new LDA(0.1 /* initial alpha */,
0.1 /* initial beta */,
50 /* the number of topics */,
bow /* bag-of-words */,
CGS /* use collapsed Gibbs sampler for inference */,
"path/to/properties" /* properties file path */);
lda.run();These items are available as properties:
numIteration=
seed=The results of topics can be refered as follows:
List> vocabs
= LDA.getVocabsSortedByPhi(0 /* = topic ID */);
vocabs.get(0).getLeft(); // the largest probability vocabulary in topic-0
vocabs.get(0).getRight(); // the probability value of the above vocabularyPlease see `example.Example#main` for more details.
Execute these commands at the directory `LDA` to build and run `example.Example#main`.$ mvn clean package dependency:copy-dependencies -DincludeScope=runtime
$ java -jar target/LDA-.jarLicense
-------- [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)