Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thu-ml/warplda
Cache efficient implementation for Latent Dirichlet Allocation
https://github.com/thu-ml/warplda
Last synced: 2 months ago
JSON representation
Cache efficient implementation for Latent Dirichlet Allocation
- Host: GitHub
- URL: https://github.com/thu-ml/warplda
- Owner: thu-ml
- License: mit
- Created: 2016-05-31T03:31:50.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2019-01-04T06:53:47.000Z (about 6 years ago)
- Last Synced: 2024-08-03T18:20:58.525Z (5 months ago)
- Language: C++
- Size: 45.9 KB
- Stars: 161
- Watchers: 13
- Forks: 55
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-topic-models - warpLDA - C++ cache efficient LDA implementation which samples each token in O(1) [:page_facing_up:](https://arxiv.org/pdf/1510.08628.pdf) (Models / Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf))
README
# WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation
## Introduction
WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).
## Installation
Prerequisites:* GCC (>=4.8.5)
* CMake (>=2.8.12)
* git
* libnuma
- CentOS: `yum install libnuma-devel`
- Ubuntu: `apt-get install libnuma-dev`Clone this project
git clone https://github.com/thu-ml/warplda
Install third-party dependency
./get_gflags.sh
Download some data, and split it as training and testing set
cd data
wget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..Compile the project
./build.sh
cd release/src
make -j## Quick-start
Format the data
./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix testTrain the model
./warplda --prefix train --k 100 --niter 300
Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.
vim train.info.full.txt
Infer latent topics of some testing data.
./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10
## Data format
The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is
id1 id2 word1 word2 word3 ...
id1, id2 are two string document identifiers, and each word is a string, separated by white space.
## Output format
WarpLDA generates a number of files:
#### `.vocab` (generated by `.format`)
Each line of it is a word in the vocabulary.#### `.info.full.txt` (generated by `warplda -estimate`)
The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format `(probability, word)`. The number of most frequent words is controlled by `-ntop`. `.info.words.txt` is a simpler version which only contains words.#### `.model` (generated by `warplda -estimate`)
The word-topic count matrix. The first line contains four integers
Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,
index:count index:count ...For example, `0:2` on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.
#### `.z.estimate` (generated by `warplda -estimate`)
The topic assignments of each token in the libsvm format. Each line is a document,
: : ...#### `.z.inference` (generated by `warplda -inference`)
The format is the same as `.z.estimate`.## Other features
* Use custom prefix for output `-prefix myprefix`
* Output perplexity every 10 iterations `-perplexity 10`
* Tune Dirichlet hyperparameters `-alpha 10 -beta 0.1`
* Use UCI machine learning repository datawget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt
wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz
gunzip docword.nips.txt.gz
./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
head -n 1400 nips.txt > nips_train.txt
tail -n 100 nips.txt > nips_test.txt## License
MIT
## Reference
Please cite WarpLDA if you find it is useful!
@inproceedings{chen2016warplda,
title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
booktitle={VLDB},
year={2016}
}