Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thu-ml/warplda

Cache efficient implementation for Latent Dirichlet Allocation
https://github.com/thu-ml/warplda

Last synced: about 1 month ago
JSON representation

Cache efficient implementation for Latent Dirichlet Allocation

Awesome Lists containing this project

README

        

# WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation

## Introduction

WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).

## Installation
Prerequisites:

* GCC (>=4.8.5)
* CMake (>=2.8.12)
* git
* libnuma
- CentOS: `yum install libnuma-devel`
- Ubuntu: `apt-get install libnuma-dev`

Clone this project

git clone https://github.com/thu-ml/warplda

Install third-party dependency

./get_gflags.sh

Download some data, and split it as training and testing set

cd data
wget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

./build.sh
cd release/src
make -j

## Quick-start

Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.

vim train.info.full.txt

Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

## Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

## Output format

WarpLDA generates a number of files:

#### `.vocab` (generated by `.format`)
Each line of it is a word in the vocabulary.

#### `.info.full.txt` (generated by `warplda -estimate`)
The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format `(probability, word)`. The number of most frequent words is controlled by `-ntop`. `.info.words.txt` is a simpler version which only contains words.

#### `.model` (generated by `warplda -estimate`)
The word-topic count matrix. The first line contains four integers

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

index:count index:count ...

For example, `0:2` on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

#### `.z.estimate` (generated by `warplda -estimate`)
The topic assignments of each token in the libsvm format. Each line is a document,

: : ...

#### `.z.inference` (generated by `warplda -inference`)
The format is the same as `.z.estimate`.

## Other features

* Use custom prefix for output `-prefix myprefix`
* Output perplexity every 10 iterations `-perplexity 10`
* Tune Dirichlet hyperparameters `-alpha 10 -beta 0.1`
* Use UCI machine learning repository data

wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt
wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz
gunzip docword.nips.txt.gz
./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
head -n 1400 nips.txt > nips_train.txt
tail -n 100 nips.txt > nips_test.txt

## License

MIT

## Reference

Please cite WarpLDA if you find it is useful!

@inproceedings{chen2016warplda,
title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
booktitle={VLDB},
year={2016}
}