Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dmlc/experimental-lda
https://github.com/dmlc/experimental-lda
Last synced: 27 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/dmlc/experimental-lda
- Owner: dmlc
- License: other
- Created: 2015-05-17T18:28:52.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-06-23T02:02:42.000Z (over 8 years ago)
- Last Synced: 2024-10-30T07:47:40.323Z (about 1 month ago)
- Language: C++
- Size: 4.7 MB
- Stars: 127
- Watchers: 36
- Forks: 63
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-topic-models - dmlc - Single-and multi-threaded C++ implementations of [lightLDA](https://arxiv.org/pdf/1412.1576.pdf), [F+LDA](https://arxiv.org/pdf/1412.4986v1.pdf), [AliasLDA](https://dl.acm.org/doi/pdf/10.1145/2623330.2623756), forestLDA and many more (Models / Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf))
README
# Single Machine implementation of LDA
## Modules
1. `parallelLDA` contains various implementation of multi threaded LDA
2. `singleLDA` contains various implementation of single threaded LDA
3. `topwords` a tool to explore topics learnt by the LDA/HDP
4. `perplexity` a tool to calculate perplexity on another dataset using word|topic matrix
5. `datagen` packages txt files for our program
6. `preprocessing` for converting from UCI or cLDA to simple txt file having one document per line## Organisation
1. All codes are under `src` within respective folder
2. For running Topic Models many template scripts are provided under `scripts`
3. `data` is a placeholder folder where to put the data
4. `build` and `dist` folder will be created to hold the executables## Requirements
1. gcc >= 5.0 or Intel® C++ Compiler 2016 for using C++14 features
2. split >= 8.21 (part of GNU coreutils)## How to use
We will show how to run our LDA on an [UCI bag of words dataset](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)1. First of all compile by hitting make
```bash
make
```2. Download example dataset from UCI repository. For this a script has been provided.
```bash
scripts/get_data.sh
```3. Prepare the data for our program
```bash
scripts/prepare.sh data/nytimes 1
```For other datasets replace nytimes with dataset name or location.
4. Run LDA!
```bash
scripts/lda_runner.sh
```Inside the `lda_runner.sh` all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under `out/`. Also you can specify which inference algorithm of LDA you want to run:
1. `simpleLDA`: Plain vanilla Gibbs sampling by [Griffiths04](http://www.pnas.org/content/101/suppl_1/5228.abstract)
2. `sparseLDA`: Sparse LDA of [Yao09](http://dl.acm.org/citation.cfm?id=1557121)
3. `aliasLDA`: Alias LDA
4. `FTreeLDA`: F++LDA (inspired from [Yu14](http://arxiv.org/abs/1412.4986)
5. `lightLDA`: light LDA of [Yuan14](http://arxiv.org/abs/1412.1576)The make file has some useful features:
- if you have Intel® C++ Compiler, then you can instead
```bash
make intel
```- or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
```bash
make inteltogether
```- Also you can selectively compile individual modules by specifying
```bash
make
```- or clean individually by
```bash
make clean-
```## Performance
Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.#### Datasets
| Dataset | V | L | D | L/V | L/D |
| ------------ | --------: | --------------: | -----------: | --------: | --------: |
| NY Times | 101,330 | 99,542,127 | 299,753 | 982.36 | 332.08 |
| PubMed | 141,043 | 737,869,085 | 8,200,000 | 5,231.52 | 89.98 |
| Wikipedia | 210,218 | 1,614,349,889 | 3,731,325 | 7,679.41 | 432.65 |Experimental datasets and their statistics. `V` denotes vocabulary size, `L` denotes the number of training tokens, `D` denotes
the number of documents, `L/V` indicates the average number of occurrences of a word, `L/D` indicates the average length of a
document.
#### log-Perplexity with time