Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/dmlc/experimental-lda

Last synced: 27 days ago
JSON representation

Host: GitHub
URL: https://github.com/dmlc/experimental-lda
Owner: dmlc
License: other
Created: 2015-05-17T18:28:52.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2016-06-23T02:02:42.000Z (about 8 years ago)
Last Synced: 2024-05-14T18:31:58.878Z (about 1 month ago)
Language: C++
Size: 4.7 MB
Stars: 127
Watchers: 36
Forks: 63
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-stars - dmlc/experimental-lda - (C++)
awesome-topic-models - dmlc - Single-and multi-threaded C++ implementations of [lightLDA](https://arxiv.org/pdf/1412.1576.pdf), [F+LDA](https://arxiv.org/pdf/1412.4986v1.pdf), [AliasLDA](https://dl.acm.org/doi/pdf/10.1145/2623330.2623756), forestLDA and many more (Models / Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf))

README

# Single Machine implementation of LDA

## Modules
1. `parallelLDA` contains various implementation of multi threaded LDA
2. `singleLDA` contains various implementation of single threaded LDA
3. `topwords` a tool to explore topics learnt by the LDA/HDP
4. `perplexity` a tool to calculate perplexity on another dataset using word|topic matrix
5. `datagen` packages txt files for our program
6. `preprocessing` for converting from UCI or cLDA to simple txt file having one document per line

## Organisation
1. All codes are under `src` within respective folder
2. For running Topic Models many template scripts are provided under `scripts`
3. `data` is a placeholder folder where to put the data
4. `build` and `dist` folder will be created to hold the executables

## Requirements
1. gcc >= 5.0 or Intel® C++ Compiler 2016 for using C++14 features
2. split >= 8.21 (part of GNU coreutils)

## How to use
We will show how to run our LDA on an [UCI bag of words dataset](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)

1. First of all compile by hitting make

```bash
make
```

2. Download example dataset from UCI repository. For this a script has been provided.

```bash
scripts/get_data.sh
```

3. Prepare the data for our program

```bash
scripts/prepare.sh data/nytimes 1
```

For other datasets replace nytimes with dataset name or location.

4. Run LDA!

```bash
scripts/lda_runner.sh
```

Inside the `lda_runner.sh` all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under `out/`. Also you can specify which inference algorithm of LDA you want to run:
1. `simpleLDA`: Plain vanilla Gibbs sampling by [Griffiths04](http://www.pnas.org/content/101/suppl_1/5228.abstract)
2. `sparseLDA`: Sparse LDA of [Yao09](http://dl.acm.org/citation.cfm?id=1557121)
3. `aliasLDA`: Alias LDA
4. `FTreeLDA`: F++LDA (inspired from [Yu14](http://arxiv.org/abs/1412.4986)
5. `lightLDA`: light LDA of [Yuan14](http://arxiv.org/abs/1412.1576)

The make file has some useful features:

- if you have Intel® C++ Compiler, then you can instead

```bash
make intel
```

- or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit

```bash
make inteltogether
```

- Also you can selectively compile individual modules by specifying

```bash
make
```

- or clean individually by

```bash
make clean-
```

## Performance
Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.

#### Datasets

| Dataset | V | L | D | L/V | L/D |
| ------------ | --------: | --------------: | -----------: | --------: | --------: |
| NY Times | 101,330 | 99,542,127 | 299,753 | 982.36 | 332.08 |
| PubMed | 141,043 | 737,869,085 | 8,200,000 | 5,231.52 | 89.98 |
| Wikipedia | 210,218 | 1,614,349,889 | 3,731,325 | 7,679.41 | 432.65 |

Experimental datasets and their statistics. `V` denotes vocabulary size, `L` denotes the number of training tokens, `D` denotes
the number of documents, `L/V` indicates the average number of occurrences of a word, `L/D` indicates the average length of a
document.

#### log-Perplexity with time