{"id":13474327,"url":"https://github.com/dmlc/experimental-lda","last_synced_at":"2025-08-30T23:35:56.206Z","repository":{"id":32203817,"uuid":"35777509","full_name":"dmlc/experimental-lda","owner":"dmlc","description":null,"archived":false,"fork":false,"pushed_at":"2016-06-23T02:02:42.000Z","size":4930,"stargazers_count":127,"open_issues_count":2,"forks_count":59,"subscribers_count":35,"default_branch":"master","last_synced_at":"2025-04-22T12:13:24.973Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmlc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-05-17T18:28:52.000Z","updated_at":"2024-01-04T15:59:55.000Z","dependencies_parsed_at":"2022-07-30T05:47:52.241Z","dependency_job_id":null,"html_url":"https://github.com/dmlc/experimental-lda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dmlc/experimental-lda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmlc%2Fexperimental-lda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmlc%2Fexperimental-lda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmlc%2Fexperimental-lda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmlc%2Fexperimental-lda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmlc","download_url":"https://codeload.github.com/dmlc/experimental-lda/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmlc%2Fexperimental-lda/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259301677,"owners_count":22836976,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T16:01:11.448Z","updated_at":"2025-06-11T17:03:54.773Z","avatar_url":"https://github.com/dmlc.png","language":"C++","readme":"# Single Machine implementation of LDA\n\n## Modules\n1. `parallelLDA` contains various implementation of multi threaded LDA\n2. `singleLDA` contains various implementation of single threaded LDA\n3. `topwords` a tool to explore topics learnt by the LDA/HDP\n4. `perplexity` a tool to calculate perplexity on another dataset using word|topic matrix\n5. `datagen` packages txt files for our program\n6. `preprocessing` for converting from UCI or cLDA to simple txt file having one document per line\n\n\n## Organisation\n1. All codes are under `src` within respective folder\n2. For running Topic Models many template scripts are provided under `scripts`\n3. `data` is a placeholder folder where to put the data\n4. `build` and `dist` folder will be created to hold the executables\n\n\n## Requirements\n1. gcc \u003e= 5.0 or Intel\u0026reg; C++ Compiler 2016 for using C++14 features\n2. split \u003e= 8.21 (part of GNU coreutils)\n\n## How to use\nWe will show how to run our LDA on an [UCI bag of words dataset](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)\n\n1. First of all compile by hitting make\n\n   ```bash\n     make\n   ```\n\n2. Download example dataset from UCI repository. For this a script has been provided.\n\n   ```bash\n     scripts/get_data.sh\n   ```\n\n3. Prepare the data for our program\n\n   ```bash\n     scripts/prepare.sh data/nytimes 1\n   ```\n\n   For other datasets replace nytimes with dataset name or location.\n\n4. Run LDA!\n\n   ```bash\n     scripts/lda_runner.sh\n   ```\n\n   Inside the `lda_runner.sh` all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under `out/`. Also you can specify which inference algorithm of LDA you want to run:\n   1. `simpleLDA`: Plain vanilla Gibbs sampling by [Griffiths04](http://www.pnas.org/content/101/suppl_1/5228.abstract)\n   2. `sparseLDA`: Sparse LDA of [Yao09](http://dl.acm.org/citation.cfm?id=1557121)\n   3. `aliasLDA`: Alias LDA\n   4. `FTreeLDA`: F++LDA (inspired from [Yu14](http://arxiv.org/abs/1412.4986)\n   5. `lightLDA`: light LDA of [Yuan14](http://arxiv.org/abs/1412.1576)\n\nThe make file has some useful features:\n\n- if you have Intel\u0026reg; C++ Compiler, then you can instead\n\n   ```bash\n     make intel\n   ```\n\n- or if you want to use Intel\u0026reg; C++ Compiler's cross-file optimization (ipo), then hit\n   \n   ```bash\n     make inteltogether\n   ```\n\n- Also you can selectively compile individual modules by specifying\n\n   ```bash\n     make \u003cmodule-name\u003e\n   ```\n\n- or clean individually by\n\n   ```bash\n     make clean-\u003cmodule-name\u003e\n   ```\n\n## Performance\nBased on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.\n\n#### Datasets\n\n|  Dataset     |  V        |  L              |  D           |  L/V      |  L/D      |\n| ------------ | --------: | --------------: | -----------: | --------: | --------: |\n|  NY Times    |  101,330  |  99,542,127     |  299,753     |  982.36   |  332.08   |\n|  PubMed      |  141,043  |  737,869,085    |  8,200,000   |  5,231.52\t|\t89.98    |\n|  Wikipedia   |\t210,218  |  1,614,349,889  |  3,731,325   |  7,679.41\t|\t432.65 \t|\n\n   Experimental datasets and their statistics. `V` denotes vocabulary size, `L` denotes the number of training tokens, `D` denotes\n   the number of documents, `L/V` indicates the average number of occurrences of a word, `L/D` indicates the average length of a\n   document.\n  \n#### log-Perplexity with time\n\n\u003cimg src=https://raw.githubusercontent.com/dmlc/experimental-lda/master/nytimesllh_v_time.jpg width=400/\u003e\n\u003cimg src=https://raw.githubusercontent.com/dmlc/experimental-lda/master/pubmedllh_v_time.jpg width=400/\u003e\n\u003cimg src=https://raw.githubusercontent.com/dmlc/experimental-lda/master/wikillh_v_time.jpg width=400/\u003e\n","funding_links":[],"categories":["C++","Models"],"sub_categories":["Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmlc%2Fexperimental-lda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmlc%2Fexperimental-lda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmlc%2Fexperimental-lda/lists"}