{"id":13696431,"url":"https://github.com/thu-ml/warplda","last_synced_at":"2025-05-03T17:31:10.823Z","repository":{"id":99804953,"uuid":"60055640","full_name":"thu-ml/warplda","owner":"thu-ml","description":"Cache efficient implementation for Latent Dirichlet Allocation","archived":false,"fork":false,"pushed_at":"2019-01-04T06:53:47.000Z","size":47,"stargazers_count":161,"open_issues_count":4,"forks_count":55,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-08-03T18:20:58.525Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2016-05-31T03:31:50.000Z","updated_at":"2024-06-25T22:07:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"bd572c85-8903-46dc-9a84-4a6a813a57a1","html_url":"https://github.com/thu-ml/warplda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fwarplda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fwarplda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fwarplda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fwarplda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-ml","download_url":"https://codeload.github.com/thu-ml/warplda/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224369532,"owners_count":17299917,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T18:00:39.962Z","updated_at":"2024-11-13T00:30:29.557Z","avatar_url":"https://github.com/thu-ml.png","language":"C++","funding_links":[],"categories":["Models"],"sub_categories":["Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)"],"readme":"# WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation\n\n## Introduction\n\nWarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).\n\n## Installation\nPrerequisites:\n\n* GCC (\u003e=4.8.5)\n* CMake (\u003e=2.8.12)\n* git\n* libnuma \n  - CentOS: `yum install libnuma-devel`\n  - Ubuntu: `apt-get install libnuma-dev`\n\nClone this project\n\n\tgit clone https://github.com/thu-ml/warplda\n\nInstall third-party dependency\n\n\t./get_gflags.sh\n\nDownload some data, and split it as training and testing set\n\n\tcd data\n\twget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt\n    head -n 900 ydir_1k.txt \u003e ydir_train.txt\n    tail -n 100 ydir_1k.txt \u003e ydir_test.txt\n    cd ..\n\nCompile the project\n\n\t./build.sh\n\tcd release/src\n\tmake -j\n\n## Quick-start\n\nFormat the data\n\n\t./format -input ../../data/ydir_train.txt -prefix train\n    ./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test\n\nTrain the model\n\n\t./warplda --prefix train --k 100 --niter 300\n\nCheck the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.\n\n\tvim train.info.full.txt\n\nInfer latent topics of some testing data.\n\n\t./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10\n\n## Data format\n\nThe data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is\n\n    id1 id2 word1 word2 word3 ...\n\nid1, id2 are two string document identifiers, and each word is a string, separated by white space.\n\n## Output format\n\nWarpLDA generates a number of files:\n\n#### `.vocab` (generated by `.format`)\nEach line of it is a word in the vocabulary.\n\n#### `.info.full.txt` (generated by `warplda -estimate`)\nThe most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format `(probability, word)`. The number of most frequent words is controlled by `-ntop`. `.info.words.txt` is a simpler version which only contains words.\n\n#### `.model` (generated by `warplda -estimate`)\nThe word-topic count matrix. The first line contains four integers\n\n\t\u003csize of vocabulary\u003e \u003cnumber of topics\u003e \u003calpha\u003e \u003cbeta\u003e\n\nEach of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,\n\t\n\t\u003cnumber of elements\u003e index:count index:count ...\n\nFor example, `0:2` on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.\n\n#### `.z.estimate` (generated by `warplda -estimate`)\nThe topic assignments of each token in the libsvm format. Each line is a document,\n\t\t\n\t\u003cnumber of tokens\u003e \u003cword id\u003e:\u003ctopic id\u003e \u003cword id\u003e:\u003ctopic id\u003e ...\n\n#### `.z.inference` (generated by `warplda -inference`)\nThe format is the same as `.z.estimate`.\n\n## Other features\n\n* Use custom prefix for output `-prefix myprefix`\n* Output perplexity every 10 iterations `-perplexity 10`\n* Tune Dirichlet hyperparameters `-alpha 10 -beta 0.1`\n* Use UCI machine learning repository data\n\n\t\twget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt\n\t\twget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz\n\t\tgunzip docword.nips.txt.gz\n\t\t./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt\n\t\thead -n 1400 nips.txt \u003e nips_train.txt\n\t\ttail -n 100 nips.txt \u003e nips_test.txt\n\n## License\n\nMIT\n\n## Reference\n\nPlease cite WarpLDA if you find it is useful!\n\n\t@inproceedings{chen2016warplda,\n\t  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},\n\t  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},\n\t  booktitle={VLDB},\n\t  year={2016}\n\t}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2Fwarplda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-ml%2Fwarplda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2Fwarplda/lists"}