{"id":13697194,"url":"https://github.com/blei-lab/hdp","last_synced_at":"2025-07-14T11:10:24.750Z","repository":{"id":29553162,"uuid":"33092348","full_name":"blei-lab/hdp","owner":"blei-lab","description":"Hierarchical Dirichlet processes. Topic models where the data determine the number of topics. This implements Gibbs sampling.","archived":false,"fork":false,"pushed_at":"2017-02-21T21:08:34.000Z","size":48,"stargazers_count":149,"open_issues_count":6,"forks_count":47,"subscribers_count":46,"default_branch":"master","last_synced_at":"2025-06-23T20:58:27.677Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blei-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-29T22:34:26.000Z","updated_at":"2025-05-23T03:05:26.000Z","dependencies_parsed_at":"2022-08-31T04:31:44.224Z","dependency_job_id":null,"html_url":"https://github.com/blei-lab/hdp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/blei-lab/hdp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fhdp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fhdp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fhdp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fhdp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blei-lab","download_url":"https://codeload.github.com/blei-lab/hdp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blei-lab%2Fhdp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265281402,"owners_count":23739875,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T18:00:53.823Z","updated_at":"2025-07-14T11:10:24.719Z","avatar_url":"https://github.com/blei-lab.png","language":"C++","funding_links":[],"categories":["Research Implementations"],"sub_categories":["Embedding based Topic Models"],"readme":"# Hierarchical Dirichlet Process (with Split-Merge Operations)\n\n**********************************************************************\n\n(C) Copyright 2010, Chong Wang and David Blei. Written by [Chong Wang](http://www.cs.princeton.edu/~chongw/index.html).\n\nThis is a C++ implementation of hierarchical Dirichlet process for topic modeling.\n\n## README\n\n\nNB: The split-merge algorithm is preliminary. Note that this code requires the Gnu Scientific Library, http://www.gnu.org/software/gsl/\n\n-----------------------------------------------------------------------------------------\n\n\nTABLE OF CONTENTS\n\n\nA. COMPILING\n\nB. POSTERIOR INFERENCE\n\nC. INFERENCE ON NEW DATA\n\nD. PARAMETER SETTINGS\n\nE. PRINTING TOPICS\n\n-----------------------------------------------------------------------------------------\n\n\nA. COMPILING\n\nType \"make\" in a shell. Make sure the GSL is installed. You may need to change\nthe Makefile a bit.\n\n\nB. POSTERIOR INFERENCE\n\nThe following shows an example of performing posterior inference on a set of documents,\n\nhdp --algorithm train --data data --directory train_dir\n\n\nData format\n\n--data points to a file where each line is of the form (the LDA-C format):\n\n     [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]\n\nwhere [M] is the number of unique terms in the document, and the\n[count] associated with each term is how many times that term appeared\nin the document. \n\nThe sampler will produce some files in the --directory,\n\n*-topics.dat: the word counts for each topic, with each line as a topic\n\n*-word-assignments.dat: print each word's assignment to the topic and the table,\nwhich is in R-friendly format,\nd w z t\n\nd: document id\nw: word id\nz: topic index\nt: table index (only for document level. If you only analyze the topics, this is irrelevant.)\n\n*.bin: the binary model file used for inference on new data.\n\nstate.log: various information to monitor the Markov chain.\n\nMore parameter settings, run:\nhdp --help\n\nNote: some parameters for split-merge are hand coded at the beginning of hdp.cpp\nfile.\n\n-----------------------------------------------------------------------------------------\n\nC. INFERENCE ON NEW DATA\n\nTo perform inference on a different set of data (in the same format as before), run:\n\nhdp --algorithm test --data data --saved_model saved_model --directory test_dir \n\nwhere --saved_model is the binary file from the posterior inference on training data.\n     \nThe sampler will produce some files in the --directory,\n\ntest-*-topics.dat: the word counts for each topic, with each line as a topic\n\ntest*-word-assignments.dat: print each word's assignment to the topic and the table,\nwhich is in R-friendly format.\n\ntest.log: various information to monitor the Markov chain.\n\ntest-*.bin: the binary model file used for inference on newer data.\n\nMore parameter settings, run:\nhdp --help\n\n-----------------------------------------------------------------------------------------\n\n\nD. PARAMETER SETTINGS\n\nThe meaning of the parameters is the same as in the in the following paper\n\nY. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes.\nJournal of the American Statistical Association, 2006. 101[476]:1566-1581\n\n-----------------------------------------------------------------------------------------\n\nE. PRINTING TOPICS\n\nA R script (print.topics.R) is included to print topics. Make sure it is\nexecutable. (chmod +x print.topics.R) For example,\n\nprint.topics.R mode-topics.dat vocab.dat topics.dat 10\n\nwill produce a topic list with top 10 words selected. For help, run,\n\nprint.topics.R\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblei-lab%2Fhdp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblei-lab%2Fhdp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblei-lab%2Fhdp/lists"}