{"id":13696509,"url":"https://github.com/thu-ml/BigTopicModel","last_synced_at":"2025-05-03T17:31:21.819Z","repository":{"id":69910578,"uuid":"65353206","full_name":"thu-ml/BigTopicModel","owner":"thu-ml","description":"Big Topic Model is a fast engine for running large-scale Topic Models. ","archived":false,"fork":false,"pushed_at":"2017-03-25T05:50:09.000Z","size":474,"stargazers_count":22,"open_issues_count":7,"forks_count":6,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-05-03T09:51:51.030Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2016-08-10T05:29:39.000Z","updated_at":"2025-01-07T02:45:12.000Z","dependencies_parsed_at":null,"dependency_job_id":"aa27e0e7-95ae-4871-988f-bbd578c36a01","html_url":"https://github.com/thu-ml/BigTopicModel","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FBigTopicModel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FBigTopicModel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FBigTopicModel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FBigTopicModel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-ml","download_url":"https://codeload.github.com/thu-ml/BigTopicModel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252226723,"owners_count":21714858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T18:00:41.648Z","updated_at":"2025-05-03T17:31:21.079Z","avatar_url":"https://github.com/thu-ml.png","language":"C++","funding_links":[],"categories":["Models"],"sub_categories":["Miscellaneous topic models"],"readme":"# BigTopicModel\n\nBig Topic Model is a fast engine for running large-scale Topic Models. It uses a hybrid data and model parallel mechanism to accelerate and performs 3~5 times faster than the state-of-the-art ones on some general dataset. It supports a set of tpoic models including LDA, DTM, MedLDA and RTM.\n\n## Requirement\n\nBig Topic Model depends on several third-party libraries including google glog, gflags, dSFMT and eigen. Currently our tests compile and run mainly based on Intel libraries(Intel® Parallel Studio XE Professional Edition for C++ Linux* 2016), but other libraries like openmpi are practicable.\n\n### Getting Intel software Toolkit\n\nGo to http://software.intel.com to download the Intel software Toolkit. And then set the compilevars:\n\n```\nsource /opt/intel/2016/compilers_and_libraries/linux/bin/compilervars.sh intel64\n```\n\n### Getting third party dependencies\n\nMake sure you can access github.com. To get third-party dependencies, run:\n\n```\n$ ./set_third_party.sh\n```\n\n(You do not need to do this if you already have all the third-party dependencies.)\n\n## Compilation\n\nFirst, run build.sh under the root directory to generate release and debug folder.\n```\n$ ./build.sh\n```\nSecond, using make to compile.\n```\n$ cd release/\n$ make\n```\n\n## Data Preprocessing\n\n### Setting the data directory\n\nSet the data under root direcotry, run:\n```\n$ ln -sf  where/you/store/the/data/ data\n```\n\n### Preprocessing the data\n\nTo run on a cluster, data partition is necessary. Unlike other current cluster computing systems, Big Topic Model not only segments data by documents, but also by words, that is a two-dimensional partition. So the data we need is different from the traditional svm data, and we need to preprocess the data by distributing words randomly for each group before running on cluster.\n\nUse split-input-data.sh in \u003ckbd\u003esrc\u003c/kbd\u003e folder to divide the raw data into groups of equal size for the cluster to format.\n\nNow, we get a package of data for each machine. Use format.py in \u003ckbd\u003esrc\u003c/kbd\u003e folder to do the data partition according to the parallelism.\n\n## Running the model\n\nModify the run.sh under \u003ckbd\u003erelease\u003c/kbd\u003e folder, to run your model.\n```\n$ cd release/\n$ ./run.sh\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2FBigTopicModel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-ml%2FBigTopicModel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2FBigTopicModel/lists"}