{"id":13696360,"url":"https://github.com/hncuong/topicmodel-lib","last_synced_at":"2025-05-03T17:30:57.528Z","repository":{"id":62584875,"uuid":"81194120","full_name":"hncuong/topicmodel-lib","owner":"hncuong","description":"A Python library for topic modeling.","archived":false,"fork":false,"pushed_at":"2020-06-24T12:05:36.000Z","size":12743,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-05T03:04:23.666Z","etag":null,"topics":["latent-dirichlet-allocation","machine-learning","topic-modeling"],"latest_commit_sha":null,"homepage":"https://test-dslab.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hncuong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-02-07T10:06:45.000Z","updated_at":"2020-07-24T08:18:58.000Z","dependencies_parsed_at":"2022-11-03T22:00:54.129Z","dependency_job_id":null,"html_url":"https://github.com/hncuong/topicmodel-lib","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hncuong%2Ftopicmodel-lib","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hncuong%2Ftopicmodel-lib/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hncuong%2Ftopicmodel-lib/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hncuong%2Ftopicmodel-lib/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hncuong","download_url":"https://codeload.github.com/hncuong/topicmodel-lib/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252226640,"owners_count":21714835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["latent-dirichlet-allocation","machine-learning","topic-modeling"],"created_at":"2024-08-02T18:00:38.838Z","updated_at":"2025-05-03T17:30:56.608Z","avatar_url":"https://github.com/hncuong.png","language":"Python","funding_links":[],"categories":["Models"],"sub_categories":["Latent Dirichlet Allocation (LDA) [:page_facing_up:](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)"],"readme":"topicmodel-lib\r\n================\r\n\r\n[![GitHub release](https://img.shields.io/badge/release-1.0.0-yellow.svg)]()[![Wheel](https://img.shields.io/pypi/wheel/gensim.svg)](https://pypi.python.org/pypi/tmlib) \r\n[![Mailing List](https://img.shields.io/badge/-Mailing%20List-lightgrey.svg)](https://groups.google.com/forum/#!forum/dslab-tmlib)\r\n[![License](https://img.shields.io/packagist/l/doctrine/orm.svg)]()\r\n\r\ntopicmodel-lib is a Python library for *topic modeling* - a field which provides an efficient way to discover hidden structures/semantics in massive data. Latent Dirichlet Allocation (LDA) is a popular model in this field and we focus on methods for learning LDA by online or stream scheme.\r\n\r\nFeatures\r\n--------\r\n\r\n- Our library provides efficient algorithms for learning LDA from large-scale data. It includes the state-of-the-art learning methods at the present\r\n- We also implement Cython code (a programming language that makes writing C extensions for the Python language as easy as Python itself) to increase speed of some algorithms\r\n- We've also designed the visualization module to help users understand and explore the result that the model discovers after learning\r\n\r\nGetting started in 30s\r\n----------------------\r\n\r\n**Training data**\r\n\r\nBecause we need to learn the model from the massive data, the loading whole of training data into memory is a bad idea. Therefore, the online/streaming learning algorithms are usually preferred in this case. Training data should be stored in a file and with a specific format. Our library supports 3 formats of training data and in here, we'll demo with [ap corpus](https://github.com/hncuong/topicmodel-lib/tree/master/examples/ap/data)\r\n\r\n**Tutorial**\r\n\r\nFirst, class `DataSet` provides some functions to processing the training data:\r\n\r\n```python\r\n\u003e\u003e\u003e from tmlib.datasets import DataSet\r\n\r\n\u003e\u003e\u003e data = DataSet('ap_train_raw.txt', batch_size=100, passes=5, shuffle_every=2)\r\n```\r\n\r\nLearning LDA by Online VB method ([Hoffman, 2010](http://www.cs.columbia.edu/~blei/papers/HoffmanBleiBach2010c.pdf)):\r\n\r\n```python\r\n\u003e\u003e\u003e from tmlib.lda import OnlineVB\r\n\u003e\u003e\u003e onlinevb = OnlineVB(data, num_topics=20)\r\n\u003e\u003e\u003e model = onlinevb.learn_model()\r\n```\r\n\r\nYou can see the topics which is discovered by Online VB:\r\n\r\n```python\r\n\u003e\u003e\u003e model.print_top_words(5, data.vocab_file, display_result='screen')\r\n```\r\n\r\nFor a more in-depth tutorial about topicmodel-lib, you can see documentation. \r\nIn the [examples folder](https://github.com/hncuong/topicmodel-lib/tree/master/examples) of the repository, you will see the example code as well as training data. You can run a demo to understand how the library work\r\n\r\n\r\nInstallation\r\n------------\r\n\r\n**Dependencies**\r\n\r\nTo use the library, your computer must installed all of these packages first:\r\n\r\n- Linux OS (Stable on Ubuntu)\r\n- Python version 2 (stable on version 2.7)\r\n- Numpy \u003e= 1.8\r\n- Scipy \u003e= 0.10\r\n- nltk (Natural Language Toolkit)\r\n- Cython\r\n- Pandas \u003e= 0.20\r\n\r\n**User Installation**\r\n\r\n- Installing by pip\r\n\r\n      $ sudo pip install tmlib\r\n\r\n\r\n- Installing by run setup file\r\n\r\n  After download library, you install by running file setup.py in folder topicmodel-lib as follow:\r\n\r\n  First, build the necessary packages:\r\n\r\n      $ python setup.py build_ext --inplace\r\n    \r\n  or if you need permission to build:\r\n  \r\n      $ sudo python setup.py build_ext --inplace\r\n    \r\n  After that, install library into your computer:\r\n  \r\n      $ sudo python setup.py install\r\n\r\nDocumentation\r\n-------------\r\n\r\nSee detail at http://test-dslab.readthedocs.io\r\n\r\nSupport\r\n-------\r\n\r\nIf you have an open-ended or a research question, you can join and contact via: \r\n\r\n- [Google Group](https://groups.google.com/forum/#!forum/dslab-tmlib)\r\n- [Facebook Group](https://www.facebook.com/groups/465441110326541/?ref=group_browse_new)\r\n\r\nContributors:\r\n\r\n- VuVanTu\r\n- KhangTruong\r\n- HaNhatCuong\r\n- TungDoan\r\n\r\nLicense\r\n-------\r\n\r\nThe project is licensed under the MIT license.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhncuong%2Ftopicmodel-lib","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhncuong%2Ftopicmodel-lib","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhncuong%2Ftopicmodel-lib/lists"}