{"id":13696714,"url":"https://github.com/jrmazarura/GPM","last_synced_at":"2025-05-03T17:32:00.595Z","repository":{"id":45300733,"uuid":"301347181","full_name":"jrmazarura/GPM","owner":"jrmazarura","description":null,"archived":false,"fork":false,"pushed_at":"2022-07-13T11:49:53.000Z","size":1019,"stargazers_count":13,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-26T09:08:12.749Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jrmazarura.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-05T08:52:08.000Z","updated_at":"2024-03-12T13:05:19.000Z","dependencies_parsed_at":"2022-09-02T13:00:12.266Z","dependency_job_id":null,"html_url":"https://github.com/jrmazarura/GPM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmazarura%2FGPM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmazarura%2FGPM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmazarura%2FGPM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmazarura%2FGPM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jrmazarura","download_url":"https://codeload.github.com/jrmazarura/GPM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252226804,"owners_count":21714871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T18:00:45.478Z","updated_at":"2025-05-03T17:31:59.781Z","avatar_url":"https://github.com/jrmazarura.png","language":"Python","funding_links":[],"categories":["Models"],"sub_categories":["Topic Models for short documents"],"readme":"# GPyM_TM\n\n**GPyM_TM** is a Python package to perform topic modelling, either through the use of the Dirichlet multinomial mixture model (GSDMM) [1] or the [Gamma Poisson mixture model](https://www.hindawi.com/journals/mpe/2020/4728095/) (GPM) [2]. Each of the above models is available within the package in a separate class, namely GSDMM and GPM, respectively. The package is also available on [Pypi](https://pypi.org/project/GPyM-TM/3.0.1/).\n\n## Preamble  \nThe aim of topic modelling is to extract latent topics from large corpora. GSDMM [1] and GPM [2] assume each document belongs to a single topic, which is a suitable assumption for some short texts. Given an initial number of topics, K, this algorithm clusters documents and extracts the topical structures present within the corpus. If K is set to a high value, then the models will also automatically learn the number of clusters.\n\n[1]\t[Yin, J. and Wang, J., 2014, August. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 233-242)](https://dl.acm.org/doi/abs/10.1145/2623330.2623715?casa_token=lSSGu4bHw6wAAAAA:iDc8SAzLNC-zOySLwkDJBe3L17Wht7WiQe5JXVd0sy7_dEBbU10C8y8mhcidwUu_9Dl4kMhEfvE)\n\n[2] [Mazarura, J., de Waal, A. and de Villiers, P., 2020. A Gamma-Poisson Mixture Topic Model for Short Text. Mathematical Problems in Engineering, 2020](https://www.hindawi.com/journals/mpe/2020/4728095/)\n\nFurther details about the GPM can be found in my thesis [here](https://repository.up.ac.za/handle/2263/78519).\n\n## Getting Started:\n\nThe package is available [online](https://pypi.org/project/GPyM-TM/) for use within Python 3 enviroments.\n\nThe installation can be performed through the use of a standard 'pip' install command, as provided below: \n\n`pip install GPyM-TM`\n\n## Prerequisites:\n\nThe package has several dependencies, namely: \n\n* numpy\n* random\n* math\n* pandas\n* re\n* nltk\n* gensim\n* scipy\n\n# GSDMM\n\n## Function and class description:\n\nThe class is named **GSDMM**, while the function itself is named **DMM**.\n\nThe function can take 6 possible arguments, two of which are required, and the remaining 4 being optional. \n\n### The required arguments are: \n\n* **corpus** - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed. \n* **nTopics** - the number of topics.\n\n### The optional requirements are:\n\n* **alpha**, **beta** - these are the distribution specific parameters.(**The defaults for both of these parameters are 0.1.**)\n* **nTopWords** - number of top words per a topic.(**The default is 10.**)  \n* **iters** - number of Gibbs sampler iterations.(**The default is 15.**)\n\n## Output:\n\nThe function provides several components of output, namely:\n* **psi** - topic x word matrix.\n* **theta** - document x topic matrix.\n* **topics** - the top words per topic. \n* **assignments** - the topic numbers of selected topics only, as well as the final topic assignments.\n* **Final k** - the final number of selected topics.\n* **coherence** - the coherence score, which is a performance measure.\n* **selected_theta**\n* **selected_psi**\n\n# GPM\n\n## Function and class description:\n\nThe class is named **GPM**, while the function itself is named **GPM**.\n\nThe function can take 8 possible arguments, two of which are required, and the remaining 6 being optional. \n\n### The required arguments are: \n\n* **corpus** - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed. \n* **nTopics** - the number of topics.\n\n### The optional requirements are:\n\n* **alpha**, **beta** and **gam** - these are the distribution specific parameters.(**The defaults for these parameters are alpha = 0.001, beta = 0.001 and gam = 0.1 respectively.**)\n* **nTopWords** - number of top words per a topic.(**The default is 10.**)  \n* **iters** - number of Gibbs sampler iterations.(**The default is 15.**)\n* **N** - this is a parameter used to normalize the document lengths, which is required for the Poisson model.\n\n## Output:\n\nThe function provides several components of output, namely:\n* **psi** - topic x word matrix.\n* **theta** - document x topic matrix.\n* **topics** - the top words per topic. \n* **assignments** - the topic numbers of selected topics only, as well as the final topic assignments.\n* **Final k** - the final number of selected topics.\n* **coherence** - the coherence score, which is a performance measure.\n* **selected_theta**\n* **selected_psi**\n\n# Example Usage:\n\nA more comprehensive [tutorial](https://github.com/CAIR-ZA/GPyM_TM/blob/master/Tutorial.ipynb) is also available.\n\n### Installation;\n\nRun the following command within a Python command window:\n\n`pip install GPym_TM`\n\n### Implementation;\n\nImport the package into the relevant python script, with the following: \n\n`from GPyM_TM import GSDMM`\n`from GPyM_TM import GPM`\n\n\u003e Call the class:\n\n#### Possible examples of calling the GSDMM function are as follows:\n\n`data_DMM = GSDMM.DMM(corpus, nTopics)`\n\n`data_DMM = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5)`\n\n#### Possible examples of calling the GPM function are as follows:\n\n`data_GPM = GPM.GPM(corpus, nTopics)`\n\n`data_GPM = GPM.GPM(corpus, nTopics, alpha = 0.002, beta = 0.03, gam = 0.06, nTopWords = 12, iters = 7, N = 8)`\n\n### Results;\n\nThe output obtained for the Dirichlet multinomial mixture model appears as follows: \n\n![Post](/Images/Post.png)\n\nWhile, the output obtained for the Poisson model appears as follows:\n\n![poisson](/Images/poisson.png)\n\n## Built With:\n\n[Google Collab](https://colab.research.google.com/notebooks/intro.ipynb) - Web framework\n\n[Python](https://www.python.org/) - Programming language of choice\n\n[Pypi](https://pypi.org/) - Distribution\n\n## Authors:\n\n[Jocelyn Mazarura](https://github.com/jrmazarura/GPM)\n\n\n## Co-Authors:\n\nI would like to extend a special thank you to my colleagues [Alta de Waal](https://github.com/altadewaal) and [Ricardo Marques](https://github.com/RicSalgado). None of this would have been possible without either of you.\n\nThank you!\n\n## License:\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n\n## Acknowledgments:\n\nUniversity of Pretoria \n![Tuks Logo](/Images/UPlogohighres.jpg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrmazarura%2FGPM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjrmazarura%2FGPM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrmazarura%2FGPM/lists"}