Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jrmazarura/GPM
https://github.com/jrmazarura/GPM
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/jrmazarura/GPM
- Owner: jrmazarura
- License: mit
- Created: 2020-10-05T08:52:08.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-07-13T11:49:53.000Z (over 2 years ago)
- Last Synced: 2024-08-08T15:44:13.843Z (4 months ago)
- Language: Python
- Size: 995 KB
- Stars: 13
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-topic-models - GPyM_TM - Python implementation of DMM and Poisson model (Models / Topic Models for short documents)
README
# GPyM_TM
**GPyM_TM** is a Python package to perform topic modelling, either through the use of the Dirichlet multinomial mixture model (GSDMM) [1] or the [Gamma Poisson mixture model](https://www.hindawi.com/journals/mpe/2020/4728095/) (GPM) [2]. Each of the above models is available within the package in a separate class, namely GSDMM and GPM, respectively. The package is also available on [Pypi](https://pypi.org/project/GPyM-TM/3.0.1/).
## Preamble
The aim of topic modelling is to extract latent topics from large corpora. GSDMM [1] and GPM [2] assume each document belongs to a single topic, which is a suitable assumption for some short texts. Given an initial number of topics, K, this algorithm clusters documents and extracts the topical structures present within the corpus. If K is set to a high value, then the models will also automatically learn the number of clusters.[1] [Yin, J. and Wang, J., 2014, August. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 233-242)](https://dl.acm.org/doi/abs/10.1145/2623330.2623715?casa_token=lSSGu4bHw6wAAAAA:iDc8SAzLNC-zOySLwkDJBe3L17Wht7WiQe5JXVd0sy7_dEBbU10C8y8mhcidwUu_9Dl4kMhEfvE)
[2] [Mazarura, J., de Waal, A. and de Villiers, P., 2020. A Gamma-Poisson Mixture Topic Model for Short Text. Mathematical Problems in Engineering, 2020](https://www.hindawi.com/journals/mpe/2020/4728095/)
Further details about the GPM can be found in my thesis [here](https://repository.up.ac.za/handle/2263/78519).
## Getting Started:
The package is available [online](https://pypi.org/project/GPyM-TM/) for use within Python 3 enviroments.
The installation can be performed through the use of a standard 'pip' install command, as provided below:
`pip install GPyM-TM`
## Prerequisites:
The package has several dependencies, namely:
* numpy
* random
* math
* pandas
* re
* nltk
* gensim
* scipy# GSDMM
## Function and class description:
The class is named **GSDMM**, while the function itself is named **DMM**.
The function can take 6 possible arguments, two of which are required, and the remaining 4 being optional.
### The required arguments are:
* **corpus** - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
* **nTopics** - the number of topics.### The optional requirements are:
* **alpha**, **beta** - these are the distribution specific parameters.(**The defaults for both of these parameters are 0.1.**)
* **nTopWords** - number of top words per a topic.(**The default is 10.**)
* **iters** - number of Gibbs sampler iterations.(**The default is 15.**)## Output:
The function provides several components of output, namely:
* **psi** - topic x word matrix.
* **theta** - document x topic matrix.
* **topics** - the top words per topic.
* **assignments** - the topic numbers of selected topics only, as well as the final topic assignments.
* **Final k** - the final number of selected topics.
* **coherence** - the coherence score, which is a performance measure.
* **selected_theta**
* **selected_psi**# GPM
## Function and class description:
The class is named **GPM**, while the function itself is named **GPM**.
The function can take 8 possible arguments, two of which are required, and the remaining 6 being optional.
### The required arguments are:
* **corpus** - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
* **nTopics** - the number of topics.### The optional requirements are:
* **alpha**, **beta** and **gam** - these are the distribution specific parameters.(**The defaults for these parameters are alpha = 0.001, beta = 0.001 and gam = 0.1 respectively.**)
* **nTopWords** - number of top words per a topic.(**The default is 10.**)
* **iters** - number of Gibbs sampler iterations.(**The default is 15.**)
* **N** - this is a parameter used to normalize the document lengths, which is required for the Poisson model.## Output:
The function provides several components of output, namely:
* **psi** - topic x word matrix.
* **theta** - document x topic matrix.
* **topics** - the top words per topic.
* **assignments** - the topic numbers of selected topics only, as well as the final topic assignments.
* **Final k** - the final number of selected topics.
* **coherence** - the coherence score, which is a performance measure.
* **selected_theta**
* **selected_psi**# Example Usage:
A more comprehensive [tutorial](https://github.com/CAIR-ZA/GPyM_TM/blob/master/Tutorial.ipynb) is also available.
### Installation;
Run the following command within a Python command window:
`pip install GPym_TM`
### Implementation;
Import the package into the relevant python script, with the following:
`from GPyM_TM import GSDMM`
`from GPyM_TM import GPM`> Call the class:
#### Possible examples of calling the GSDMM function are as follows:
`data_DMM = GSDMM.DMM(corpus, nTopics)`
`data_DMM = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5)`
#### Possible examples of calling the GPM function are as follows:
`data_GPM = GPM.GPM(corpus, nTopics)`
`data_GPM = GPM.GPM(corpus, nTopics, alpha = 0.002, beta = 0.03, gam = 0.06, nTopWords = 12, iters = 7, N = 8)`
### Results;
The output obtained for the Dirichlet multinomial mixture model appears as follows:
![Post](/Images/Post.png)
While, the output obtained for the Poisson model appears as follows:
![poisson](/Images/poisson.png)
## Built With:
[Google Collab](https://colab.research.google.com/notebooks/intro.ipynb) - Web framework
[Python](https://www.python.org/) - Programming language of choice
[Pypi](https://pypi.org/) - Distribution
## Authors:
[Jocelyn Mazarura](https://github.com/jrmazarura/GPM)
## Co-Authors:
I would like to extend a special thank you to my colleagues [Alta de Waal](https://github.com/altadewaal) and [Ricardo Marques](https://github.com/RicSalgado). None of this would have been possible without either of you.
Thank you!
## License:
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments:
University of Pretoria
![Tuks Logo](/Images/UPlogohighres.jpg)