https://github.com/centre-for-humanities-computing/tweetopic
Blazing fast topic modelling for short texts.
https://github.com/centre-for-humanities-computing/tweetopic
dirichlet-process-mixtures dmm gibbs-sampling gsdmm machine-learning mcmc nlp python scikit-learn topic-modeling tweet tweet-analysis visualization
Last synced: 8 months ago
JSON representation
Blazing fast topic modelling for short texts.
- Host: GitHub
- URL: https://github.com/centre-for-humanities-computing/tweetopic
- Owner: centre-for-humanities-computing
- License: mit
- Created: 2022-08-30T08:47:33.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-10-06T17:58:59.000Z (8 months ago)
- Last Synced: 2025-10-10T15:16:37.004Z (8 months ago)
- Topics: dirichlet-process-mixtures, dmm, gibbs-sampling, gsdmm, machine-learning, mcmc, nlp, python, scikit-learn, topic-modeling, tweet, tweet-analysis, visualization
- Language: Python
- Homepage: https://centre-for-humanities-computing.github.io/tweetopic/
- Size: 2.2 MB
- Stars: 33
- Watchers: 0
- Forks: 4
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: citation.cff
Awesome Lists containing this project
README

# tweetopic
:zap: Blazing Fast topic modelling over short texts in Python
[](https://pypi.org/project/tweetopic/)
[](https://pypi.org/project/tweetopic/)
[](https://github.com/centre-for-humanities-computing/tweetopic)
[](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html)
## Features
- Fast :zap:
- Scalable :collision:
- High consistency and coherence :dart:
- High quality topics :fire:
- Easy visualization and inspection :eyes:
- Full scikit-learn compatibility :nut_and_bolt:
#### New in version 0.4.0 ✨
You can now pass `random_state` to topic models to make your results reproducible.
```python
from tweetopic import DMM
model = DMM(10, random_state=42)
```
## 🛠 Installation
Install from PyPI:
```bash
pip install tweetopic
```
## 👩💻 Usage ([documentation](https://centre-for-humanities-computing.github.io/tweetopic/))
Train your a topic model on a corpus of short texts:
```python
from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)
# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)
# Creating topic pipeline
pipeline = Pipeline([
("vectorizer", vectorizer),
("dmm", dmm),
])
```
You may fit the model with a stream of short texts:
```python
pipeline.fit(texts)
```
To investigate internal structure of topics and their relations to words and indicidual documents we recommend using [topicwizard](https://github.com/x-tabdeveloping/topic-wizard).
Install it from PyPI:
```bash
pip install topic-wizard
```
Then visualize your topic model:
```python
import topicwizard
topicwizard.visualize(pipeline=pipeline, corpus=texts)
```

## 🎓 References
- Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. _In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery._