Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/centre-for-humanities-computing/tweetopic

Blazing fast topic modelling for short texts.
https://github.com/centre-for-humanities-computing/tweetopic

dirichlet-process-mixtures dmm gibbs-sampling gsdmm machine-learning mcmc nlp python scikit-learn topic-modeling tweet tweet-analysis visualization

Last synced: 19 days ago
JSON representation

Blazing fast topic modelling for short texts.

Host: GitHub
URL: https://github.com/centre-for-humanities-computing/tweetopic
Owner: centre-for-humanities-computing
License: mit
Created: 2022-08-30T08:47:33.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-10-07T18:33:32.000Z (4 months ago)
Last Synced: 2024-12-27T04:06:15.154Z (26 days ago)
Topics: dirichlet-process-mixtures, dmm, gibbs-sampling, gsdmm, machine-learning, mcmc, nlp, python, scikit-learn, topic-modeling, tweet, tweet-analysis, visualization
Language: Python
Homepage: https://centre-for-humanities-computing.github.io/tweetopic/
Size: 2.19 MB
Stars: 31
Watchers: 0
Forks: 3
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: citation.cff

Awesome Lists containing this project

README

        

# tweetopic

:zap: Blazing Fast topic modelling over short texts in Python




[![PyPI version](https://badge.fury.io/py/tweetopic.svg)](https://pypi.org/project/tweetopic/)

[![pip downloads](https://img.shields.io/pypi/dm/tweetopic.svg)](https://pypi.org/project/tweetopic/)

[![python version](https://img.shields.io/badge/Python-%3E=3.7-blue)](https://github.com/centre-for-humanities-computing/tweetopic)

[![Code style: black](https://img.shields.io/badge/Code%20Style-Black-black)](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html)










## Features

- Fast :zap:

- Scalable :collision:

- High consistency and coherence :dart:

- High quality topics :fire:

- Easy visualization and inspection :eyes:

- Full scikit-learn compatibility :nut_and_bolt:

#### New in version 0.4.0 ✨

You can now pass `random_state` to topic models to make your results reproducible.

```python

from tweetopic import DMM

model = DMM(10, random_state=42)

```

## 🛠 Installation

Install from PyPI:

```bash

pip install tweetopic

```

## 👩‍💻 Usage ([documentation](https://centre-for-humanities-computing.github.io/tweetopic/))

Train your a topic model on a corpus of short texts:

```python

from tweetopic import DMM

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.pipeline import Pipeline

# Creating a vectorizer for extracting document-term matrix from the

# text corpus.

vectorizer = CountVectorizer(min_df=15, max_df=0.1)

# Creating a Dirichlet Multinomial Mixture Model with 30 components

dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)

# Creating topic pipeline

pipeline = Pipeline([

    ("vectorizer", vectorizer),

    ("dmm", dmm),

])

```

You may fit the model with a stream of short texts:

```python

pipeline.fit(texts)

```

To investigate internal structure of topics and their relations to words and indicidual documents we recommend using [topicwizard](https://github.com/x-tabdeveloping/topic-wizard).

Install it from PyPI:

```bash

pip install topic-wizard

```

Then visualize your topic model:

```python

import topicwizard

topicwizard.visualize(pipeline=pipeline, corpus=texts)

```

![topicwizard visualization](docs/_static/topicwizard.png)

## 🎓 References

- Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. _In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery._