https://github.com/terencicp/mastodon-topics

Topic modeling of Mastodon posts using BERTopic and LLM summarization. Results can be viewed in a Streamlit app.
https://github.com/terencicp/mastodon-topics

bertopic mastodon streamlit summarization topic-modeling

Last synced: 2 months ago
JSON representation

Topic modeling of Mastodon posts using BERTopic and LLM summarization. Results can be viewed in a Streamlit app.

Host: GitHub
URL: https://github.com/terencicp/mastodon-topics
Owner: terencicp
License: gpl-3.0
Created: 2025-12-01T11:24:05.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-12-08T05:35:13.000Z (7 months ago)
Last Synced: 2025-12-09T22:56:09.595Z (7 months ago)
Topics: bertopic, mastodon, streamlit, summarization, topic-modeling
Language: Python
Homepage: https://mastodon-topics.streamlit.app/
Size: 321 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Mastodon topic modeling

This project implements an end-to-end pipeline for discovering and summarizing popular topics on the Mastodon social network. It combines BERTopic for topic modeling with large language models to generate interpretable summaries, providing daily insights into what the Mastodon community is discussing.

View the results at [mastodon-topics.streamlit.app](https://mastodon-topics.streamlit.app/)

View the development notebooks at [github.com/terencicp/social-network-topic-modeling](https://github.com/terencicp/social-network-topic-modeling)

## Architecture

The pipeline consists of three main components:
- **pipeline/collect.py**: Runs continuously to collect Mastodon posts
- **pipeline/run_pipeline.py**: Runs once daily to generate topic summaries
- **app/streamlit.py**: Used to visualize the results of run_pipeline.py

![Pipeline diagram](./pipeline-diagram.svg)

The main files are:

**collect.py** - Data collection
- Fetches public posts from mastodon.social API
- Anonymizes data and stores it in MongoDB

**preprocess.py** - Data preprocessing
- Filters english-language posts
- Aggregates textual content from multiple fields

**model.py** - Topic modeling
- Generates Gemma-300m embeddings
- Trains BERTopic models with optimized parameters

**summarize.py** - LLM summarization
- Selects top topics by popularity
- Generates structured summaries using Qwen3 8b

**streamlit.py** - Streamlit app
- Displays topic summaries

## Requirements

### System dependencies
- Python 3.11
- MongoDB
- Ollama

### Python dependencies
- Pipeline: pipeline/requirements.txt
- Streamlit: app/requirements.txt

### Environment variables
- MASTODON_HASHING_KEY for data anonymization
- HUGGINGFACE_TOKEN from your HuggingFace account for Gemma embeddings

## Acknowledgments

This project builds upon the work of several amazing open-source projects and research teams:

- **[Mastodon](https://joinmastodon.org/)**: decentralized social network
- **[Gemma embeddings](https://ai.google.dev/gemma)**: embedding models from Google DeepMind
- **[BERTopic](https://maartengr.github.io/BERTopic/)**: topic modeling framework by Maarten Grootendorst
- **[Qwen3](https://qwen.ai/)**: LLM from Alibaba Cloud

Please make a donation to Mastodon if you use its API to offset the server costs.

## Contact

LinkedIn: [Terenci Claramunt](https://www.linkedin.com/in/terenci/)

This project was developed as part of my Applied Data Science Degree thesis.

Feel free to reach out for questions about the methodology or implementation.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/terencicp/mastodon-topics

Awesome Lists containing this project

README