https://github.com/huggingface/text-clustering
Easily embed, cluster and semantically label text datasets
https://github.com/huggingface/text-clustering
Last synced: 8 months ago
JSON representation
Easily embed, cluster and semantically label text datasets
- Host: GitHub
- URL: https://github.com/huggingface/text-clustering
- Owner: huggingface
- License: apache-2.0
- Created: 2024-01-12T18:03:00.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-28T16:30:36.000Z (about 2 years ago)
- Last Synced: 2025-09-30T18:02:31.678Z (8 months ago)
- Language: Python
- Homepage:
- Size: 771 KB
- Stars: 577
- Watchers: 35
- Forks: 44
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text Clustering
The Text Clustering repository contains tools to easily embed and cluster texts as well as label clusters semantically. This repository is a work in progress and serves as a minimal codebase that can be modified and adapted to other use cases.
Clustering of texts in the Cosmopedia dataset.
## How it works
The pipeline consists of several distinct blocks that can be customized and the whole pipeline can run in a few minutes on a consumer laptop. Each block uses existing standard methods and works quite robustly.
Text clustering pipeline.
## Install
Install the following libraries to get started:
```bash
pip install scikit-learn umap-learn sentence_transformers faiss-cpu plotly matplotlib datasets
```
Clone this repository and navigate to the folder:
```bash
git clone https://github.com/huggingface/text-clustering.git
cd text-clustering
```
## Usage
Run pipeline and visualize results:
```python
from src.text_clustering import ClusterClassifier
from datasets import load_dataset
SAMPLE = 100_000
texts = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train").select(range(SAMPLE))["text"]
cc = ClusterClassifier(embed_device="mps")
# run the pipeline:
embs, labels, summaries = cc.fit(texts)
# show the results
cc.show()
# save
cc.save("./cc_100k")
```
Load classifier and run inference:
```python
from src.text_clustering import ClusterClassifier
cc = ClusterClassifier(embed_device="mps")
# load state
cc.load("./cc_100k")
# visualize
cc.show()
# classify new texts with k-nearest neighbour search
cluster_labels, embeddings = cc.infer(some_texts, top_k=1)
```
If you want to reproduce the color scheme in the plot above you can add the following code before you run `cc.show()`:
```python
from cycler import cycler
import matplotlib.pyplot as plt
default_cycler = (cycler(color=[
"0F0A0A",
"FF6600",
"FFBE00",
"496767",
"87A19E",
"FF9200",
"0F3538",
"F8E08E",
"0F2021",
"FAFAF0"])
)
plt.rc('axes', prop_cycle=default_cycler)
```
If you would like to customize the plotting further the easiest way is to customize or overwrite the `_show_mpl` and `_show_plotly` methods.
You can also run the pipeline using a script with:
```bash
# run a new pipeline
python run_pipeline.py --mode run --save_load_path './cc_100k' --n_samples 100000 --build_hf_ds
# load existing pipeline
python run_pipeline.py --mode load --save_load_path './cc_100k' --build_hf_ds
# inference mode on new texts from an input dataset
python run_pipeline.py --mode infer --save_load_path './cc_100k' --n_samples --input_dataset
```
The `build_hf_ds` flag builds and pushes HF datasets, for the files and clusters, that can be directly used in the FW visualization space. In `infer` mode, we push the clusters dataset by default.
You can also change how the clusters are labeled (multiple topics (default) vs single topic with an educational score) using the flag `--topic_mode`.
## Examples
Check the `examples` folder for an example of clustering and topic labeling applied to the [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText/) dataset, utilizing [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)'s web labeling approach.