Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/huggingface/text-clustering

Easily embed, cluster and semantically label text datasets
https://github.com/huggingface/text-clustering

Last synced: 4 months ago
JSON representation

Easily embed, cluster and semantically label text datasets

Host: GitHub
URL: https://github.com/huggingface/text-clustering
Owner: huggingface
License: apache-2.0
Created: 2024-01-12T18:03:00.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-03-28T16:30:36.000Z (11 months ago)
Last Synced: 2024-08-01T16:30:38.193Z (7 months ago)
Language: Python
Homepage:
Size: 771 KB
Stars: 404
Watchers: 33
Forks: 27
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Text Clustering

The Text Clustering repository contains tools to easily embed and cluster texts as well as label clusters semantically. This repository is a work in progress and serves as a minimal codebase that can be modified and adapted to other use cases.



Clustering of texts in the Cosmopedia dataset.

## How it works

The pipeline consists of several distinct blocks that can be customized and the whole pipeline can run in a few minutes on a consumer laptop. Each block uses existing standard methods and works quite robustly.



Text clustering pipeline.

## Install 

Install the following libraries to get started:

```bash

pip install scikit-learn umap-learn sentence_transformers faiss-cpu plotly matplotlib datasets

```

Clone this repository and navigate to the folder:

```bash

git clone https://github.com/huggingface/text-clustering.git

cd text-clustering

```

## Usage

Run pipeline and visualize results:

```python

from src.text_clustering import ClusterClassifier

from datasets import load_dataset

SAMPLE = 100_000

texts = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train").select(range(SAMPLE))["text"]

cc = ClusterClassifier(embed_device="mps")

# run the pipeline:

embs, labels, summaries = cc.fit(texts)

# show the results

cc.show()

# save 

cc.save("./cc_100k")

```

Load classifier and run inference:

```python

from src.text_clustering import ClusterClassifier

cc = ClusterClassifier(embed_device="mps")

# load state

cc.load("./cc_100k")

# visualize

cc.show()

# classify new texts with k-nearest neighbour search

cluster_labels, embeddings = cc.infer(some_texts, top_k=1)

```

If you want to reproduce the color scheme in the plot above you can add the following code before you run `cc.show()`:

```python

from cycler import cycler

import matplotlib.pyplot as plt

default_cycler = (cycler(color=[

    "0F0A0A",

    "FF6600",

    "FFBE00",

    "496767",

    "87A19E",

    "FF9200",

    "0F3538",

    "F8E08E",

    "0F2021",

    "FAFAF0"])

    )

plt.rc('axes', prop_cycle=default_cycler)

```

If you would like to customize the plotting further the easiest way is to customize or overwrite the `_show_mpl` and `_show_plotly` methods.

You can also run the pipeline using a script with:

```bash

# run a new pipeline

python run_pipeline.py --mode run  --save_load_path './cc_100k' --n_samples 100000 --build_hf_ds

# load existing pipeline

python run_pipeline.py --mode load --save_load_path './cc_100k' --build_hf_ds

# inference mode on new texts from an input dataset

python run_pipeline.py --mode infer --save_load_path './cc_100k'  --n_samples  --input_dataset 

```

The `build_hf_ds` flag builds and pushes HF datasets, for the files and clusters, that can be directly used in the FW visualization space. In `infer` mode, we push the clusters dataset by default.

You can also change how the clusters are labeled (multiple topics (default) vs single topic with an educational score) using the flag `--topic_mode`.

## Examples

Check the `examples` folder for an example of clustering and topic labeling applied to the [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText/) dataset, utilizing [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)'s web labeling approach.