https://github.com/alex-j-b/hac-text-clustering

It's the HAC algorithm that Im using to sort newspaper articles by news. You can adapt it to pretty much any type of text.
https://github.com/alex-j-b/hac-text-clustering

clustering-algorithm hac hierarchical-clustering kmeans kmeans-algorithm kmeans-clustering news newspaper silhouette-score text-clustering

Last synced: about 2 months ago
JSON representation

It's the HAC algorithm that Im using to sort newspaper articles by news. You can adapt it to pretty much any type of text.

Host: GitHub
URL: https://github.com/alex-j-b/hac-text-clustering
Owner: alex-j-b
Created: 2020-07-22T23:41:59.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-07-22T23:57:28.000Z (over 5 years ago)
Last Synced: 2025-09-15T02:53:14.111Z (3 months ago)
Topics: clustering-algorithm, hac, hierarchical-clustering, kmeans, kmeans-algorithm, kmeans-clustering, news, newspaper, silhouette-score, text-clustering
Language: Python
Homepage:
Size: 6.84 KB
Stars: 3
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# HAC-Text-Clustering
It is the HAC algorithm that Im using to sort newspaper articles by news. You can adapt it to pretty much any type of text.
HAC means "Hierarchical Agglomerative Clustering", it worked out better for me than KMeans.
It uses the silhouette score to find the best k.

Table of Contents
---------------------------
process_text.py

Tokenize and Stem data to get meaningful list of words for the clusterizer.

silhouette_best_k.py

Find the best k number of clusters based on the highest silhouette score.

main.py

Find the best k with silhouette score.
Apply AgglomerativeClustering with the best k.
Return clusters.

You might need to adapt few parameters to your type of dataset. Here are some changes that you can try :

process_text.py
Add more stopwords.
Replace or remove the stemmer.

silhouette_best_k.py
Change the Birch threshold
Change the silhouette_score metric

main.py
Change TfidfVectorizer max_df and/or min_df.
Try others AgglomerativeClustering affinity and linkage options.
Replace AgglomerativeClustering by KMeans.

Project Requirements
----------------------------

python 3
pip install requirements.txt
(nltk, scikit_learn)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alex-j-b/hac-text-clustering

Awesome Lists containing this project

README