{"id":25819020,"url":"https://github.com/pngo1997/document-clustering-using-k-means","last_synced_at":"2026-05-18T19:35:06.839Z","repository":{"id":275033776,"uuid":"924860756","full_name":"pngo1997/Document-Clustering-using-K-Means","owner":"pngo1997","description":"Performs unsupervised clustering on text documents.","archived":false,"fork":false,"pushed_at":"2025-01-31T18:28:43.000Z","size":1428,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-31T19:27:31.250Z","etag":null,"topics":["clustering","kmeans-clustering","python","sparse-matrix","text-clustering","unsupervised-learning","wordcloud","wordcloud-visualization"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pngo1997.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-30T19:23:52.000Z","updated_at":"2025-01-31T18:28:46.000Z","dependencies_parsed_at":"2025-01-31T19:37:37.552Z","dependency_job_id":null,"html_url":"https://github.com/pngo1997/Document-Clustering-using-K-Means","commit_stats":null,"previous_names":["pngo1997/text-clustering-using-k-means","pngo1997/document-clustering-using-k-means"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FDocument-Clustering-using-K-Means","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FDocument-Clustering-using-K-Means/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FDocument-Clustering-using-K-Means/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pngo1997%2FDocument-Clustering-using-K-Means/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pngo1997","download_url":"https://codeload.github.com/pngo1997/Document-Clustering-using-K-Means/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241122317,"owners_count":19913455,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","kmeans-clustering","python","sparse-matrix","text-clustering","unsupervised-learning","wordcloud","wordcloud-visualization"],"created_at":"2025-02-28T08:14:24.383Z","updated_at":"2026-05-18T19:35:01.821Z","avatar_url":"https://github.com/pngo1997.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🏗️ Document Clustering using K-Means\n\n## 📜 Overview  \nThis project performs **unsupervised clustering** on a subset of the **20 Newsgroups dataset**, which contains **2,500 documents** belonging to one of five categories:  \n- `0` - Windows  \n- `1` - Cryptography  \n- `2` - Christianity  \n- `3` - Hockey  \n- `4` - For Sale  \nEach document is represented as a **sparse term-document matrix** with **9,328 unique terms** (stems). The goal is to apply **K-Means clustering**, analyze cluster characteristics, and compare clustering results to the original categories.  \n\n## 🎯 Problem Explanation  \nThe project includes the following key tasks:  \n1. **Implement a custom Cosine similarity distance function** for K-Means.  \n2. **Preprocess the dataset** (transpose term-document matrix, random train-test split, and TF-IDF transformation).  \n3. **Perform K-Means clustering** on the transformed dataset using different values of K (from **4 to 8**) and analyze cluster properties.  \n4. **Evaluate cluster quality** using **Completeness and Homogeneity scores**.  \n5. **Classify new documents** from the test set based on **Cosine similarity to cluster centroids**.  \n6. **Generate Word Clouds** for each cluster.  \n\n## 🛠️ Implementation Details  \n### **1. Custom Cosine Similarity Distance Function**  \n- K-Means normally uses **Euclidean distance**, but for text data, **Cosine similarity** is more effective.  \n- A custom distance function is implemented that computes: Cosine Distance = 1 - Cosine Similarity\n### 2. Data Preprocessing\n- Transpose the term-document matrix (documents as rows, terms as columns).\n- Randomly split the dataset into: 80% training data (used for clustering) and 20% test data (used for classification).\n- Apply TF-IDF transformation to the data.\n### 3. K-Means Clustering\n- Run K-Means clustering on the training data for different values of K (4 to 8).\n- Experiment with multiple random initializations to identify the best clustering structure.\n- Extract top N terms per cluster based on:\nCluster Document Frequency (DF): Percentage of docs in the cluster containing a term.\nCentroid TF-IDF Weight: Mean TF-IDF weight of the term in the cluster.\n- The final results summarize which terms define each cluster.\n### 4. Clustering Evaluation\n- Compute Completeness and Homogeneity scores to measure clustering quality.\n- Higher scores indicate better alignment between clusters and true labels.\n- Experiment with different values of K to find the optimal number of clusters.\n### 5. Classifying Test Data Using Cosine Similarity\n- Assign each document in the test set to the closest cluster centroid based on Cosine similarity.\n- Output predicted cluster label and similarity score for each document.\n### 6. Word Cloud Generation\n- Create word clouds for each cluster to visually represent important terms.\n\n### 🚀 Technologies Used\n- **Python** (for text processing and clustering).\n- **NumPy \u0026 Pandas** (for data manipulation).\n- **Scikit-learn** (for TF-IDF transformation and clustering evaluation).\n- **Matplotlib \u0026 Seaborn** (for visualizing cluster statistics).\n- **WordCloud** (for generating cluster word clouds).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpngo1997%2Fdocument-clustering-using-k-means","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpngo1997%2Fdocument-clustering-using-k-means","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpngo1997%2Fdocument-clustering-using-k-means/lists"}