https://github.com/xvxvdee/cps803-finalproject
This repo contains code and data for solving a practical clustering problem using the Daily Kos blog entries dataset. The dataset consists of 3420 documents, a vocabulary of 6906 terms, and 467,714 words. The goal is to cluster the documents into meaningful groups based on their content.
https://github.com/xvxvdee/cps803-finalproject
machine-learning machine-learning-algorithms nlp-machine-learning python
Last synced: 4 months ago
JSON representation
This repo contains code and data for solving a practical clustering problem using the Daily Kos blog entries dataset. The dataset consists of 3420 documents, a vocabulary of 6906 terms, and 467,714 words. The goal is to cluster the documents into meaningful groups based on their content.
- Host: GitHub
- URL: https://github.com/xvxvdee/cps803-finalproject
- Owner: xvxvdee
- Created: 2023-12-25T23:42:41.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-25T23:48:46.000Z (almost 2 years ago)
- Last Synced: 2025-02-09T21:15:37.204Z (8 months ago)
- Topics: machine-learning, machine-learning-algorithms, nlp-machine-learning, python
- Language: Jupyter Notebook
- Homepage:
- Size: 8.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CPS803-FinalProject
# Solving a Practical Clustering Problem: Exploring the Daily Kos Dataset
This project is a machine learning assignment from Toronto Metropolitan University. The goal is to apply the KMeans algorithm to cluster a bag of words dataset from the Daily Kos political blog.
## Data
The data consists of two files: the bag of words file in sparse format and the vocabulary. The repository's sample dataset contains 3420 documents, a vocabulary of 6906 terms, and 467,714 words. The creation of the vocabulary was based on the tokenization and elimination of stop words from each document. If the token occurred more than ten times, it was added to the vocab.
## Methods
The pipeline for preprocessing the data includes:
- Building each post by using the bag of words file
- Cleaning the content by replacing underscores, eliminating words with numbers, and stemming the vocabulary
- Vectorizing the text using the TF-IDF vectorizer
- Reducing the dimensionality using PCATo cluster the bag of words, the KMeans algorithm was applied. To choose the optimal number of clusters, the Elbow method was used, which calculated the Sum of Squared Errors (SSE) for different values of k.
## Results
The optimal number of clusters was found to be four, based on the analysis of the SSE plot and the words in each cluster. The clusters were labeled as follows:
- Cluster 0: General politics and news
- Cluster 1: Iraq war and foreign policy
- Cluster 2: US elections and candidates
- Cluster 3: Bush administration and criticism## Conclusions
The project demonstrated the use of KMeans to cluster a text-based dataset and the importance of considering other factors besides the Elbow method when choosing the number of clusters. The clusters showed some meaningful patterns and topics that reflected the nature of the Daily Kos blog.