https://github.com/jfdev001/k-means

(Naive) K-Means implementation for continuous N-dimensional data from scratch.
https://github.com/jfdev001/k-means

Last synced: 3 months ago
JSON representation

(Naive) K-Means implementation for continuous N-dimensional data from scratch.

Host: GitHub
URL: https://github.com/jfdev001/k-means
Owner: jfdev001
Created: 2021-11-27T18:46:46.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-11-29T20:38:14.000Z (over 3 years ago)
Last Synced: 2025-02-08T13:45:34.123Z (5 months ago)
Language: Python
Size: 2.12 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# K-Means Unsupervised Clustering

The [K-Means Algorithm](https://en.wikipedia.org/wiki/K-means_clustering) is used to generated _K_ clusters with centroids that capture information about high dimensional data with (or without) labels. This repository uses labeled data to assess the performance of the algorithm, and essentially formulates the problem as a semi-supervised task.

# Installation

This repository makes use of the standard data analysis/scientific computing libraries:

`pip install numpy pandas matplotlib`

# Testing

To train and subsequently test the k-means algorithm, use the command line interface for `kmeans.py`

`python kmeans.py 0 3 data/iris-data.txt data/iris-data.txt`

For information about the arguments to any `.py` script, type

`python name_of_script.py -h`

# Analysis

A summary of results and analysis is in `report/report.pdf`; however, the commands to reproduce the figures are available below.

If running on Unix system, use `sed -i -e 's/\r$//' parallelize.bash` and `sed -i -e 's/\r$//' split.bash`. You will also need to make sure both `.bash` files are executable. This can be done with `chmod +x parallelize.bash` and `chmod +x split.bash`

For analysis of the iris dataset, use the below bash command:

```
for num_cluster in {1..140}; do for ((seed=0; seed<100; seed++)); do echo "cat data/iris-data.txt | ./split.bash 10 python kmeans.py $seed $num_cluster --percentage True --precision 3"; done | ./parallelize.bash; done >> stats/iris_out.txt
```

For analysis of cancer dataset, use the below bash command:

```
for num_cluster in {1..95}; do for ((seed=0; seed<100; seed++)); do echo "cat data/cancer-data.txt | ./split.bash 10 python kmeans.py $seed $num_cluster --percentage True --precision 3"; done | ./parallelize.bash; done >> stats/cancer_out.txt
```

# Future Work

The k-means algorithm spends significant time processing euclidean distances, so variations of the algorithm using caching and the triangle inequality could be used to accelerate the algorithm. Morever, different intialization strategies for the centroids could be used since the one employed for the current repo simply uses random initialization using a sample from the dataset (without replacement for _k>1_).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jfdev001/k-means

Awesome Lists containing this project

README