https://github.com/cascadingradium/cuda-hungarian-clustering

A GPU-Accelerated Clustering Algorithm that uses the Hungarian method
https://github.com/cascadingradium/cuda-hungarian-clustering

clustering cpp cuda gpu hungarian-algorithm parallel-computing

Last synced: about 2 months ago
JSON representation

A GPU-Accelerated Clustering Algorithm that uses the Hungarian method

Host: GitHub
URL: https://github.com/cascadingradium/cuda-hungarian-clustering
Owner: CascadingRadium
License: mit
Created: 2022-06-03T17:25:54.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-01-01T05:20:29.000Z (over 2 years ago)
Last Synced: 2023-03-06T21:55:25.636Z (over 2 years ago)
Topics: clustering, cpp, cuda, gpu, hungarian-algorithm, parallel-computing
Language: Cuda
Homepage:
Size: 7.72 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

[![GitHub license](https://img.shields.io/github/license/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/blob/main/LICENCE)
[![GitHub forks](https://img.shields.io/github/forks/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/network)
[![GitHub stars](https://img.shields.io/github/stars/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/stargazers)
[![GitHub issues](https://img.shields.io/github/issues/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/issues)
![GitHub repo size](https://img.shields.io/github/repo-size/CascadingRadium/CUDA-Hungarian-Clustering)
![GitHub last commit](https://img.shields.io/github/last-commit/CascadingRadium/CUDA-Hungarian-Clustering)

# CUDA-Hungarian-Clustering
A GPU-Accelerated Clustering Algorithm that uses the Hungarian method

Written in CUDA and C++

Introduction:
- Parameterless (Almost) Clustering Algorithm
- Input is a single CSV file and the output will be a file named 'output.csv' which has the full original data + an extra 'label' column that specifies what cluster/group it belongs to.
- Does not need any prior knowledge of the number of clusters/groups present in the dataset
- Results similar to ones obtained from Spectral Clustering (but without the requirement of the number of clusters parameter)
- Combined the work of two research papers:
- A hierarchical clustering algorithm based on the Hungarian method, Journal of Pattern Recognition Letters (2008) (https://doi.org/10.1016/j.patrec.2008.04.003)
- GPU-accelerated Hungarian algorithms for the Linear Assignment Problem, Journal of Parallel Computing (2016) (https://doi.org/10.1016/j.parco.2016.05.012)
- Mainly used to find the number of groups in the dataset with each group being a set of 'similar' rows similar to DBSCAN

Execution instructions:

```
nvcc Clustering.cu

./a.out [INPUT_FILE] [PARAMETER] [Number of Columns from right to skip/ignore] [Number of Rows from top to skip/ignore]

python3 plot_output.py

```

INPUT FILE - Any file(.xlxs .csv) that can be opened in spreadsheet software like LibreOffice calc/MS Excel.

PARAMETER - Integral value in the range [0,8] for most inputs (must be manually tuned) - 7 works for most datasets (Independent of the real number of clusters in the dataset)

The other two command-line arguments are meant to filter out the label column and the column header row respectively before passing on the raw data to the model

Constraints:
- The input file should only have numeric columns (float/ integer)
- The input file should not have any NaN or null values - Dataset cleaning must be done prior
- Parameter tuning can only be possible if a rough estimate of the number of values the label can take is known, otherwise, a pure unsupervised clustering without any tuning can be done by just assuming Parameter as 7
- Sensitive to noise
- Parameter, being fully independent of the dataset, cannot be estimated and is mostly tuned based on trial-and-error, but almost always takes a value in the range [0,10]

Working Example:

```
nvcc Clustering.cu
./a.out data_banknote_authentication.csv 10 1 1
```
This will now use parameter 10 and cluster the input .csv file into some number of groups and output a file named 'output.csv' which has an additional column called label which represents the groupID or the group to which it belongs.

Sample output images - using datasets in the TestedDataset directory:

data0
data1
data2