Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cascadingradium/cuda-hungarian-clustering
A GPU-Accelerated Clustering Algorithm that uses the Hungarian method
https://github.com/cascadingradium/cuda-hungarian-clustering
clustering cpp cuda gpu hungarian-algorithm parallel-computing
Last synced: 2 months ago
JSON representation
A GPU-Accelerated Clustering Algorithm that uses the Hungarian method
- Host: GitHub
- URL: https://github.com/cascadingradium/cuda-hungarian-clustering
- Owner: CascadingRadium
- License: mit
- Created: 2022-06-03T17:25:54.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-01-01T05:20:29.000Z (about 2 years ago)
- Last Synced: 2023-03-06T21:55:25.636Z (almost 2 years ago)
- Topics: clustering, cpp, cuda, gpu, hungarian-algorithm, parallel-computing
- Language: Cuda
- Homepage:
- Size: 7.72 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![GitHub license](https://img.shields.io/github/license/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/blob/main/LICENCE)
[![GitHub forks](https://img.shields.io/github/forks/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/network)
[![GitHub stars](https://img.shields.io/github/stars/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/stargazers)
[![GitHub issues](https://img.shields.io/github/issues/CascadingRadium/CUDA-Hungarian-Clustering)](https://github.com/CascadingRadium/CUDA-Hungarian-Clustering/issues)
![GitHub repo size](https://img.shields.io/github/repo-size/CascadingRadium/CUDA-Hungarian-Clustering)
![GitHub last commit](https://img.shields.io/github/last-commit/CascadingRadium/CUDA-Hungarian-Clustering)
# CUDA-Hungarian-Clustering
A GPU-Accelerated Clustering Algorithm that uses the Hungarian methodWritten in CUDA and C++
Introduction:
- Parameterless (Almost) Clustering Algorithm
- Input is a single CSV file and the output will be a file named 'output.csv' which has the full original data + an extra 'label' column that specifies what cluster/group it belongs to.
- Does not need any prior knowledge of the number of clusters/groups present in the dataset
- Results similar to ones obtained from Spectral Clustering (but without the requirement of the number of clusters parameter)
- Combined the work of two research papers:
- A hierarchical clustering algorithm based on the Hungarian method, Journal of Pattern Recognition Letters (2008) (https://doi.org/10.1016/j.patrec.2008.04.003)
- GPU-accelerated Hungarian algorithms for the Linear Assignment Problem, Journal of Parallel Computing (2016) (https://doi.org/10.1016/j.parco.2016.05.012)
- Mainly used to find the number of groups in the dataset with each group being a set of 'similar' rows similar to DBSCANExecution instructions:
```
nvcc Clustering.cu./a.out [INPUT_FILE] [PARAMETER] [Number of Columns from right to skip/ignore] [Number of Rows from top to skip/ignore]
python3 plot_output.py
```
INPUT FILE - Any file(.xlxs .csv) that can be opened in spreadsheet software like LibreOffice calc/MS Excel.
PARAMETER - Integral value in the range [0,8] for most inputs (must be manually tuned) - 7 works for most datasets (Independent of the real number of clusters in the dataset)
The other two command-line arguments are meant to filter out the label column and the column header row respectively before passing on the raw data to the model
Constraints:
- The input file should only have numeric columns (float/ integer)
- The input file should not have any NaN or null values - Dataset cleaning must be done prior
- Parameter tuning can only be possible if a rough estimate of the number of values the label can take is known, otherwise, a pure unsupervised clustering without any tuning can be done by just assuming Parameter as 7
- Sensitive to noise
- Parameter, being fully independent of the dataset, cannot be estimated and is mostly tuned based on trial-and-error, but almost always takes a value in the range [0,10]Working Example:
```
nvcc Clustering.cu
./a.out data_banknote_authentication.csv 10 1 1
```
This will now use parameter 10 and cluster the input .csv file into some number of groups and output a file named 'output.csv' which has an additional column called label which represents the groupID or the group to which it belongs.Sample output images - using datasets in the TestedDataset directory: