https://github.com/jermp/kmeans
A small, header-only, parallel implementation of kmeans clustering for arbitrary-long byte vectors.
https://github.com/jermp/kmeans
kmeans-clustering kmeans-plus-plus
Last synced: 7 days ago
JSON representation
A small, header-only, parallel implementation of kmeans clustering for arbitrary-long byte vectors.
- Host: GitHub
- URL: https://github.com/jermp/kmeans
- Owner: jermp
- Created: 2023-10-06T10:02:02.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2026-05-10T08:07:07.000Z (22 days ago)
- Last Synced: 2026-05-10T09:28:35.517Z (22 days ago)
- Topics: kmeans-clustering, kmeans-plus-plus
- Language: C++
- Homepage:
- Size: 53.7 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# KMeans for byte vectors
A small, header-only, parallel implementation of kmeans clustering for arbitrary-long byte vectors.
The code is inspired by the [dkm](https://github.com/genbattle/dkm) library, but then
modified to use `std::thread` (rather than OpenMP) with a thread pool.
Compiling the code
------------------
The code is tested on Linux with `gcc` and on MacOS with `clang`.
To build the code, [`CMake`](https://cmake.org/) is required.
First clone the repository with
git clone --recursive https://github.com/jermp/kmeans.git
If you forgot `--recursive` when cloning, do
git submodule update --init --recursive
before compiling.
To compile the code for a release environment (see file `CMakeLists.txt` for the used compilation flags), it is sufficient to do the following, within the parent `kmeans` directory:
mkdir build
cd build
cmake ..
make -j
For a testing environment, use the following instead:
mkdir debug_build
cd debug_build
cmake .. -D CMAKE_BUILD_TYPE=Debug -D KMEANS_USE_SANITIZERS=On
make -j
Examples
--------
The tool `tools/cluster` can be used to cluster a collection of byte vectors.
We assume the input collection `vectors.bin` is a binary file where: the
first 8 bytes encode the number of bytes per vector, say `p`; the next 8 bytes encode the number
of vectors in the collection, say `n`; we have then `p` bytes for vector (a total of `np` bytes).
./cluster -i vectors.bin -k 16 -d 0.0 -s 13 -t 8 > labels.txt
./cluster -i vectors.bin -m 7 -d 0.001 -s 13 -t 8 --mse 500 --mcs 10 > labels.txt
./cluster -i vectors.bin -m 7 -d 0.001 -s 13 -t 8 --mse 50 --mcs 1 > labels.txt