https://github.com/jermp/kmeans

A small, header-only, parallel implementation of kmeans clustering for arbitrary-long byte vectors.
https://github.com/jermp/kmeans

kmeans-clustering kmeans-plus-plus

Last synced: 7 days ago
JSON representation

A small, header-only, parallel implementation of kmeans clustering for arbitrary-long byte vectors.

Host: GitHub
URL: https://github.com/jermp/kmeans
Owner: jermp
Created: 2023-10-06T10:02:02.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2026-05-10T08:07:07.000Z (22 days ago)
Last Synced: 2026-05-10T09:28:35.517Z (22 days ago)
Topics: kmeans-clustering, kmeans-plus-plus
Language: C++
Homepage:
Size: 53.7 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# KMeans for byte vectors

A small, header-only, parallel implementation of kmeans clustering for arbitrary-long byte vectors.

The code is inspired by the [dkm](https://github.com/genbattle/dkm) library, but then
modified to use `std::thread` (rather than OpenMP) with a thread pool.

Compiling the code
------------------

The code is tested on Linux with `gcc` and on MacOS with `clang`.
To build the code, [`CMake`](https://cmake.org/) is required.

First clone the repository with

git clone --recursive https://github.com/jermp/kmeans.git

If you forgot `--recursive` when cloning, do

git submodule update --init --recursive

before compiling.

To compile the code for a release environment (see file `CMakeLists.txt` for the used compilation flags), it is sufficient to do the following, within the parent `kmeans` directory:

mkdir build
cd build
cmake ..
make -j

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -D CMAKE_BUILD_TYPE=Debug -D KMEANS_USE_SANITIZERS=On
make -j

Examples
--------

The tool `tools/cluster` can be used to cluster a collection of byte vectors.
We assume the input collection `vectors.bin` is a binary file where: the
first 8 bytes encode the number of bytes per vector, say `p`; the next 8 bytes encode the number
of vectors in the collection, say `n`; we have then `p` bytes for vector (a total of `np` bytes).

./cluster -i vectors.bin -k 16 -d 0.0 -s 13 -t 8 > labels.txt

./cluster -i vectors.bin -m 7 -d 0.001 -s 13 -t 8 --mse 500 --mcs 10 > labels.txt

./cluster -i vectors.bin -m 7 -d 0.001 -s 13 -t 8 --mse 50 --mcs 1 > labels.txt

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jermp/kmeans

Awesome Lists containing this project

README