https://github.com/ltla/cppkmeans
C++ port of R's Hartigan-Wong implementation
https://github.com/ltla/cppkmeans
Last synced: about 1 year ago
JSON representation
C++ port of R's Hartigan-Wong implementation
- Host: GitHub
- URL: https://github.com/ltla/cppkmeans
- Owner: LTLA
- License: mit
- Created: 2021-07-17T10:36:39.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2025-04-02T04:41:33.000Z (about 1 year ago)
- Last Synced: 2025-04-02T05:27:00.564Z (about 1 year ago)
- Language: C++
- Homepage: https://ltla.github.io/CppKmeans/
- Size: 8.82 MB
- Stars: 5
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# C++ library for k-means



[](https://codecov.io/gh/LTLA/CppKmeans)
## Overview
This repository contains a header-only C++ library for k-means clustering.
Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007).
Refinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm.
The Hartigan-Wong implementation is derived from the Fortran code in the R **stats** package, heavily refactored for more idiomatic C++.
## Quick start
**kmeans** is a header-only library, so it can be easily used by just `#include`ing the relevant source files and running `compute()`:
```cpp
#include "kmeans/Kmeans.hpp"
int ndim = 5;
int nobs = 1000;
std::vector matrix(ndim * nobs); // column-major ndim x nobs matrix of coordinates
// Wrap your matrix in a SimpleMatrix.
kmeans::SimpleMatrix<
int, /* type for the column index */
double /* type of the data */
> kmat(ndim, nobs, matrix.data());
auto res = kmeans::compute(
kmat,
// initialize with kmeans++
kmeans::InitializeKmeanspp<
/* column index type */ int,
/* input matrix data type */ double,
/* cluster ID type */ int,
/* centroid type */ double
>(),
// refine with Lloyd's algorithm
kmeans::RefineLloyd<
/* column index type */ int,
/* input matrix data type */ double,
/* cluster ID type */ int,
/* centroid type */ double
>(),
ncenters
);
res.centers; // Matrix of centroid coordinates, stored in column-major format
res.clusters; // Vector of cluster assignments
res.details; // Details from the clustering algorithm
```
See the [reference documentation](https://ltla.github.io/CppKmeans) for more details.
## Changing parameters
We can tune the clustering by passing options into the constructors of the relevant classes:
```cpp
kmeans::InitializeVariancePartitionOptions vp_opt;
vp_opt.optimize_partition = false;
kmeans::InitializeVariancePartition vp(vp_opt);
kmeans::RefineLloydOptions ll_opt;
ll_opt.max_iterations = 10;
ll_opt.num_threads = 3;
kmeans::RefineLloyd ll(ll_opt);
auto res2 = kmeans::compute(kmat, pp, ll, ncenters);
```
The initialization and refinement classes can themselves be swapped at run-time via pointers to their respective interfaces.
This design also allows the **kmeans** library to be easily extended to additional methods from third-party developers.
```cpp
std::unique_ptr > init_ptr;
if (init_method == "random") {
init_ptr.reset(new kmeans::InitializeRandom);
} else if (init_method == "kmeans++") {
kmeans::InitializeKmeansppOptions opt;
opt.seed = 42;
init_ptr.reset(new kmeans::InitializeKmeanspp(opt));
} else {
// do something else
}
std::unique_ptr > ref_ptr;
if (ref_method == "random") {
kmeans::RefineLloydOptions opt;
opt.max_iterations = 10;
ref_ptr.reset(new kmeans::RefineLloyd(opt));
} else {
kmeans::RefineHartiganWongOptions opt;
opt.max_iterations = 100;
opt.max_quick_transfer_iterations = 1000;
ref_ptr.reset(new kmeans::RefineHartiganWong(opt));
}
auto res3 = kmeans::compute(kmat, *init_ptr, *ref_ptr, ncenters);
```
Template parameters can also be altered to control the input and output data types.
As shown above, these should be set consistently for all classes used in `compute()`.
While `int` and `double` are suitable for most cases, advanced users may wish to use other types.
For example, we might consider the following parametrization for various reasons:
```cpp
kmeans::InitializeKmeanspp<
/* If our input data has too many observations to fit into an 'int', we
* might need to use a 'size_t' instead.
*/
size_t,
/* Perhaps our input data is in single-precision floating point to save
* space and to speed up processing.
*/
float,
/* If we know that we will never ask for more than 255 clusters, we can use
* a smaller integer for the cluster IDs to save space.
*/
uint8_t,
/* We still want our centroids and distances to be computed in high
* precision, even though the input data is only single precision.
*/
double
> initpp();
```
## Other bits and pieces
If we want the within-cluster sum of squares, this can be easily computed from the output of `compute()`:
```cpp
std::vector wcss(ncenters);
kmeans::compute_wcss(
kmat,
ncenters,
res.centers.data(),
res.clusters.data(),
wcss.data()
);
```
If we already allocated arrays for the centroids and clusters, we can fill the arrays directly.
This allows us to skip a copy when interfacing with other languages that manage their own memory (e.g., R, Python).
```cpp
std::vector centers(ndim * ncenters);
std::vector clusters(nobs);
auto deets = kmeans::compute(
kmat,
kmeans::InitializeRandom(), // random initialization
kmeans::RefineHartiganWong(), // refine with Hartigan-Wong
ncenters
centers.data(),
clusters.data()
);
```
## Building projects
### CMake with `FetchContent`
If you're using CMake, you just need to add something like this to your `CMakeLists.txt`:
```cmake
include(FetchContent)
FetchContent_Declare(
kmeans
GIT_REPOSITORY https://github.com/LTLA/CppKmeans
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(kmeans)
```
Then you can link to **kmeans** to make the headers available during compilation:
```cmake
# For executables:
target_link_libraries(myexe ltla::kmeans)
# For libaries
target_link_libraries(mylib INTERFACE ltla::kmeans)
```
### CMake with `find_package()`
To install the library, clone an appropriate version of this repository and run:
```sh
mkdir build && cd build
cmake .. -DKMEANS_TESTS=OFF
cmake --build . --target install
```
Then we can use `find_package()` as usual:
```cmake
find_package(ltla_kmeans CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE ltla::kmeans)
```
By default, this will use `FetchContent` to fetch all external dependencies (see [`extern/CMakeLists.txt`](extern/CMakeLists.txt) for a list).
If you want to install them manually, use `-DKMEANS_FETCH_EXTERN=OFF`.
### Manual
If you're not using CMake, the simple approach is to just copy the files in `include/` - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's `-I`.
This requires the external dependencies listed in [`extern/CMakeLists.txt`](extern/CMakeLists.txt), which also need to be made available during compilation.
## References
Hartigan, J. A. and Wong, M. A. (1979).
Algorithm AS 136: A K-means clustering algorithm.
_Applied Statistics_ 28, 100-108.
Arthur, D. and Vassilvitskii, S. (2007).
k-means++: the advantages of careful seeding.
_Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms_, 1027-1035.
Su, T. and Dy, J. G. (2007).
In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering,
_Intelligent Data Analysis_ 11, 319-338.