{"id":18046597,"url":"https://github.com/ltla/cppkmeans","last_synced_at":"2025-04-10T04:44:30.652Z","repository":{"id":43869995,"uuid":"386908901","full_name":"LTLA/CppKmeans","owner":"LTLA","description":"C++ port of R's Hartigan-Wong implementation","archived":false,"fork":false,"pushed_at":"2025-04-02T04:41:33.000Z","size":9244,"stargazers_count":5,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-02T05:27:00.564Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://ltla.github.io/CppKmeans/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LTLA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-17T10:36:39.000Z","updated_at":"2025-02-05T20:57:36.000Z","dependencies_parsed_at":"2024-08-28T06:29:23.022Z","dependency_job_id":"748741a9-a4e8-4ae3-a146-57973935ea7b","html_url":"https://github.com/LTLA/CppKmeans","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2FCppKmeans","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2FCppKmeans/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2FCppKmeans/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2FCppKmeans/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LTLA","download_url":"https://codeload.github.com/LTLA/CppKmeans/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248161233,"owners_count":21057552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-30T19:08:04.697Z","updated_at":"2025-04-10T04:44:30.643Z","avatar_url":"https://github.com/LTLA.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# C++ library for k-means\n\n![Unit tests](https://github.com/LTLA/CppKmeans/actions/workflows/run-tests.yaml/badge.svg)\n![Documentation](https://github.com/LTLA/CppKmeans/actions/workflows/doxygenate.yaml/badge.svg)\n![stats comparison](https://github.com/LTLA/CppKmeans/actions/workflows/compare-kmeans.yaml/badge.svg)\n[![Codecov](https://codecov.io/gh/LTLA/CppKmeans/branch/master/graph/badge.svg?token=7S231XHC0Q)](https://codecov.io/gh/LTLA/CppKmeans)\n\n## Overview\n\nThis repository contains a header-only C++ library for k-means clustering.\nInitialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007).\nRefinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm.\nThe Hartigan-Wong implementation is derived from the Fortran code in the R **stats** package, heavily refactored for more idiomatic C++.\n\n## Quick start\n\n**kmeans** is a header-only library, so it can be easily used by just `#include`ing the relevant source files and running `compute()`:\n\n```cpp\n#include \"kmeans/Kmeans.hpp\"\n\nint ndim = 5;\nint nobs = 1000;\nstd::vector\u003cdouble\u003e matrix(ndim * nobs); // column-major ndim x nobs matrix of coordinates\n\n// Wrap your matrix in a SimpleMatrix.\nkmeans::SimpleMatrix\u003c\n    int, /* type for the column index */\n    double /* type of the data */\n\u003e kmat(ndim, nobs, matrix.data());\n\nauto res = kmeans::compute(\n    kmat,\n    // initialize with kmeans++\n    kmeans::InitializeKmeanspp\u003c\n        /* column index type */ int,\n        /* input matrix data type */ double, \n        /* cluster ID type */ int, \n        /* centroid type */ double\n    \u003e(),\n    // refine with Lloyd's algorithm\n    kmeans::RefineLloyd\u003c\n        /* column index type */ int,\n        /* input matrix data type */ double, \n        /* cluster ID type */ int, \n        /* centroid type */ double\n    \u003e(),\n    ncenters \n);\n\nres.centers; // Matrix of centroid coordinates, stored in column-major format\nres.clusters; // Vector of cluster assignments\nres.details; // Details from the clustering algorithm\n```\n\nSee the [reference documentation](https://ltla.github.io/CppKmeans) for more details.\n\n## Changing parameters \n\nWe can tune the clustering by passing options into the constructors of the relevant classes:\n\n```cpp\nkmeans::InitializeVariancePartitionOptions vp_opt;\nvp_opt.optimize_partition = false;\nkmeans::InitializeVariancePartition\u003cint, double, int, double\u003e vp(vp_opt);\n\nkmeans::RefineLloydOptions ll_opt;\nll_opt.max_iterations = 10;\nll_opt.num_threads = 3;\nkmeans::RefineLloyd\u003cint, double, int, double\u003e ll(ll_opt);\n\nauto res2 = kmeans::compute(kmat, pp, ll, ncenters);\n```\n\nThe initialization and refinement classes can themselves be swapped at run-time via pointers to their respective interfaces.\nThis design also allows the **kmeans** library to be easily extended to additional methods from third-party developers.\n\n```cpp\nstd::unique_ptr\u003ckmeans::Initialize\u003cint, double, int, double\u003e \u003e init_ptr;\nif (init_method == \"random\") {\n    init_ptr.reset(new kmeans::InitializeRandom\u003cint, double, int, double\u003e);\n} else if (init_method == \"kmeans++\") {\n    kmeans::InitializeKmeansppOptions opt;\n    opt.seed = 42;\n    init_ptr.reset(new kmeans::InitializeKmeanspp\u003cint, double, int, double\u003e(opt));\n} else {\n    // do something else\n}\n\nstd::unique_ptr\u003ckmeans::Refine\u003cint, double, int, double\u003e \u003e ref_ptr;\nif (ref_method == \"random\") {\n    kmeans::RefineLloydOptions opt;\n    opt.max_iterations = 10;\n    ref_ptr.reset(new kmeans::RefineLloyd\u003cint, double, int, double\u003e(opt));\n} else {\n    kmeans::RefineHartiganWongOptions opt;\n    opt.max_iterations = 100;\n    opt.max_quick_transfer_iterations = 1000;\n    ref_ptr.reset(new kmeans::RefineHartiganWong\u003cint, double, int, double\u003e(opt));\n}\n\nauto res3 = kmeans::compute(kmat, *init_ptr, *ref_ptr, ncenters);\n```\n\nTemplate parameters can also be altered to control the input and output data types.\nAs shown above, these should be set consistently for all classes used in `compute()`. \nWhile `int` and `double` are suitable for most cases, advanced users may wish to use other types.\nFor example, we might consider the following parametrization for various reasons:\n\n```cpp\nkmeans::InitializeKmeanspp\u003c\n    /* If our input data has too many observations to fit into an 'int', we\n     * might need to use a 'size_t' instead.\n     */\n    size_t,\n\n    /* Perhaps our input data is in single-precision floating point to save\n     * space and to speed up processing.\n     */\n    float, \n\n    /* If we know that we will never ask for more than 255 clusters, we can use\n     * a smaller integer for the cluster IDs to save space.\n     */\n    uint8_t, \n\n    /* We still want our centroids and distances to be computed in high\n     * precision, even though the input data is only single precision.\n     */\n    double \n\u003e initpp();\n```\n\n## Other bits and pieces\n\nIf we want the within-cluster sum of squares, this can be easily computed from the output of `compute()`:\n\n```cpp\nstd::vector\u003cdouble\u003e wcss(ncenters);\nkmeans::compute_wcss(\n    kmat, \n    ncenters, \n    res.centers.data(), \n    res.clusters.data(), \n    wcss.data()\n);\n```\n\nIf we already allocated arrays for the centroids and clusters, we can fill the arrays directly.\nThis allows us to skip a copy when interfacing with other languages that manage their own memory (e.g., R, Python).\n\n```cpp\nstd::vector\u003cdouble\u003e centers(ndim * ncenters);\nstd::vector\u003cint\u003e clusters(nobs);\n\nauto deets = kmeans::compute(\n    kmat,\n    kmeans::InitializeRandom(), // random initialization\n    kmeans::RefineHartiganWong(), // refine with Hartigan-Wong \n    ncenters \n    centers.data(),\n    clusters.data()\n);\n```\n\n## Building projects \n\n### CMake with `FetchContent`\n\nIf you're using CMake, you just need to add something like this to your `CMakeLists.txt`:\n\n```cmake\ninclude(FetchContent)\n\nFetchContent_Declare(\n  kmeans \n  GIT_REPOSITORY https://github.com/LTLA/CppKmeans\n  GIT_TAG master # or any version of interest\n)\n\nFetchContent_MakeAvailable(kmeans)\n```\n\nThen you can link to **kmeans** to make the headers available during compilation:\n\n```cmake\n# For executables:\ntarget_link_libraries(myexe ltla::kmeans)\n\n# For libaries\ntarget_link_libraries(mylib INTERFACE ltla::kmeans)\n```\n\n### CMake with `find_package()`\n\nTo install the library, clone an appropriate version of this repository and run:\n\n```sh\nmkdir build \u0026\u0026 cd build\ncmake .. -DKMEANS_TESTS=OFF\ncmake --build . --target install\n```\n\nThen we can use `find_package()` as usual:\n\n```cmake\nfind_package(ltla_kmeans CONFIG REQUIRED)\ntarget_link_libraries(mylib INTERFACE ltla::kmeans)\n```\n\nBy default, this will use `FetchContent` to fetch all external dependencies (see [`extern/CMakeLists.txt`](extern/CMakeLists.txt) for a list).\nIf you want to install them manually, use `-DKMEANS_FETCH_EXTERN=OFF`.\n\n### Manual\n\nIf you're not using CMake, the simple approach is to just copy the files in `include/` - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's `-I`.\nThis requires the external dependencies listed in [`extern/CMakeLists.txt`](extern/CMakeLists.txt), which also need to be made available during compilation.\n\n## References\n\nHartigan, J. A. and Wong, M. A. (1979).\nAlgorithm AS 136: A K-means clustering algorithm.\n_Applied Statistics_ 28, 100-108.\n\nArthur, D. and Vassilvitskii, S. (2007). \nk-means++: the advantages of careful seeding.\n_Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms_, 1027-1035.\n\nSu, T. and Dy, J. G. (2007).\nIn Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering,\n_Intelligent Data Analysis_ 11, 319-338.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fltla%2Fcppkmeans","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fltla%2Fcppkmeans","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fltla%2Fcppkmeans/lists"}