{"id":13935023,"url":"https://github.com/src-d/kmcuda","last_synced_at":"2025-04-08T09:06:41.703Z","repository":{"id":39420395,"uuid":"61807798","full_name":"src-d/kmcuda","owner":"src-d","description":"Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA","archived":false,"fork":false,"pushed_at":"2022-10-11T16:58:55.000Z","size":718,"stargazers_count":826,"open_issues_count":47,"forks_count":146,"subscribers_count":27,"default_branch":"master","last_synced_at":"2025-04-01T08:30:54.538Z","etag":null,"topics":["afk-mc2","cuda","hacktoberfest","kmeans","knn-search","machine-learning","python","yinyang"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-06-23T13:35:55.000Z","updated_at":"2025-03-31T11:25:59.000Z","dependencies_parsed_at":"2022-09-20T02:57:06.437Z","dependency_job_id":null,"html_url":"https://github.com/src-d/kmcuda","commit_stats":null,"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fkmcuda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fkmcuda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fkmcuda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fkmcuda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/kmcuda/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247809964,"owners_count":20999816,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["afk-mc2","cuda","hacktoberfest","kmeans","knn-search","machine-learning","python","yinyang"],"created_at":"2024-08-07T23:01:21.553Z","updated_at":"2025-04-08T09:06:41.676Z","avatar_url":"https://github.com/src-d.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Software"],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/src-d/kmcuda.svg?branch=master)](https://travis-ci.org/src-d/kmcuda) [![PyPI](https://img.shields.io/pypi/v/libKMCUDA.svg)](https://pypi.python.org/pypi/libKMCUDA) [![10.5281/zenodo.286944](https://zenodo.org/badge/DOI/10.5281/zenodo.286944.svg)](https://doi.org/10.5281/zenodo.286944)\n\n\"Yinyang\" K-means and K-nn using NVIDIA CUDA\n============================================\n\nK-means implementation is based on [\"Yinyang K-Means: A Drop-In Replacement\nof the Classic K-Means with Consistent Speedup\"](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf).\nWhile it introduces some overhead and many conditional clauses\nwhich are bad for CUDA, it still shows 1.6-2x speedup against the Lloyd\nalgorithm. K-nearest neighbors employ the same triangle inequality idea and\nrequire precalculated centroids and cluster assignments, similar to the flattened\nball tree.\n\n| [Benchmarks](#benchmarks) | sklearn KMeans | KMeansRex | KMeansRex OpenMP | Serban | kmcuda | kmcuda 2 GPUs |\n|---------------------------|----------------|-----------|------------------|--------|--------|---------------|\n| speed                     | 1x             | 4.5x      | 8.2x             | 15.5x  | 17.8x  | 29.8x         |\n| memory                    | 1x             | 2x        | 2x               | 0.6x   | 0.6x   | 0.6x          |\n\nTechnically, this project is a shared library which exports two functions\ndefined in `kmcuda.h`: `kmeans_cuda` and `knn_cuda`.\nIt has built-in Python3 and R native extension support, so you can\n`from libKMCUDA import kmeans_cuda` or `dyn.load(\"libKMCUDA.so\")`.\n\n[![source{d}](img/sourced.png)](http://sourced.tech)\n\u003cp align=\"right\"\u003e\u003ca href=\"img/kmeans_image.ipynb\"\u003eHow was this created?\u003c/a\u003e\u003c/p\u003e\n\nTable of contents\n-----------------\n* [K-means](#k-means)\n* [K-nn](#k-nn)\n* [Notes](#notes)\n* [Building](#building)\n   * [macOS](#macos)\n* [Testing](#testing)\n* [Benchmarks](#benchmarks)\n   * [100,000x256@1024](#100000x2561024)\n      * [Configuration](#configuration)\n      * [Contestants](#contestants)\n      * [Data](#data)\n      * [Notes](#notes-1)\n   * [8,000,000x256@1024](#8000000x2561024)\n      * [Data](#data-1)\n      * [Notes](#notes-2)\n* [Python examples](#python-examples)\n   * [K-means, L2 (Euclidean) distance](#k-means-l2-euclidean-distance)\n   * [K-means, angular (cosine) distance + average](#k-means-angular-cosine-distance--average)\n   * [K-nn](#k-nn-1)\n* [Python API](#python-api)\n* [R examples](#r-examples)\n   * [K-means](#k-means-1)\n   * [K-nn](#k-nn-2)\n* [R API](#r-api)\n* [C examples](#c-examples)\n* [C API](#c-api)\n* [License](#license)\n\nK-means\n-------\nThe major difference between this project and others is that kmcuda is\noptimized for low memory consumption and the large number of clusters. E.g.,\nkmcuda can sort 4M samples in 480 dimensions into 40000 clusters (if you\nhave several days and 12 GB of GPU memory); 300K samples are grouped\ninto 5000 clusters in 4½ minutes on NVIDIA Titan X (15 iterations); 3M samples\nand 1000 clusters take 20 minutes (33 iterations). Yinyang can be\nturned off to save GPU memory but the slower Lloyd will be used then.\nFour centroid initialization schemes are supported: random, k-means++,\n[AFKMC2](http://olivierbachem.ch/files/afkmcmc-oral-pdf.pdf) and import.\nTwo distance metrics are supported: L2 (the usual one) and angular\n(arccos of the scalar product). L1 is in development.\n16-bit float support delivers 2x memory compression. If you've got several GPUs,\nthey can be utilized together and it gives the corresponding linear speedup\neither for Lloyd or Yinyang.\n\nThe code has been thoroughly tested to yield bit-to-bit identical\nresults from Yinyang and Lloyd. \"Fast and Provably Good Seedings for k-Means\" was adapted from\n[the reference code](https://github.com/obachem/kmc2).\n\nRead the articles: [1](http://blog.sourced.tech/post/towards_kmeans_on_gpu/),\n[2](https://blog.sourced.tech/post/kmcuda4/).\n\nK-nn\n----\nCentroid distance matrix C\u003csub\u003eij\u003c/sub\u003e is calculated together with clusters'\nradiuses R\u003csub\u003ei\u003c/sub\u003e (the maximum distance from the centroid to the corresponding\ncluster's members). Given sample S in cluster A, we avoid calculating the distances from S\nto another cluster B's members if C\u003csub\u003eAB\u003c/sub\u003e - SA - R\u003csub\u003eB\u003c/sub\u003e is greater\nthan the current maximum K-nn distance. This resembles the [ball tree\nalgorithm](http://scikit-learn.org/stable/modules/neighbors.html#ball-tree).\n\nThe implemented algorithm is tolerant to NANs. There are two variants depending\non whether k is small enough to fit the sample's neighbors into CUDA shared memory.\nInternally, the neighbors list is a [binary heap](https://en.wikipedia.org/wiki/Binary_heap) -\nthat reduces the complexity multiplier from O(k) to O(log k).\n\nThe implementation yields identical results to `sklearn.neighbors.NearestNeighbors`\nexcept cases in which adjacent distances are equal and the order is undefined.\nThat is, the returned indices are sorted in the increasing order of the\ncorresponding distances.\n\nNotes\n-----\nLloyd is tolerant to samples with NaN features while Yinyang is not.\nIt may happen that some of the resulting clusters contain zero elements.\nIn such cases, their features are set to NaN.\n\nAngular (cosine) distance metric effectively results in Spherical K-Means behavior.\nThe samples **must** be normalized to L2 norm equal to 1 before clustering,\nit is not done automatically. The actual formula is:\n\n![D(A, B)=\\arccos\\left(\\frac{A\\cdot B}{|A||B|}\\right)](img/latex_angular.png)\n\nIf you get OOM with the default parameters, set `yinyang_t` to 0 which\nforces Lloyd. `verbosity` 2 will print the memory allocation statistics\n(all GPU allocation happens at startup).\n\nData type is either 32- or 16-bit float. Number of samples is limited by 2^32,\nclusters by 2^32 and features by 2^16 (2^17 for fp16). Besides, the product of\nclusters number and features number may not exceed 2^32.\n\nIn the case of 16-bit floats, the reduced precision often leads to a slightly\nincreased number of iterations, Yinyang is especially sensitive to that.\nIn some cases, there may be overflows and the clustering may fail completely.\n\nBuilding\n--------\n```\ngit clone https://github.com/src-d/kmcuda\ncd src\ncmake -DCMAKE_BUILD_TYPE=Release . \u0026\u0026 make\n```\nIt requires cudart 8.0 / Pascal and OpenMP 4.0 capable compiler. The build has\nbeen tested primarily on Linux but it works on macOS too with some blows and whistles\n(see \"macOS\" subsection).\nIf you do not want to build the Python native module, add `-D DISABLE_PYTHON=y`.\nIf you do not want to build the R native module, add `-D DISABLE_R=y`.\nIf CUDA is not automatically found, add `-D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-8.0`\n(change the path to the actual one). By default, CUDA kernels are compiled for\nthe architecture 60 (Pascal). It is possible to override it via `-D CUDA_ARCH=52`,\nbut fp16 support will be disabled then.\n\nPython users:\n```\nCUDA_ARCH=61 pip install libKMCUDA\n# replace 61 with your device version\n```\n\nOr install it from source:\n```\nCUDA_ARCH=61 pip install git+https://github.com/src-d/kmcuda.git#subdirectory=src\n# replace 61 with your device version\n```\n\nBinary Python packages are quite hard to provide because they depend on CUDA and device architecture versions. PRs welcome!\n\n#### macOS\nmacOS build is tricky, but possible. The instructions below correspond to the state from 1 year ago and may be different now.\nPlease help with updates!\n\nInstall [Homebrew](http://brew.sh/) and the [Command Line Developer Tools](https://developer.apple.com/download/more/)\nwhich are compatible with your CUDA installation. E.g., CUDA 8.0 does not support\nthe latest 8.x and works with 7.3.1 and below. Install `clang` with OpenMP support\nand Python with numpy:\n```\nbrew install llvm --with-clang\nbrew install python3\npip3 install numpy\n```\nExecute this magic command which builds kmcuda afterwards:\n```\nCC=/usr/local/opt/llvm/bin/clang CXX=/usr/local/opt/llvm/bin/clang++ LDFLAGS=-L/usr/local/opt/llvm/lib/ cmake -DCMAKE_BUILD_TYPE=Release .\n```\nAnd make the last important step - rename \\*.dylib to \\*.so so that Python is able to import the native extension:\n```\nmv libKMCUDA.{dylib,so}\n```\n\nTesting\n-------\n`test.py` contains the unit tests based on [unittest](https://docs.python.org/3/library/unittest.html).\nThey require either [cuda4py](https://github.com/ajkxyz/cuda4py) or [pycuda](https://github.com/inducer/pycuda) and\n[scikit-learn](http://scikit-learn.org/stable/).\n`test.R` contains R integration tests and shall be run with `Rscript`.\n\nBenchmarks\n----------\n\n### 100000x256@1024\n|            | sklearn KMeans | KMeansRex | KMeansRex OpenMP | Serban | kmcuda | kmcuda 2 GPUs |\n|------------|----------------|-----------|------------------|--------|--------|---------------|\n| time, s    | 164            | 36        | 20               | 10.6   | 9.2    | 5.5           |\n| memory, GB | 1              | 2         | 2                | 0.6    | 0.6    | 0.6           |\n\n#### Configuration\n* 16-core (32 threads) Intel Xeon E5-2620 v4 @ 2.10GHz\n* 256 GB RAM Samsung M393A2K40BB1\n* Nvidia Titan X 2016\n\n#### Contestants\n* [sklearn.cluster.KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)@0.18.1; `KMeans(n_clusters=1024, init=\"random\", max_iter=15, random_state=0, n_jobs=1, n_init=1)`.\n* [KMeansRex](https://github.com/michaelchughes/KMeansRex)@288c40a with `-march-native` and Eigen 3.3; `KMeansRex.RunKMeans(data, 1024, Niter=15, initname=b\"random\")`.\n* KMeansRex with additional `-fopenmp`.\n* [Serban KMeans](https://github.com/serban/kmeans)@83e76bf built for arch 6.1; `./cuda_main  -b -i serban.bin -n 1024 -t 0.0028 -o`\n* kmcuda v6.1 built for arch 6.1; `libKMCUDA.kmeans_cuda(dataset, 1024, tolerance=0.002, seed=777, init=\"random\", verbosity=2, yinyang_t=0, device=0)`\n* kmcuda running on 2 GPUs.\n\n#### Data\n100000 random samples uniformly distributed between 0 and 1 in 256 dimensions.\n\n#### Notes\n100000 is the maximum size Serban KMeans can handle.\n\n### 8000000x256@1024\n|            | sklearn KMeans | KMeansRex | KMeansRex OpenMP | Serban | kmcuda 2 GPU | kmcuda Yinyang 2 GPUs |\n|------------|----------------|-----------|------------------|--------|--------------|-----------------------|\n| time       | please no      | -         | 6h 34m           | fail   | 44m          | 36m                   |\n| memory, GB | -              | -         | 205              | fail   | 8.7          | 10.4                  |\n\nkmeans++ initialization, 93 iterations (1% reassignments equivalent).\n\n#### Data\n8,000,000 secret production samples.\n\n#### Notes\nKmeansRex did eat 205 GB of RAM on peak; it uses dynamic memory so it constantly\nbounced from 100 GB to 200 GB.\n\nContributions\n-------------\n\n...are welcome! See [CONTRIBUTING](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md).\n\nLicense\n-------\n\n[Apache 2.0](LICENSE.md)\n\nPython examples\n---------------\n\n#### K-means, L2 (Euclidean) distance\n\n```python\nimport numpy\nfrom matplotlib import pyplot\nfrom libKMCUDA import kmeans_cuda\n\nnumpy.random.seed(0)\narr = numpy.empty((10000, 2), dtype=numpy.float32)\narr[:2500] = numpy.random.rand(2500, 2) + [0, 2]\narr[2500:5000] = numpy.random.rand(2500, 2) - [0, 2]\narr[5000:7500] = numpy.random.rand(2500, 2) + [2, 0]\narr[7500:] = numpy.random.rand(2500, 2) - [2, 0]\ncentroids, assignments = kmeans_cuda(arr, 4, verbosity=1, seed=3)\nprint(centroids)\npyplot.scatter(arr[:, 0], arr[:, 1], c=assignments)\npyplot.scatter(centroids[:, 0], centroids[:, 1], c=\"white\", s=150)\n```\nYou should see something like this:\n![Clustered dots](img/cls_euclidean.png)\n\n#### K-means, angular (cosine) distance + average\n\n```python\nimport numpy\nfrom matplotlib import pyplot\nfrom libKMCUDA import kmeans_cuda\n\nnumpy.random.seed(0)\narr = numpy.empty((10000, 2), dtype=numpy.float32)\nangs = numpy.random.rand(10000) * 2 * numpy.pi\nfor i in range(10000):\n    arr[i] = numpy.sin(angs[i]), numpy.cos(angs[i])\ncentroids, assignments, avg_distance = kmeans_cuda(\n    arr, 4, metric=\"cos\", verbosity=1, seed=3, average_distance=True)\nprint(\"Average distance between centroids and members:\", avg_distance)\nprint(centroids)\npyplot.scatter(arr[:, 0], arr[:, 1], c=assignments)\npyplot.scatter(centroids[:, 0], centroids[:, 1], c=\"white\", s=150)\n```\nYou should see something like this:\n![Clustered dots](img/cls_angular.png)\n\n#### K-nn\n\n```python\nimport numpy\nfrom libKMCUDA import kmeans_cuda, knn_cuda\n\nnumpy.random.seed(0)\narr = numpy.empty((10000, 2), dtype=numpy.float32)\nangs = numpy.random.rand(10000) * 2 * numpy.pi\nfor i in range(10000):\n    arr[i] = numpy.sin(angs[i]), numpy.cos(angs[i])\nca = kmeans_cuda(arr, 4, metric=\"cos\", verbosity=1, seed=3)\nneighbors = knn_cuda(10, arr, *ca, metric=\"cos\", verbosity=1, device=1)\nprint(neighbors[0])\n```\nYou should see\n```\nreassignments threshold: 100\nperforming kmeans++...\ndone\ntoo few clusters for this yinyang_t =\u003e Lloyd\niteration 1: 10000 reassignments\niteration 2: 926 reassignments\niteration 3: 416 reassignments\niteration 4: 187 reassignments\niteration 5: 87 reassignments\ninitializing the inverse assignments...\ncalculating the cluster radiuses...\ncalculating the centroid distance matrix...\nsearching for the nearest neighbors...\ncalculated 0.276552 of all the distances\n[1279 1206 9846 9886 9412 9823 7019 7075 6453 8933]\n```\n\nPython API\n----------\n```python\ndef kmeans_cuda(samples, clusters, tolerance=0.01, init=\"k-means++\",\n                yinyang_t=0.1, metric=\"L2\", average_distance=False,\n                seed=time(), device=0, verbosity=0)\n```\n**samples** numpy array of shape \\[number of samples, number of features\\]\n            or tuple(raw device pointer (int), device index (int), shape (tuple(number of samples, number of features\\[, fp16x2 marker\\]))).\n            In the latter case, negative device index means host pointer. Optionally,\n            the tuple can be 2 items longer with preallocated device pointers for\n            centroids and assignments. dtype must be either float16 or\n            convertible to float32.\n\n**clusters** integer, the number of clusters.\n\n**tolerance** float, if the relative number of reassignments drops below this value,\n              algorithm stops.\n\n**init** string or numpy array, sets the method for centroids initialization,\n         may be \"k-means++\", \"afk-mc2\", \"random\" or numpy array of shape\n         \\[**clusters**, number of features\\]. dtype must be float32.\n\n**yinyang_t** float, the relative number of cluster groups, usually 0.1.\n              0 disables Yinyang refinement.\n\n**metric** str, the name of the distance metric to use. The default is Euclidean (L2),\n           it can be changed to \"cos\" to change the algorithm to Spherical K-means\n           with the angular distance. Please note that samples *must* be normalized\n           in the latter case.\n\n**average_distance** boolean, the value indicating whether to calculate\n                     the average distance between cluster elements and\n                     the corresponding centroids. Useful for finding\n                     the best K. Returned as the third tuple element.\n\n**seed** integer, random generator seed for reproducible results.\n\n**device** integer, bitwise OR-ed CUDA device indices, e.g. 1 means first device,\n           2 means second device, 3 means using first and second device. Special\n           value 0 enables all available devices. The default is 0.\n\n**verbosity** integer, 0 means complete silence, 1 means mere progress logging,\n              2 means lots of output.\n\n**return** tuple(centroids, assignments, \\[average_distance\\]).\n           If **samples** was a numpy array or a host pointer tuple, the types\n           are numpy arrays, otherwise, raw pointers (integers) allocated on the\n           same device. If **samples** are float16, the returned centroids are\n           float16 too.\n\n```python\ndef knn_cuda(k, samples, centroids, assignments, metric=\"L2\", device=0, verbosity=0)\n```\n**k** integer, the number of neighbors to search for each sample. Must be ≤ 1\u003csup\u003e16\u003c/sup\u003e.\n\n**samples** numpy array of shape \\[number of samples, number of features\\]\n            or tuple(raw device pointer (int), device index (int), shape (tuple(number of samples, number of features\\[, fp16x2 marker\\]))).\n            In the latter case, negative device index means host pointer. Optionally,\n            the tuple can be 1 item longer with the preallocated device pointer for\n            neighbors. dtype must be either float16 or convertible to float32.\n\n**centroids** numpy array with precalculated clusters' centroids (e.g., using\n              K-means/kmcuda/kmeans_cuda()). dtype must match **samples**.\n              If **samples** is a tuple then **centroids** must be a length-2\n              tuple, the first element is the pointer and the second is the\n              number of clusters. The shape is (number of clusters, number of features).\n\n**assignments** numpy array with sample-cluster associations. dtype is expected\n                to be compatible with uint32. If **samples** is a tuple then\n                **assignments** is a pointer. The shape is (number of samples,).\n\n**metric** str, the name of the distance metric to use. The default is Euclidean (L2),\n           it can be changed to \"cos\" to change the algorithm to Spherical K-means\n           with the angular distance. Please note that samples *must* be normalized\n           in the latter case.\n\n**device** integer, bitwise OR-ed CUDA device indices, e.g. 1 means first device,\n           2 means second device, 3 means using first and second device. Special\n           value 0 enables all available devices. The default is 0.\n\n**verbosity** integer, 0 means complete silence, 1 means mere progress logging,\n              2 means lots of output.\n\n**return** neighbor indices. If **samples** was a numpy array or\n            a host pointer tuple, the return type is numpy array, otherwise, a\n            raw pointer (integer) allocated on the same device. The shape is\n            (number of samples, k).\n\nR examples\n----------\n#### K-means\n```R\ndyn.load(\"libKMCUDA.so\")\nsamples = replicate(4, runif(16000))\nresult = .External(\"kmeans_cuda\", samples, 50, tolerance=0.01,\n                   seed=777, verbosity=1, average_distance=TRUE)\nprint(result$average_distance)\nprint(result$centroids[1:10,])\nprint(result$assignments[1:10])\n```\n\n#### K-nn\n```R\ndyn.load(\"libKMCUDA.so\")\nsamples = replicate(4, runif(16000))\ncls = .External(\"kmeans_cuda\", samples, 50, tolerance=0.01,\n                seed=777, verbosity=1)\nresult = .External(\"knn_cuda\", 20, samples, cls$centroids, cls$assignments,\n                   verbosity=1)\nprint(result[1:10,])\n```\n\nR API\n-----\n```R\nfunction kmeans_cuda(\n    samples, clusters, tolerance=0.01, init=\"k-means++\", yinyang_t=0.1,\n    metric=\"L2\", average_distance=FALSE, seed=Sys.time(), device=0, verbosity=0)\n```\n**samples** real matrix of shape \\[number of samples, number of features\\]\n            or list of real matrices which are rbind()-ed internally. No more\n            than INT32_MAX samples and UINT16_MAX features are supported.\n\n**clusters** integer, the number of clusters.\n\n**tolerance** real, if the relative number of reassignments drops below this value,\n              algorithm stops.\n\n**init** character vector or real matrix, sets the method for centroids initialization,\n         may be \"k-means++\", \"afk-mc2\", \"random\" or real matrix, of shape\n         \\[**clusters**, number of features\\].\n\n**yinyang_t** real, the relative number of cluster groups, usually 0.1.\n              0 disables Yinyang refinement.\n\n**metric** character vector, the name of the distance metric to use. The default\n           is Euclidean (L2), it can be changed to \"cos\" to change the algorithm\n           to Spherical K-means with the angular distance. Please note that\n           samples *must* be normalized in the latter case.\n\n**average_distance** logical, the value indicating whether to calculate\n                     the average distance between cluster elements and\n                     the corresponding centroids. Useful for finding\n                     the best K. Returned as the third list element.\n\n**seed** integer, random generator seed for reproducible results.\n\n**device** integer, bitwise OR-ed CUDA device indices, e.g. 1 means first device,\n           2 means second device, 3 means using first and second device. Special\n           value 0 enables all available devices. The default is 0.\n\n**verbosity** integer, 0 means complete silence, 1 means mere progress logging,\n              2 means lots of output.\n\n**return** list(centroids, assignments\\[, average_distance\\]). Indices in\n           assignments start from 1.\n\n```R\nfunction knn_cuda(k, samples, centroids, assignments, metric=\"L2\", device=0, verbosity=0)\n```\n**k** integer, the number of neighbors to search for each sample. Must be ≤ 1\u003csup\u003e16\u003c/sup\u003e.\n\n**samples** real matrix of shape \\[number of samples, number of features\\]\n            or list of real matrices which are rbind()-ed internally.\n            In the latter case, is is possible to pass in more than INT32_MAX\n            samples.\n\n**centroids** real matrix with precalculated clusters' centroids (e.g., using\n              kmeans() or kmeans_cuda()).\n\n**assignments** integer vector with sample-cluster associations. Indices start\n                from 1.\n\n**metric** str, the name of the distance metric to use. The default is Euclidean (L2),\n                can be changed to \"cos\" to behave as Spherical K-means with the\n                angular distance. Please note that samples *must* be normalized in that\n                case.\n\n**device** integer, bitwise OR-ed CUDA device indices, e.g. 1 means first device, 2 means second device,\n           3 means using first and second device. Special value 0 enables all available devices.\n           The default is 0.\n\n**verbosity** integer, 0 means complete silence, 1 means mere progress logging,\n              2 means lots of output.\n\n**return** integer matrix with neighbor indices. The shape is (number of samples, k).\n           Indices start from 1.\n\nC examples\n----------\n`example.c`:\n```C\n#include \u003cassert.h\u003e\n#include \u003cstdint.h\u003e\n#include \u003cstdio.h\u003e\n#include \u003cstdlib.h\u003e\n#include \u003ckmcuda.h\u003e\n\n// ./example /path/to/data \u003cnumber of clusters\u003e\nint main(int argc, const char **argv) {\n  assert(argc == 3);\n  // we open the binary file with the data\n  // [samples_size][features_size][samples_size x features_size]\n  FILE *fin = fopen(argv[1], \"rb\");\n  assert(fin);\n  uint32_t samples_size, features_size;\n  assert(fread(\u0026samples_size, sizeof(samples_size), 1, fin) == 1);\n  assert(fread(\u0026features_size, sizeof(features_size), 1, fin) == 1);\n  uint64_t total_size = ((uint64_t)samples_size) * features_size;\n  float *samples = malloc(total_size * sizeof(float));\n  assert(samples);\n  assert(fread(samples, sizeof(float), total_size, fin) == total_size);\n  fclose(fin);\n  int clusters_size = atoi(argv[2]);\n  // we will store cluster centers here\n  float *centroids = malloc(clusters_size * features_size * sizeof(float));\n  assert(centroids);\n  // we will store assignments of every sample here\n  uint32_t *assignments = malloc(((uint64_t)samples_size) * sizeof(uint32_t));\n  assert(assignments);\n  float average_distance;\n  KMCUDAResult result = kmeans_cuda(\n      kmcudaInitMethodPlusPlus, NULL,  // kmeans++ centroids initialization\n      0.01,                            // less than 1% of the samples are reassigned in the end\n      0.1,                             // activate Yinyang refinement with 0.1 threshold\n      kmcudaDistanceMetricL2,          // Euclidean distance\n      samples_size, features_size, clusters_size,\n      0xDEADBEEF,                      // random generator seed\n      0,                               // use all available CUDA devices\n      -1,                              // samples are supplied from host\n      0,                               // not in float16x2 mode\n      1,                               // moderate verbosity\n      samples, centroids, assignments, \u0026average_distance);\n  free(samples);\n  free(centroids);\n  free(assignments);\n  assert(result == kmcudaSuccess);\n  printf(\"Average distance between a centroid and the corresponding \"\n         \"cluster members: %f\\n\", average_distance);\n  return 0;\n}\n```\nBuild:\n```\ngcc -std=c99 -O2 example.c -I/path/to/kmcuda.h/dir -L/path/to/libKMCUDA.so/dir -l KMCUDA -Wl,-rpath,. -o example\n```\nRun:\n```\n./example serban.bin 1024\n```\nThe file format is the same as in [serban/kmeans](https://github.com/serban/kmeans/blob/master/README#L113).\n\nC API\n-----\n```C\nKMCUDAResult kmeans_cuda(\n    KMCUDAInitMethod init, float tolerance, float yinyang_t,\n    KMCUDADistanceMetric metric, uint32_t samples_size, uint16_t features_size,\n    uint32_t clusters_size, uint32_t seed, uint32_t device, int32_t device_ptrs,\n    int32_t fp16x2, int32_t verbosity, const float *samples, float *centroids,\n    uint32_t *assignments, float *average_distance)\n```\n**init** specifies the centroids initialization method: k-means++, random or import\n         (in the latter case, **centroids** is read).\n\n**tolerance** if the number of reassignments drop below this ratio, stop.\n\n**yinyang_t** the relative number of cluster groups, usually 0.1.\n\n**metric** The distance metric to use. The default is Euclidean (L2), can be\n           changed to cosine to behave as Spherical K-means with the angular\n           distance. Please note that samples *must* be normalized in that case.\n\n**samples_size** number of samples.\n\n**features_size** number of features. if fp16x2 is set, one half of the number of features.\n\n**clusters_size** number of clusters.\n\n**seed** random generator seed passed to srand().\n\n**device** CUDA device OR-ed indices - usually 1. For example, 1 means using first device,\n           2 means second device, 3 means first and second device (2x speedup). Special\n           value 0 enables all available devices.\n\n**device_ptrs** configures the location of input and output. If it is negative,\n                samples and returned arrays are on host, otherwise, they belong to the\n                corresponding device. E.g., if device_ptrs is 0, **samples** is expected\n                to be a pointer to device #0's memory and the resulting **centroids** and\n                **assignments** are expected to be preallocated on device #0 as well.\n                Usually this value is -1.\n\n**fp16x2** activates fp16 mode, two half-floats are packed into a single 32-bit float,\n           features_size becomes effectively 2 times bigger, the returned\n           centroids are fp16x2 too.\n\n**verbosity** 0 - no output; 1 - progress output; \u003e=2 - debug output.\n\n**samples** input array of size samples_size x features_size in row major format.\n\n**centroids** output array of centroids of size clusters_size x features_size\n              in row major format.\n\n**assignments** output array of cluster indices for each sample of size\n                samples_size x 1.\n\n**average_distance** output mean distance between cluster elements and\n                     the corresponding centroids. If nullptr, not calculated.\n\nReturns KMCUDAResult (see `kmcuda.h`);\n\n```C\nKMCUDAResult knn_cuda(\n    uint16_t k, KMCUDADistanceMetric metric, uint32_t samples_size,\n    uint16_t features_size, uint32_t clusters_size, uint32_t device,\n    int32_t device_ptrs, int32_t fp16x2, int32_t verbosity,\n    const float *samples, const float *centroids, const uint32_t *assignments,\n    uint32_t *neighbors);\n```\n**k** integer, the number of neighbors to search for each sample.\n\n**metric** The distance metric to use. The default is Euclidean (L2), can be\n           changed to cosine to behave as Spherical K-means with the angular\n           distance. Please note that samples *must* be normalized in that case.\n\n**samples_size** number of samples.\n\n**features_size** number of features. if fp16x2 is set, one half of the number of features.\n\n**clusters_size** number of clusters.\n\n**device** CUDA device OR-ed indices - usually 1. For example, 1 means using first device,\n           2 means second device, 3 means first and second device (2x speedup). Special\n           value 0 enables all available devices.\n\n**device_ptrs** configures the location of input and output. If it is negative,\n                samples, centroids, assignments and the returned array are on host,\n                otherwise, they belong to the corresponding device.\n                E.g., if device_ptrs is 0, **samples**, **centroids** and\n                **assignments** are expected to be pointers to device #0's memory\n                and the resulting **neighbors** is expected to be preallocated on\n                device #0 as well. Usually this value is -1.\n\n**fp16x2** activates fp16 mode, two half-floats are packed into a single 32-bit float,\n           features_size becomes effectively 2 times bigger, affects **samples**\n           and **centroids**.\n\n**verbosity** 0 - no output; 1 - progress output; \u003e=2 - debug output.\n\n**samples** input array of size samples_size x features_size in row major format.\n\n**centroids** input array of centroids of size clusters_size x features_size\n              in row major format.\n\n**assignments** input array of cluster indices for each sample of size\n                samples_size x 1.\n\n**neighbors** output array with the nearest neighbors of size\n              samples_size x k in row major format.\n\nReturns KMCUDAResult (see `kmcuda.h`);\n\n#### README {#ignore_this_doxygen_anchor}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fkmcuda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fkmcuda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fkmcuda/lists"}