{"id":44324852,"url":"https://github.com/scrayil/k-means","last_synced_at":"2026-02-11T07:19:31.511Z","repository":{"id":178988497,"uuid":"643423661","full_name":"Scrayil/k-means","owner":"Scrayil","description":"This project consists in the implementation of the K-Means and Mini-Batch K-Means clustering algorithms. This is not to be considered as the final and most efficient algorithm implementation as the objective here is to make a clear omparison between the sequential and parallel execution of the clustering steps.","archived":false,"fork":false,"pushed_at":"2023-07-09T08:24:41.000Z","size":30479,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-12-19T15:36:19.363Z","etag":null,"topics":["benchmark","centroids","clustering","clusters","euclidean-distances","gpu","gpu-programming","k-means","k-means-clustering","machine-learning","mini-batch-kmeans","mini-batching","multithreading","parallel-computing","parallel-programming","perfomance-analysis","speedup","unsupervised-learning","unsupervised-machine-learning"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Scrayil.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-05-21T05:34:56.000Z","updated_at":"2023-10-25T13:57:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"6e1a22a9-b696-4432-93c9-5a7c9a0443bd","html_url":"https://github.com/Scrayil/k-means","commit_stats":{"total_commits":49,"total_committers":2,"mean_commits":24.5,"dds":"0.20408163265306123","last_synced_commit":"a688c250caa23933cfe6ae7b106bccd988815783"},"previous_names":["scrayil/k-means"],"tags_count":0,"template":null,"template_full_name":null,"purl":"pkg:github/Scrayil/k-means","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Scrayil%2Fk-means","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Scrayil%2Fk-means/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Scrayil%2Fk-means/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Scrayil%2Fk-means/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Scrayil","download_url":"https://codeload.github.com/Scrayil/k-means/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Scrayil%2Fk-means/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29329492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-11T06:13:03.264Z","status":"ssl_error","status_checked_at":"2026-02-11T06:12:55.843Z","response_time":97,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","centroids","clustering","clusters","euclidean-distances","gpu","gpu-programming","k-means","k-means-clustering","machine-learning","mini-batch-kmeans","mini-batching","multithreading","parallel-computing","parallel-programming","perfomance-analysis","speedup","unsupervised-learning","unsupervised-machine-learning"],"created_at":"2026-02-11T07:19:31.042Z","updated_at":"2026-02-11T07:19:31.505Z","avatar_url":"https://github.com/Scrayil.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# K_Means\n\nThis project consists in the implementation of the K-Means and Mini-Batch K-Means clustering algorithms.  \nThis is not to be considered as the final and most efficient algorithm implementation as the objective here is to make a clear comparison between the sequential and parallel execution of the clustering steps.   \nThis project offers two different implementations:\n- the first one follows a sequential execution by relying entirely on the CPU for the computation  \n- the second one takes advantage of the GPU capabilities to achieve parallelism\n\nBy building the project only one executable is generated. The program relies on a configuration file in which it's possible to select which implementations to run and to specify some parameters for the clustering algorithm.  \nIt is possible to limit the maximum number of records to process from the given dataset, along with parameters like the desired number of clusters and the maximum tolerance for evaluating the overall convergence.\nA specific random seed can eventually be set, it is used in both the implementations during the initialization phases. This is done purposely for consistency while comparing the two.  \nIf no seed is specified, one gets automatically generated and shared by both.  \n\nThe following animation shows some of the clustering steps in the K-Means algorithm:  \n\n![Quick animation of the clustering steps for the K-Means algorithm](https://github.com/Scrayil/k-means/blob/af8a170da15aa8a0e4d70493d9dd5bfd40b3e72e/report/media/images/k-means-5-clusters-animation.gif)  \n*Visual representation of 5 different clusters from “K-Means clustering and Vonoi sets”,\nhttps://freakonometrics.hypotheses.org/19156. Accessed 05 July 2023.*\n\n## Requirements  \n\nBefore building the project it is necessary to perform some steps:\n1.  The dataset used is big and it has been stored with the git-lfs framework,. Make sure to install [Git-LFS](https://git-lfs.com/) if necessary.\n2.  Install the appropriate CUDA libraries on your system. See: [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit)\n3.  Change the CUDA properties inside the \"CmakeLists.txt\" file accordingly to your GPU characteristics.\n4.  Set the environment variable required to locate your nvcc compiler like this:\n    ~~~bash\n    CUDACXX=/usr/local/cuda/bin/nvcc\n    ~~~\n5.  While testing and evaluating the performances on your machine, change the number of threads (co-workers) to use, according to your GPU resources. You can find that variable inside the \"k_means.cuh\" file here: [parallel version](https://github.com/Scrayil/k-means/tree/af8a170da15aa8a0e4d70493d9dd5bfd40b3e72e/parallel)\n\n## Reporting  \nThe aim of the project was to compare the two implementations, highlight the eventual limitations and evaluate the performance benefits resulting from GPU multithreading.\nFor this purpose, a specific benchmarking and ready to use dataset has been randomly generated.  \nNote that **no data pre-processing strategy** has been applied here.  \nBoth the measured [results](https://github.com/Scrayil/k-means/tree/af8a170da15aa8a0e4d70493d9dd5bfd40b3e72e/results) and the [reporting](https://github.com/Scrayil/k-means/tree/af8a170da15aa8a0e4d70493d9dd5bfd40b3e72e/report) have been added to this repository.\n\nThe executions report has been structured as follows:  \n\n~~~csv\nLine 1: version,elapsed_time,n_data_points,n_features,n_clusters,max_tolerance,total_iterations,random_seed,centroids_data_path\nLine 2: sequential,14.485,1000,2,20,0,26,2793709286,\"/home/scrayil/Desktop/dev/University/projects/PPFML/K_Means/results/centroids/sequential_23-06-22T17:51:14_2793709286.json\"\nLine 3: parallel,61.298,1000,2,20,0,26,2793709286,\"/home/scrayil/Desktop/dev/University/projects/PPFML/K_Means/results/centroids/parallel_23-06-22T17:51:14_2793709286.json\"\n~~~\n\n## Notes\n\nBy specifying a particular seed, the **centroids' filenames generation** does guarantee uniqueness if two different main program executions don't happen during the same second.  \nIf the two different executions occure consecutively, in the same second (timestamp), the uniqueness is guaranteed only if different seeds have been used between them. (default behavior)  \n\nThis software includes third-party code for parsing json and csv files.  \n- The csv parser has been taken from [AriaFallah](https://github.com/AriaFallah/csv-parser.git)\n- The json parser from [nlohmann](https://github.com/nlohmann/json.git)\n\n## License\nCopyright 2023 Mattia Bennati  \nLicensed under the GNU GPL V2: https://www.gnu.org/licenses/old-licenses/gpl-2.0.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrayil%2Fk-means","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrayil%2Fk-means","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrayil%2Fk-means/lists"}