{"id":17100867,"url":"https://github.com/joshlk/k-means-constrained","last_synced_at":"2025-05-15T12:05:38.963Z","repository":{"id":41180233,"uuid":"150868923","full_name":"joshlk/k-means-constrained","owner":"joshlk","description":"K-Means clustering - constrained with minimum and maximum cluster size. Documentation: https://joshlk.github.io/k-means-constrained","archived":false,"fork":false,"pushed_at":"2025-02-06T15:32:24.000Z","size":6870,"stargazers_count":209,"open_issues_count":0,"forks_count":44,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-14T20:57:34.041Z","etag":null,"topics":["clustering","k-means","kmeans-constrained","maximum-cluster-sizes","minimum-cluster-sizes","ml","optimization","python"],"latest_commit_sha":null,"homepage":"https://github.com/joshlk/k-means-constrained","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joshlk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-09-29T13:53:48.000Z","updated_at":"2025-04-11T14:01:59.000Z","dependencies_parsed_at":"2022-08-15T16:30:31.570Z","dependency_job_id":"9a12a8a0-8963-4743-82b2-2f60e274299c","html_url":"https://github.com/joshlk/k-means-constrained","commit_stats":{"total_commits":197,"total_committers":4,"mean_commits":49.25,"dds":0.3350253807106599,"last_synced_commit":"65e22120d37e7c878d91d7fc73cb6380d581ec4f"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshlk%2Fk-means-constrained","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshlk%2Fk-means-constrained/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshlk%2Fk-means-constrained/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshlk%2Fk-means-constrained/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joshlk","download_url":"https://codeload.github.com/joshlk/k-means-constrained/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254337613,"owners_count":22054253,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","k-means","kmeans-constrained","maximum-cluster-sizes","minimum-cluster-sizes","ml","optimization","python"],"created_at":"2024-10-14T15:22:51.940Z","updated_at":"2025-05-15T12:05:33.894Z","avatar_url":"https://github.com/joshlk.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI](https://img.shields.io/pypi/v/k-means-constrained)](https://pypi.org/project/k-means-constrained/)\n![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)\n[![Build](https://github.com/joshlk/k-means-constrained/actions/workflows/build_wheels.yml/badge.svg)](https://github.com/joshlk/k-means-constrained/actions/workflows/build_wheels.yml)\n[**Documentation**](https://joshlk.github.io/k-means-constrained/)\n\n# k-means-constrained\nK-means clustering implementation whereby a minimum and/or maximum size for each\ncluster can be specified.\n\nThis K-means implementation modifies the cluster assignment step (E in EM)\nby formulating it as a Minimum Cost Flow (MCF) linear network\noptimisation problem. This is then solved using a cost-scaling\npush-relabel algorithm and uses [Google's Operations Research tools's\n`SimpleMinCostFlow`](https://developers.google.com/optimization/flow/mincostflow)\nwhich is a fast C++ implementation.\n\nThis package is inspired by [Bradley et al.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-65.pdf).\nThe original Minimum Cost Flow (MCF) network proposed by Bradley et al.\nhas been modified so maximum cluster sizes can also be specified along\nwith minimum cluster size. \n\nThe code is based on [scikit-lean's `KMeans`](https://scikit-learn.org/0.19/modules/generated/sklearn.cluster.KMeans.html)\nand implements the same [API with modifications](https://joshlk.github.io/k-means-constrained/).\n\nRef:\n1. [Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. \"Constrained k-means clustering.\"\n    Microsoft Research, Redmond (2000): 1-8.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-65.pdf)\n2. [Google's SimpleMinCostFlow C++ implementation](https://github.com/google/or-tools/blob/master/ortools/graph/min_cost_flow.h)\n\n# Installation\nYou can install the k-means-constrained from PyPI:\n\n```\npip install k-means-constrained\n```\n\nIt is supported on Python 3.10, 3.11 and 3.12. Previous versions of k-means-constrained support older versions of Python and Numpy.\n\n# Example\n\nMore details can be found in the [API documentation](https://joshlk.github.io/k-means-constrained/).\n\n```python\n\u003e\u003e\u003e from k_means_constrained import KMeansConstrained\n\u003e\u003e\u003e import numpy as np\n\u003e\u003e\u003e X = np.array([[1, 2], [1, 4], [1, 0],\n...                [4, 2], [4, 4], [4, 0]])\n\u003e\u003e\u003e clf = KMeansConstrained(\n...     n_clusters=2,\n...     size_min=2,\n...     size_max=5,\n...     random_state=0\n... )\n\u003e\u003e\u003e clf.fit_predict(X)\narray([0, 0, 0, 1, 1, 1], dtype=int32)\n\u003e\u003e\u003e clf.cluster_centers_\narray([[ 1.,  2.],\n       [ 4.,  2.]])\n\u003e\u003e\u003e clf.labels_\narray([0, 0, 0, 1, 1, 1], dtype=int32)\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eCode only\u003c/summary\u003e\n    \n```\nfrom k_means_constrained import KMeansConstrained\nimport numpy as np\nX = np.array([[1, 2], [1, 4], [1, 0],\n                [4, 2], [4, 4], [4, 0]])\nclf = KMeansConstrained(\n     n_clusters=2,\n     size_min=2,\n     size_max=5,\n     random_state=0\n )\nclf.fit_predict(X)\nclf.cluster_centers_\nclf.labels_\n```\n    \n\u003c/details\u003e\n\n# Time complexity and runtime\n\nk-means-constrained is a more complex algorithm than vanilla k-means and therefore will take longer to execute and has worse scaling characteristics.\n\nGiven a number of data points $n$ and clusters $c$, the time complexity of:\n* k-means: $\\mathcal{O}(nc)$\n* k-means-constrained\u003csup\u003e1\u003c/sup\u003e: $\\mathcal{O}((n^3c+n^2c^2+nc^3)\\log(n+c)))$\n\nThis assumes a constant number of algorithm iterations and data-point features/dimensions.\n\nIf you consider the case where $n$ is the same order as $c$ ($n \\backsim c$) then:\n* k-means: $\\mathcal{O}(n^2)$\n* k-means-constrained\u003csup\u003e1\u003c/sup\u003e: $\\mathcal{O}(n^4\\log(n)))$\n\nBelow is a runtime comparison between k-means and k-means-constrained whereby the number of iterations, initializations, multi-process pool size and dimension size are fixed. The number of clusters is also always one-tenth the number of data points $n=10c$. It is shown above that the runtime is independent of the minimum or maximum cluster size, and so none is included below.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/joshlk/k-means-constrained/master/etc/execution_time.png\" alt=\"Data-points vs execution time for k-means vs k-means-constrained. Data-points=10*clusters. No min/max constraints\" width=\"50%\" height=\"50%\"\u003e\n\u003c/p\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eSystem details\u003c/summary\u003e\n    \n* OS: Linux-5.15.0-75-generic-x86_64-with-glibc2.35\n* CPU: AMD EPYC 7763 64-Core Processor\n* CPU cores: 120\n* k-means-constrained version: 0.7.3\n* numpy version: 1.24.2\n* scipy version: 1.11.1\n* ortools version: 9.6.2534\n* joblib version: 1.3.1\n* sklearn version: 1.3.0\n\u003c/details\u003e\n---\n\n\u003csup\u003e1\u003c/sup\u003e: [Ortools states](https://developers.google.com/optimization/reference/graph/min_cost_flow) the time complexity of their cost-scaling push-relabel algorithm for the min-cost flow problem as $\\mathcal{O}(n^2m\\log(nC))$ where $n$ is the number of nodes, $m$ is the number of edges and $C$ is the maximum absolute edge cost.\n\n# Change log\n\n* v0.7.5 fix comment in README on Python version that is supported\n* v0.7.4 compatible with Numpy +v2.1.1. Added Python 3.12 support and dropped Python 3.8 and 3.9 support (due to Numpy). Linux ARM support has been dropped as we use GitHub runners to build the package and ARM machines was being emulated using QEMU. This however was producing numerical errors. GitHub should natively support Ubuntu ARM images soon and then we can start to re-build them.\n* v0.7.3 compatible with Numpy v1.23.0 to 1.26.4\n\n# Citations\n\nIf you use this software in your research, please use the following citation:\n\n```\n@software{Levy-Kramer_k-means-constrained_2018,\n  author = {Levy-Kramer, Josh},\n  month = apr,\n  title = {{k-means-constrained}},\n  url = {https://github.com/joshlk/k-means-constrained},\n  year = {2018}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshlk%2Fk-means-constrained","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoshlk%2Fk-means-constrained","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshlk%2Fk-means-constrained/lists"}