{"id":16346395,"url":"https://github.com/milesgranger/gap_statistic","last_synced_at":"2025-04-04T13:12:10.758Z","repository":{"id":44725093,"uuid":"57036078","full_name":"milesgranger/gap_statistic","owner":"milesgranger","description":"Dynamically get the suggested clusters in the data for unsupervised learning.","archived":false,"fork":false,"pushed_at":"2024-07-31T13:08:38.000Z","size":401,"stargazers_count":219,"open_issues_count":7,"forks_count":46,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-28T12:07:37.120Z","etag":null,"topics":["cluster","cluster-count","clustering","kmeans","python","scikit-learn","unsupervised","unsupervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/milesgranger.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-MIT","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-04-25T11:22:18.000Z","updated_at":"2025-03-27T07:33:18.000Z","dependencies_parsed_at":"2023-02-12T09:00:50.670Z","dependency_job_id":"41f35747-c095-420f-b320-2765a98f98bd","html_url":"https://github.com/milesgranger/gap_statistic","commit_stats":{"total_commits":129,"total_committers":6,"mean_commits":21.5,"dds":"0.046511627906976716","last_synced_commit":"7fe220653311dc4961ce0571ed43064adca9142d"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milesgranger%2Fgap_statistic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milesgranger%2Fgap_statistic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milesgranger%2Fgap_statistic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/milesgranger%2Fgap_statistic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/milesgranger","download_url":"https://codeload.github.com/milesgranger/gap_statistic/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247182353,"owners_count":20897380,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","cluster-count","clustering","kmeans","python","scikit-learn","unsupervised","unsupervised-learning"],"created_at":"2024-10-11T00:35:14.693Z","updated_at":"2025-04-04T13:12:10.739Z","avatar_url":"https://github.com/milesgranger.png","language":"Rust","funding_links":[],"categories":["Clustering"],"sub_categories":[],"readme":"### Python implementation of the [Gap Statistic](http://www.web.stanford.edu/~hastie/Papers/gap.pdf)\n\n[![PythonCI](https://github.com/milesgranger/gap_statistic/workflows/PythonCI/badge.svg?branch=master)](https://github.com/milesgranger/gap_statistic/actions?query=branch=master)\n[![RustCI](https://github.com/milesgranger/gap_statistic/workflows/RustCI/badge.svg?branch=master)](https://github.com/milesgranger/gap_statistic/actions?query=branch=master)\n\n[![Downloads](http://pepy.tech/badge/gap-stat)](http://pepy.tech/project/gap-stat)\n[![Coverage Status](https://coveralls.io/repos/github/milesgranger/gap_statistic/badge.svg)](https://coveralls.io/github/milesgranger/gap_statistic)\n[![Code Health](https://landscape.io/github/milesgranger/gap_statistic/master/landscape.svg?style=flat)](https://landscape.io/github/milesgranger/gap_statistic/master)\n[![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)\n\n---\n\n### Maintenance mode\n\nI've lost interest/time in developing this further, other things have taken priority for some time now. However, all is not lost. I will be willing to review/comment on any issues/PRs but will not complete any fixes or feature requests myself. \n\n---\n\n#### Purpose\nDynamically identify the suggested number of clusters in a data-set\nusing the gap statistic.\n\n---\n\n### Full example available in a notebook [HERE](Example.ipynb)\n\n---\n#### Install:  \nBleeding edge:\n```commandline\npip install git+git://github.com/milesgranger/gap_statistic.git\n```\n\nPyPi:  \n```commandline\npip install --upgrade gap-stat\n```\n\nWith Rust extension:\n```commandline\npip install --upgrade gap-stat[rust]\n```\n\n\n---\n#### Uninstall:\n```commandline\npip uninstall gap-stat\n```\n\n---\n\n### Methodology:\n\nThis package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in [\"Estimating the number of clusters in a data set via the gap statistic\"](https://statweb.stanford.edu/~gwalther/gap) (Tibshirani et al.).\n\nThe methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:\n\n  - Taking the `k` maximizing the Gap value, which is calculated for each `k`. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.\n  - Taking the smallest `k` such that Gap(k) \u003e= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure `diff = Gap(k) - Gap(k+1) + s(k+1)` is calculated for each `k`; the parallel here, then, is to take the smallest `k` for which `diff` is positive. Note that in some cases this can be true for the entire range of `k`.\n  - Taking the `k` maximizing the Gap\\* value, an alternative measure suggested in [\"A comparison of Gap statistic definitions with and\nwith-out logarithm function\"](https://core.ac.uk/download/pdf/12172514.pdf) by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap\\* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.\n\nNote that none of the above methods is guaranteed to find an optimal value for `k`, and that they often contradict one another. Rather, they can provide more information on which to base your choice of `k`, which should take numerous other factors into account.\n\n---\n\n### Use:\n\nFirst, construct an `OptimalK` object. Optional intialization parameters are:\n\n  - `n_jobs` - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.\n  - `parallel_backend` - Possible values are `joblib`, `rust` or `multiprocessing` for the built-in Python backend. If `parallel_backend == 'rust'` it will use all cores.\n  - `clusterer` - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.\n  - `clusterer_kwargs` - Any keyword arguments to be forwarded to the custom clusterer function on each call.\n\nAn example intialization:\n```python\noptimalK = OptimalK(n_jobs=4, parallel_backend='joblib')\n```\n\n\nAfter the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:\n\n  - `X` - A pandas dataframe or numpy array of data points of shape `(n_samples, n_features)`.\n  - `n_refs` - The number of random reference data sets to use as inertia reference to actual data. Optional.\n  - `cluster_array` - A 1-dimensional iterable of integers; each representing `n_clusters` to try on the data. Optional.\n\nFor example:\n```python\nimport numpy as np\nn_clusters = optimalK(X, cluster_array=np.arange(1, 15))\n```\n\nAfter performing the search procedure, a DataFrame of gap values and other usefull statistics for  each passed cluster count is now available as the `gap_df` attributre of the `OptimalK` object:\n\n```python\noptimalK.gap_df.head()\n```\n\nThe columns of the dataframe are:\n\n  - `n_clusters` - The number of clusters for which the statistics in this row were calculated.\n  - `gap_value` - The Gap value for this `n`.\n  - `gap*` - The Gap\\* value for this `n`.\n  - `ref_dispersion_std` - The standard deviation of the reference distributions for this `n`.\n  - `sk` - The standard error of the Gap statistic for this `n`.\n  - `sk*` - The standard error of the Gap\\* statistic for this `n`.\n  - `diff` - The diff value for this `n` (see the methodology section for details).\n  - `diff*` - The diff\\* value for this `n` (corresponding to the diff value for Gap\\*).\n\n\nAdditionally, the relation between the above measures and the number of clusters can be plotted by calling the `OptimalK.plot_results()` method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:\n\n  - A plot of the Gap value versus n, the number of clusters.\n  - A plot of diff versus n.\n  - A plot of the Gap\\* value versus n, the number of clusters.\n  - A plot of the diff\\* value versus n.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmilesgranger%2Fgap_statistic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmilesgranger%2Fgap_statistic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmilesgranger%2Fgap_statistic/lists"}