https://github.com/ndgigliotti/cluster-tuner
A GridSearchCV-like hyperparameter tuner for clustering algorithms.
https://github.com/ndgigliotti/cluster-tuner
clustering gridsearchcv hyperparameter-tuning parameter-search scikit-learn scikit-learn-compatible unsupervised
Last synced: 4 months ago
JSON representation
A GridSearchCV-like hyperparameter tuner for clustering algorithms.
- Host: GitHub
- URL: https://github.com/ndgigliotti/cluster-tuner
- Owner: ndgigliotti
- License: bsd-3-clause
- Created: 2021-12-02T23:32:31.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2026-02-01T19:11:46.000Z (4 months ago)
- Last Synced: 2026-02-02T01:55:09.779Z (4 months ago)
- Topics: clustering, gridsearchcv, hyperparameter-tuning, parameter-search, scikit-learn, scikit-learn-compatible, unsupervised
- Language: Python
- Homepage:
- Size: 188 KB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# cluster-tuner
[](https://github.com/ndgigliotti/cluster-tuner/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/BSD-3-Clause)
[](https://pypi.org/project/cluster-tuner/)
A GridSearchCV-like hyperparameter tuner for clustering algorithms.
## Installation
```bash
pip install cluster-tuner
```
**Requirements:** Python >= 3.10, scikit-learn >= 1.6
## Purpose
This project provides a simple, scikit-learn-compatible hyperparameter tuning tool for clustering. It's intended for situations where predicting clusters for new data points is a low priority. Many clustering algorithms in scikit-learn are **transductive**, meaning they are not designed to be applied to new observations. Even when using an **inductive** algorithm like KMeans, you might not need to predict clusters for new data—or prediction might be a lower priority than finding the best clusters.
Since scikit-learn's `GridSearchCV` uses cross-validation and is designed for inductive models, an alternative tool is necessary.
## `ClusterTuner`
The `ClusterTuner` class is a hyperparameter search tool for clustering algorithms. It fits one model per hyperparameter combination and selects the best. The implementation is derived from scikit-learn's `GridSearchCV`, but without cross-validation. It works with clustering-specific scorers and doesn't always require a target variable, since metrics like silhouette, Calinski-Harabasz, and Davies-Bouldin are designed for unsupervised evaluation.
The interface is largely the same as `GridSearchCV`. Results are stored in the `results_` attribute (`cv_results_` also works as an alias for compatibility).
### Basic Usage
```python
from sklearn.cluster import DBSCAN
from cluster_tuner import ClusterTuner
tuner = ClusterTuner(
DBSCAN(),
param_grid={'eps': [0.3, 0.5, 0.7], 'min_samples': [5, 10]},
scoring='silhouette',
)
tuner.fit(X)
print(tuner.best_params_)
print(tuner.best_score_)
labels = tuner.labels_
# Access detailed results (single-metric uses 'test_score')
print(tuner.results_['test_score'])
```
### Key Parameters
- **`scoring`**: Metric name (string), callable, or list/dict for multi-metric evaluation.
- **`refit`** (default=True): Whether to refit the best estimator on the full dataset. For multi-metric, must be a string specifying which metric to use.
- **`max_noise`** (default=0.1): Maximum allowed ratio of noise points (label=-1). Fits exceeding this threshold receive `error_score`.
- **`min_cluster_size`** (default=3): Minimum allowed size for the smallest cluster. Fits with smaller clusters receive `error_score`.
- **`error_score`** (default=np.nan): Value to assign when a fit fails or violates constraints. Use `'raise'` to raise exceptions instead.
- **`n_jobs`**: Number of parallel jobs (-1 for all CPUs).
### Multi-Metric Scoring
Evaluate multiple metrics simultaneously using a list, tuple, or dict:
```python
tuner = ClusterTuner(
DBSCAN(),
param_grid={'eps': [0.3, 0.5, 0.7]},
scoring=['silhouette', 'calinski_harabasz', 'neg_davies_bouldin'],
refit='silhouette', # Required: which metric to use for selecting best
)
tuner.fit(X)
# Results use 'test_' prefix for each metric
print(tuner.results_['test_silhouette'])
print(tuner.results_['test_calinski_harabasz'])
print(tuner.results_['test_neg_davies_bouldin'])
```
### Supervised Scoring
When ground truth labels are available, use supervised metrics:
```python
from sklearn.cluster import KMeans
tuner = ClusterTuner(
KMeans(n_init='auto'),
param_grid={'n_clusters': [2, 3, 4, 5]},
scoring='adjusted_rand',
)
tuner.fit(X, y=y_true) # Pass ground truth labels
print(tuner.best_score_) # Adjusted Rand Index
```
### Pipeline Support
`ClusterTuner` works with scikit-learn pipelines:
```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
pipe = make_pipeline(
StandardScaler(),
PCA(n_components=10),
KMeans(n_init='auto'),
)
tuner = ClusterTuner(
pipe,
param_grid={'kmeans__n_clusters': [2, 3, 4, 5]},
scoring='silhouette',
)
tuner.fit(X)
```
## Scorers
You can use `ClusterTuner` by passing the string name of a clustering metric, e.g., `'silhouette'`, `'calinski_harabasz'`, or `'adjusted_rand'` (the `_score` suffix is optional).
### Recognized Scorer Names
**Unsupervised metrics** (no ground truth required):
- `'silhouette'` / `'silhouette_score'`
- `'silhouette_euclidean'` / `'silhouette_score_euclidean'`
- `'silhouette_cosine'` / `'silhouette_score_cosine'`
- `'neg_davies_bouldin'` / `'neg_davies_bouldin_score'`
- `'calinski_harabasz'` / `'calinski_harabasz_score'`
**Supervised metrics** (require ground truth labels `y`):
- `'mutual_info'` / `'mutual_info_score'`
- `'normalized_mutual_info'` / `'normalized_mutual_info_score'`
- `'adjusted_mutual_info'` / `'adjusted_mutual_info_score'`
- `'rand'` / `'rand_score'`
- `'adjusted_rand'` / `'adjusted_rand_score'`
- `'completeness'` / `'completeness_score'`
- `'fowlkes_mallows'` / `'fowlkes_mallows_score'`
- `'homogeneity'` / `'homogeneity_score'`
- `'v_measure'` / `'v_measure_score'`
### Naming Convention
Following sklearn's convention, metrics where **lower is better** use a `neg_` prefix. The score is negated internally so that higher values always indicate better clustering:
- `'neg_davies_bouldin'` — Davies-Bouldin index (lower raw values = better separation)
### Custom Scorers
Create custom scorers using `make_scorer`:
```python
from cluster_tuner import make_scorer
# Unsupervised scorer: score_func(X, labels)
def my_metric(X, labels):
return some_score
scorer = make_scorer(my_metric, ground_truth=False)
# Supervised scorer: score_func(y_true, labels)
def my_supervised_metric(y_true, labels):
return some_score
scorer = make_scorer(my_supervised_metric, ground_truth=True)
tuner = ClusterTuner(estimator, param_grid, scoring=scorer)
```
## Caveats
### Comparing Clustering Algorithms
Consider your dataset and goals before comparing clustering algorithms. A higher score doesn't necessarily mean a better choice—different algorithms have [different benefits, drawbacks, and use cases](https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods).
## Credits
Most of the credit goes to the scikit-learn developers for the engineering behind the search estimators.