Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bgu-cs-vil/pdc-dp-means
"Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation" [Dinari and Freifeld, UAI 2022]
https://github.com/bgu-cs-vil/pdc-dp-means
clustering dpmeans kmeans machine-learning minibatch scikit-learn
Last synced: 3 months ago
JSON representation
"Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation" [Dinari and Freifeld, UAI 2022]
- Host: GitHub
- URL: https://github.com/bgu-cs-vil/pdc-dp-means
- Owner: BGU-CS-VIL
- License: bsd-3-clause
- Created: 2022-06-16T10:30:19.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-20T21:02:24.000Z (6 months ago)
- Last Synced: 2024-09-30T09:20:21.429Z (4 months ago)
- Topics: clustering, dpmeans, kmeans, machine-learning, minibatch, scikit-learn
- Language: Python
- Homepage:
- Size: 554 KB
- Stars: 16
- Watchers: 2
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Parallel Delayed Cluster DP-Means
[Paper](https://openreview.net/pdf?id=rnzVBD8jqlq)
### Introduction
The PDC-DP-Means package presents a highly optimized version of the DP-Means algorithm, introducing a new parallel algorithm, Parallel Delayed Cluster DP-Means (PDC-DP-Means), and a MiniBatch implementation for enhanced speed. These features cater to scalable and efficient cluster analysis where the number of clusters is unknown.In addition to offering major speed improvements, the PDC-DP-Means algorithm supports an optional online mode for real-time data processing. Its scikit-learn-like interface is user-friendly and designed for easy integration into existing data workflows. PDC-DP-Means outperforms other nonparametric methods, establishing its efficiency and scalability in the realm of clustering algorithms.
See the paper for more details.
### Installation
`pip install pdc-dp-means`### Quick Start
from sklearn.datasets import make_blobs
from pdc_dp_means import DPMeans# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Apply DPMeans clustering
dpmeans = DPMeans(n_clusters=1,n_init=10, delta=10) # n_init and delta parameters
dpmeans.fit(X)# Predict the cluster for each data point
y_dpmeans = dpmeans.predict(X)# Plotting clusters and centroids
import matplotlib.pyplot as pltplt.scatter(X[:, 0], X[:, 1], c=y_dpmeans, s=50, cmap='viridis')
centers = dpmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.show()One thing to note is that we replace the `\lambda` parameter from the paper with `delta` in the code, as `lambda` is a reserved word in python.
### Usage
Please refer to the documentation: https://pdc-dp-means.readthedocs.io/en/latest/### Paper Code
Please refer to https://github.com/BGU-CS-VIL/pdc-dp-means/tree/main/paper_code for the code used in the paper.### Citing this work
If you use this code for your work, please cite the following:```
@inproceedings{dinari2022revisiting,
title={Revisiting {DP}-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation},
author={Dinari, Or and Freifeld, Oren},
booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
year={2022}
}
```
### License
Our code is licensed under the BDS-3-Clause license.