https://github.com/boredoms/prone
PRONE - PRojected ONE-dimensional clustering algorithm for k-means
https://github.com/boredoms/prone
cpp data-science k-means-clustering neurips-2023 python
Last synced: 3 months ago
JSON representation
PRONE - PRojected ONE-dimensional clustering algorithm for k-means
- Host: GitHub
- URL: https://github.com/boredoms/prone
- Owner: boredoms
- Created: 2023-10-12T14:33:35.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-14T18:33:35.000Z (about 1 year ago)
- Last Synced: 2025-03-15T07:18:11.173Z (7 months ago)
- Topics: cpp, data-science, k-means-clustering, neurips-2023, python
- Language: C++
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# `PRONE` (PRojected ONE-dimensional clustering)
This repository contains a ready to use implementation of the `PRONE` algorithm from the paper "Simple, Scalable and Effective Clustering via One-Dimensional Projections". This is a reference implementation published alongside our paper.
It can be installed and used in python using `pip`. Installing can be done using `pip install .` in the prone directory. The dependencies are `numpy` and `cython`. To run the tests for the C++ code, you need to have `CMake` installed. If you want to install the packages for the demo, run `pip install .[demo]`. The demo shows off how to run prone and how to use it to create a coreset and use it with scikit-learn.
## Usage
After installing, import the `prone` function from the `prone` module (See `demo.py`). It can then be used for k-means clustering. The strength of our algorithm is that it produces an approximate solution with provable guarantees much faster than previous algorithms for seeding k-means. This makes it particularly suitable to be used in conjunction with sensitivity sampling to create a coreset. This approach allows us to significantly "compress" the input data in the first stage of a clustering pipeline, which speeds up downstream tasks.
You can also import the `coreset` function to compute a coreset of your dataset. The parameters are the dataset, the number of clusters and the size of the coreset. It returns two arrays, one containing the indices of coreset points in the dataset, and the other containing the weights corresponding to the coreset points.
## Future work
Parallelism and performance improvements.
## Citing
If you use this library in your own research, please cite our NeurIPS paper (currently this is the preprint, as the proceedings are not out yet January 5th, 2024):
```
@inproceedings{charikar2023simple,
title={Simple, Scalable and Effective Clustering via One-Dimensional Projections},
author={Charikar, Moses and Henzinger, Monika and Hu, Lunjia and V{\"o}tsch, Maximilian and Waingarten, Erik},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}
```