https://github.com/cmtt/kmpp
k-means clustering algorithm with k-means++ initialization.
https://github.com/cmtt/kmpp
clustering-algorithm javascript kmeans-algorithm kmeansplusplus
Last synced: 4 days ago
JSON representation
k-means clustering algorithm with k-means++ initialization.
- Host: GitHub
- URL: https://github.com/cmtt/kmpp
- Owner: cmtt
- Created: 2012-08-15T18:45:03.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2022-12-03T12:11:37.000Z (over 2 years ago)
- Last Synced: 2025-06-05T00:14:45.520Z (12 days ago)
- Topics: clustering-algorithm, javascript, kmeans-algorithm, kmeansplusplus
- Language: JavaScript
- Size: 1.2 MB
- Stars: 32
- Watchers: 6
- Forks: 9
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# kmpp
[](https://travis-ci.org/cmtt/kmpp)
When dealing with lots of data points, clustering algorithms may be used to group them. The k-means algorithm partitions _n_ data points into _k_ clusters and finds the centroids of these clusters incrementally.
The algorithm assigns data points to the closest cluster, and the centroids of each cluster are re-calculated. These steps are repeated until the centroids do not changing anymore.
The basic k-means algorithm is initialized with _k_ centroids at random positions. This implementation addresses some disadvantages of the arbitrary initialization method with the k-means++ algorithm (see "Further reading" at the end).
## Installation
## Installing via npm
Install kmpp as Node.js module via NPM:
````bash
$ npm install kmpp
````## Example
```javascript
var kmpp = require('kmpp');kmpp([
[x1, y1, ...],
[x2, y2, ...],
[x3, y3, ...],
...
], {
k: 4
});// =>
// { converged: true,
// centroids: [[xm1, ym1, ...], [xm2, ym2, ...], [xm3, ym3, ...]],
// counts: [ 7, 6, 7 ],
// assignments: [ 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 ]
// }
```## API
### `kmpp(points[, opts)`
Exectes the k-means++ algorithm on `points`.
Arguments:
- `points` (`Array`): An array-of-arrays containing the points in format `[[x1, y1, ...], [x2, y2, ...], [x3, y3, ...], ...]`
- `opts`: object containing configuration parameters. Parameters are
- `distance` (`function`): Optional function that takes two points and returns the distance between them.
- `initialize` (`Boolean`): Perform initialization. If false, uses the initial state provided in `centroids` and `assignments`. Otherwise discards any initial state and performs initialization.
- `k` (`Number`): number of centroids. If not provided, `sqrt(n / 2)` is used, where `n` is the number of points.
- `kmpp` (`Boolean`, default: `true`): If true, uses k-means++ initialization. Otherwise uses naive random assignment.
- `maxIterations` (`Number`, default: `100`): Maximum allowed number of iterations.
- `norm` (`Number`, default: `2`): L-norm used for distance computation. `1` is Manhattan norm, `2` is Euclidean norm. Ignored if `distance` function is provided.
- `centroids` (`Array`): An array of centroids. If `initialize` is false, used as initialization for the algorithm, otherwise overwritten in-place if of the correct size.
- `assignments` (`Array`): An array of assignments. Used for initialization, otherwise overwritten.
- `counts` (`Array`): An output array used to avoid extra allocation. Values are discarded and overwritten.Returns an object containing information about the centroids and point assignments. Values are:
- `converged`: `true` if the algorithm converged successfully
- `centroids`: a list of centroids
- `counts`: the number of points assigned to each respective centroid
- `assignments`: a list of integer assignments of each point to the respective centroid
- `iterations`: number of iterations used# Credits
* [Jared Harkins](https://github.com/hDeraj) improved the performance by
reducing the amount of function calls, reverting to Manhattan distance
for measurements and improved the random initialization by choosing from
points* [Ricky Reusser](https://github.com/rreusser) refactored API
# Further reading
* [Wikipedia: k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering)
* [Wikipedia: Determining the number of clusters in a data set](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set)
* [k-means++: The advantages of careful seeding, Arthur Vassilvitskii](http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf)
* [k-means++: The advantages of careful seeding, Presentation by Arthur Vassilvitskii (Presentation)](http://theory.stanford.edu/~sergei/slides/BATS-Means.pdf)# License
© 2017-2019. MIT License.