Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/danieltizon/clusteringmetrics


https://github.com/danieltizon/clusteringmetrics

clustering scala spark

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

# Spark Clustering Metrics

I have developed some important clustering evaluation indexes in Spark 2.1.0,
using the Scala API. These indexes can help us to choose the best number of "k"
for our dataset.

The internal indexes developed have been:

* Ball and Hall Index (Ball Index) - 1965
* Calinski and Harabasz Index (CH Index) - 1974
* Davies and Boulding Index (DB Index) - 1979
* Hartigan Index (Hartigan Index) - 1975
* Krzanowski and Lai Index (KL Index) - 1988
* Ratkowsky and Lance Index (Ratkowsky Index) - 1978

These indexes have been tested using the Iris dataset, and the R package NbClust.

I also have developed an external index (useful when you have some information
about any elements that must go in the same group or in different groups):

* Rand Index (Rand Index) - 1971

I have created a function that let you estimate what is the best "k" for your dataset,
you can see an example in clustering.test.TestDataIris (Iris dataset for testing is
included in the project).

Any suggestion or fix will be appreciated.

You can contact with me using my email: [email protected]

Thanks!