https://github.com/danieltizon/clusteringmetrics

clustering scala spark

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/danieltizon/clusteringmetrics
Owner: DanielTizon
License: apache-2.0
Created: 2016-11-16T17:53:52.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-06-14T16:32:28.000Z (about 8 years ago)
Last Synced: 2025-04-02T05:54:33.396Z (3 months ago)
Topics: clustering, scala, spark
Language: Scala
Size: 84 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # Spark Clustering Metrics

I have developed some important clustering evaluation indexes in Spark 2.1.0,

using the Scala API. These indexes can help us to choose the best number of "k"

for our dataset. 

The internal indexes developed have been:

* Ball and Hall Index (Ball Index) - 1965

* Calinski and Harabasz Index (CH Index) - 1974

* Davies and Boulding Index (DB Index) - 1979

* Hartigan Index (Hartigan Index) - 1975

* Krzanowski and Lai Index (KL Index) - 1988

* Ratkowsky and Lance Index (Ratkowsky Index) - 1978

These indexes have been tested using the Iris dataset, and the R package NbClust.

I also have developed an external index (useful when you have some information 

about any elements that must go in the same group or in different groups):

* Rand Index (Rand Index) - 1971

I have created a function that let you estimate what is the best "k" for your dataset, 

you can see an example in clustering.test.TestDataIris (Iris dataset for testing is 

included in the project).

Any suggestion or fix will be appreciated.

You can contact with me using my email: [email protected]

Thanks!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danieltizon/clusteringmetrics

Awesome Lists containing this project

README