Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danieltizon/clusteringmetrics
https://github.com/danieltizon/clusteringmetrics
clustering scala spark
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/danieltizon/clusteringmetrics
- Owner: DanielTizon
- License: apache-2.0
- Created: 2016-11-16T17:53:52.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-06-14T16:32:28.000Z (over 7 years ago)
- Last Synced: 2024-12-09T19:42:01.685Z (about 1 month ago)
- Topics: clustering, scala, spark
- Language: Scala
- Size: 84 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Spark Clustering Metrics
I have developed some important clustering evaluation indexes in Spark 2.1.0,
using the Scala API. These indexes can help us to choose the best number of "k"
for our dataset.The internal indexes developed have been:
* Ball and Hall Index (Ball Index) - 1965
* Calinski and Harabasz Index (CH Index) - 1974
* Davies and Boulding Index (DB Index) - 1979
* Hartigan Index (Hartigan Index) - 1975
* Krzanowski and Lai Index (KL Index) - 1988
* Ratkowsky and Lance Index (Ratkowsky Index) - 1978These indexes have been tested using the Iris dataset, and the R package NbClust.
I also have developed an external index (useful when you have some information
about any elements that must go in the same group or in different groups):* Rand Index (Rand Index) - 1971
I have created a function that let you estimate what is the best "k" for your dataset,
you can see an example in clustering.test.TestDataIris (Iris dataset for testing is
included in the project).Any suggestion or fix will be appreciated.
You can contact with me using my email: [email protected]
Thanks!