https://github.com/pierrekieffer/genericunsupervisedmachinelearning

Generic Clustering algorithm for Apache Spark deployment
https://github.com/pierrekieffer/genericunsupervisedmachinelearning

kmeans machine-learning mllib silhouette spark

Last synced: 4 months ago
JSON representation

Generic Clustering algorithm for Apache Spark deployment

Host: GitHub
URL: https://github.com/pierrekieffer/genericunsupervisedmachinelearning
Owner: PierreKieffer
License: apache-2.0
Created: 2019-04-02T14:34:15.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-05-01T13:52:57.000Z (over 6 years ago)
Last Synced: 2023-03-07T01:31:41.522Z (over 2 years ago)
Topics: kmeans, machine-learning, mllib, silhouette, spark
Language: Scala
Size: 17.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# GenericUnsupervisedMachineLearning
This application provides a Generic Clustering algorithm for Apache Spark deployment

## Concept
This application allows to quickly deploy an unsupervised machine learning algorithm based on Kmeans clustering method.
The main objective of the algorithm is the adaptation to the input, and the computation of the optimal number of cluster for the application of the clustering.

The application is divided into three distinct parts :

- The first part is the data import and preprocessing :
The algorithm applies a transformation on the features, to obtain vectors that will be the input of the clustering algorithm.
The choosen clustering algorithm is the Kmeans.

- The second part is the application of the Silhouette method :
The goal is to compute the optimal cluster number (optimalK) based on input data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
The algorithm test compute the Silouhette score for each K between 2 and MAX_K (choosen value in config file, default = 20) and selects the optimal by comparing the scores between them.

- The third part is kmeans clustering application with optimalK :
The output dataframe is saved with a new column "Cluster" to indicate the cluster of each record.
Moreover a report that describe the different clusters and their characteristics is provided thank's to GenerateReport object
The report compute the pourcentage of the categorical features in each cluster, and the median of numeric features in each cluster.

## How to run
- Set variables in config.yml file
- Run :
- `spark-submit --class main.Main --master local[*] /PathToJAR/.../GenericUnsupervisedMachineLearning-assembly-0.1.jar /pathToConfig/config.yml`

## How to build
### sbt
The build.sbt file is provided.
You need to create under project repo :
- build.properties file with sbt version
- assembly.sbt file with sbt-assembly version associated to sbt version

To build and package, inside the sbt shell, call assembly command.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pierrekieffer/genericunsupervisedmachinelearning

Awesome Lists containing this project

README