https://github.com/aamend/hadoop-primitive-clustering
Hadoop implementation of Canopy Clustering using Levenshtein distance
https://github.com/aamend/hadoop-primitive-clustering
Last synced: 11 months ago
JSON representation
Hadoop implementation of Canopy Clustering using Levenshtein distance
- Host: GitHub
- URL: https://github.com/aamend/hadoop-primitive-clustering
- Owner: aamend
- Created: 2014-05-05T20:05:49.000Z (about 12 years ago)
- Default Branch: master
- Last Pushed: 2017-07-22T15:00:02.000Z (almost 9 years ago)
- Last Synced: 2025-04-14T07:47:08.829Z (about 1 year ago)
- Language: Java
- Homepage:
- Size: 136 KB
- Stars: 5
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Hadoop Primitive Array Clustering
==============
Hadoop implementation of Canopy Clustering using Levenshtein distance algorithm.
Difference with Mahout
----
One of the major limitation of Mahout is that the clustering algorithms (K-Means or Canopy clustering) uses a euclidean distance in order to compute Clusters' centers. Each time a new point is added to a cluster, Mahout framework recomputes cluster's center as an average of data points.
```
NewCenter[i] = Sum(Vectors)[i] / observations
```
But...
- What if your data set is composed of non-mathematical primitive data points (**char**, **boolean**) ?
- What if an average of points does not make any sense for your business ?
- Or simply what if you wish to use a non (or less) mathematical distance measure ?
Motivations
----
I had to create canopies for sequences of IDs (Integer). Let's take the following example with 2 vectors V1 and V2.
```
V1={0:123, 1:23, 2:55, 3:141, 4:22}
V2={0:23, 1:55, 2:141, 3:22}
```
These vectors are totally different using most of standard distance measures Mahout provides (e.g. *Euclidean*). I can still change the way my vectors are created, but none of the solution I tried were considering my arrays as a **sequence of IDs** and furthermore a sequence of IDs **where the order matters**. *Levensthein* metric (that is usually used for fuzzy string matching) is a perfect match as it compares sequences of IDs and not only IDs as numbers.
- I had to create a new set of *DistanceMeasure* taking arrays as Input parameters.
Besides, assuming both of them belongs to a same cluster, does a new cluster's center V' (made as an average of points in both V1 and V2) makes sense for sequence analysis ?
```
V'={0:(23+123)/2, 1:(55+23)/2, 2:(141+55)/2, 3:(22+141)/2, 4:(0+22)/1}
```
- I had to find a way to override Mahout cluster's center computation. Instead of computing an average of data points, I find the point Pi that minimizes the distance across all cluster's data points.
Pseudo code:
```
Point min_point = Pi
float min_dist = Infinity
For each point Pi
For each point Pj
Compute distance Pi->Pj
Update min_point if distance < min_dist
Center = minimum
```
Distance Measures
----
Supported distance measures are
- com.aamend.hadoop.clustering.distance.LevenshteinDistance measure
- com.aamend.hadoop.clustering.distance.TanimotoDistance measure (a.k.a Jaccard Coefficient).
- Any DistanceMeasure implementing com.aamend.hadoop.clustering.distance.DistanceMeasure
Primitive Arrays
----
Only **Integer.class** is supported on Version 1.0. It is planned however to support any of the Java primitive arrays (**boolean**[], **char**[], **int**[], **double**[], **long**[], **float**[]). I invite you to actively contribute to this project.
Dependencies
----
Even though the project has been directly inspired by Mahout canopy clustering, it does not depend on any of Mahout libraries. Instead of using Mahout *Vector*, I use arrays of Integer, and instead of Mahout *VectorWritable*, I use Hadoop *ArrayPrimitiveWritable*. Simply add the maven dependency to your project. Releases versions should be available on Maven Central (synched from Sonatype). Even though this project (actively depends on Hadoop libraries) has been built around Hadoop CDH4 distribution, this can be easily overridden on client side by using maven "exclusion" tag in order to use any of the Hadoop versions / distributions.
```
com.aamend.hadoop
hadoop-primitive-clustering
1.0
```
Usage
----
### Create canopies
Use *buildClusters* static method from *com.aamend.hadoop.clustering.job.CanopyDriver* class
```
/**
* @param conf the Hadoop Configuration
* @param input the Path containing input PrimitiveArraysWritable
* @param output the final Path where clusters / data will be written to
* @param reducers the number of reducers to use (at least 1)
* @param measure the DistanceMeasure
* @param t1 the float CLUSTER_T1 distance metric
* @param t2 the float CLUSTER_T2 distance metric
* @param cf the minimum observations per cluster
* @return the number of created canopies
*/
public static long buildClusters(Configuration conf, Path input,
Path output, int reducers,
DistanceMeasure measure,
float t1, float t2, long cf)
```
This will build Canopies using several Map-Reduce jobs (at least 2, driven by the initial number of reducers). Firstly, because we need to keep track of each observed point per clusters in order to minimize intra-distance of data points (obviously cannot fit in memory), Secondly because the measure used here might be fairly inneficient using a single Map job (*Levenshtein* complexitiy is O(n\*m)). In order to allow a smooth run without any hot spot, at each iteration, the number of reducers is 2 times smaller (until reached 1) while {T1,T2} parameters gets slightly larger (starts with half of the required size). Clustering algorithm is defined according to the supplied *DistanceMeasure* (can be a custom measure implementing DistanceMeasure assuming it is available on Hadoop classpath).
The **input** data should be a sequenceFile format using any key class (implementing *WritableComparable* interface) and value should be *ArrayPrimitiveWritable* (serializing integer array).
The **output** will be a sequenceFile format using Cluster Id as key (*IntWritable*) and *com.aamend.hadoop.clustering.clusterCanopyWritable* as value.
### Cluster input data
Once canopies are created, use static *clusterData* method from *com.aamend.hadoop.clustering.job.CanopyDriver* class
```
/**
* @param conf the Configuration
* @param inputData the Path containing input arrays
* @param dataPath the final Path where data will be written to
* @param clusterPath the path where clusters have been written
* @param measure the DistanceMeasure
* @param minSimilarity the minimum similarity to cluster data
* @param reducers the number of reducers to use (at least 1)
*/
public static void clusterData(Configuration conf, Path inputData,
Path dataPath, Path clusterPath,
DistanceMeasure measure,
float minSimilarity, int reducers)
```
This will retrieve the most probable clusters any point should belongs to. If not 100% identical to cluster's center, we cluster data if similarity is greater than X% (minSimilarity). Canopies (created at previous steps) are added to Distributed cache.
The **output** will be a sequenceFile format using Cluster Id as key (*IntWritable*) and *ObjectWritable* as value (object pointing to your initial *WritableComparable* key so that you can keep track of which point belongs to which cluster)
License
----
Apache License, Version 2.0
Author
----
Antoine Amend