https://github.com/aamend/hadoop-primitive-clustering

Hadoop implementation of Canopy Clustering using Levenshtein distance
https://github.com/aamend/hadoop-primitive-clustering

Last synced: 11 months ago
JSON representation

Hadoop implementation of Canopy Clustering using Levenshtein distance

Host: GitHub
URL: https://github.com/aamend/hadoop-primitive-clustering
Owner: aamend
Created: 2014-05-05T20:05:49.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2017-07-22T15:00:02.000Z (almost 9 years ago)
Last Synced: 2025-04-14T07:47:08.829Z (about 1 year ago)
Language: Java
Homepage:
Size: 136 KB
Stars: 5
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          Hadoop Primitive Array Clustering

==============

Hadoop implementation of Canopy Clustering using Levenshtein distance algorithm.

Difference with Mahout

----

One of the major limitation of Mahout is that the clustering algorithms (K-Means or Canopy clustering) uses a euclidean distance in order to compute Clusters' centers. Each time a new point is added to a cluster, Mahout framework recomputes cluster's center as an average of data points.

```

NewCenter[i] = Sum(Vectors)[i] / observations

```

But...

- What if your data set is composed of non-mathematical primitive data points (**char**, **boolean**) ?

- What if an average of points does not make any sense for your business ? 

- Or simply what if you wish to use a non (or less) mathematical distance measure ? 

Motivations

----

I had to create canopies for sequences of IDs (Integer). Let's take the following example with 2 vectors V1 and V2.

```

V1={0:123, 1:23, 2:55,  3:141, 4:22}

V2={0:23,  1:55, 2:141, 3:22}

```

These vectors are totally different using most of standard distance measures Mahout provides (e.g. *Euclidean*). I can still change the way my vectors are created, but none of the solution I tried were considering my arrays as a **sequence of IDs** and furthermore a sequence of IDs **where the order matters**. *Levensthein* metric (that is usually used for fuzzy string matching) is a perfect match as it compares sequences of IDs and not only IDs as numbers. 

- I had to create a new set of *DistanceMeasure* taking arrays as Input parameters.

Besides, assuming both of them belongs to a same cluster, does a new cluster's center V' (made as an average of points in both V1 and V2) makes sense for sequence analysis ? 

```

V'={0:(23+123)/2, 1:(55+23)/2, 2:(141+55)/2, 3:(22+141)/2, 4:(0+22)/1}

```

- I had to find a way to override Mahout cluster's center computation. Instead of computing an average of data points, I find the point Pi that minimizes the distance across all cluster's data points. 

Pseudo code:

```

Point min_point = Pi

float min_dist  = Infinity

For each point Pi

  For each point Pj

     Compute distance Pi->Pj

     Update min_point if distance < min_dist

Center = minimum

```

Distance Measures

----

Supported distance measures are 

- com.aamend.hadoop.clustering.distance.LevenshteinDistance measure

- com.aamend.hadoop.clustering.distance.TanimotoDistance measure (a.k.a Jaccard Coefficient).

- Any DistanceMeasure implementing com.aamend.hadoop.clustering.distance.DistanceMeasure

Primitive Arrays

----

Only **Integer.class** is supported on Version 1.0. It is planned however to support any of the Java primitive arrays (**boolean**[], **char**[], **int**[], **double**[], **long**[], **float**[]). I invite you to actively contribute to this project.

Dependencies

----

Even though the project has been directly inspired by Mahout canopy clustering, it does not depend on any of Mahout libraries. Instead of using Mahout *Vector*, I use arrays of Integer, and instead of Mahout *VectorWritable*, I use Hadoop *ArrayPrimitiveWritable*. Simply add the maven dependency to your project. Releases versions should be available on Maven Central (synched from Sonatype). Even though this project (actively depends on Hadoop libraries) has been built around Hadoop CDH4 distribution, this can be easily overridden on client side by using maven "exclusion" tag in order to use any of the Hadoop versions / distributions.

```

    

        com.aamend.hadoop

        hadoop-primitive-clustering

        1.0

    

```

Usage

----

### Create canopies

Use *buildClusters* static method from *com.aamend.hadoop.clustering.job.CanopyDriver* class

```

     /**

     * @param conf     the Hadoop Configuration

     * @param input    the Path containing input PrimitiveArraysWritable

     * @param output   the final Path where clusters / data will be written to

     * @param reducers the number of reducers to use (at least 1)

     * @param measure  the DistanceMeasure

     * @param t1       the float CLUSTER_T1 distance metric

     * @param t2       the float CLUSTER_T2 distance metric

     * @param cf       the minimum observations per cluster

     * @return the number of created canopies

     */

    public static long buildClusters(Configuration conf, Path input,

                                     Path output, int reducers,

                                     DistanceMeasure measure,

                                     float t1, float t2, long cf)

```

This will build Canopies using several Map-Reduce jobs (at least 2, driven by the initial number of reducers). Firstly, because we need to keep track of each observed point per clusters in order to minimize intra-distance of data points (obviously cannot fit in memory), Secondly because the measure used here might be fairly inneficient using a single Map job (*Levenshtein* complexitiy is O(n\*m)). In order to allow a smooth run without any hot spot, at each iteration, the number of reducers is 2 times smaller (until reached 1) while {T1,T2} parameters gets slightly larger (starts with half of the required size). Clustering algorithm is defined according to the supplied *DistanceMeasure* (can be a custom measure implementing DistanceMeasure assuming it is available on Hadoop classpath). 

The **input** data should be a sequenceFile format using any key class (implementing *WritableComparable* interface) and value should be *ArrayPrimitiveWritable* (serializing integer array). 

The **output** will be a sequenceFile format using Cluster Id as key (*IntWritable*) and *com.aamend.hadoop.clustering.clusterCanopyWritable* as value.

### Cluster input data

Once canopies are created, use static *clusterData* method from *com.aamend.hadoop.clustering.job.CanopyDriver* class

```

     /**

     * @param conf          the Configuration

     * @param inputData     the Path containing input arrays

     * @param dataPath      the final Path where data will be written to

     * @param clusterPath   the path where clusters have been written

     * @param measure       the DistanceMeasure

     * @param minSimilarity the minimum similarity to cluster data

     * @param reducers      the number of reducers to use (at least 1)

     */

    public static void clusterData(Configuration conf, Path inputData,

                                   Path dataPath, Path clusterPath,

                                   DistanceMeasure measure,

                                   float minSimilarity, int reducers)

```

This will retrieve the most probable clusters any point should belongs to. If not 100% identical to cluster's center, we cluster data if similarity is greater than X% (minSimilarity). Canopies (created at previous steps) are added to Distributed cache. 

The **output** will be a sequenceFile format using Cluster Id as key (*IntWritable*) and *ObjectWritable* as value (object pointing to your initial *WritableComparable* key so that you can keep track of which point belongs to which cluster)

License

----

Apache License, Version 2.0

Author

----

Antoine Amend

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aamend/hadoop-primitive-clustering

Awesome Lists containing this project

README