Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/irvingc/dbscan-on-spark

An implementation of DBSCAN runing on top of Apache Spark
https://github.com/irvingc/dbscan-on-spark

Last synced: 3 months ago
JSON representation

An implementation of DBSCAN runing on top of Apache Spark

Awesome Lists containing this project

README

        

# DBSCAN on Spark

### Overview

This is an implementation of the [DBSCAN clustering algorithm](http://en.wikipedia.org/wiki/DBSCAN)
on top of [Apache Spark](http://spark.apache.org/). It is loosely based on the paper from He, Yaobin, et al.
["MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data"](http://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf).

I have also created a [visual guide](http://www.irvingc.com/visualizing-dbscan) that explains how the algorithm works.
### Current vesion of DBSCAN is dbscan-on-spark_2.10:0.2.0-SNAPSHOT
Be aware that current version of DBSCAN in this repo is :
com.irvingc.spark
**dbscan-on-spark_2.10**
**0.2.0-SNAPSHOT**
It is not present in any official repository and to make it work, you need to build it yourself.
### Getting DBSCAN on Spark

Version 0.1.0 of DBSCAN on Spark is published to [bintray](https://bintray.com/). If you use SBT you
can include SBT in your application adding the following to your build.sbt:

```
resolvers += "bintray/irvingc" at "http://dl.bintray.com/irvingc/maven"

libraryDependencies += "com.irvingc.spark" %% "dbscan" % "0.1.0"
```

If you use Maven or Ivy you can use a similar resolver, but you just
need to account for the scala version (the example is for Scala 2.10):

```
...



dbscan-on-spark-repo
Repo for DBSCAN on Spark
http://dl.bintray.com/irvingc/maven


...


com.irvingc.spark
dbscan_2.10
0.1.0

```
DBSCAN on Spark is built against Scala 2.10.

### Example usage

I have created a [sample project](https://github.com/irvingc/dbscan-on-spark-example)
showing how DBSCAN on Spark can be used. The following however should give you a
good idea of how it should be included in your application.

```scala
import org.apache.spark.mllib.clustering.dbscan.DBSCAN

object DBSCANSample {

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("DBSCAN Sample")
val sc = new SparkContext(conf)

val data = sc.textFile(src)

val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()

log.info(s"EPS: $eps minPoints: $minPoints")

val model = DBSCAN.train(
parsedData,
eps = eps,
minPoints = minPoints,
maxPointsPerPartition = maxPointsPerPartition)

model.labeledPoints.map(p => s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)

sc.stop()
}
}
```

### License

DBSCAN on Spark is available under the Apache 2.0 license.
See the [LICENSE](LICENSE) file for details.

### Credits

DBSCAN on Spark is maintained by Irving Cordova ([email protected]).