https://github.com/irvingc/dbscan-on-spark

An implementation of DBSCAN runing on top of Apache Spark
https://github.com/irvingc/dbscan-on-spark

Last synced: 2 months ago
JSON representation

An implementation of DBSCAN runing on top of Apache Spark

Host: GitHub
URL: https://github.com/irvingc/dbscan-on-spark
Owner: irvingc
License: apache-2.0
Created: 2015-03-15T00:45:16.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2018-01-10T01:29:42.000Z (over 7 years ago)
Last Synced: 2024-08-03T23:04:42.251Z (12 months ago)
Language: Scala
Size: 111 KB
Stars: 183
Watchers: 19
Forks: 58
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # DBSCAN on Spark

### Overview

This is an implementation of the [DBSCAN clustering algorithm](http://en.wikipedia.org/wiki/DBSCAN) 

on top of [Apache Spark](http://spark.apache.org/). It is loosely based on the paper from He, Yaobin, et al.

["MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data"](http://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf). 

I have also created a [visual guide](http://www.irvingc.com/visualizing-dbscan) that explains how the algorithm works.

### Current vesion of DBSCAN is dbscan-on-spark_2.10:0.2.0-SNAPSHOT

Be aware that current version of DBSCAN in this repo  is  :

	com.irvingc.spark

	**dbscan-on-spark_2.10**

	**0.2.0-SNAPSHOT**

It is not present in any  official repository and to make it work, you need to build it yourself.

### Getting DBSCAN on Spark

Version 0.1.0 of DBSCAN on Spark is published to [bintray](https://bintray.com/). If you use SBT you

can include SBT in your application adding the following to your build.sbt:

```

resolvers += "bintray/irvingc" at "http://dl.bintray.com/irvingc/maven"

libraryDependencies += "com.irvingc.spark" %% "dbscan" % "0.1.0"

```

If you use Maven or Ivy you can use a similar resolver, but you just

need to account for the scala version (the example is for Scala 2.10):

```

...

	

		

			dbscan-on-spark-repo

			Repo for DBSCAN on Spark

			http://dl.bintray.com/irvingc/maven

		

	

...

	

		com.irvingc.spark

		dbscan_2.10

		0.1.0

	

```

DBSCAN on Spark is built against Scala 2.10.

### Example usage 

I have created a [sample project](https://github.com/irvingc/dbscan-on-spark-example) 

showing how DBSCAN on Spark can be used. The following however should give you a

good idea of how it should be included in your application.

```scala

import org.apache.spark.mllib.clustering.dbscan.DBSCAN

object DBSCANSample {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("DBSCAN Sample")

    val sc = new SparkContext(conf)

    val data = sc.textFile(src)

    val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()

    log.info(s"EPS: $eps minPoints: $minPoints")

    val model = DBSCAN.train(

      parsedData,

      eps = eps,

      minPoints = minPoints,

      maxPointsPerPartition = maxPointsPerPartition)

    model.labeledPoints.map(p =>  s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)

    sc.stop()

  }

}

```

### License

DBSCAN on Spark is available under the Apache 2.0 license. 

See the [LICENSE](LICENSE) file for details.

### Credits

DBSCAN on Spark is maintained by Irving Cordova ([email protected]).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/irvingc/dbscan-on-spark

Awesome Lists containing this project

README