Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/irvingc/dbscan-on-spark
An implementation of DBSCAN runing on top of Apache Spark
https://github.com/irvingc/dbscan-on-spark
Last synced: 3 months ago
JSON representation
An implementation of DBSCAN runing on top of Apache Spark
- Host: GitHub
- URL: https://github.com/irvingc/dbscan-on-spark
- Owner: irvingc
- License: apache-2.0
- Created: 2015-03-15T00:45:16.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2018-01-10T01:29:42.000Z (almost 7 years ago)
- Last Synced: 2024-04-12T17:18:00.734Z (7 months ago)
- Language: Scala
- Size: 111 KB
- Stars: 182
- Watchers: 19
- Forks: 57
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DBSCAN on Spark
### Overview
This is an implementation of the [DBSCAN clustering algorithm](http://en.wikipedia.org/wiki/DBSCAN)
on top of [Apache Spark](http://spark.apache.org/). It is loosely based on the paper from He, Yaobin, et al.
["MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data"](http://www.researchgate.net/profile/Yaobin_He/publication/260523383_MR-DBSCAN_a_scalable_MapReduce-based_DBSCAN_algorithm_for_heavily_skewed_data/links/0046353a1763ee2bdf000000.pdf).I have also created a [visual guide](http://www.irvingc.com/visualizing-dbscan) that explains how the algorithm works.
### Current vesion of DBSCAN is dbscan-on-spark_2.10:0.2.0-SNAPSHOT
Be aware that current version of DBSCAN in this repo is :
com.irvingc.spark
**dbscan-on-spark_2.10**
**0.2.0-SNAPSHOT**
It is not present in any official repository and to make it work, you need to build it yourself.
### Getting DBSCAN on SparkVersion 0.1.0 of DBSCAN on Spark is published to [bintray](https://bintray.com/). If you use SBT you
can include SBT in your application adding the following to your build.sbt:```
resolvers += "bintray/irvingc" at "http://dl.bintray.com/irvingc/maven"libraryDependencies += "com.irvingc.spark" %% "dbscan" % "0.1.0"
```If you use Maven or Ivy you can use a similar resolver, but you just
need to account for the scala version (the example is for Scala 2.10):```
...
dbscan-on-spark-repo
Repo for DBSCAN on Spark
http://dl.bintray.com/irvingc/maven
...
com.irvingc.spark
dbscan_2.10
0.1.0
```
DBSCAN on Spark is built against Scala 2.10.### Example usage
I have created a [sample project](https://github.com/irvingc/dbscan-on-spark-example)
showing how DBSCAN on Spark can be used. The following however should give you a
good idea of how it should be included in your application.```scala
import org.apache.spark.mllib.clustering.dbscan.DBSCANobject DBSCANSample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DBSCAN Sample")
val sc = new SparkContext(conf)val data = sc.textFile(src)
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
log.info(s"EPS: $eps minPoints: $minPoints")
val model = DBSCAN.train(
parsedData,
eps = eps,
minPoints = minPoints,
maxPointsPerPartition = maxPointsPerPartition)model.labeledPoints.map(p => s"${p.x},${p.y},${p.cluster}").saveAsTextFile(dest)
sc.stop()
}
}
```### License
DBSCAN on Spark is available under the Apache 2.0 license.
See the [LICENSE](LICENSE) file for details.### Credits
DBSCAN on Spark is maintained by Irving Cordova ([email protected]).