https://github.com/fvictorio/spark-nnd
An efficient implementation of the NND algorithm on Spark
https://github.com/fvictorio/spark-nnd
apache-spark knn
Last synced: about 2 months ago
JSON representation
An efficient implementation of the NND algorithm on Spark
- Host: GitHub
- URL: https://github.com/fvictorio/spark-nnd
- Owner: fvictorio
- Created: 2019-03-05T17:13:01.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-03-06T00:01:10.000Z (about 6 years ago)
- Last Synced: 2025-04-01T12:23:31.605Z (2 months ago)
- Topics: apache-spark, knn
- Language: Scala
- Homepage:
- Size: 6.84 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# spark-nnd
An efficient implementation of the Nearest Neighbor Descent algorithm on
Apache Spark.---
This is a Spark implementation of the Nearest Neighbor Descent algorithm for
building a K-nearest neighbor graph (K-NNG). The code is based in [Efficient
K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data
Sets](https://www.researchgate.net/publication/285839354_Efficient_K-Nearest_Neighbor_Graph_Construction_Using_MapReduce_for_Large-Scale_Data_Sets),
a master's thesis by Tomohiro Warashina. The [original paper describing
NND](https://dl.acm.org/citation.cfm?id=1963487) mentions a naive implementation
that could be used in a MapReduce environment, but this implementation suffers
from low data transmission efficiency. The mentioned thesis proposes an improved
implementation on Hadoop MapReduce. This repository contains an adaptation of
this algorithm for Apache Spark.## Usage
The package exposes a `NND` object with a single method `buildGraph`. This
method receives a `RDD[(Long, Node)]` where the first element of the tuple is the id of the element, and `Node` is:```scala
case class Node(features: Vector, label: Option[Long], partition: Long = 0, finished: Boolean = false)
```and returns a `RDD[(Long, NodeWithNeighbors)]` where:
```scala
case class NodeWithNeighbors(features: Vector, label: Option[Long], neighbors: Seq[(Long, Double)], partition: Long = 0, finished: Boolean = false)
```_(You can ignore the `partition` and `finished` fields; they are used in the [package](https://github.com/fvictorio/spark-rgt) from where this implementation was extracted.)_
The new `neighbors` field is a sequence with the id and the similarity of each neighbor.
## Example
```scala
import com.github.fvictorio.nnd.{NND, Node}
...
// parameters
val K = 10
val maxIterations = 5
val earlyTermination = 0.01
val sampleRate = 1.0
val bucketsPerInstance = 4// get your input data
val rdd: RDD[(Long, Node)] = ???// build the graph
val result = NND.buildGraph(rdd, K, maxIterations, earlyTermination, sampleRate, bucketsPerInstance)
```## Installation
In your `build.sbt` add:
```scala
resolvers += "jitpack" at "https://jitpack.io"libraryDependencies += "com.github.fvictorio" % "spark-nnd" % "master-SNAPSHOT"
```_Disclaimer: I use [jitpack](https://jitpack.io) because I have no idea how to publish a package in Sonatype._
## Comparison
An implementation of the naive approach can be found
[here](https://github.com/tdebatty/spark-knn-graphs). The following table
compares that implementation with the one in this repository. The tests were
done in a Google Cloud cluster using different subsets of the EMNIST dataset.
The columns show how much time each implementation took in building the graph,
and also the maximum shuffle size (the amount of data moved between nodes in a
stage).
Number of elements
Time (seconds)
Max shuffle size (MB)
Compared implementation
This implementation
Compared implementation
This implementation
2K
170
149
366
39
4K
139
132
727
76
8K
724
229
1462
147
16K
1411
701
2900
290