https://github.com/fvictorio/spark-nnd

An efficient implementation of the NND algorithm on Spark
https://github.com/fvictorio/spark-nnd

apache-spark knn

Last synced: about 2 months ago
JSON representation

An efficient implementation of the NND algorithm on Spark

Host: GitHub
URL: https://github.com/fvictorio/spark-nnd
Owner: fvictorio
Created: 2019-03-05T17:13:01.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2019-03-06T00:01:10.000Z (about 6 years ago)
Last Synced: 2025-04-01T12:23:31.605Z (2 months ago)
Topics: apache-spark, knn
Language: Scala
Homepage:
Size: 6.84 KB
Stars: 4
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # spark-nnd

An efficient implementation of the Nearest Neighbor Descent algorithm on

Apache Spark.

---

This is a Spark implementation of the Nearest Neighbor Descent algorithm for

building a K-nearest neighbor graph (K-NNG). The code is based in [Efficient

K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data

Sets](https://www.researchgate.net/publication/285839354_Efficient_K-Nearest_Neighbor_Graph_Construction_Using_MapReduce_for_Large-Scale_Data_Sets),

a master's thesis by Tomohiro Warashina. The [original paper describing

NND](https://dl.acm.org/citation.cfm?id=1963487) mentions a naive implementation

that could be used in a MapReduce environment, but this implementation suffers

from low data transmission efficiency. The mentioned thesis proposes an improved

implementation on Hadoop MapReduce. This repository contains an adaptation of

this algorithm for Apache Spark.

## Usage

The package exposes a `NND` object with a single method `buildGraph`. This

method receives a `RDD[(Long, Node)]` where the first element of the tuple is the id of the element, and `Node` is:

```scala

case class Node(features: Vector, label: Option[Long], partition: Long = 0, finished: Boolean = false)

```

and returns a `RDD[(Long, NodeWithNeighbors)]` where:

```scala

case class NodeWithNeighbors(features: Vector, label: Option[Long], neighbors: Seq[(Long, Double)], partition: Long = 0, finished: Boolean = false)

```

_(You can ignore the `partition` and `finished` fields; they are used in the [package](https://github.com/fvictorio/spark-rgt) from where this implementation was extracted.)_

The new `neighbors` field is a sequence with the id and the similarity of each neighbor.

## Example

```scala

import com.github.fvictorio.nnd.{NND, Node}

...

// parameters

val K = 10

val maxIterations = 5

val earlyTermination = 0.01

val sampleRate = 1.0

val bucketsPerInstance = 4

// get your input data

val rdd: RDD[(Long, Node)] = ???

// build the graph

val result = NND.buildGraph(rdd, K, maxIterations, earlyTermination, sampleRate, bucketsPerInstance)

```

## Installation

In your `build.sbt` add:

```scala

resolvers += "jitpack" at "https://jitpack.io"

libraryDependencies += "com.github.fvictorio" % "spark-nnd" % "master-SNAPSHOT"

```

_Disclaimer: I use [jitpack](https://jitpack.io) because I have no idea how to publish a package in Sonatype._

## Comparison

An implementation of the naive approach can be found

[here](https://github.com/tdebatty/spark-knn-graphs). The following table

compares that implementation with the one in this repository. The tests were

done in a Google Cloud cluster using different subsets of the EMNIST dataset.

The columns show how much time each implementation took in building the graph,

and also the maximum shuffle size (the amount of data moved between nodes in a

stage).

  

  

    Number of elements

    Time (seconds)

    Max shuffle size (MB)

  

  

    Compared implementation

    This implementation

    Compared implementation

    This implementation

  

  

  

    2K

    170

    149

    366

    39

  

  

    4K

    139

    132

    727

    76

  

  

    8K

    724

    229

    1462

    147

  

  

    16K

    1411

    701

    2900

    290

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fvictorio/spark-nnd

Awesome Lists containing this project

README