https://github.com/searchivarius/nndes

Automatically exported from code.google.com/p/nndes
https://github.com/searchivarius/nndes

Last synced: about 1 month ago
JSON representation

Automatically exported from code.google.com/p/nndes

Host: GitHub
URL: https://github.com/searchivarius/nndes
Owner: searchivarius
License: bsd-2-clause
Created: 2015-08-13T06:43:53.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-08-13T06:57:22.000Z (over 9 years ago)
Last Synced: 2025-03-17T18:19:57.211Z (about 2 months ago)
Language: C++
Size: 152 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README
- License: LICENSE

Awesome Lists containing this project

README

        NNDES: A Library for Efficient K-NN Graph Construction

Wei Dong    [email protected]

Copyright (C) 2010,2011 Wei Dong .

0. IMPORTANT INFORMATION FOR USERS OF AN EARLIER RELEASE

The meaning of the parameter S of the NNDescent constructor

has been changed from an absolute number to a sample rate to

be consistent with the standalone program.

1. Background

This package implements the NN-Descent algorithm [1].  Given a set of

objects and a similarity measure, the algrorithm simultaneusly finds

(approximately) the k objects most similar to each objects.

The standalone program supports dense vectors with L2 distance only.

The library supports arbitrary similarity measures.

THIS SOFTWARE DOES NOT HAVE ANY GUARANTEE ON THE ACCURACY OF THE

RESULT.  USE THE CONTROL (SEE SECTION 3) TO ESTIMATE THE EMPIRICAL ACCURACY.

2. Building

This package is coded in C++ and depends on the following library:

    * Boost > 1.41 (assert, random, progress, program_options)

To build the package, edit Makefile to reflect your environment and type

"make".

3. Using the Program

Usage: nndes {-D DIMENSION | --lshkit} [OTHER OPTIONS]... INPUT [OUTPUT]

General options:

  -h [ --help ]         produce help message.

  --input arg           input path

  --output arg          output path

  -K [ -- ] arg (=20)   number of nearest neighbor

  --fast                use fast configuration (less accurate)

  --control arg (=0)    number of control points

  -D [ --dim ] arg      dimension, see format

  --skip arg (=0)       see format

  --gap arg (=0)        see format

  --lshkit              use LSHKIT data format

Input Format:

  The INPUT file is parsed as a architecture-dependent binary file.  The

  initial  bytes are skipped.  After that, every 

  bytes are read as a D-dimensional vector.  There could be an optional

  -byte gap between each vectors.  Therefore, the program expect

  the file to contain [size(INPUT)-skip]/[D*sizeof(float)+gap] vectors.

  If the option "--lshkit" is specified, the initial 3*sizeof(int)

  bytes are interpreted as three 32-bit integers: sizeof(float), number

  of vectors in the file and the dimension.  The program then sets D =

  dimension, skip = 3 * sizeof(int) and gap = 0.

Output Format:

  Each input vector is assigned an serial ID (0, 1, ...) according to

  the order they appear in the input.  Each output line contains the ID

  of a point followed by the K IDs of its nearest neighbor.

Control:

  To measure the accuracy of the algorithm,  points are randomly

  sampled, and their k-nearest neighbors are found with brute-force

  search.  The control is then used to measure the recall, i.e., average rate

  of true k-nearest neighbors found for every point.

Progress Report:

  The following parameters are reported after each iteration:

  update: update rate of the K * N result entries.

  recall: estimated recall, or 0 if no control is specified.

  cost: number of similarity evaluate / [N*(N-1)/2, the brute force cost].

The following hidden options are also supported for the experts to

fine-tune the algorithm:

-I [ --iteration ] arg (=100)   Maximal number of iterations allowed.

-S [ --rho ] arg (=1.0)         Sample rate.  Use 1.0 for slow & accurate

                                configuration and 0.5 for fast & less-accurate

                                configuration.  See Section 2.5 of the paper

                                for more information.  The effect of the

                                "--fast" options is to change the default

                                value of "-S" to 0.5.

-T [ --delta] arg (=0.001)      Termination threshold T. See Section 2.6 of

                                the paper for more information.

4. Using the Library

To use the library, copy all the header files to where your C++ compiler

can find them and include  in your source code.  The library is

header only and nothing extra needs to be linked against your application.

The main interface is the class nndes::NNDescent, which is explained

below.

namespace nndes {

    template  class NNDescent {

        ...

    public:

        NNDescent (int N, int K, float S, ORACLE const &oracle_); //

        The constructor.  //      N:      number of objects, //      K:

        # nearest neighbors to find //      S:      sample rate.  See

        Section 2.5 of the paper //      oracle: the similarity oracle.

        For any 0 <= i, j < N, //              (float)oracle(i,j) should

        return the similarity between //              the i-th and j-th

        objects.  Smaller value means higher //              similarity.

        int iterate (); // Run one iteration.  // There are K * N entries

        in the results, and the return value // is how many of these

        entries are improved.

        const vector &getNN () const; // Get the current results.

        The vector has N entries, one for each object.

        long long int getCost () const; // Get the current cost, i.e.,

        the number of invocation of the similarity oracle // since the

        NNDescent object is created.

    };

    typedef vector KNN;

    struct KNNEntry {

        int key;        // from 0 to N - 1 float dist;     // similarity

        value returned by the oracle

    };

}

Following is how the class NNDescent is typicall used:

    // declare oracle class class MyOracle {

        vector objects;

    public:

        MyOracle (...) {

            ... // load objects

        }

        float operator (int i, int j) const {

            return distance between objects[i] and objects[j];

        }

    };

    MyOracle oracle(...); NNDescent nndes(N, K, S, oracle);

    for (;;) {

        int up = nndes.iterate(); if (up < N * K * delta) break; //

        probably do something with the partial result nndes.getNN() //

        or report cost nndes.getCost()

    }

    vector const &result = nndes.getNN();

    ... // do something with the final result.

5. Sample Datasets

Sample datasets in LSHKIT format can be downloaded at 

    http://www.cs.princeton.edu/~wdong/data/

See the paper for details about the datasets.

6. Known Bugs

7. Contact

Please contact the author for problems and bug report.

8. Reference

[1] Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest

neighbor graph construction for generic similarity measures. In

Proceedings of the 20th international conference on World wide web (WWW

'11). ACM, New York, NY, USA, 577-586.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/searchivarius/nndes

Awesome Lists containing this project

README