Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/whitead/vdict

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/whitead/vdict
Owner: whitead
License: mit
Created: 2022-12-03T00:06:13.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-01-08T07:30:34.000Z (almost 2 years ago)
Last Synced: 2024-11-06T02:06:45.353Z (about 2 months ago)
Language: Python
Size: 11.7 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # vdict

[![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/ur-whitelab/vdict)

[![tests](https://github.com/ur-whitelab/vdict/actions/workflows/tests.yml/badge.svg)](https://github.com/ur-whitelab/vdict)

[![PyPI version](https://badge.fury.io/py/vdict.svg)](https://badge.fury.io/py/vdict)

[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)

This a very thin wrapper around [hnswlib](https://github.com/nmslib/hnswlib) to make it look like a python dictionary whose keys are numpy arrays. Install with `pip install vdict`.

```python

from vdict import vdict

import numpy as np

data = vdict()

v1 = np.random.rand(32)

v2 = np.random.rand(32)

data[v1] = 'hello'

data[v2] = 32

assert data[v1] == 'hello'

```

You can have it throw IndexErrors if you try to access a key that doesn't exist:

```python

data = vdict(tol=0.001)

v1 = np.random.rand(32)

v2 = np.random.rand(32)

data[v1] = 'hello'

# this will throw an IndexError because we didn't add yet!

print(data[v2])

```

The default tolerance is `1` (generally do not throw errors), but you can set it to a smaller value to make it more strict.

## Details

* All vectors must be the same length

* Accessing with a vector gives the closest value keyed by the closest vector

* The algorithm is *approximate* nearest neighbor search. You can tune the accuracy (see below)

* You can have millions of vectors in the dictionary

* If you know the approximate size, pass `est_nelements` to `vidct()` to reduce how often things are resized

## Usage

The `vdict` class has some reasonable defaults, but you may need to tune for your use case. These are adjustable in the constructor. You can read about the parameters at the [hnswlib](https://github.com/nmslib/hnswlib). Briefly,

the most important ones are:

* `M` - the number of neighbors to consider when building the graph (higher `M` means more accurate, but more memory). 12-48 is typical.

* `space` - the distance metric to use. The default is `l2`, but you can also use `cosine` or `ip` (inner product).

* `ef_construction` - parameter that controls speed/accuracy trade-off during the index construction - 50 - 200 is typical.

```python

from vdict import vdict

data = vdict(M=16, space='cosine', ef_construction=100)

# add some vectors

data[np.random.rand(32)] = 'hello'

data[np.random.rand(32)] = 'world'

```

## License

MIT

## Author

Andrew White