https://github.com/ashvardanian/usearch-binary

Binary vector search example using Unum's USearch engine and pre-computed Wikipedia embeddings from Co:here and MixedBread
https://github.com/ashvardanian/usearch-binary

binary-vector bitset vector-database vector-search

Last synced: about 1 year ago
JSON representation

Binary vector search example using Unum's USearch engine and pre-computed Wikipedia embeddings from Co:here and MixedBread

Host: GitHub
URL: https://github.com/ashvardanian/usearch-binary
Owner: ashvardanian
License: apache-2.0
Created: 2024-03-26T03:47:01.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-04-09T06:19:08.000Z (about 2 years ago)
Last Synced: 2025-04-11T03:02:23.851Z (about 1 year ago)
Topics: binary-vector, bitset, vector-database, vector-search
Language: Jupyter Notebook
Homepage: https://github.com/unum-cloud/usearch
Size: 66.4 KB
Stars: 18
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Binary Vector Search Examples for USearch

This repository contains examples for constructing binary vector-search indicies for WikiPedia embeddings available on the HuggingFace portal:

- [Co:here](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3)

- [MixedBread.ai](https://huggingface.co/datasets/mixedbread-ai/wikipedia-embed-en-2023-11)

## Running Examples

To view the results, check out the [`bench.ipynb`](bench.ipynb).

To replicate the results, first, download the data:

```sh

$ pip install -r requirements.txt

$ python download.py

$ ls -alh mixedbread | head -n 1

> total 15G

$ ls -alh cohere | head -n 1

> total 15G

```

In both cases, the embeddings have 1024 dimensions, each represented with a single bit, packed into 128-byte vectors.

32 GBs of RAM are recommended to run the scripts.

## Optimizations

Knowing the length of embeddings is very handy for optimizations.

If the embeddings are only 1024 bits long, we only need 2 ZMM registers to store the entire vector.

We don't need any `for`-loops, then entire operation can be unrolled and inlined.

```c

inline uint64_t hamming_distance(uint8_t const* first_vector, uint8_t const* second_vector) {

    __m512i const first_start = _mm512_loadu_si512((__m512i const*)(first_vector));

    __m512i const first_end = _mm512_loadu_si512((__m512i const*)(first_vector + 64));

    __m512i const second_start = _mm512_loadu_si512((__m512i const*)(second_vector));

    __m512i const second_end = _mm512_loadu_si512((__m512i const*)(second_vector + 64));

    __m512i const differences_start = _mm512_xor_epi64(first_start, second_start);

    __m512i const differences_end = _mm512_xor_epi64(first_end, second_end);

    __m512i const population_start = _mm512_popcnt_epi64(differences_start);

    __m512i const population_end = _mm512_popcnt_epi64(differences_end);

    __m512i const population = _mm512_add_epi64(population_start, population_end);

    return _mm512_reduce_add_epi64(population);

}

```

To run the kernel benchmarks, use the following command:

```sh

$ python kernel.py

```

To run benchmarks over real data:

```sh

$ python kernels.py --dir cohere --limit 1e6

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ashvardanian/usearch-binary

Awesome Lists containing this project

README