{"id":16976164,"url":"https://github.com/ashvardanian/usearch-binary","last_synced_at":"2025-04-12T00:32:25.406Z","repository":{"id":229790771,"uuid":"777547736","full_name":"ashvardanian/usearch-binary","owner":"ashvardanian","description":"Binary vector search example using Unum's USearch engine and pre-computed Wikipedia embeddings from Co:here and MixedBread","archived":false,"fork":false,"pushed_at":"2024-04-09T06:19:08.000Z","size":68,"stargazers_count":18,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-11T03:02:23.851Z","etag":null,"topics":["binary-vector","bitset","vector-database","vector-search"],"latest_commit_sha":null,"homepage":"https://github.com/unum-cloud/usearch","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-03-26T03:47:01.000Z","updated_at":"2024-12-30T22:29:25.000Z","dependencies_parsed_at":"2024-04-09T07:39:58.489Z","dependency_job_id":null,"html_url":"https://github.com/ashvardanian/usearch-binary","commit_stats":null,"previous_names":["ashvardanian/usearch-binary"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-binary","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-binary/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-binary/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-binary/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/usearch-binary/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248501427,"owners_count":21114674,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-vector","bitset","vector-database","vector-search"],"created_at":"2024-10-14T01:25:10.251Z","updated_at":"2025-04-12T00:32:25.136Z","avatar_url":"https://github.com/ashvardanian.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# Binary Vector Search Examples for USearch\n\nThis repository contains examples for constructing binary vector-search indicies for WikiPedia embeddings available on the HuggingFace portal:\n\n- [Co:here](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3)\n- [MixedBread.ai](https://huggingface.co/datasets/mixedbread-ai/wikipedia-embed-en-2023-11)\n\n## Running Examples\n\nTo view the results, check out the [`bench.ipynb`](bench.ipynb).\nTo replicate the results, first, download the data:\n\n```sh\n$ pip install -r requirements.txt\n$ python download.py\n$ ls -alh mixedbread | head -n 1\n\u003e total 15G\n$ ls -alh cohere | head -n 1\n\u003e total 15G\n```\n\nIn both cases, the embeddings have 1024 dimensions, each represented with a single bit, packed into 128-byte vectors.\n32 GBs of RAM are recommended to run the scripts.\n\n## Optimizations\n\nKnowing the length of embeddings is very handy for optimizations.\nIf the embeddings are only 1024 bits long, we only need 2 ZMM registers to store the entire vector.\nWe don't need any `for`-loops, then entire operation can be unrolled and inlined.\n\n```c\ninline uint64_t hamming_distance(uint8_t const* first_vector, uint8_t const* second_vector) {\n    __m512i const first_start = _mm512_loadu_si512((__m512i const*)(first_vector));\n    __m512i const first_end = _mm512_loadu_si512((__m512i const*)(first_vector + 64));\n    __m512i const second_start = _mm512_loadu_si512((__m512i const*)(second_vector));\n    __m512i const second_end = _mm512_loadu_si512((__m512i const*)(second_vector + 64));\n    __m512i const differences_start = _mm512_xor_epi64(first_start, second_start);\n    __m512i const differences_end = _mm512_xor_epi64(first_end, second_end);\n    __m512i const population_start = _mm512_popcnt_epi64(differences_start);\n    __m512i const population_end = _mm512_popcnt_epi64(differences_end);\n    __m512i const population = _mm512_add_epi64(population_start, population_end);\n    return _mm512_reduce_add_epi64(population);\n}\n```\n\nTo run the kernel benchmarks, use the following command:\n\n```sh\n$ python kernel.py\n```\n\nTo run benchmarks over real data:\n\n```sh\n$ python kernels.py --dir cohere --limit 1e6\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fusearch-binary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fusearch-binary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fusearch-binary/lists"}