{"id":13532146,"url":"https://github.com/puffinn/puffinn","last_synced_at":"2025-04-01T20:31:27.961Z","repository":{"id":43864915,"uuid":"194040273","full_name":"puffinn/puffinn","owner":"puffinn","description":"Parameterless and Universal FInding of Nearest Neighbors","archived":false,"fork":false,"pushed_at":"2025-03-06T14:09:21.000Z","size":1176,"stargazers_count":59,"open_issues_count":7,"forks_count":10,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-06T15:24:28.479Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/puffinn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-27T06:59:21.000Z","updated_at":"2025-03-06T14:09:24.000Z","dependencies_parsed_at":"2024-02-14T13:39:19.928Z","dependency_job_id":"11519e1f-9bdf-4f43-afc5-6fc4eecc2681","html_url":"https://github.com/puffinn/puffinn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/puffinn%2Fpuffinn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/puffinn%2Fpuffinn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/puffinn%2Fpuffinn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/puffinn%2Fpuffinn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/puffinn","download_url":"https://codeload.github.com/puffinn/puffinn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246709923,"owners_count":20821297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T07:01:08.536Z","updated_at":"2025-04-01T20:31:22.951Z","avatar_url":"https://github.com/puffinn.png","language":"C++","funding_links":[],"categories":["SDKs \u0026 Libraries","Awesome Vector Search Engine"],"sub_categories":["Library"],"readme":"[![Build Status](https://travis-ci.com/puffinn/puffinn.svg?branch=master)](https://travis-ci.com/puffinn/puffinn)\n\n# PUFFINN - Parameterless and Universal Fast FInding of Nearest Neighbors\nPUFFINN is an easily configurable library for finding the approximate nearest neighbors of arbitrary points.\nIt also supports the identification of the closest pairs in the dataset.\nThe only necessary parameters are the allowed space usage and the recall.\nEach near neighbor is guaranteed to be found with the probability given by the recall, regardless of the difficulty of the query. \n\nUnder the hood PUFFINN uses Locality Sensitive Hashing with an adaptive query mechanism.\nThis means that the algorithm works for any similarity measure where a Locality Sensitive Hash family exists.\nCurrently Cosine similarity is supported using SimHash or cross-polytope LSH and Jaccard similarity is supported using MinHash.\n\n# Usage\nPUFFINN is implemented in C++ with Python bindings available. All features are available in both languages. \nTo get started quickly, see the below examples, as well as those in the /examples directory.\nMore details are available in the [documentation](https://puffinn.readthedocs.io/en/latest/).\n\n## C++\nPUFFINN is a header-only library. In most cases, including `puffinn.hpp` is sufficient.\nTo use the library, use the `insert`, `rebuild`  and `search` methods on `puffinn::Index` as shown in the below example. \nNote that points inserted after the last call to `rebuild` cannot be found.\n\n```cpp\n#include \"puffinn.hpp\"\n\nint main() {\n    std::vector\u003cstd::vector\u003cfloat\u003e\u003e dataset = ...;\n    int dimensions = ...;\n    \n    // Construct the index using the cosine similarity measure,\n    // the default hash functions and 4 GB of memory.\n    puffinn::LSHTable\u003cpuffinn::CosineSimilarity\u003e index(dimensions, 4*1024*1024*1024);\n    for (auto\u0026 v : dataset) { index.insert(v); }\n    index.rebuild();\n    \n    std::vector\u003cfloat\u003e query = ...;\n    \n    // Find the approximate 10 nearest neighbors.\n    // Each of the true 10 nearest neighbors has at least an 80% chance of being found.\n    std::vector\u003cuint32_t\u003e result = index.search(query, 10, 0.8); \n    \n    // Find the approximate 10 closest pairs in the dataset.\n    // Each of the true 10 closest pairs has at least an 80% chance of being found. \n    std::vector\u003cstd::pair\u003cuint32_t, uint32_t\u003e\u003e result = index.closest_pairs(10, 0.8); \n}\n```\n\n## Python\nTo build the library locally using setuptools, run `python3 setup.py build`. \n\nThe API of the Python wrapper does not differ significantly from C++ API, except that arguments are passed slightly differently. The Python equivalent to the above example is shown below.\nSee the [documentation](https://puffinn.readthedocs.io/en/latest/) for more details.\n\n```python\nimport puffinn\n\ndataset = ...\ndimensions = ...\n\n# Construct the index using the cosine similarity measure,\n# the default hash functions and 4 GB of memory.\nindex = puffinn.Index('angular', dimensions, 4*1024**3)\nfor v in dataset:\n    index.insert(v)\nindex.rebuild()\n\nquery = ...\n    \n# Find the approximate 10 nearest neighbors.\n# Each of the true 10 nearest neighbors has at least an 80% chance of being found.\nresult = index.search(query, 10, 0.8) \n\n# Find the approximate 10 closest pairs in the dataset.\n# Each of the true 10 closest pairs has at least an 80% chance of being found.\nclosest_pairs = index.closest_pairs(k, 0.8)\n```\n\n# Benchmark\n\nPUFFINN provides fast query times with considerable space usage. It's reliable (see bottom right plot) and doesn't require parameter tuning. \n![Benchmark](https://user-images.githubusercontent.com/6311646/61288829-40903080-a7c8-11e9-9eb0-effc6beb808e.png)\n\nThe following benchmark summarizes running times for finding the (globally) $k$-closest pairs in the dataset. \n\n![Closest Pairs Benchmark](https://github.com/Cecca/puffinn/assets/6311646/b9d96135-0d55-4c01-b00b-60d702312fc3\u003e)\n\n# Authors\n\nPUFFINN is mainly developed by Michael Vesterli. It grew out of a research project with Martin Aumüller, Tobias Christiani, and Rasmus Pagh. If you want to cite PUFFINN in your publication, please use the following reference.\n\n\u003e PUFFINN: Parameterless and Universal Fast FInding of Nearest Neighbors, M. Aumüller, T. Christiani, R. Pagh, and M. Vesterli. ESA 2019.\n\nAn extended version of the paper is available at https://arxiv.org/abs/1906.12211.\n\nThe closest pair functionality was developed by Martin Aumüller and Matteo Ceccarello. Details of the method are available in the following publication\n\n\u003e Solving $k$-Closest Pairs in High-Dimensional Data, M. Aumüller, M. Ceccarello, SISAP 2023. [Link](https://link.springer.com/chapter/10.1007/978-3-031-46994-7_17)\n\nThe experimental setup to reproduce the results from the paper is available in the following repository: \u003chttps://github.com/Cecca/puffinn\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpuffinn%2Fpuffinn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpuffinn%2Fpuffinn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpuffinn%2Fpuffinn/lists"}