{"id":13436242,"url":"https://github.com/hdidx/hdidx","last_synced_at":"2025-03-18T20:31:05.715Z","repository":{"id":22974715,"uuid":"26324806","full_name":"hdidx/hdidx","owner":"hdidx","description":"Approximate Nearest Neighbor (ANN) search for high-dimensional data.","archived":false,"fork":false,"pushed_at":"2018-09-03T00:56:59.000Z","size":171,"stargazers_count":92,"open_issues_count":5,"forks_count":26,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-10-27T20:18:49.640Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://wanji.me/hdidx","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hdidx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-07T15:15:23.000Z","updated_at":"2024-08-18T15:28:38.000Z","dependencies_parsed_at":"2022-08-21T17:10:32.022Z","dependency_job_id":null,"html_url":"https://github.com/hdidx/hdidx","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hdidx%2Fhdidx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hdidx%2Fhdidx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hdidx%2Fhdidx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hdidx%2Fhdidx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hdidx","download_url":"https://codeload.github.com/hdidx/hdidx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244301364,"owners_count":20430929,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:00:45.810Z","updated_at":"2025-03-18T20:31:05.372Z","avatar_url":"https://github.com/hdidx.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# **HDIdx**: Indexing High-Dimensional Data\n\n[![pypi](https://img.shields.io/pypi/v/hdidx.svg?style=flat-square)](https://pypi.python.org/pypi/hdidx/)\n[![downloads_month](https://img.shields.io/pypi/dm/hdidx.svg?style=flat-square)](https://pypi.python.org/pypi/hdidx/)\n[![license](https://img.shields.io/pypi/l/hdidx.svg?style=flat-square)](https://raw.githubusercontent.com/wanji/hdidx/master/LICENSE.md)\n\n## What is **HDIdx**?\n\n**HDIdx** is a python package for approximate nearest neighbor (ANN) search. Nearest neighbor (NN) search is very challenging in high-dimensional space because of the [*Curse of Dimensionality*](https://en.wikipedia.org/wiki/Curse_of_dimensionality) problem. The basic idea of **HDIdx** is to compress the original feature vectors into compact binary codes, and perform approximate NN search instead of extract NN search. This can largely reduce the storage requirements and can significantly speed up the search.\n\n## Architecture\n\n![Framework](https://raw.githubusercontent.com/wanji/hdidx/master/doc/framework.png)\n\n**HDIdx** has three main modules: 1) `Encoder` which can compress the original feature vectors into compact binary hash codes, 2) `Indexer` which can index the database items and search approximate nearest neighbor for a given query item, and 3) `Storage` module which encapsulates the underlying data storage, which can be memory or NoSQL database like LMDB, for the `Indexer`.\n\nThe current version implements following feature compressing algorithms: \n\n- `Product Quantization`[1].\n- `Spectral Hashing`[2].\n\nTo use HDIdx, first you should learn a `Encoder` from some learning vectors.\nThen you can map the base vectors into hash codes using the learned `Encoder` and building indexes over these hash codes by an `Indexer`, which will write the indexes to the specified storage medium.\nWhen a query vector comes, it will be mapped to hash codes by the same `Encoder` and the `Indexer` will find the similar items to this query vector.\n\n\n## Installation\n\n**HDIdx** can be installed by `pip`:\n\n```bash\n[sudo] pip install cython\n[sudo] pip install hdidx\n```\n\nBy default, **HDIdx** use kmeans algorithm provided by [*SciPy*](http://www.scipy.org/). To be more efficient, you can install python extensions of [*OpenCV*](http://opencv.org/), which can be installed via `apt-get` on Ubuntu. For other Linux distributions, e.g. CentOS, you need to compile it from source.\n\n```bash\n[sudo] apt-get install python-opencv\n```\n\n**HDIdx** will use [*OpenCV*](http://opencv.org/) automatically if it is available.\n\n### Windows Guide\n\nGeneral dependencies:\n\n- [Anaconda](https://store.continuum.io/cshop/anaconda/)\n- [Microsoft Visual C++ Compiler for Python](http://www.microsoft.com/en-us/download/details.aspx?id=44266)\n\nAfter install the above mentioned software, download [`stdint.h`](http://msinttypes.googlecode.com/svn/trunk/stdint.h) and put it under the `include` folder of Visual C++, e.g. `C:\\Users\\xxx\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0\\VC\\include`. Then hdidx can be installed by `pip` from the *Anaconda Command Prompt*.\n\n## Example\n\nHere is a simple example. See this [notebook](http://nbviewer.ipython.org/gist/wanji/c08693f06ef744feef50) for more examples.\n\n```python\n# import necessary packages\n\nimport hdidx\nimport numpy as np\n\n# generating sample data\nndim = 16      # dimension of features\nndb = 10000    # number of dababase items\nnqry = 10      # number of queries\n\nX_db = np.random.random((ndb, ndim))\nX_qry = np.random.random((nqry, ndim))\n\n# create Product Quantization Indexer\nidx = hdidx.indexer.IVFPQIndexer()\n# build indexer\nidx.build({'vals': X_db, 'nsubq': 8})\n# add database items to the indexer\nidx.add(X_db)\n# searching in the database, and return top-10 items for each query\nids, dis = idx.search(X_qry, 10)\nprint ids\nprint dis\n```\n\n## Citation\n\nPlease cite the following paper if you use this library:\n\n```\n@article{wan2015hdidx,\n  title={HDIdx: High-Dimensional Indexing for Efficient Approximate Nearest Neighbor Search},\n  author={Wan, Ji and Tang, Sheng and Zhang, Yongdong and Li, Jintao and Wu, Pengcheng and Hoi, Steven CH},\n  journal={Neurocomputing },\n  year={2016}\n}\n```\n\n## Reference\n```\n[1] Jegou, Herve, Matthijs Douze, and Cordelia Schmid.\n    \"Product quantization for nearest neighbor search.\"\n    Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.1 (2011): 117-128.\n[2] Weiss, Yair, Antonio Torralba, and Rob Fergus.\n    \"Spectral hashing.\"\n    In Advances in neural information processing systems, pp. 1753-1760. 2009.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhdidx%2Fhdidx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhdidx%2Fhdidx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhdidx%2Fhdidx/lists"}