{"id":13436224,"url":"https://github.com/norouzi/mih","last_synced_at":"2025-03-18T20:31:07.867Z","repository":{"id":3542697,"uuid":"4602791","full_name":"norouzi/mih","owner":"norouzi","description":"Fast exact nearest neighbor search in Hamming distance on binary codes with Multi-index hashing","archived":false,"fork":false,"pushed_at":"2015-11-02T15:42:53.000Z","size":7730,"stargazers_count":285,"open_issues_count":4,"forks_count":69,"subscribers_count":19,"default_branch":"master","last_synced_at":"2024-10-27T20:18:49.619Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://www.cs.toronto.edu/~norouzi/research/mih/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"Thomvis/BrightFutures","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/norouzi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"license.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-06-08T22:02:47.000Z","updated_at":"2024-10-05T15:27:20.000Z","dependencies_parsed_at":"2022-09-05T15:11:10.978Z","dependency_job_id":null,"html_url":"https://github.com/norouzi/mih","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/norouzi%2Fmih","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/norouzi%2Fmih/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/norouzi%2Fmih/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/norouzi%2Fmih/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/norouzi","download_url":"https://codeload.github.com/norouzi/mih/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244301364,"owners_count":20430929,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:00:45.659Z","updated_at":"2025-03-18T20:31:05.526Z","avatar_url":"https://github.com/norouzi.png","language":"C++","funding_links":[],"categories":["Uncategorized","C++"],"sub_categories":["Uncategorized"],"readme":"Multi Index Hashing (MIH)\n=======\n\nAn implementation of *\"Fast Exact Search in Hamming Space with\nMulti-Index Hashing, M. Norouzi, A. Punjani, D. J. Fleet, IEEE TPAMI\n2014\"*. See http://www.cs.toronto.edu/~norouzi/research/mih/.\n\nThis algorithm performs fast exact nearest neighbor search in Hamming\ndistance on binary codes. Using this code, one can re-run the\nexperiments described in the paper. For best results, consider using\n*libhugetlbfs* with multi-index hashing.\n\n### Compilation\n\nYou need make, cmake, hdf5 library, hdf5-dev package to build this\nproject. To compile, create a folder called `build`, and run:\n\n```\ncd build\nrm * -rf\ncmake ..\nmake\n```\nThis should generate two binary files: `mih` and `linscan`\n\n### Datasets\n\nAn example binary code dataset with 1 million 64-bit codes from SIFT\nis stored in the data folder. To generate larger binary code datasets,\none should download raw data which can be converted to binary codes\nusing hashing techniques (e.g., LSH or MLH).  For example, download\nthe INRIA bigann dataset (1 billion SIFT features) from\nhttp://corpus-texmex.irisa.fr/ and store it under data/inria/.  You\ncan also download the Tiny images dataset (80 million GIST\ndescriptors) from http://horatio.cs.nyu.edu/mit/tiny/data/index.html\nand store it under data/tiny.\n\nBy running create_lsh_codes.m (a matlab snippet) one can generate\nbinary codes from the above datasets using random projections (LSH,\n\"Similarity estimation techniques from rounding algorithms,\nM. Charikar, STOC. 2002\"). By changing the first few lines of\ncreate_lsh_codes, you can control the parameters of the matlab\nsnippet. The output is in matlab (version 7.3) format, which is\nessentially hdf5 format. Hence, we use hdf5 library to read the binary\ncode datasets.\n\n### Usage\n\n`RUN.sh` is a bash script showing an example run of the program 64-bit\ncodes. Set the parameters `nb`, `HUGE`, `hashfunc`, etc. to change the\nsetting. `RUN.sh` includes suggested values for `m`: number of hash\ntables.\n\n##### Linear scan\n`linscan` provides an efficient implementation of exhaustive linear scan for\nkNN in Hamming distance on binary codes. This serves as a good baseline.\n\n```\nlinscan \u003cinfile\u003e \u003coutfile\u003e [options]\nOptions:\n  -N \u003cnumber\u003e       Set the number of binary codes from the beginning of the dataset file to be used\n  -B \u003cnumber\u003e       Set the number of bits per code, default autodetect\n  -Q \u003cnumber\u003e       Set the number of query points to use from \u003cinfile\u003e, default all\n  -K \u003cnumber\u003e       Set number of nearest neighbors to be retrieved\n```\n\nExamples:\n```\n./build/linscan data/lsh_64_sift_1M.mat linscan_64_1M.h5 -N 100000  -B 64 -Q 1000 -K 100\n./build/linscan data/lsh_64_sift_1M.mat linscan_64_1M.h5 -N 1000000 -B 64 -Q 1000 -K 100\n```\n\nAssuming that a dataset of 128-bit binary codes is stored at\n`codes/lsh_64_sift_1M.mat`, running the above lines will create an\noutput file `linscan_64_1M.h5`, which stores the results and timings\nfor 100-NN search on 100K and 1M binary codes. If the output file does\nnot exist (the first time), the output file is created. If the output\nfile exists (since the second time), the file is appended with the new\nresults.\n\n##### Multi Index Hashing\n`mih` provides an implementation of multi-index hashing for fast exact kNN in\nHamming distance on binary codes.\n\n```\nmih \u003cinfile\u003e \u003coutfile\u003e [options]\nOptions:\n  -N \u003cnumber\u003e          Set the number of binary codes from the beginning of the dataset file to be used\n  -B \u003cnumber\u003e          Set the number of bits per code, default autodetect\n  -Q \u003cnumber\u003e          Set the number of query points to use from \u003cinfile\u003e, default all\n  -m \u003cnumber\u003e          Set the number of chunks to use, default 1\n  -K \u003cnumber\u003e          Set number of nearest neighbors to be retrieved\n  -R \u003cnumber\u003e          Set the number of codes (in Millions) to use in computing the optimal bit reordering, default OFF (0)\n```\n\nExamples:\n```\n./build/mih data/lsh_64_sift_1M.mat mih_64_1M.h5 -N 100000 -B 64 -m 5 -Q 10000 -K 100\n./build/mih data/lsh_64_sift_1M.mat mih_64_1M.h5 -N 1000000 -B 64 -m 4 -Q 10000 -K 100\n```\n\nThe mih's options are very similar to linscan. It has an additional\nargument (-m) to determine the number of hash tables. It also has a\nflag (-R) to determine whether the assignment of bits to the\nsubstrings should be optimized.\n\n### FAQs\n\nQ: I have tried your code with some of my datasets. It works well when\nI used small datasets, but it does not perform well with large\ndatasets.\n\nA: Did you try decreasing the number of hash tables (by the -m switch) as\nyou increased the size of the database? My experience is that with the\ncorrect choice of m, the speedup on larger datasets should be much\nbetter. In the RUN.sh file, I have a set of suggestions for the values\nof m for different number of codes in the datasets.\n\n### License\n\nCopyright (c) 2012, Mohammad Norouzi [\u003cmohammad.n@gmail.com\u003e] and Ali Punjani\n[\u003calipunjani@cs.toronto.edu\u003e]. This is a free software; for license information\nplease refer to license.txt file.\n\n### TODO\n\n- Automatic suggestion for the m parameter.\n- Multi-core functionality.\n- Improve SparseHashtable insertion speed. It is currently very slow,\nbut can be improved in different ways.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnorouzi%2Fmih","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnorouzi%2Fmih","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnorouzi%2Fmih/lists"}