{"id":22712722,"url":"https://github.com/oertl/probminhash","last_synced_at":"2025-09-23T14:54:57.795Z","repository":{"id":86419895,"uuid":"218876177","full_name":"oertl/probminhash","owner":"oertl","description":"ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity","archived":false,"fork":false,"pushed_at":"2020-10-26T11:00:57.000Z","size":6566,"stargazers_count":42,"open_issues_count":1,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-07-02T11:52:46.073Z","etag":null,"topics":["jaccard-similarity","jaccard-similarity-estimation","locality-sensitive-hashing","lsh-algorithm","minhash","minhash-sketches","similarity","sketch"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oertl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-10-31T23:09:59.000Z","updated_at":"2024-09-05T14:05:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"6e70e13b-fd03-4d46-b868-8c5a451cdc02","html_url":"https://github.com/oertl/probminhash","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/oertl/probminhash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oertl%2Fprobminhash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oertl%2Fprobminhash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oertl%2Fprobminhash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oertl%2Fprobminhash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oertl","download_url":"https://codeload.github.com/oertl/probminhash/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oertl%2Fprobminhash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276595480,"owners_count":25670167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-23T02:00:09.130Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jaccard-similarity","jaccard-similarity-estimation","locality-sensitive-hashing","lsh-algorithm","minhash","minhash-sketches","similarity","sketch"],"created_at":"2024-12-10T13:12:53.299Z","updated_at":"2025-09-23T14:54:57.790Z","avatar_url":"https://github.com/oertl.png","language":"C++","readme":"# ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity\n\nThe revision with tag [results-published-in-tkde-paper](https://github.com/oertl/probminhash/tree/results-published-in-tkde-paper) was used to generate the results presented in the final paper, which is available at https://doi.ieeecomputersociety.org/10.1109/TKDE.2020.3021176 or as arXiv-preprint at https://arxiv.org/abs/1911.00675.\n\nIn addition to the algorithms presented in the paper, [minhash.hpp](https://github.com/oertl/probminhash/blob/master/c%2B%2B/minhash.hpp) contains the algorithms `NonStreamingProbMinHash2` and `NonStreamingProbMinHash4`, which are non-streaming equivalent variants of `ProbMinHash2` and `ProbMinHash4`. In a first pass they calculate the sum of all weights, which determines the distribution of the final stop limit. This allows to estimate an appropriate stop limit upfront. For example, if the stop limit is initialized to the 90-th percentile of this distribution, the processing can be stopped early even for the first elements for which the stop limit would otherwise be infinite. However, there is a 10% probability that the stop limit was chosen too small and the algorithm therefore fails. In this case the algorithm has to be restarted in this case with a larger stop limit. Nevertheless, the [performance results](https://github.com/oertl/probminhash/blob/master/paper/speed_charts.pdf) show that this approach can reduce the expected calculation time, provided that multiple passes over the data are allowed.\n\nProbMinHash is a locality-sensitive hash algorithm for the [probability Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index#Probability_Jaccard_similarity_and_distance). If a hash algorithm for the [weighted Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index#Weighted_Jaccard_similarity_and_distance) is needed, we recommend the use of [TreeMinHash](https://github.com/oertl/treeminhash) or [BagMinHash](https://github.com/oertl/bagminhash).\n\n\n\n\n\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foertl%2Fprobminhash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foertl%2Fprobminhash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foertl%2Fprobminhash/lists"}