{"id":13485653,"url":"https://github.com/LiveRamp/HyperMinHash-java","last_synced_at":"2025-03-27T19:31:32.361Z","repository":{"id":42950892,"uuid":"132077074","full_name":"LiveRamp/HyperMinHash-java","owner":"LiveRamp","description":"Union, intersection, and set cardinality in loglog space","archived":false,"fork":false,"pushed_at":"2023-06-21T00:41:02.000Z","size":586,"stargazers_count":54,"open_issues_count":4,"forks_count":13,"subscribers_count":124,"default_branch":"master","last_synced_at":"2024-10-30T20:45:24.081Z","etag":null,"topics":["cardinality","cardinality-estimation","hyperloglog","hyperloglog-sketches","java","loglog","loglog-beta","minhash"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LiveRamp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-05-04T02:48:32.000Z","updated_at":"2024-07-28T02:32:13.000Z","dependencies_parsed_at":"2024-01-03T01:20:46.334Z","dependency_job_id":"62b3fa39-57b1-4042-a05c-3a4474250bf6","html_url":"https://github.com/LiveRamp/HyperMinHash-java","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveRamp%2FHyperMinHash-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveRamp%2FHyperMinHash-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveRamp%2FHyperMinHash-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveRamp%2FHyperMinHash-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LiveRamp","download_url":"https://codeload.github.com/LiveRamp/HyperMinHash-java/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245910833,"owners_count":20692507,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cardinality","cardinality-estimation","hyperloglog","hyperloglog-sketches","java","loglog","loglog-beta","minhash"],"created_at":"2024-07-31T18:00:28.862Z","updated_at":"2025-03-27T19:31:32.012Z","avatar_url":"https://github.com/LiveRamp.png","language":"Java","readme":"[![Build Status](https://travis-ci.com/LiveRamp/HyperMinHash-java.svg?branch=master)](https://travis-ci.com/LiveRamp/HyperMinHash-java)\n\n# HyperMinHash-java\nA Java implementation of the HyperMinHash algorithm, presented by\n[Yu and Weber](https://arxiv.org/pdf/1710.08436.pdf).\nHyperMinHash allows approximating set unions, intersections, Jaccard Indices,\nand cardinalities of very large sets with high accuracy using only loglog space.\nIt also supports streaming updates and merging sketches, just the same\nas HyperLogLog.\n\nThis repo implements two flavors of HyperMinHash:\n1) **HyperMinHash**: An implementation based on HyperLogLog with the\naddition of the bias correction seen in HyperLogLog++.\n2) **BetaMinHash**: An implementation which uses [LogLog-Beta](https://arxiv.org/abs/1612.02284)\nfor the underlying LogLog implementation. Loglog-beta is almost identical in\naccuracy to HyperLogLog++, except it performs better on cardinality\nestimations for small datasets (n \u003c= 80k), holding memory fixed. Since we use Loglog-Beta,\nwe refer to our implementation as BetaMinHash. However, our implementation\ncurrently only supports a fixed precision `p=14`.\n\nIf you expect to be dealing with low cardinality datasets (\u003c= 80,000 unique elements),\nwe recommend using BetaMinHash as it has a smaller memory footprint and is more accurate\nthan HyperLogLog in the range from 20,000-80,000, holding memory fixed. However, note that\ndifferent sketch types are not interchangeable i.e: obtaining the intersection of an\nHMH and a BMH is not currently supported.\n\nBoth implementations are equipped with serialization/deserialization\ncapabilities out of the box for sending sketches over the wire or\npersisting them to disk.\n\n## Usage\n\n### Importing via Maven\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.liveramp\u003c/groupId\u003e\n  \u003cartifactId\u003ehyperminhash\u003c/artifactId\u003e\n  \u003cversion\u003e0.2\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Cardinality estimation\n```java\nSet\u003cbyte[]\u003e mySet = getMySet();\nBetaMinHash sketch = new BetaMinHash();\nfor (byte[] element : mySet){\n    sketch.add(element);\n}\n\nlong estimatedCardinality = sketch.cardinality();\n```\n\n### Merging (unioning) sketches\n```java\nCollection\u003cBetaMinHash\u003e sketches = getSketches();\nSketchCombiner\u003cBetaMinHash\u003e combiner = BetaMinHashCombiner.getInstance();\nBetaMinHash combined = combiner.union(sketches);\n\n// to get cardinality of the union\nlong unionCardinality = combined.cardinality();\n\n// using HyperMinHash instead of BetaMinHash\nCollection\u003cHyperMinHash\u003e sketches = getSketches();\nSketchCombiner\u003cHyperMinHash\u003e combiner = HyperMinHashCombinre.getInstance();\nHyperMinHash combined = combiner.union(sketches);\n```\n\n### Cardinality of unions\n```java\nBetaMinHash combined = combiner.union(sketches);\nlong estimatedCardinality = combined.cardinality();\n```\n\n### Cardinality of intersection\n```java\nCollection\u003cBetaMinHash\u003e sketches = getSketches();\nSketchCombiner\u003cBetaMinHash\u003e combiner = BetaMinHashComber.getInstance();\nlong intersectionCardinality = combiner.intersectionCardinality(sketches);\n```\n\n### Serializing a sketch\nTo get a byte[] representation of a sketch, use the `IntersectionSketch.SerDe` interface:\n```java\nHyperMinHash sketch = getSketch();\nHyperMinHashSerde serde = new HyperMinHashSerde();\n\nbyte[] serialized = serde.toBytes(sketch);\nHyperMinHash deserialized = serde.fromBytes(serialized);\n\nint sizeInBytes = serde.sizeInBytes(sketch);\n```\n\n## Maintainers\n\nCommit authorship was lost when merging code. The maintainers of the library, in alphabetical order, are:\n\n1) Christian Hansen (github.com/ChristianHansen)\n2) Harry Rackmil (github.com/harryrackmil)\n3) Shrif Nada (github.com/sherifnada)\n\n## Acknowledgements\n\nThanks to Seif Lotfy for implementing a\n[Golang version of HyperMinHash](http://github.com/axiomhq/hyperminhash).\nWe use some of his tests in our library, and our BetaMinHash implementation\nreferences his implementation.\n","funding_links":[],"categories":["Projects","项目"],"sub_categories":["Data Structures","数据结构"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLiveRamp%2FHyperMinHash-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLiveRamp%2FHyperMinHash-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLiveRamp%2FHyperMinHash-java/lists"}