{"id":22712100,"url":"https://github.com/dynatrace-oss/hash4j","last_synced_at":"2025-04-13T16:13:20.661Z","repository":{"id":38085331,"uuid":"452337765","full_name":"dynatrace-oss/hash4j","owner":"dynatrace-oss","description":"Dynatrace hash library for Java","archived":false,"fork":false,"pushed_at":"2025-04-09T07:37:49.000Z","size":38865,"stargazers_count":102,"open_issues_count":3,"forks_count":11,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-13T16:13:11.088Z","etag":null,"topics":["cardinality-estimation","consistent-hashing","data-sketches","farmhash","hash","hash-algorithm","hash-functions","hashing-algorithm","hyperloglog","imohash","java","jumphash","minhash","murmur3","non-cryptographic-hash-functions","simhash","streaming-algorithms","superminhash","wyhash","xxh3"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dynatrace-oss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-01-26T15:50:55.000Z","updated_at":"2025-04-12T11:00:51.000Z","dependencies_parsed_at":"2022-07-10T21:17:04.294Z","dependency_job_id":"4670eeec-a52c-4436-a162-d658481b09a2","html_url":"https://github.com/dynatrace-oss/hash4j","commit_stats":null,"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fhash4j","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fhash4j/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fhash4j/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dynatrace-oss%2Fhash4j/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dynatrace-oss","download_url":"https://codeload.github.com/dynatrace-oss/hash4j/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248741193,"owners_count":21154255,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cardinality-estimation","consistent-hashing","data-sketches","farmhash","hash","hash-algorithm","hash-functions","hashing-algorithm","hyperloglog","imohash","java","jumphash","minhash","murmur3","non-cryptographic-hash-functions","simhash","streaming-algorithms","superminhash","wyhash","xxh3"],"created_at":"2024-12-10T13:09:23.194Z","updated_at":"2025-04-13T16:13:20.654Z","avatar_url":"https://github.com/dynatrace-oss.png","language":"Java","funding_links":[],"categories":["安全"],"sub_categories":[],"readme":"# ![hash4j logo](doc/images/logo/hash4j-logo-small.png) hash4j\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Maven Central](https://img.shields.io/maven-central/v/com.dynatrace.hash4j/hash4j.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:%22com.dynatrace.hash4j%22%20AND%20a:%22hash4j%22)\n[![javadoc](https://javadoc.io/badge2/com.dynatrace.hash4j/hash4j/javadoc.svg)](https://javadoc.io/doc/com.dynatrace.hash4j/hash4j)\n![CodeQL](https://github.com/dynatrace-oss/hash4j/actions/workflows/codeql-analysis.yml/badge.svg)\n[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=dynatrace-oss_hash4j\u0026metric=alert_status)](https://sonarcloud.io/summary/new_code?id=dynatrace-oss_hash4j)\n[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=dynatrace-oss_hash4j\u0026metric=coverage)](https://sonarcloud.io/summary/new_code?id=dynatrace-oss_hash4j)\n[![Java 11 or higher](https://img.shields.io/badge/JDK-11%2B-007396)](https://docs.oracle.com/javase/11/)\n\nhash4j is a Java library by Dynatrace that includes various non-cryptographic hash algorithms and data structures that are based on high-quality hash functions.\n\n## Content\n- [First steps](#first-steps)\n- [Hash algorithms](#hash-algorithms)\n- [Similarity hashing](#similarity-hashing)\n- [Approximate distinct counting](#approximate-distinct-counting)\n- [File hashing](#file-hashing)\n- [Consistent hashing](#consistent-hashing)\n- [Benchmark results](#benchmark-results)\n- [Contribution FAQ](#contribution-faq)\n\n## First steps\nTo add a dependency on hash4j using Maven, use the following:\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.dynatrace.hash4j\u003c/groupId\u003e\n  \u003cartifactId\u003ehash4j\u003c/artifactId\u003e\n  \u003cversion\u003e0.21.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\nTo add a dependency using Gradle:\n```gradle\nimplementation 'com.dynatrace.hash4j:hash4j:0.21.0'\n```\n\n## Hash algorithms\nhash4j currently implements the following hash algorithms:\n* [Murmur3](https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp)\n  * 32-bit\n  * 128-bit\n* [Wyhash](https://github.com/wangyi-fudan/wyhash)\n  * [final version 3](https://github.com/wangyi-fudan/wyhash/releases/tag/wyhash)\n  * [final version 4](https://github.com/wangyi-fudan/wyhash/releases/tag/wyhash_final4)\n* [Komihash](https://github.com/avaneev/komihash)\n  * version [4.3](https://github.com/avaneev/komihash/releases/tag/4.3) (compatible with [4.7](https://github.com/avaneev/komihash/releases/tag/4.7))\n  * version [5.0](https://github.com/avaneev/komihash/releases/tag/5.0) (compatible with [5.10](https://github.com/avaneev/komihash/releases/tag/5.10), [5.18](https://github.com/avaneev/komihash/releases/tag/5.18), and  [5.19](https://github.com/avaneev/komihash/releases/tag/5.19))\n* [FarmHash](https://github.com/google/farmhash)\n  * farmhashna\n  * farmhashuo\n* [PolymurHash 2.0](https://github.com/orlp/polymur-hash)\n* [XXH3](https://github.com/Cyan4973/xxHash)\n  * 64-bit\n  * 128-bit\n\nAll hash functions are thoroughly tested against the native reference implementations and also other libraries like [Guava Hashing](https://javadoc.io/doc/com.google.guava/guava/latest/com/google/common/hash/package-summary.html), [Zero-Allocation Hashing](https://github.com/OpenHFT/Zero-Allocation-Hashing), [Apache Commons Codec](https://commons.apache.org/proper/commons-codec/apidocs/index.html), or [crypto](https://github.com/appmattus/crypto) (see [CrossCheckTest.java](src/test/java/com/dynatrace/hash4j/hashing/CrossCheckTest.java)).\n \n### Usage\nThe interface allows direct hashing of Java objects in a streaming fashion without first mapping them to byte arrays. This minimizes memory allocations and keeps the memory footprint of the hash algorithm constant regardless of the object size.\n```java\nclass TestClass { \n    int a = 42;\n    long b = 1234567890L;\n    String c = \"Hello world!\";\n}\n\nTestClass obj = new TestClass(); // create an instance of some test class\n    \nHasher64 hasher = Hashing.komihash5_0(); // create a hasher instance\n\n// variant 1: hash object by passing data into a hash stream\nlong hash1 = hasher.hashStream().putInt(obj.a).putLong(obj.b).putString(obj.c).getAsLong(); // gives 0x90553fd9c675dfb2L\n\n// variant 2: hash object by defining a funnel\nHashFunnel\u003cTestClass\u003e funnel = (o, sink) -\u003e sink.putInt(o.a).putLong(o.b).putString(o.c);\nlong hash2 = hasher.hashToLong(obj, funnel); // gives 0x90553fd9c675dfb2L\n```\nMore examples can be found in [HashingDemo.java](src/test/java/com/dynatrace/hash4j/hashing/HashingDemo.java).\n\n## Similarity hashing\nSimilarity hashing algorithms are able to compute hash signature of sets that allow estimation of set similarity without using the original sets. Following algorithms are currently available:\n* [MinHash](https://en.wikipedia.org/wiki/MinHash)\n* [SuperMinHash](https://arxiv.org/abs/1706.05698)\n* [SimHash](https://en.wikipedia.org/wiki/SimHash)\n* FastSimHash: A fast implementation of SimHash using a bit hack (see [this blog post](https://medium.com/dynatrace-engineering/speeding-up-simhash-by-10x-using-a-bit-hack-e7b69e701624))\n\n### Usage\n\n```java\nToLongFunction\u003cString\u003e stringHashFunc = s -\u003e Hashing.komihash5_0().hashCharsToLong(s);\n\nSet\u003cString\u003e setA = IntStream.range(0, 90000).mapToObj(Integer::toString).collect(toSet());\nSet\u003cString\u003e setB = IntStream.range(10000, 100000).mapToObj(Integer::toString).collect(toSet());\n// intersection size = 80000, union size = 100000\n// =\u003e exact Jaccard similarity of sets A and B is J = 80000 / 100000 = 0.8\n\nint numberOfComponents = 1024;\nint bitsPerComponent = 1;\n// =\u003e each signature will take 1 * 1024 bits = 128 bytes\n\nSimilarityHashPolicy policy =\nSimilarityHashing.superMinHash(numberOfComponents, bitsPerComponent);\nSimilarityHasher simHasher = policy.createHasher();\n\nbyte[] signatureA = simHasher.compute(ElementHashProvider.ofCollection(setA, stringHashFunc));\nbyte[] signatuerB = simHasher.compute(ElementHashProvider.ofCollection(setB, stringHashFunc));\n\ndouble fractionOfEqualComponents = policy.getFractionOfEqualComponents(signatureA, signatuerB);\n\n// this formula estimates the Jaccard similarity from the fraction of equal components\ndouble estimatedJaccardSimilarity =\n    (fractionOfEqualComponents - Math.pow(2., -bitsPerComponent))\n        / (1. - Math.pow(2., -bitsPerComponent)); // gives a value close to 0.8\n```\n\nSee also [SimilarityHashingDemo.java](src/test/java/com/dynatrace/hash4j/similarity/SimilarityHashingDemo.java).\n\n## Approximate distinct counting\nCounting the number of distinct elements exactly requires space that must increase linearly with the count. \nHowever, there are algorithms that require much less space by counting just approximately.\nThe space-efficiency of those algorithms can be compared by means of the storage factor which is defined as \nthe state size in bits multiplied by the squared relative standard error of the estimator\n\n$\\text{storage factor} := (\\text{relative standard error})^2 \\times (\\text{state size})$.\n\nThis library implements two algorithms for approximate distinct counting:\n* [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog): This implementation uses [6-bit registers](https://doi.org/10.1145/2452376.2452456). \nThe default estimator, which is an [improved version of the original estimator](https://arxiv.org/abs/1702.01284), leads to an \nasymptotic storage factor of $18 \\ln 2 - 6 = 6.477$. Using the definition of the storage factor, the corresponding relative standard error is\nroughly $\\sqrt{\\frac{6.477}{6 m}} = \\frac{1.039}{\\sqrt{m}}$. The state size is $6m = 6\\cdot 2^p$ bits,\nwhere the precision parameter $p$ also defines the number of registers as $m = 2^p$.\nAlternatively, the maximum-likelihood estimator can be used,\nwhich achieves a slightly smaller asymptotic storage factor of $6\\ln(2)/(\\frac{\\pi^2}{6}-1)\\approx 6.449$\ncorresponding to a relative error of $\\frac{1.037}{\\sqrt{m}}$, but has a worse worst-case runtime performance.\nIn case of non-distributed data streams, the [martingale estimator](src/main/java/com/dynatrace/hash4j/distinctcount/MartingaleEstimator.java)\ncan be used, which gives slightly better estimation results as the asymptotic storage factor is $6\\ln 2 = 4.159$.\nThis gives a relative standard error of $\\sqrt{\\frac{6\\ln 2}{6m}} = \\frac{0.833}{\\sqrt{m}}$.\nThe theoretically predicted estimation errors  have been empirically confirmed by [simulation results](doc/hyperloglog-estimation-error.md).\n* UltraLogLog: This algorithm is described in detail in this [paper](https://doi.org/10.14778/3654621.3654632).\nLike for HyperLogLog, a precision parameter $p$ defines the number of registers $m = 2^p$.\nHowever, since UltraLogLog uses 8-bit registers to enable fast random accesses and updates of the registers, \n$m$ is also the state size in bytes.\nThe default estimator leads to an asymptotic storage factor of 4.895,\nwhich corresponds to a 24% reduction compared to HyperLogLog and a\nrelative standard error of $\\frac{0.782}{\\sqrt{m}}$.\nAlternatively, if performance is not an issue, the slower maximum-likelihood estimator can be used to obtain\na storage factor of $8\\ln(2)/\\zeta(2,\\frac{5}{4}) \\approx 4.631$ corresponding to a 28% reduction and a relative error of $\\frac{0.761}{\\sqrt{m}}$.\nIf the martingale estimator can \nbe used, the storage factor will be just $5 \\ln 2 = 3.466$ yielding an asymptotic relative standard error of\n$\\frac{0.658}{\\sqrt{m}}$. These theoretical formulas again agree well with the [simulation results](doc/ultraloglog-estimation-error.md).\n\nBoth algorithms share the following properties:\n* Constant-time add-operations\n* Allocation-free updates\n* Idempotency, adding items already inserted before will never change the internal state\n* Mergeability, even for data structures initialized with different precision parameters   \n* Final state is independent of order of add- and merge-operations\n* Fast estimation algorithm that is fully backed by theory and does not rely on magic constants\n\n### Usage\n```java\nHasher64 hasher = Hashing.komihash5_0(); // create a hasher instance\n\nUltraLogLog sketch = UltraLogLog.create(12); // corresponds to a standard error of 1.2% and requires 4kB\n\nsketch.add(hasher.hashCharsToLong(\"foo\"));\nsketch.add(hasher.hashCharsToLong(\"bar\"));\nsketch.add(hasher.hashCharsToLong(\"foo\"));\n\ndouble distinctCountEstimate = sketch.getDistinctCountEstimate(); // gives a value close to 2\n```\nSee also [UltraLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/UltraLogLogDemo.java) and [HyperLogLogDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/HyperLogLogDemo.java).\n\n### Compatibility\nHyperLogLog and UltraLogLog sketches can be reduced to corresponding sketches with smaller precision parameter `p` using `sketch.downsize(p)`. UltraLogLog sketches can be also transformed into HyperLogLog sketches with same precision parameter using `HyperLogLog hyperLogLog = HyperLogLog.create(ultraLogLog);` as demonstrated in [ConversionDemo.java](src/test/java/com/dynatrace/hash4j/distinctcount/ConversionDemo.java).\nHyperLogLog can be made compatible with implementations of other libraries which also use a single 64-bit hash value as input. The implementations usually differ only in which bits of the hash value are used for the register index and which bits are used to determine the number of leading (or trailing) zeros.\nTherefore, if the bits of the hash value are permuted accordingly, compatibility can be achieved.\n\n## File hashing\nThis library contains an implementation of [Imohash](https://github.com/kalafut/imohash) that\nallows fast hashing of files.\nIt is based on the idea of hashing only the beginning,\na middle part and the end, of large files,\nwhich is usually sufficient to distinguish files.\nUnlike cryptographic hashing algorithms, this method is not suitable for verifying the integrity of files.\nHowever, this algorithm can be useful for file indexes, for example, to find identical files.\n\n### Usage\n```java\n// create some file in the given path\nFile file = path.resolve(\"test.txt\").toFile();\ntry (FileWriter fileWriter = new FileWriter(file, StandardCharsets.UTF_8)) {\n    fileWriter.write(\"this is the file content\");\n}\n\n// use ImoHash to hash that file\nHashValue128 hash = FileHashing.imohash1_0_2().hashFileTo128Bits(file);\n// returns 0xd317f2dad6ea7ae56ff7fdb517e33918\n```\nSee also [FileHashingDemo.java](src/test/java/com/dynatrace/hash4j/file/FileHashingDemo.java).\n\n## Consistent hashing\nThis library contains various algorithms for the distributed agreement on the assignment of hash values to a given number of buckets.\nIn the naive approach, the hash values are assigned to the buckets with the modulo operation according to\n`bucketIdx = abs(hash) % numBuckets`.\nIf the number of buckets is changed, the bucket index will change for most hash values.\nWith a consistent hash algorithm, the above expression can be replaced by\n`bucketIdx = consistentBucketHasher.getBucket(hash, numBuckets)`\nto minimize the number of reassignments while still ensuring a fair distribution across all buckets.\n\nThe following consistent hashing algorithms are available:\n* [JumpHash](https://arxiv.org/abs/1406.2294): This algorithm has a calculation time that scales logarithmically with the number of buckets  \n* [Improved Consistent Weighted Sampling](https://doi.org/10.1109/ICDM.2010.80): This algorithm is based on improved\nconsistent weighted sampling with a constant computation time independent of the number of buckets. This algorithm is faster than\nJumpHash for a large number of buckets.\n* [JumpBackHash](https://doi.org/10.1002/spe.3385): In contrast to JumpHash, which traverses \"active indices\" (see [here](https://doi.org/10.1109/ICDM.2010.80) for a definition)\nin ascending order, JumpBackHash does this in the opposite direction. In this way, floating-point operations can be completely avoided.\nFurther optimizations minimize the number of random values that need to be generated to reach\nthe largest \"active index\" within the given bucket range in amortized constant time. The largest \"active index\",\ndefines the bucket assignment of the given hash value. In the worst case,\nthis algorithm consumes an average of 5/3 = 1.667 64-bit random values.\n  \n### Usage\n```java\n// create a consistent bucket hasher\nConsistentBucketHasher consistentBucketHasher =\n    ConsistentHashing.jumpBackHash(PseudoRandomGeneratorProvider.splitMix64_V1());\n\nlong[] hashValues = {9184114998275508886L, 7090183756869893925L, -8795772374088297157L};\n\n// determine assignment of hash values to 2 buckets\nMap\u003cInteger, List\u003cLong\u003e\u003e assignment2Buckets =\n    LongStream.of(hashValues)\n        .boxed()\n        .collect(groupingBy(hash -\u003e consistentBucketHasher.getBucket(hash, 2)));\n// gives {0=[7090183756869893925, -8795772374088297157], 1=[9184114998275508886]}\n\n// determine assignment of hash values to 3 buckets\nMap\u003cInteger, List\u003cLong\u003e\u003e assignment3Buckets =\n    LongStream.of(hashValues)\n        .boxed()\n        .collect(groupingBy(hash -\u003e consistentBucketHasher.getBucket(hash, 3)));\n// gives {0=[7090183756869893925], 1=[9184114998275508886], 2=[-8795772374088297157]}\n// hash value -8795772374088297157 got reassigned from bucket 0 to bucket 2\n// probability of reassignment is equal to 1/3\n```\nSee also [ConsistentHashingDemo.java](src/test/java/com/dynatrace/hash4j/consistent/ConsistentHashingDemo.java).\n\n## Benchmark results\nBenchmark results for different revisions can be found [here](https://github.com/dynatrace-oss/hash4j-benchmarks).\n\n## Contribution FAQ\n\n### Coding style\n\nTo ensure that your contribution adheres to our coding style, run the `spotlessApply` Gradle task.\n\n### Python\n\nThis project contains python code. We recommend using a python virtual environment in a `.venv` directory. If you are new, please follow the steps outlined\nin the [official Python documentation](https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments) for creation and activation.\nTo install the required dependencies including black, please execute `pip install -r requirements.txt`.\n\n### Reference implementations\n\nReference implementations of hash algorithms are included as git submodules within the `reference-implementations` directory and can be fetched using \n`git submodule update --init --recursive`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdynatrace-oss%2Fhash4j","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdynatrace-oss%2Fhash4j","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdynatrace-oss%2Fhash4j/lists"}