{"id":13677336,"url":"https://github.com/orlp/polymur-hash","last_synced_at":"2025-04-07T16:18:22.191Z","repository":{"id":189480431,"uuid":"656238548","full_name":"orlp/polymur-hash","owner":"orlp","description":"The PolymurHash universal hash function.","archived":false,"fork":false,"pushed_at":"2023-06-23T18:22:07.000Z","size":40,"stargazers_count":346,"open_issues_count":2,"forks_count":7,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-31T14:12:57.292Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"zlib","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/orlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-06-20T14:29:56.000Z","updated_at":"2025-03-17T10:00:38.000Z","dependencies_parsed_at":"2023-08-20T11:16:07.843Z","dependency_job_id":null,"html_url":"https://github.com/orlp/polymur-hash","commit_stats":null,"previous_names":["orlp/polymur-hash"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orlp%2Fpolymur-hash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orlp%2Fpolymur-hash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orlp%2Fpolymur-hash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/orlp%2Fpolymur-hash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/orlp","download_url":"https://codeload.github.com/orlp/polymur-hash/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247685628,"owners_count":20979085,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:40.640Z","updated_at":"2025-04-07T16:18:22.172Z","avatar_url":"https://github.com/orlp.png","language":"C","funding_links":[],"categories":["C"],"sub_categories":[],"readme":"# PolymurHash\n\nPolymurHash is a 64-bit [universal hash\nfunction](https://en.wikipedia.org/wiki/Universal_hashing) designed for use\nin hash tables. It has a couple desirable properties:\n\n - It is **mathematically proven** to have a statistically low collision rate.\n   When initialized with an independently chosen random seed, for any distinct\n   pair of inputs `m` and `m'` of up to `n` bytes the probability that `h(m) =\n   h(m')` is at most `n * 2^-60.2`. This is known as an almost-universal hash\n   function. In fact PolymurHash has a stronger property: it is almost pairwise\n   independent. For any two distinct inputs `m` and `m'` the probability they\n   hash to the pair of specific 64-bit hash values `(x, y)` is at most `n *\n   2^-124.2`.\n \n - It is very fast for short inputs, while being no slouch for longer inputs. On\n   an Apple M1 machine it can hash any input \u003c= 49 bytes in 21 cycles, and\n   processes 33.3 GiB/sec (11.6 bytes / cycle) for long inputs.\n   \n - It is cross-platform, using no extended instruction sets such as\n   CLMUL or AES-NI. For good speed it only requires native 64 x 64 -\u003e 128 bit\n   multiplication, which almost all 64-bit processors have.\n\n - It is small in code size and space. Ignoring cross-platform compatibility\n   definitions, the hash function and initialization procedure is just over 100\n   lines of C code combined. The initialized hash function uses 32 bytes of\n   memory to store its parameters, and it uses only a small constant amount of\n   stack memory when computing a hash.\n   \nTo my knowledge PolymurHash is the first hash function to have all those\nproperties. There are already very fast universal hash functions, such as\n[CLHASH](https://github.com/lemire/clhash),\n[umash](https://github.com/backtrace-labs/umash),\n[HalftimeHash](https://github.com/jbapple/HalftimeHash), etc., but they all\nrequire a large amount of space (1KiB+) to store the hash function parameters,\nare not optimized for hashing small strings, and/or require specific instruction\nsets such as CLMUL. There are also very fast cross-platform hashes such as\n[xxHash3](https://github.com/Cyan4973/xxHash),\n[komihash](https://github.com/avaneev/komihash) or\n[wyhash](https://github.com/wangyi-fudan/wyhash), but they do not come with\nproofs of universality. [SipHash](https://en.wikipedia.org/wiki/SipHash) claims\nto be cryptographically secure, but is relatively slow, leading people to use\nreduced-round variants with unknown cryptanalysis, or to use a fast but insecure\nhash altogether.\n\nNeedless to say, PolymurHash passes the full [SMHasher\nsuite](https://github.com/rurban/smhasher/) without any failures[*](https://github.com/rurban/smhasher/issues/114#issuecomment-1587631635). For the proof\nof the collision rate, see\n[`extras/universality-proof.md`](extras/universality-proof.md).\n\n\n## How to use it\n\nPolymurHash is provided as a header-only C library `polymur-hash.h`. Simply\ninclude it and you are good to go. First the hash function must have its\n`PolymurHashParams` initialized from a seed, for which there are two functions.\nPolymurHash uses two 64-bit secrets, but provides a convenient function to\nexpand a single 64-bit seed to that:\n\n```c\nvoid polymur_init_params(PolymurHashParams* p, uint64_t k, uint64_t s);\nvoid polymur_init_params_from_seed(PolymurHashParams* p, uint64_t seed);\n```\n\nThe proof of almost universality applies to both, but for the proof of almost\npairwise independence to hold you must provide 128 bits of entropy. Once\ninitialization is complete, you can compute as many hashes as you want with it:\n\n```c\n// Computes the full hash of buf. The tweak is added to the hash before final\n// mixing, allowing different outputs much faster than re-seeding. No claims are\n// made about the collision probability between hashes with different tweaks.\nuint64_t polymur_hash(const uint8_t* buf, size_t len, const PolymurHashParams* p, uint64_t tweak);\n```\n\nThe tweak allows you to have a different hash function for each hash table\nwithout having to calculate new parameters from another seed. This can prevent\ncertain worst-case problems when inserting into a second hash table while\niterating over another.\n\n### License\n\nPolymurHash is available under the zlib license, included in `polymur-hash.h`.\n\n\n## How it works and why it's fast\n\nAt its core, PolymurHash is a Carter-Wegman style polynomial hash that treats\nthe input as a series of coefficients for a polynomial, and then evaluates that\npolynomial at a secret key `k` modulo some prime `p`. The result of this\nuniversal hash is then fed into a Murmur-style permutation followed by the\naddition of a second secret key `s`. The polynomial part of the hash provides\nthe universality guarantee, and the Murmur-style bit mixing part provides good\nbit avalanching and uniformity over the full 64 bit output. The final addition\nof `s` provides pairwise independence and resistance against cryptanalysis by\nmaking the otherwise trivially invertible permutation a lot harder to invert.\n\nNow to make the above fast, a couple tricks are employed. The prime used is\n`p = 2^61 - 1`, a Mersenne prime. By expressing any number `x` as `2^61 a + b`\nwhere `b \u003c 2^61`, we notice that mod `p` this is equal to just `a + b`. With\nthis we can do efficient reduction:\n\n    reduce(x) = (x \u003e\u003e 61) + (x \u0026 P611)\n    \nThis allows us to keep the numbers small very efficiently. Furthermore we also\nlimit `k` during initialization in such a way that we overflow 64/128 bit\nnumbers less often and thus need to perform the above reduction less often.\n\nWe also use a trick similar to the one found in \"Polynomial evaluation and\nmessage authentication\" by Daniel J. Bernstein, where we forego computing an\nexact polynomial `m[0]k + m[1]k^2 + m[2]k^3 + ...` and instead compute any\npolynomial that is injective. That is, we're good as long as each input maps to\na distinct polynomial. Then we can use the 'pseudo-dot-product' to compute\n\n    (k + m[0])*(k^2 + m[1]) + (k^3 + m[2])*(k^4 + m[3]) + ...\n\nwhich allows us to process twice as much data per multiplication. It also\nallows us to pre-compute a couple powers of `k` to then use instruction-level\nparallelism to further increase throughput. The loop used for large inputs\n\n    m[i] = loadword(buf + 7*i) \u0026 ((1 \u003c\u003c 56) - 1)\n    f =  (f   + m[6]) * k^7\n    f += (k   + m[0]) * (k^6 + m[1])\n    f += (k^2 + m[2]) * (k^5 + m[3])\n    f += (k^3 + m[4]) * (k^4 + m[5])\n    f = reduce(f)\n\nprocesses 49 bytes of input using seven parallel 64-bit additions and binary\nANDs, four parallel 64 x 64 -\u003e 128 bit multiplications, three 128-bit additions\nand one 128-bit to 64-bit modular reduction. For small inputs we use custom\ninjective polynomials that are fast to evaluate, filled with input from\npotentially overlapping reads to avoid branches on the input length.\n\n\n## Resistance against cryptanalysis\n\nPolymurHash has strong collision guarantees for input chosen independently from\nits random seed. An interactive attacker however can craft its input based on\nearlier seen hashes. PolymurHash is **not** a cryptographically secure\ncollision-resistant hash if the attacker can see (or worse, request) hash values\nat will. This is not a failure of the underlying hash, for example the\nwell-known Poly1305 hash used to secure TLS suffers from the same problem. It\nsolves this by hiding its hash values from attackers by adding an encrypted\nnonce acting as a one-time-pad.\n\nPolymurHash is not intended to be used in a context where an attacker can see\nthe hash values. Its main intended use case is as a hash function for\nDoS-resistant hash tables. However, a clever attacker might still acquire\n*some* information about the hash values by using side-channels such as\nhash table iteration order, or timing attacks on collisions.\n\nTo protect against this PolymurHash is structured in a way so that it is not\ntrivial to invert the hash, nor to set up controlled informative experiments\nthrough side-channels. Roughly speaking, every bit of input is first mixed with\nthe secret key `k` by modular multiplication and addition modulo a prime. This\nis a linear process, so to hide the linear structure of the underlying hash we\npass the resulting value through a Murmur-style bit-mixing permutation which is\nhighly non-linear. Finally, to ensure the permutation is not trivially\ninvertible, we add the second secret key `s`.\n\nThe above process is by no means a strong multi-round cipher and would likely\nnot hold up to proper cryptanalysis in a standard context. But the underlying\nstructure is sound (e.g. the ChaCha cipher has the form `mix(secret + input) +\nsecret` where `mix` is an unkeyed permutation), and I believe that extrapolating\nwhat little information you can get from side-channels to recover `k, s` to\nexecute a HashDoS attack is difficult.\n\n\n## No weak keys\n\nPolynomial hashing has one potential issue: weak keys. The multiplicative group\nof numbers mod `p = 2^61-1` has subgroups of (much) smaller size. This means\nthat for some keys `k^(i + d) mod p == k^i mod p` for small constant `d`. In\nother words, you can swap the 7-byte block at index `i` with the one at `i + d`\nwithout changing the hash value for such a weak key.\n\nWhat is the probability you choose such a weak key if you pick a key at random\nfrom the 2^61 - 2 possible keys? If `d` is a divisor of `p - 1`, then there are\n`totient(d)` such keys with the above property. And of course, for the attack to\nwork we need `d \u003c= n / 7`. Here is a small table showing the probabilities:\n\n    Max length of input   Divisors   Weak key probability\n    64 bytes              8          2^-56.5\n    1 kilobyte            49         2^-50.5\n    1 megabyte            811        2^-37.3\n    1 gigabyte            3420       2^-26.1\n    \nThese probabilities are very small. Nevertheless, it didn't sit well with me\nthat the possibility existed of picking a key that has such a flaw. So, the\nseeding algorithm for PolymurHash makes sure to select a key that is a\n*generator* of the multiplicative group, that is, the only solution to\n`k^(i + d) mod p == k^i mod p` is `d = 2^61 - 1`.\n\nIn other words, PolymurHash does not suffer from weak keys. As a trade-off our\nkey space is slightly more limited: in combination with the fact we want `k^7 \u003c\n2^60 - 2^56` for efficiency reasons we get a total key space of `totient(2^61 -\n2) * (2^60 - 2^56) / (2^61 - 1) ~= 2^57.4`. Additionally, initialization is also\nslower than simply selecting a random key, on an Apple M1 it takes ~300 cycles\non average. If you feel the need to seed many different hashes, consider looking\nat the `tweak` parameter instead to see if it fits your criteria.\n\n\n# Acknowledgements\n\nI am standing on the shoulders of giants, and in the well-researched field of\n(universal) hash functions there are a lot of them. J. Lawrence Carter, Mark N.\nWegman, Ted Krovetz, Phillip Rogaway, Mikkel Thorup, Daniel J. Bernstein, Daniel\nLemire, Martin Dietzfelbinger, Austin Appleby, many names come to mind. I have\nread many publications by them, and borrowed ideas from all of them.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Forlp%2Fpolymur-hash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Forlp%2Fpolymur-hash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Forlp%2Fpolymur-hash/lists"}