{"id":22059455,"url":"https://github.com/frozen-beak/hash_table","last_synced_at":"2025-03-23T17:14:01.498Z","repository":{"id":256139957,"uuid":"854404996","full_name":"frozen-beak/hash_table","owner":"frozen-beak","description":"Hash Table in rust from scratch with djb2 algorithm","archived":false,"fork":false,"pushed_at":"2024-09-09T06:32:34.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-28T23:11:23.525Z","etag":null,"topics":["djb2","hashtable","rust","tutorial"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/frozen-beak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-09T05:39:53.000Z","updated_at":"2024-09-09T06:32:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"bd412c6d-91b1-4fc3-83e7-523c8bcdf73f","html_url":"https://github.com/frozen-beak/hash_table","commit_stats":null,"previous_names":["adityamotale/hash_table","frozen-beak/hash_table"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/frozen-beak%2Fhash_table","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/frozen-beak%2Fhash_table/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/frozen-beak%2Fhash_table/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/frozen-beak%2Fhash_table/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/frozen-beak","download_url":"https://codeload.github.com/frozen-beak/hash_table/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245136405,"owners_count":20566588,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["djb2","hashtable","rust","tutorial"],"created_at":"2024-11-30T17:28:54.874Z","updated_at":"2025-03-23T17:14:01.447Z","avatar_url":"https://github.com/frozen-beak.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# :hash: Hash Table\n\nA HashTable or a HashMap is a key-value based data structure. It works by mapping keys to values\nusing a hash function. HashMap’s have on average time complexity of `O(1)` for *insertion*,\n*deletion* and *fetching*.\n\nUnder the hood they use arrays to store the values. The only difference is we access values based\non the key’s and not the index, which makes them faster.\n\nThe index at the item (both key \u0026 value) is stored at is calculated by a hash function and the index\nis also referred to as hash code.\n\nTable of contents:\n\n- [Hash Function](#hash-function)\n- [Collisions](#collisions)\n- [Creating a Hash Function](#creating-a-hash-function)\n- [Load Factor](#load-factor)\n- [Clustering](#clustering)\n- [Overflow](#overflow-in-rust)\n\n## Hash Function\n\nA hash function is a *function* that takes the key as an input and outputs the number.\nThe output is our hash code or the index at our item will be stored in the HashTable.\n\nHash function are **deterministic** in nature, which simply implies when provided with the same input\nmultiple times, they produced the same output!\n\nOutput generated by hash functions should always be in a certain range, typically in between the size\nof the underline array.\n\n## Collisions\n\nBecause the key can be any string and hash functions returns a value in specific range,\nthere is a possibility that the hash function will return similar output for different\ninputs, these are called collisions.\n\nFor e.g. If we wrote a hash function to return output between *0* and *7*, and we pass it\n*9* unique inputs, we’re guaranteed at least *1 collision*.\n\n```txt\nhash(\"to\")         == 3\nhash(\"the\")        == 2\nhash(\"café\")       == 0\nhash(\"de\")         == 6\nhash(\"versailles\") == 4\nhash(\"for\")        == 5\nhash(\"coffee\")     == 0\nhash(\"we\")         == 7\nhash(\"had\")        == 1\n```\n\nWe have two ways to deal with collisions —\n\n1. Separate Chaining\n\n    In this we use linked list to address the loading. If the key’s are colliding at a\n    certain index, we store a linked list at that index including the colliding elements\n\n2. Open Addressing\n\n    In this, if collision occurs, we store the incoming item at index + 1 to avoid the\n    collision. In open addressing, we have to keep resizing the underline array to store\n    upcoming items.\n\n## Creating a Hash Function\n\nTo create our hash function we can use `djb2` algorithm created by the\n[Daniel J. Bernstein](https://en.wikipedia.org/wiki/Daniel_J._Bernstein). More details\n[here](https://theartincode.stanis.me/008-djb2/).\n\n```c\nunsigned long\nhash(unsigned char *str)\n{\n    unsigned long hash = 5381;\n    int c;\n\n    while (c = *str++)\n        hash = ((hash \u003c\u003c 5) + hash) + c; /* hash * 33 + c */\n\n    return hash;\n}\n```\n\nAbove algorithm in details —\n\n1. The *hash* value is initialized with `5381` . This number was chosen by the creator\nupon some empirical testing, but it’s not necessarily special beyond serving a good\nstarting point. As the rule says, *”If something works, don’t change it!”.*\n2. Then we loop through every character from the `input` string.\n3. Inside the loop, on every iteration we update our *hash* value w/ fallowing formula,\n\n    $$\n    hash = ((hash \u003c\u003c 5) + hash) + c;\n    $$\n\n    - We first left shift the current *hash* value by **5 bits**. Which is equivalent\n    of multiplying the value by 32, i.e. 2^5.\n\n        For e.g. `5`, it is represent as `101` in binary. If we left shift this\n        by *5 bits* (which basically means adding 5 zeros at end) gives us `10100000`\n        which is binary for `160`. If we do `5 * 32` we get exactly the same value.\n\n    - After shifting, we then add original hash value back to the shifted value.\n    Which is equivalent of `hash * 33`\n\n    - Finally ASCII value of current character is added, and calculated value is then\n    interchanged with `hash`.\n\n\u003e [!IMPORTANT]\n\u003e In early computing, multiplication was generally more computationally expensive\n\u003e than bitwise operations. Bitwise shifts and additions were faster and required\n\u003e less hardware. So directly multiplying by 33 is slower then left shift and addition.\n\n\u003e [!TIP]\n\u003e But modern compilers are very good at optimizing code. They might even recognize\n\u003e that `hash * 33` can be optimized using a left shift and addition, so there's\n\u003e often no performance difference in modern systems.\n\nFor e.g. If our key is “ab”, in our loop we do fallowing —\n\nASCII value of *a* is 97 and for *b* is 98.\n\n\u003e [!NOTE]\n\u003e ASCII, or **American Standard Code for Information Interchange**, is a character encoding\n\u003e standard used for electronic communication. It assigns numeric values to letters, numerals,\n\u003e punctuation marks, and other characters, allowing computers to represent and manipulate text\n\u003e data effectively.\n\n- For `a` , we left shift our current hash value which is `5381` , i.e. `5381 \u003c\u003c 5 = 172192`,\nthen we add the original hash, which was `5381` and then we add `97` (ASCII code for char a).\nWe get `177670` .\n- For `b` , we left shift our current hash value which is `177670` , i.e. `177670 \u003c\u003c 5 = 5685440`,\nthen we add the original hash, which was `177670` and then we add `98` (ASCII code for char a).We get `5863488` .\n\n## Load Factor\n\nWhen initializing a *HashTable*, we start with a fixed size. To manage collisions and maintain efficient\nperformance, we use a load factor to determine when to resize the table. Specifically, we extend the size\nof our hash table when it reaches a certain capacity.\n\nFor e.g. in our hash table if we have load factor of 75%, then we will extend the size of our table if it\nreaches 75% of its capacity.\n\n\u003e [!IMPORTANT]\n\u003e But why does it matter?\n\u003e\n\u003e If our table exceeds the load factor (generally 75%), it increases the likelihood of collisions in our table,\n\u003e degrading its performance. By resizing our hash table we reduce the probability of collisions and maintain\n\u003e efficient operations speed.\n\n## Clustering\n\nAs we are using *Open Addressing* with *linear probing* as our solution to avoid collisions,\nit may create *clusters* in our HashTable.\n\nClustering in HashTable refers to situation where multiple keys hash to nearby slots, causing\nconsecutive elements to form a *cluster* in the table.\n\n![Visualization of clustering in HashTable created by the amazing [Sam Rose](https://samwho.dev/)](./images/clustering.png)\n\nAbove is the visualization of clustering in HashTable created by the amazing [Sam Rose](https://samwho.dev/)\n\nThis makes it harder to find empty slots for newer insertions and increase the time taken\nfor insertion operation.\n\nImagine a hash table where the following elements are inserted —\n\n- Insert key A at index 2.\n- Insert key B, which hashes to index 2 (collision), so it's placed at index 3.\n- Insert key C, which hashes to index 3 (another collision), so it's placed at index 4.\n\nNow, we have a small \"cluster\" from index 2 to 4. Every new insertion or search that hashes near\nthis range will likely end up in this cluster, causing more probing and slowing down operations.\n\nTo avoid clustering we can —\n\n1. Quadratic Probing\n\n    In this, if collision occurs at `index`  then, instead of looking at the next slot (index +1),\n    the probing distance increases quadratically (e.g., 1, 4, 9, etc.), i.e. `index + 4`, `index + 9`,\n    etc. This reduces the likelihood of clustering.\n\n2. Double Hashing\n\n    In this we use a second hash function to determine the distance (step size) to calculate new\n    index for the element. This makes it hard to land values close together to make it less likely\n    to form a cluster.\n\n## Overflow In Rust\n\nIn Rust, **overflow** occurs when an arithmetic operation produces a value that is too large to fit into\nthe type being used to store it. This happens when the result exceeds the maximum value that the type can represent.\n\n```rust\nlet mut a: u8 = 250;\na = a + 10;\n```\n\nHere the type `u8` can store values in range `0` to `2^8 - 1` i.e. (0 - 255), so if we add *10* in *250*,\nthe result should be 260. However, since 260 is out of range of `u8`'s capacity, our operation overflow‘s\nand results in panic.\n\n\u003e [!NOTE]\n\u003e Our program will only panic in debug mode to make us alert of this bug. In release mode our program will\n\u003e automatically wrap the value, resulting `a`'s value to be `4`. How? — `(250 + 10) % 256 = 4`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrozen-beak%2Fhash_table","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffrozen-beak%2Fhash_table","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrozen-beak%2Fhash_table/lists"}