https://github.com/frozen-beak/hash_table

Hash Table in rust from scratch with djb2 algorithm
https://github.com/frozen-beak/hash_table
djb2 hashtable rust tutorial
Last synced: 11 months ago
JSON representation
Hash Table in rust from scratch with djb2 algorithm
Host: GitHub
URL: https://github.com/frozen-beak/hash_table
Owner: frozen-beak
Created: 2024-09-09T05:39:53.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-09-09T06:32:34.000Z (over 1 year ago)
Last Synced: 2025-01-28T23:11:23.525Z (about 1 year ago)
Topics: djb2, hashtable, rust, tutorial
Language: Rust
Homepage:
Size: 17.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # :hash: Hash Table

A HashTable or a HashMap is a key-value based data structure. It works by mapping keys to values

using a hash function. HashMap’s have on average time complexity of `O(1)` for *insertion*,

*deletion* and *fetching*.

Under the hood they use arrays to store the values. The only difference is we access values based

on the key’s and not the index, which makes them faster.

The index at the item (both key & value) is stored at is calculated by a hash function and the index

is also referred to as hash code.

Table of contents:

- [Hash Function](#hash-function)

- [Collisions](#collisions)

- [Creating a Hash Function](#creating-a-hash-function)

- [Load Factor](#load-factor)

- [Clustering](#clustering)

- [Overflow](#overflow-in-rust)

## Hash Function

A hash function is a *function* that takes the key as an input and outputs the number.

The output is our hash code or the index at our item will be stored in the HashTable.

Hash function are **deterministic** in nature, which simply implies when provided with the same input

multiple times, they produced the same output!

Output generated by hash functions should always be in a certain range, typically in between the size

of the underline array.

## Collisions

Because the key can be any string and hash functions returns a value in specific range,

there is a possibility that the hash function will return similar output for different

inputs, these are called collisions.

For e.g. If we wrote a hash function to return output between *0* and *7*, and we pass it

*9* unique inputs, we’re guaranteed at least *1 collision*.

```txt

hash("to")         == 3

hash("the")        == 2

hash("café")       == 0

hash("de")         == 6

hash("versailles") == 4

hash("for")        == 5

hash("coffee")     == 0

hash("we")         == 7

hash("had")        == 1

```

We have two ways to deal with collisions —

1. Separate Chaining

    In this we use linked list to address the loading. If the key’s are colliding at a

    certain index, we store a linked list at that index including the colliding elements

2. Open Addressing

    In this, if collision occurs, we store the incoming item at index + 1 to avoid the

    collision. In open addressing, we have to keep resizing the underline array to store

    upcoming items.

## Creating a Hash Function

To create our hash function we can use `djb2` algorithm created by the

[Daniel J. Bernstein](https://en.wikipedia.org/wiki/Daniel_J._Bernstein). More details

[here](https://theartincode.stanis.me/008-djb2/).

```c

unsigned long

hash(unsigned char *str)

{

    unsigned long hash = 5381;

    int c;

    while (c = *str++)

        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;

}

```

Above algorithm in details —

1. The *hash* value is initialized with `5381` . This number was chosen by the creator

upon some empirical testing, but it’s not necessarily special beyond serving a good

starting point. As the rule says, *”If something works, don’t change it!”.*

2. Then we loop through every character from the `input` string.

3. Inside the loop, on every iteration we update our *hash* value w/ fallowing formula,

    $$

    hash = ((hash << 5) + hash) + c;

    $$

    - We first left shift the current *hash* value by **5 bits**. Which is equivalent

    of multiplying the value by 32, i.e. 2^5.

        For e.g. `5`, it is represent as `101` in binary. If we left shift this

        by *5 bits* (which basically means adding 5 zeros at end) gives us `10100000`

        which is binary for `160`. If we do `5 * 32` we get exactly the same value.

    - After shifting, we then add original hash value back to the shifted value.

    Which is equivalent of `hash * 33`

    - Finally ASCII value of current character is added, and calculated value is then

    interchanged with `hash`.

> [!IMPORTANT]

> In early computing, multiplication was generally more computationally expensive

> than bitwise operations. Bitwise shifts and additions were faster and required

> less hardware. So directly multiplying by 33 is slower then left shift and addition.

> [!TIP]

> But modern compilers are very good at optimizing code. They might even recognize

> that `hash * 33` can be optimized using a left shift and addition, so there's

> often no performance difference in modern systems.

For e.g. If our key is “ab”, in our loop we do fallowing —

ASCII value of *a* is 97 and for *b* is 98.

> [!NOTE]

> ASCII, or **American Standard Code for Information Interchange**, is a character encoding

> standard used for electronic communication. It assigns numeric values to letters, numerals,

> punctuation marks, and other characters, allowing computers to represent and manipulate text

> data effectively.

- For `a` , we left shift our current hash value which is `5381` , i.e. `5381 << 5 = 172192`,

then we add the original hash, which was `5381` and then we add `97` (ASCII code for char a).

We get `177670` .

- For `b` , we left shift our current hash value which is `177670` , i.e. `177670 << 5 = 5685440`,

then we add the original hash, which was `177670` and then we add `98` (ASCII code for char a).We get `5863488` .

## Load Factor

When initializing a *HashTable*, we start with a fixed size. To manage collisions and maintain efficient

performance, we use a load factor to determine when to resize the table. Specifically, we extend the size

of our hash table when it reaches a certain capacity.

For e.g. in our hash table if we have load factor of 75%, then we will extend the size of our table if it

reaches 75% of its capacity.

> [!IMPORTANT]

> But why does it matter?

>

> If our table exceeds the load factor (generally 75%), it increases the likelihood of collisions in our table,

> degrading its performance. By resizing our hash table we reduce the probability of collisions and maintain

> efficient operations speed.

## Clustering

As we are using *Open Addressing* with *linear probing* as our solution to avoid collisions,

it may create *clusters* in our HashTable.

Clustering in HashTable refers to situation where multiple keys hash to nearby slots, causing

consecutive elements to form a *cluster* in the table.

![Visualization of clustering in HashTable created by the amazing [Sam Rose](https://samwho.dev/)](./images/clustering.png)

Above is the visualization of clustering in HashTable created by the amazing [Sam Rose](https://samwho.dev/)

This makes it harder to find empty slots for newer insertions and increase the time taken

for insertion operation.

Imagine a hash table where the following elements are inserted —

- Insert key A at index 2.

- Insert key B, which hashes to index 2 (collision), so it's placed at index 3.

- Insert key C, which hashes to index 3 (another collision), so it's placed at index 4.

Now, we have a small "cluster" from index 2 to 4. Every new insertion or search that hashes near

this range will likely end up in this cluster, causing more probing and slowing down operations.

To avoid clustering we can —

1. Quadratic Probing

    In this, if collision occurs at `index`  then, instead of looking at the next slot (index +1),

    the probing distance increases quadratically (e.g., 1, 4, 9, etc.), i.e. `index + 4`, `index + 9`,

    etc. This reduces the likelihood of clustering.

2. Double Hashing

    In this we use a second hash function to determine the distance (step size) to calculate new

    index for the element. This makes it hard to land values close together to make it less likely

    to form a cluster.

## Overflow In Rust

In Rust, **overflow** occurs when an arithmetic operation produces a value that is too large to fit into

the type being used to store it. This happens when the result exceeds the maximum value that the type can represent.

```rust

let mut a: u8 = 250;

a = a + 10;

```

Here the type `u8` can store values in range `0` to `2^8 - 1` i.e. (0 - 255), so if we add *10* in *250*,

the result should be 260. However, since 260 is out of range of `u8`'s capacity, our operation overflow‘s

and results in panic.

> [!NOTE]

> Our program will only panic in debug mode to make us alert of this bug. In release mode our program will

> automatically wrap the value, resulting `a`'s value to be `4`. How? — `(250 + 10) % 256 = 4`
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/frozen-beak/hash_table

Awesome Lists containing this project

README