An open API service indexing awesome lists of open source software.

https://github.com/frozen-beak/hash_table

Hash Table in rust from scratch with djb2 algorithm
https://github.com/frozen-beak/hash_table

djb2 hashtable rust tutorial

Last synced: 11 months ago
JSON representation

Hash Table in rust from scratch with djb2 algorithm

Awesome Lists containing this project

README

          

# :hash: Hash Table

A HashTable or a HashMap is a key-value based data structure. It works by mapping keys to values
using a hash function. HashMap’s have on average time complexity of `O(1)` for *insertion*,
*deletion* and *fetching*.

Under the hood they use arrays to store the values. The only difference is we access values based
on the key’s and not the index, which makes them faster.

The index at the item (both key & value) is stored at is calculated by a hash function and the index
is also referred to as hash code.

Table of contents:

- [Hash Function](#hash-function)
- [Collisions](#collisions)
- [Creating a Hash Function](#creating-a-hash-function)
- [Load Factor](#load-factor)
- [Clustering](#clustering)
- [Overflow](#overflow-in-rust)

## Hash Function

A hash function is a *function* that takes the key as an input and outputs the number.
The output is our hash code or the index at our item will be stored in the HashTable.

Hash function are **deterministic** in nature, which simply implies when provided with the same input
multiple times, they produced the same output!

Output generated by hash functions should always be in a certain range, typically in between the size
of the underline array.

## Collisions

Because the key can be any string and hash functions returns a value in specific range,
there is a possibility that the hash function will return similar output for different
inputs, these are called collisions.

For e.g. If we wrote a hash function to return output between *0* and *7*, and we pass it
*9* unique inputs, we’re guaranteed at least *1 collision*.

```txt
hash("to") == 3
hash("the") == 2
hash("café") == 0
hash("de") == 6
hash("versailles") == 4
hash("for") == 5
hash("coffee") == 0
hash("we") == 7
hash("had") == 1
```

We have two ways to deal with collisions —

1. Separate Chaining

In this we use linked list to address the loading. If the key’s are colliding at a
certain index, we store a linked list at that index including the colliding elements

2. Open Addressing

In this, if collision occurs, we store the incoming item at index + 1 to avoid the
collision. In open addressing, we have to keep resizing the underline array to store
upcoming items.

## Creating a Hash Function

To create our hash function we can use `djb2` algorithm created by the
[Daniel J. Bernstein](https://en.wikipedia.org/wiki/Daniel_J._Bernstein). More details
[here](https://theartincode.stanis.me/008-djb2/).

```c
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}
```

Above algorithm in details —

1. The *hash* value is initialized with `5381` . This number was chosen by the creator
upon some empirical testing, but it’s not necessarily special beyond serving a good
starting point. As the rule says, *”If something works, don’t change it!”.*
2. Then we loop through every character from the `input` string.
3. Inside the loop, on every iteration we update our *hash* value w/ fallowing formula,

$$
hash = ((hash << 5) + hash) + c;
$$

- We first left shift the current *hash* value by **5 bits**. Which is equivalent
of multiplying the value by 32, i.e. 2^5.

For e.g. `5`, it is represent as `101` in binary. If we left shift this
by *5 bits* (which basically means adding 5 zeros at end) gives us `10100000`
which is binary for `160`. If we do `5 * 32` we get exactly the same value.

- After shifting, we then add original hash value back to the shifted value.
Which is equivalent of `hash * 33`

- Finally ASCII value of current character is added, and calculated value is then
interchanged with `hash`.

> [!IMPORTANT]
> In early computing, multiplication was generally more computationally expensive
> than bitwise operations. Bitwise shifts and additions were faster and required
> less hardware. So directly multiplying by 33 is slower then left shift and addition.

> [!TIP]
> But modern compilers are very good at optimizing code. They might even recognize
> that `hash * 33` can be optimized using a left shift and addition, so there's
> often no performance difference in modern systems.

For e.g. If our key is “ab”, in our loop we do fallowing —

ASCII value of *a* is 97 and for *b* is 98.

> [!NOTE]
> ASCII, or **American Standard Code for Information Interchange**, is a character encoding
> standard used for electronic communication. It assigns numeric values to letters, numerals,
> punctuation marks, and other characters, allowing computers to represent and manipulate text
> data effectively.

- For `a` , we left shift our current hash value which is `5381` , i.e. `5381 << 5 = 172192`,
then we add the original hash, which was `5381` and then we add `97` (ASCII code for char a).
We get `177670` .
- For `b` , we left shift our current hash value which is `177670` , i.e. `177670 << 5 = 5685440`,
then we add the original hash, which was `177670` and then we add `98` (ASCII code for char a).We get `5863488` .

## Load Factor

When initializing a *HashTable*, we start with a fixed size. To manage collisions and maintain efficient
performance, we use a load factor to determine when to resize the table. Specifically, we extend the size
of our hash table when it reaches a certain capacity.

For e.g. in our hash table if we have load factor of 75%, then we will extend the size of our table if it
reaches 75% of its capacity.

> [!IMPORTANT]
> But why does it matter?
>
> If our table exceeds the load factor (generally 75%), it increases the likelihood of collisions in our table,
> degrading its performance. By resizing our hash table we reduce the probability of collisions and maintain
> efficient operations speed.

## Clustering

As we are using *Open Addressing* with *linear probing* as our solution to avoid collisions,
it may create *clusters* in our HashTable.

Clustering in HashTable refers to situation where multiple keys hash to nearby slots, causing
consecutive elements to form a *cluster* in the table.

![Visualization of clustering in HashTable created by the amazing [Sam Rose](https://samwho.dev/)](./images/clustering.png)

Above is the visualization of clustering in HashTable created by the amazing [Sam Rose](https://samwho.dev/)

This makes it harder to find empty slots for newer insertions and increase the time taken
for insertion operation.

Imagine a hash table where the following elements are inserted —

- Insert key A at index 2.
- Insert key B, which hashes to index 2 (collision), so it's placed at index 3.
- Insert key C, which hashes to index 3 (another collision), so it's placed at index 4.

Now, we have a small "cluster" from index 2 to 4. Every new insertion or search that hashes near
this range will likely end up in this cluster, causing more probing and slowing down operations.

To avoid clustering we can —

1. Quadratic Probing

In this, if collision occurs at `index` then, instead of looking at the next slot (index +1),
the probing distance increases quadratically (e.g., 1, 4, 9, etc.), i.e. `index + 4`, `index + 9`,
etc. This reduces the likelihood of clustering.

2. Double Hashing

In this we use a second hash function to determine the distance (step size) to calculate new
index for the element. This makes it hard to land values close together to make it less likely
to form a cluster.

## Overflow In Rust

In Rust, **overflow** occurs when an arithmetic operation produces a value that is too large to fit into
the type being used to store it. This happens when the result exceeds the maximum value that the type can represent.

```rust
let mut a: u8 = 250;
a = a + 10;
```

Here the type `u8` can store values in range `0` to `2^8 - 1` i.e. (0 - 255), so if we add *10* in *250*,
the result should be 260. However, since 260 is out of range of `u8`'s capacity, our operation overflow‘s
and results in panic.

> [!NOTE]
> Our program will only panic in debug mode to make us alert of this bug. In release mode our program will
> automatically wrap the value, resulting `a`'s value to be `4`. How? — `(250 + 10) % 256 = 4`