https://github.com/rwillians/trie-autocomplete
Implementation of autocompletion using Trie data structure
https://github.com/rwillians/trie-autocomplete
Last synced: 15 days ago
JSON representation
Implementation of autocompletion using Trie data structure
- Host: GitHub
- URL: https://github.com/rwillians/trie-autocomplete
- Owner: rwillians
- License: mit
- Created: 2023-07-08T06:14:38.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-13T20:26:50.000Z (almost 2 years ago)
- Last Synced: 2025-05-07T21:05:38.556Z (15 days ago)
- Language: Elixir
- Size: 522 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Autocompletion using Trie data structure
This project uses Trie data structure as a dictionary against which a prefix can be queried to get completions for matching words.
There are 3 implementation, [Trie.A](https://github.com/rwillians/trie-autocomplete/blob/main/lib/trie_a.ex) (most performant), [Trie.B](https://github.com/rwillians/trie-autocomplete/blob/main/lib/trie_b.ex) and [Trie.C](https://github.com/rwillians/trie-autocomplete/blob/main/lib/trie_c.ex), each one with variations on how to represent the
data structure.Code examples in their functions' documentation are actually tests, using ExUnit's doctest.
## TL;DR:
[Run on Docker](#run-on-docker).
## How does it word?
The idea for using Trie for autocompletion is to divide the problem in 2 major steps:
1. follow the prefix typed as deep as we can in our Trie dictionary;
2. compute the possible (if any) completions we can provide.The first step assumes we already have our dictionary of words as a Trie, so I'll skip the explanition for how to compute it.
Given we have a dictionary with the words "bird", "bite", "bites", "cat" and dog, our Trie dictionary would look like:
```
===========================================
L0 L1 L2 L3 L4 L5 end of word?
---- ---- ---- ---- ---- ---- -------------
root false
b false
i false
r false
d true
t false
e true
s true
c false
a false
t true
d false
o false
g true
===========================================
```Let's imagine the user typed only the letter "d" so far. Just like in a real dictionary, ours is sorted alphabetically meaning that we don't need to go through all the nodes to find node `d`, we can got straight to it and retrieve it -- `O(1)`. That's possible by having a hash table mapping all child nodes that `root` node has -- and the same is true for any node in our Trie.
The `d` node we retrieved looks like this:
```
=============================
L0 L1 L2 end of word?
---- ---- ---- --------------
d false
o false
g true
=============================
```Now it's time to move to the second and final step towards the solution: calculating the possible completions.
For this example, it's as simple as concatenating the value of each children node from `d` up untill the last nodes:```
=============================
L0 L1 L2 end of word?
---- ---- ---- --------------
d ┌─┐ false
│o│ ┌─┐ false
└─┘ │g│ true
└─┘
=============================
````"o" + "g" = "og"`
Our only possible completion is "og".
But what if the user had typed "bi" instead?
In that case we'd retrieve the node `root.b`, which looks like:
```
======================================
L0 L1 L2 L3 L4 end of word?
---- ---- ---- ---- ---- -------------
b false
i false
r false
d true
t false
e true
s true
======================================
```We sill have one more letter typed by the user, letter "i", so we repeat the process.
We retrieve the node `i` from node `b`. It looks like this:```
=================================
L0 L1 L2 L3 end of word?
---- ---- ---- ---- -------------
i false
r false
d true
t false
e true
s true
=================================
```Now we're as deeps as the letters typed by the user can get us, so it's time to calculate the possible completions.
It's a bit trickier this time becase there are 3 known words we can complete: "ird" (for "bird"), "ite" (for "bite") and "ites" (for "bites"). So it's not as simple as just concatenating the child nodes -- but it's still somewhat simple -- and we'll make use of that flag "end of word?". The trickest thing about this is that we'll use recursion.
For each child of `i` (nodes `r` and `t`), we'll assign to it an empty accumulator `acc1 = ""` (empty string), forming a Tuple containing the node itself and the accumulator. There's also gonna be a second accumulator containing the completions we come up with: `acc2 = []`.
Then, we'll iterate over the tuples:
* we'll concatenate the value of the node with the first accumulator: `acc1 += "r"` (e.g: `acc1 = "" + "r"`);
* if the node is flaged as `end of word?`, then we add its concatenated value (e.g: `"r"`) to the completions accumulator (not the case for `"r"` but will be the case for `"rd"`, `"ite"` and `"ites"`);
* for each child node that the current node has (e.g: `r` has nodes `[d]`, `e` has nodes `[s]` and `s` has `[]`) we assign an accumulator containing the concatenation so far (`acc1` -- e.g `"r"`) and append it to the list of remaining nodes we need to iterate over;
* repeate untill there's no more nodes we need to iterate over -- `O(n)` where `n` is the number of child nodes used to calculate completions.Recursion is one of those things that's harder to explain than to do it. Here's one way the code can look like:
```elixir
def calculate_words([{node, wip_completion} = _head | rest_of_nodes], completions) do
# ^ acc1 ^ acc2
wip_completion = wip_completion <> node.valuechildren_to_be_calculated =
for child <- node.children,
do: {child, wip_completion}maybe_updated_completions =
if node.end_of_word?,
do: completions ++ [wip_completion],
else: completionscalculate_words(children_to_be_calculated ++ rest_of_nodes, maybe_updated_completions)
enddef calculate_words([], completions), do: completions
# ^ no more nodes to iterate overcompletions = calculate_words([{node_b_i_r, ""}, {node_b_i_t, ""}], [])
# ^ completions = ["ird", "ite", "ites"]
```You can see the actual implementation for that on file [trie_a.ex:207](https://github.com/rwillians/trie-autocomplete/blob/main/lib/trie_a.ex#L207).
## Do you want to take it for a spin?
### Run on Docker
> **Note**
> I have only tested the docker image against arm64, not sure if it's working for amd64 yet.The image is preloaded with a set of words from [@dolph/dictionary](https://github.com/dolph/dictionary/tree/master) containing 170k+ words.
Get completions:
```bash
docker run ghcr.io/rwillians/trie-autocomplete:latest trie.completions "ba"
```Run benchmark:
```bash
docker run ghcr.io/rwillians/trie-autocomplete:latest trie.completions --benchmark "ba"
```### Running Local
This project uses [asdf](https://asdf-vm.com/#/core-manage-asdf-vm) to manage Elixir's and Erlang's version. If you have `asdf` installed, you can run `asdf install` to install the correct version of Elixir and Erlang.
Install dependencies:
```bash
mix deps.get
```Compile:
```bash
mix compile
```#### Interactive shell
```bash
iex -S mix
```Load a set of words (text files with one word per line):
```elixir
dict = File.stream!("priv/assets/popular.txt") |> Enum.into(%Trie.A{})
```Get completions:
```elixir
Trie.completions(dict, "appl")
```#### Mix commands
Completions:
```bash
mix trie.completions "ag"
```Benchmark:
```bash
mix trie.completions --benchmark "ag"
```## Benchmark
Benchmark done using [Benchee](https://github.com/bencheeorg/benchee).
> **Note** the dataset of 25k words contain only 329 words that start with "ba" while the dataset of 170k words contains 1800.
```
$> mix trie.completions --benchmark "ba"Loading dictionaries...
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.15.2
Erlang 26.0.2Benchmark suite executing with the following configuration:
warmup: 5 s
time: 20 s
memory time: 10 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 3.50 minMeasured function call overhead as: 0 ns
Benchmarking Trie.A 170k ...
Benchmarking Trie.A 25k ...
Benchmarking Trie.B 170k ...
Benchmarking Trie.B 25k ...
Benchmarking Trie.C 170k ...
Benchmarking Trie.C 25k ...Name ips average deviation median 99th %
Trie.A 25k 15.47 K 64.64 μs ±56.92% 51.71 μs 228.50 μs
Trie.C 25k 6.75 K 148.14 μs ±69.04% 96.96 μs 456.75 μs
Trie.B 25k 6.49 K 154.15 μs ±66.55% 102.88 μs 459.71 μs
Trie.A 170k 3.45 K 290.24 μs ±74.26% 244.38 μs 1196.96 μs
Trie.C 170k 1.66 K 602.67 μs ±68.16% 416.05 μs 1771.33 μs
Trie.B 170k 1.60 K 623.29 μs ±67.56% 434.46 μs 1827.27 μsComparison:
Trie.A 25k 15.47 K
Trie.C 25k 6.75 K - 2.29x slower +83.50 μs
Trie.B 25k 6.49 K - 2.38x slower +89.51 μs
Trie.A 170k 3.45 K - 4.49x slower +225.60 μs
Trie.C 170k 1.66 K - 9.32x slower +538.04 μs
Trie.B 170k 1.60 K - 9.64x slower +558.65 μsMemory usage statistics:
Name Memory usage
Trie.A 25k 110.77 KB
Trie.C 25k 176.36 KB - 1.59x memory usage +65.59 KB
Trie.B 25k 176.40 KB - 1.59x memory usage +65.63 KB
Trie.A 170k 531.53 KB - 4.80x memory usage +420.77 KB
Trie.C 170k 843.94 KB - 7.62x memory usage +733.17 KB
Trie.B 170k 843.98 KB - 7.62x memory usage +733.21 KB
```### Observations
* Using structs to represent nodes has consistently shown to be must effective
for both performance and memory usage -- `Trie.A`.* To my surprise, representing a char by it's unicode integer code severely
hurt performance and memory consumption -- {Trie.B} -- in comparison to
storing them as erlangs binary string -- `Trie.A`.* Using tuples (triples, actually) to represent a node wasn't benefitial in any
way -- `Trie.C`. It hurt performance, memory consumption and decreased
redability of the code.* The memory consumed to get completions seems ok-ish for now, but I wonder if
I can decrese the amount of memory consumed by the dictionary. Needs further
profiling.