https://github.com/manishrwt15/triebenchmark

A benchmark study of Trie data structure performance on real-world datasets
https://github.com/manishrwt15/triebenchmark

benchmark data-structures java performance-analysis trie

Last synced: about 1 year ago
JSON representation

A benchmark study of Trie data structure performance on real-world datasets

Host: GitHub
URL: https://github.com/manishrwt15/triebenchmark
Owner: Manishrwt15
Created: 2025-05-16T04:24:36.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-18T06:07:14.000Z (about 1 year ago)
Last Synced: 2025-06-02T01:47:59.460Z (about 1 year ago)
Topics: benchmark, data-structures, java, performance-analysis, trie
Language: Python
Homepage:
Size: 2.8 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# TrieBenchmark

A high-performance Trie data structure implementation in Java, benchmarked with a large real-world English word dataset. This project evaluates the insert and search efficiency of a Trie at multiple dataset sizes, showcasing its suitability for search-heavy applications such as autocomplete and dictionary lookup.

---

## Features

- Efficient insertion of large datasets (up to 370,105 words)
- Fast search operations with near constant-time performance
- Benchmarking across multiple dataset sizes (1K, 10K, 100K, full dataset)
- Detailed performance metrics: insert time, search time, average search time per word
- Simple and extensible Java implementation

---

## Dataset

The benchmarking uses a real-world English word list (`words.txt`) containing **370,105** words sourced from an open dataset.

---

## How to Run

### Prerequisites

- Java JDK 11 or higher installed
- (Optional) Python 3 environment with `matplotlib` and `pandas` for graph plotting

### Compile and Run Benchmark

```bash
javac Trie.java TrieBenchmark.java
java TrieBenchmark
```

## Results Summary

| Dataset Size | Insert Time (ms) | Search Time (ms) | Avg Search Time per Word (ms) |
|--------------|------------------|------------------|-------------------------------|
| 1,000 | 1 | 0.246334 | 0.000246 |
| 10,000 | 15 | 0.151625 | 0.000152 |
| 100,000 | 26 | 1.475125 | 0.001475 |
| 370,105 | 93 | 0.659375 | 0.000659 |

## Analysis

- **Insert time** increases approximately linearly with the size of the dataset, indicating scalable insertion performance.
- **Search time** remains consistently low across all dataset sizes, demonstrating the efficiency of Trie for lookup operations.
- The average search time per word is in the order of microseconds (~0.0006 ms for the largest dataset), which shows that Trie performs **near constant-time searches**.
- Interestingly, the search time for 10,000 words was slightly better than 1,000 words, likely due to caching effects or system optimizations.
- Overall, the Trie data structure is highly suitable for applications requiring fast and frequent searches, such as autocomplete, spell-checking, and dictionary implementations.

## Plotting Graphs

The benchmark results were visualized using Python's `matplotlib` and `pandas` libraries to better understand the Trie performance trends.

## Visualizations

The following graphs illustrate the performance of the Trie across various dataset sizes:

### Insert Time vs Dataset Size
Shows how the time taken to insert words increases with dataset size.

![Insert Time vs Dataset Size](results/insert_time_vs_dataset_size.png)

---

### Search Time vs Dataset Size
Illustrates how the total search time changes with different dataset sizes.

![Search Time vs Dataset Size](results/search_time_vs_dataset_size.png)

---

### Avg Search Time per Word vs Dataset Size
Demonstrates the near-constant time performance of Trie searches.

![Avg Search Time per Word vs Dataset Size](results/avg_time_per_word_vs_dataset_size.png)

---

> All graphs are auto-generated using `matplotlib` and `pandas` from the benchmarking results. See the [Plotting Graphs](#plotting-graphs) section above for steps to regenerate them.

### How to Generate Graphs:
1. Make sure Python 3 is installed on your system.
2. Create and activate a virtual environment to avoid permission issues:
```bash
python3 -m venv venv
source venv/bin/activate
```
3. Install required Python libraries:
```bash
pip install matplotlib pandas
```
4. Run the graph plotting script:
```bash
python3 plot_graphs.py
```
5. The graphs will be saved in the results/ folder as image files (e.g., PNG).

# Note:
If you encounter errors while installing packages system-wide, using a virtual environment is highly recommended to keep dependencies isolated and manageable.

## Author

**Manish Rawat**

- GitHub: [https://github.com/Manishrwt15](https://github.com/Manishrwt15)
- Email: manishrwat15@gmail.com
- LinkedIn: [https://www.linkedin.com/in/manish-rawat-b1b61b269/](https://www.linkedin.com/in/manish-rawat-b1b61b269/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/manishrwt15/triebenchmark

Awesome Lists containing this project

README