https://github.com/manishrwt15/triebenchmark
A benchmark study of Trie data structure performance on real-world datasets
https://github.com/manishrwt15/triebenchmark
benchmark data-structures java performance-analysis trie
Last synced: 9 months ago
JSON representation
A benchmark study of Trie data structure performance on real-world datasets
- Host: GitHub
- URL: https://github.com/manishrwt15/triebenchmark
- Owner: Manishrwt15
- Created: 2025-05-16T04:24:36.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-05-18T06:07:14.000Z (11 months ago)
- Last Synced: 2025-06-02T01:47:59.460Z (10 months ago)
- Topics: benchmark, data-structures, java, performance-analysis, trie
- Language: Python
- Homepage:
- Size: 2.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TrieBenchmark
A high-performance Trie data structure implementation in Java, benchmarked with a large real-world English word dataset. This project evaluates the insert and search efficiency of a Trie at multiple dataset sizes, showcasing its suitability for search-heavy applications such as autocomplete and dictionary lookup.
---
## Features
- Efficient insertion of large datasets (up to 370,105 words)
- Fast search operations with near constant-time performance
- Benchmarking across multiple dataset sizes (1K, 10K, 100K, full dataset)
- Detailed performance metrics: insert time, search time, average search time per word
- Simple and extensible Java implementation
---
## Dataset
The benchmarking uses a real-world English word list (`words.txt`) containing **370,105** words sourced from an open dataset.
---
## How to Run
### Prerequisites
- Java JDK 11 or higher installed
- (Optional) Python 3 environment with `matplotlib` and `pandas` for graph plotting
### Compile and Run Benchmark
```bash
javac Trie.java TrieBenchmark.java
java TrieBenchmark
```
## Results Summary
| Dataset Size | Insert Time (ms) | Search Time (ms) | Avg Search Time per Word (ms) |
|--------------|------------------|------------------|-------------------------------|
| 1,000 | 1 | 0.246334 | 0.000246 |
| 10,000 | 15 | 0.151625 | 0.000152 |
| 100,000 | 26 | 1.475125 | 0.001475 |
| 370,105 | 93 | 0.659375 | 0.000659 |
## Analysis
- **Insert time** increases approximately linearly with the size of the dataset, indicating scalable insertion performance.
- **Search time** remains consistently low across all dataset sizes, demonstrating the efficiency of Trie for lookup operations.
- The average search time per word is in the order of microseconds (~0.0006 ms for the largest dataset), which shows that Trie performs **near constant-time searches**.
- Interestingly, the search time for 10,000 words was slightly better than 1,000 words, likely due to caching effects or system optimizations.
- Overall, the Trie data structure is highly suitable for applications requiring fast and frequent searches, such as autocomplete, spell-checking, and dictionary implementations.
## Plotting Graphs
The benchmark results were visualized using Python's `matplotlib` and `pandas` libraries to better understand the Trie performance trends.
## Visualizations
The following graphs illustrate the performance of the Trie across various dataset sizes:
### Insert Time vs Dataset Size
Shows how the time taken to insert words increases with dataset size.

---
### Search Time vs Dataset Size
Illustrates how the total search time changes with different dataset sizes.

---
### Avg Search Time per Word vs Dataset Size
Demonstrates the near-constant time performance of Trie searches.

---
> All graphs are auto-generated using `matplotlib` and `pandas` from the benchmarking results. See the [Plotting Graphs](#plotting-graphs) section above for steps to regenerate them.
### How to Generate Graphs:
1. Make sure Python 3 is installed on your system.
2. Create and activate a virtual environment to avoid permission issues:
```bash
python3 -m venv venv
source venv/bin/activate
```
3. Install required Python libraries:
```bash
pip install matplotlib pandas
```
4. Run the graph plotting script:
```bash
python3 plot_graphs.py
```
5. The graphs will be saved in the results/ folder as image files (e.g., PNG).
# Note:
If you encounter errors while installing packages system-wide, using a virtual environment is highly recommended to keep dependencies isolated and manageable.
## Author
**Manish Rawat**
- GitHub: [https://github.com/Manishrwt15](https://github.com/Manishrwt15)
- Email: manishrwat15@gmail.com
- LinkedIn: [https://www.linkedin.com/in/manish-rawat-b1b61b269/](https://www.linkedin.com/in/manish-rawat-b1b61b269/)