https://github.com/zefirchiky/spelright
A simple spell checker written in rust. Includes CLI and lib.
https://github.com/zefirchiky/spelright
speed spellcheck spelling spelling-correction
Last synced: 6 months ago
JSON representation
A simple spell checker written in rust. Includes CLI and lib.
- Host: GitHub
- URL: https://github.com/zefirchiky/spelright
- Owner: Zefirchiky
- License: gpl-3.0
- Created: 2025-09-17T20:01:38.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-11-13T13:18:14.000Z (7 months ago)
- Last Synced: 2025-11-13T14:30:40.057Z (7 months ago)
- Topics: speed, spellcheck, spelling, spelling-correction
- Language: Rust
- Homepage:
- Size: 3.83 MB
- Stars: 21
- Watchers: 1
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SpelRight
Yes, it is intentional.
A simple Spell Checker written in Rust. Includes CLI and lib.
Also available in [crates.io](https://crates.io/crates/mangahub-spellchecker)!
Supports any utf-8 (kinda, WIP), as long as input file is of right format (look [Dataset Fixer](https://github.com/Zefirchiky/SpelRight/blob/49247d1db4ad47746484e1cdd809b7bdec336ffe/dataset_fixer/src/main.rs) or [load_words_dict](https://github.com/Zefirchiky/SpelRight/blob/49247d1db4ad47746484e1cdd809b7bdec336ffe/src/load_dict.rs)).
Was primarily written for [MangaHub](https://github.com/Zefirchiky/MangaHub) project's Novel ecosystem. And to learn Rust :D
> [!WARN]
>
> For now, only supports bytes processing, WIP
## Some benchmarks
On my i5-12450H laptop with VSC opened.
English.
Load and parse 4mb file with 370105 words in ~<2ms.
Words spelling check ~50,000,000 words/s for all correct words (worst case scenario, `batch_par_check`).
Sorted suggestions for 1000 incorrect words in ~63ms (~15800 words/s, words case scenario, `batch_par_suggest`).
Memory usage is minimal, a few big strings of all words without a delimiters + a small vec of information.
Totaling dict size + ~200 bytes (depending on the biggest word's length) + additional cost of some operations.
## CLI
`spell.exe` in %PATH%. `words.txt` in the same folder.
```shell
> spell funny wrd sjdkfhsdjfh
✅ funny
❓ wrd => wro wry word wad rd wird ord urd ward wd
❌ Wrong word 'sjdkfhsdjfh', no suggestions
```
## Breakthroughs that lead to this
### Storing blobs of words, and their metadata
Storing words of each length in immutable (optional) blobs, sorted by bytes.
Store info about those blobs: len and/or count.
Pros:
- Incredibly easy to iterate over
- SIMD compatible
- Highly parallelizable
- Great cache locality (a shit ton of cache hits)
- Search words with binary search `O(log n)`
- Working with bytes instead of chars
- Support any language
- Other that I forgor
Cons:
- Needs precise dataset
- Pretty difficult words addition without moving the whole Vec
Pros totally outweigh the Cons!
### Specialized matching algorithm
When iterating over each `LenGroup`, based on `max difference`, we can calculate maximum amount of `deletions`, `insertions` and `substitutions`.
As an example:
Checking `nothng` (group 6) against group 7, the difference between them is 1 `insertion` and 1 (optional) `substitution`.
With one insertion, `nothng` will become group 7, and with optional `substitution` it can match other words.
There will always be exactly `max_dif` of `max_delete + max_insert + max_substitution`.
This is **multiple times** faster then any other distance finding algorithm.
## Goals
- [x] Checking word correctness
- [x] Suggesting similar words
- [ ] Adding new words
- [x] Support different languages
- [ ] Full languages support
- [x] Full ascii support
- [ ] Full UTF-8 support
- [ ] Normalize some languages
- [ ] Divide languages into words with pure ascii, with possible normalization, and with present UTF-8
- [ ] Plugin
- [ ] For everything
- [ ] Default plugins
- [ ] For especially complex languages
- [ ] Make good CLI
- [ ] Long ruining Server
- [ ] Config
- [ ] Make it fast
Suggestions (12500 words/s)
- [x] 100 words/s
- [x] 250 words/s
- [x] 1000 words/s
- [x] 2500 words/s
- [x] 10000 words/s
- [ ] 25000 words/s
- [ ] 100000 words/s
Loading (2.2 ms)
- [x] <200 ms
- [x] <100 ms
- [x] <50 ms
- [x] <20 ms
- [x] <10 ms
- [x] <5 ms
- [x] <3 ms
- [x] <2 ms (read_to_string is more then 2 ms, not sure if even possible (nvm, after reloading pc, its less then 2 ms))
- [ ] <1 ms (No idea how the fuck this could be possible, but hey, goals!)
## Possible Optimizations
### Hardware
- [x] Cache locality (dence blob of words)
- [ ] SIMDeez nuts
- [x] Distance matching
- [ ] Binary search (might be optimized by the compiler)
- [ ] Parallelism
- [ ] Rayon
- [x] Test with and without
- [ ] Auto deciding between parallel and normal
- [ ] Manual
- [ ] GPU Acceleration
### Memory usage
- [x] Blobs of words with no other symbol (aka. no `\n`)
- [x] Storing minimal metadata about each word length
- [ ] Storing first letter offsets, size depends on the language, but minimal overall
Total memory usage is pretty much minimal.
### Reduce amount of words checked
- [x] Word length groups (depend on dataset)
- [ ] For length that are max distance from a word (no chars change is allowed, only deletions)
- [ ] Tracking first letter offsets, use only the once, whose first letter is the same
- [x] For length that are the same as a word's (no chars deletion or insertion, only change)
### Caching
- [ ] Often mistakes
### Loading
> [!NOTE]
> read_to_string of 370000 words (~4 mb) is about 2 ms.
>
> **on my machine.**
- [x] Reduce parsing by pre-parsing the dataset, look `Better dataset`
### Better dataset
- [ ] Reduce words amount, most words are never used in an average text
- [x] Store offsets, no unnecessary `\n`
- [ ] Store first letters offsets
> [!NOTE]
> Made it harder to work manually with dataset.
### Better algorithms
- [x] Custom
- [x] See Breakthrough