{"id":34250371,"url":"https://github.com/zefirchiky/spelright","last_synced_at":"2025-12-16T09:18:45.295Z","repository":{"id":315307964,"uuid":"1058971882","full_name":"Zefirchiky/SpelRight","owner":"Zefirchiky","description":"A simple spell checker written in rust. Includes CLI and lib.","archived":false,"fork":false,"pushed_at":"2025-11-13T13:18:14.000Z","size":4013,"stargazers_count":21,"open_issues_count":7,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-13T14:30:40.057Z","etag":null,"topics":["speed","spellcheck","spelling","spelling-correction"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Zefirchiky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-17T20:01:38.000Z","updated_at":"2025-11-13T13:18:18.000Z","dependencies_parsed_at":"2025-09-17T22:22:00.905Z","dependency_job_id":"315a2efb-159a-4a47-acb4-3298aa6932e5","html_url":"https://github.com/Zefirchiky/SpelRight","commit_stats":null,"previous_names":["zefirchiky/easy-spell-checker","zefirchiky/spelright"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Zefirchiky/SpelRight","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zefirchiky%2FSpelRight","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zefirchiky%2FSpelRight/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zefirchiky%2FSpelRight/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zefirchiky%2FSpelRight/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Zefirchiky","download_url":"https://codeload.github.com/Zefirchiky/SpelRight/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Zefirchiky%2FSpelRight/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27761836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-16T02:00:10.477Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["speed","spellcheck","spelling","spelling-correction"],"created_at":"2025-12-16T09:18:42.448Z","updated_at":"2025-12-16T09:18:45.281Z","avatar_url":"https://github.com/Zefirchiky.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SpelRight\n\nYes, it is intentional.\n\nA simple Spell Checker written in Rust. Includes CLI and lib.\n\nAlso available in [crates.io](https://crates.io/crates/mangahub-spellchecker)!\n\nSupports any utf-8 (kinda, WIP), as long as input file is of right format (look [Dataset Fixer](https://github.com/Zefirchiky/SpelRight/blob/49247d1db4ad47746484e1cdd809b7bdec336ffe/dataset_fixer/src/main.rs) or [load_words_dict](https://github.com/Zefirchiky/SpelRight/blob/49247d1db4ad47746484e1cdd809b7bdec336ffe/src/load_dict.rs)).\n\nWas primarily written for [MangaHub](https://github.com/Zefirchiky/MangaHub) project's Novel ecosystem. And to learn Rust :D\n\n\u003e [!WARN]\n\u003e\n\u003e For now, only supports bytes processing, WIP\n\n## Some benchmarks\n\nOn my i5-12450H laptop with VSC opened.\n\nEnglish.\n\nLoad and parse 4mb file with 370105 words in ~\u003c2ms.\n\nWords spelling check ~50,000,000 words/s for all correct words (worst case scenario, `batch_par_check`).\n\nSorted suggestions for 1000 incorrect words in ~63ms (~15800 words/s, words case scenario, `batch_par_suggest`).\n\nMemory usage is minimal, a few big strings of all words without a delimiters + a small vec of information.\nTotaling dict size + ~200 bytes (depending on the biggest word's length) + additional cost of some operations.\n\n## CLI\n\n`spell.exe` in %PATH%. `words.txt` in the same folder.\n\n```shell\n\u003e spell funny wrd sjdkfhsdjfh\n✅ funny\n❓ wrd =\u003e wro wry word wad rd wird ord urd ward wd\n❌ Wrong word 'sjdkfhsdjfh', no suggestions\n```\n\n## Breakthroughs that lead to this\n\n### Storing blobs of words, and their metadata\n\nStoring words of each length in immutable (optional) blobs, sorted by bytes.\n\nStore info about those blobs: len and/or count.\n\nPros:\n\n- Incredibly easy to iterate over\n- SIMD compatible\n- Highly parallelizable\n- Great cache locality (a shit ton of cache hits)\n- Search words with binary search `O(log n)`\n- Working with bytes instead of chars\n  - Support any language\n- Other that I forgor\n\nCons:\n\n- Needs precise dataset\n- Pretty difficult words addition without moving the whole Vec\n\nPros totally outweigh the Cons!\n\n### Specialized matching algorithm\n\nWhen iterating over each `LenGroup`, based on `max difference`, we can calculate maximum amount of `deletions`, `insertions` and `substitutions`.\n\nAs an example:\n\nChecking `nothng` (group 6) against group 7, the difference between them is 1 `insertion` and 1 (optional) `substitution`.\n\nWith one insertion, `nothng` will become group 7, and with optional `substitution` it can match other words.\n\nThere will always be exactly `max_dif` of `max_delete + max_insert + max_substitution`.\n\nThis is **multiple times** faster then any other distance finding algorithm.\n\n## Goals\n\n- [x] Checking word correctness\n- [x] Suggesting similar words\n- [ ] Adding new words\n- [x] Support different languages\n- [ ] Full languages support\n  - [x] Full ascii support\n  - [ ] Full UTF-8 support\n    - [ ] Normalize some languages\n    - [ ] Divide languages into words with pure ascii, with possible normalization, and with present UTF-8\n  - [ ] Plugin\n    - [ ] For everything\n      - [ ] Default plugins\n    - [ ] For especially complex languages\n- [ ] Make good CLI\n  - [ ] Long ruining Server\n  - [ ] Config\n- [ ] Make it fast\n\n  Suggestions (12500 words/s)\n  - [x] 100 words/s\n  - [x] 250 words/s\n  - [x] 1000 words/s\n  - [x] 2500 words/s\n  - [x] 10000 words/s\n  - [ ] 25000 words/s\n  - [ ] 100000 words/s\n\n  Loading (2.2 ms)\n  - [x] \u003c200 ms\n  - [x] \u003c100 ms\n  - [x] \u003c50 ms\n  - [x] \u003c20 ms\n  - [x] \u003c10 ms\n  - [x] \u003c5 ms\n  - [x] \u003c3 ms\n  - [x] \u003c2 ms (read_to_string is more then 2 ms, not sure if even possible (nvm, after reloading pc, its less then 2 ms))\n  - [ ] \u003c1 ms (No idea how the fuck this could be possible, but hey, goals!)\n\n## Possible Optimizations\n\n### Hardware\n\n- [x] Cache locality (dence blob of words)\n- [ ] SIMDeez nuts\n  - [x] Distance matching\n  - [ ] Binary search (might be optimized by the compiler)\n- [ ] Parallelism\n  - [ ] Rayon\n    - [x] Test with and without\n    - [ ] Auto deciding between parallel and normal\n  - [ ] Manual\n- [ ] GPU Acceleration\n\n### Memory usage\n\n- [x] Blobs of words with no other symbol (aka. no `\\n`)\n- [x] Storing minimal metadata about each word length\n- [ ] Storing first letter offsets, size depends on the language, but minimal overall\n\nTotal memory usage is pretty much minimal.\n\n### Reduce amount of words checked\n\n- [x] Word length groups (depend on dataset)\n- [ ] For length that are max distance from a word (no chars change is allowed, only deletions)\n  - [ ] Tracking first letter offsets, use only the once, whose first letter is the same\n- [x] For length that are the same as a word's (no chars deletion or insertion, only change)\n\n### Caching\n\n- [ ] Often mistakes\n\n### Loading\n\n\u003e [!NOTE]\n\u003e read_to_string of 370000 words (~4 mb) is about 2 ms.\n\u003e\n\u003e **on my machine.**\n\n- [x] Reduce parsing by pre-parsing the dataset, look `Better dataset`\n\n### Better dataset\n\n- [ ] Reduce words amount, most words are never used in an average text\n- [x] Store offsets, no unnecessary `\\n`\n- [ ] Store first letters offsets\n\n\u003e [!NOTE]\n\u003e Made it harder to work manually with dataset.\n\n### Better algorithms\n\n- [x] Custom\n  - [x] See Breakthrough\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzefirchiky%2Fspelright","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzefirchiky%2Fspelright","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzefirchiky%2Fspelright/lists"}