{"id":51176534,"url":"https://github.com/sysaulab/fuzzy_dict","last_synced_at":"2026-06-27T04:03:57.778Z","repository":{"id":366030468,"uuid":"1273756073","full_name":"sysaulab/fuzzy_dict","owner":"sysaulab","description":"Dictionary based fuzzy filter","archived":false,"fork":false,"pushed_at":"2026-06-19T23:30:56.000Z","size":1464,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-20T01:12:24.411Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sysaulab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-18T20:59:24.000Z","updated_at":"2026-06-19T23:30:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/sysaulab/fuzzy_dict","commit_stats":null,"previous_names":["sysaulab/fuzzy_dict"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/sysaulab/fuzzy_dict","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sysaulab%2Ffuzzy_dict","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sysaulab%2Ffuzzy_dict/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sysaulab%2Ffuzzy_dict/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sysaulab%2Ffuzzy_dict/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sysaulab","download_url":"https://codeload.github.com/sysaulab/fuzzy_dict/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sysaulab%2Ffuzzy_dict/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34840912,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-27T02:00:06.362Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-27T04:03:56.362Z","updated_at":"2026-06-27T04:03:57.766Z","avatar_url":"https://github.com/sysaulab.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# fuzzy_dict\n\n**fuzzy_dict** is a high‑performance fuzzy string matching library for Rust. It uses a character‑presence bitmask filter to quickly narrow down the search space before applying a similarity scorer. The approach is **conservative** (no false negatives) and can reduce the number of candidates that need to be scored by only comparing a small pool of most likely candidates, making it ideal for large dictionaries.\n\n\u003e This library is a weekend project, but it’s open source and ready to use. If you find it useful, feel free to polish and upload it to crates.io!\n\n---\n\n## 🌍 Global Coverage\n\nThe library ships with **34 alphabet definitions** covering the majority of the world’s **non‑logographic scripts** – that’s roughly **65% of the world’s population** (~5.4 billion people) who use these scripts as their primary writing system.\n\nIncluded scripts (organised by region):\n\n- **Europe**: Latin, Cyrillic, Greek\n- **Middle East**: Arabic, Hebrew\n- **Caucasus**: Armenian, Georgian\n- **South Asia**: Devanagari, Bengali, Gurmukhi, Gujarati, Telugu, Tamil, Kannada, Malayalam, Odia, Sinhala\n- **Southeast Asia**: Thai, Lao, Khmer, Burmese, Javanese, Baybayin\n- **Africa**: Ethiopic, Tifinagh, N’Ko, Coptic, Adlam\n- **Americas**: Cherokee, Osage, Canadian Aboriginal Syllabics\n- **Central / East Asia**: Tibetan, Mongolian, Ol Chiki\n\n\u003e **For CJK (Chinese, Japanese, Korean)**: These scripts are logographic and cannot be directly represented in a 64‑bit mask. However, you can **romanise** the text (e.g., pinyin for Chinese, romaji for Japanese, revamped romanisation for Korean) and then pass those romanised strings through the library. The romanisation step is **not** included in this library – it is the responsibility of the consuming application. This allows you to benefit from the same fast filtering for CJK content when paired with a suitable transliteration pipeline.\n\n---\n\n## Features\n\n- **Blazing fast** – O(1) bucket lookup for exact masks, with optional expansion to 1‑ and 2‑bit flips.\n- **Multilingual** – Supports 34 scripts out of the box, with custom alphabet support for any other script.\n- **Accent‑ and case‑insensitive** – Character classes group accented variants and both cases together.\n- **Conservative filter** – Never misses a potential match (no false negatives).\n- **Custom alphabets** – Easily add your own scripts via simple text files (line‑wise character classes).\n- **Lightweight** – Only ~8 bytes per dictionary entry overhead.\n\n## How It Works\n\n1. **Alphabet definition**: Each character is assigned to a bit position (1–63) based on the line it appears on in the alphabet files. Bit 0 is reserved for unknown characters.\n2. **Mask computation**: For every word, a 64‑bit mask is computed where each bit indicates the presence of a character class.\n3. **Bucketing**: Words are stored in a `HashMap\u003cu64, Vec\u003cString\u003e\u003e` keyed by their mask.\n4. **Search**: Given a query:\n   - Compute its mask.\n   - Look up the exact mask bucket.\n   - If not enough good candidates are found, also inspect buckets whose masks differ by exactly 1 or 2 bit flips (only for bits that exist in the dictionary).\n   - Score candidates using a fast inline scorer (longest common prefix + suffix normalised by max length).\n   - Return the top `limit` results sorted by score.\n\nThe search stops early once the sum of scores of collected candidates exceeds `SCORE_SUM_THRESHOLD` (default 15.0), ensuring we only score the most promising candidates.\n\n## Usage\n\nAdd this to your `Cargo.toml` (until it's on crates.io, use the git repository):\n\n```toml\n[dependencies]\nfuzzy_dict = { git = \"https://github.com/yourusername/fuzzy_dict\" }\n```\n\nThen in your code:\n\n```rust\nuse fuzzy_dict::{Alphabet, FuzzyDict};\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    // Load the default alphabet (includes Latin, Cyrillic, Greek, Arabic, Hebrew,\n    // Armenian, Georgian, Thai, and Devanagari).\n    let alphabet = Alphabet::default();\n    let mut dict = FuzzyDict::with_alphabet(alphabet);\n\n    // Add words from a file (one word per line, '#' for comments)\n    dict.add_file(\"dictionary.txt\")?;\n\n    // Or add words manually\n    dict.add_word(\"hello\");\n    dict.add_word(\"world\");\n\n    // Search with a limit of 10 and optional score threshold\n    let results = dict.search_limit(\"helo\", 10);\n    // Or search without limit\n    // let all = dict.search(\"helo\");\n\n    for (word, score) in results {\n        println!(\"{} -\u003e {:.3}\", word, score);\n    }\n\n    Ok(())\n}\n```\n\n### Using Additional Scripts\n\nThe library includes alphabet definition files for **34 scripts** in the `assets/` directory. To load a custom set:\n\n```rust\n// Load a single custom alphabet\nlet ethiopic = Alphabet::from_file(\"assets/ethiopic.txt\")?;\n\n// Or merge several alphabet files together (line‑wise)\nlet indian_scripts = Alphabet::from_files(\u0026[\n    \"assets/devanagari.txt\",\n    \"assets/bengali.txt\",\n    \"assets/gurmukhi.txt\",\n    // ...\n])?;\n```\n\nYou can also define your own alphabet files – see the [Custom Alphabets](#custom-alphabets) section below.\n\n### Command‑Line Demo\n\nThe repository includes a `demo.rs` that shows how to load a dictionary and query it from the command line:\n\n```bash\ncargo run --example demo words.txt cafe 15 0.7\n```\n\nArguments: `\u003cdictionary_file\u003e \u003cquery\u003e [limit] [score_threshold]`\n\nThe demo prints loading statistics, search time, and the results.\n\n## Alphabet Files\n\nAlphabet files are plain text files where each line defines a character class. For example:\n\n```\naAáÀâÄãÃ\nbB\ncCçÇ\n```\n\nAll characters on the same line share the same bit. The library includes predefined alphabets for the supported scripts. You can also create your own and load them using `Alphabet::from_file()` or `Alphabet::from_files()`.\n\nTo load only specific standard alphabets:\n\n```rust\nlet alphabet = Alphabet::load_named(\u0026[\"latin\", \"cyrillic\"]);\n```\n\nThe full list of supported scripts and their file names can be found in the `assets/` directory.\n\n## Performance\n\nOn a dictionary of ~500,000 words, the filter achieves:\n\n- **100,000+ queries per second** on modest hardware.\n- **Candidate reduction of 80‑95%** before scoring.\n- **Memory overhead**: ~8 bytes per word for the mask, plus bucket storage.\n\nThe search algorithm is constant time for the exact mask, and the expansion to 1‑ and 2‑bit flips is bounded by the number of effective bits (≤63), making it scale well even for large dictionaries.\n\n## Customisation\n\nYou can tweak the internal score threshold by modifying the constant `SCORE_SUM_THRESHOLD` in the source (currently 15.0). This controls how many candidates are collected before sorting. A higher value may yield more accurate results at the cost of slightly more scoring.\n\nThe scorer itself is a simple inline function; you can replace it with a more sophisticated metric like Jaro‑Winkler if needed, but that will increase query time.\n\n## Limitations\n\n- **Order‑insensitive**: The filter ignores character order, so \"abc\" and \"cba\" produce the same mask. This is a trade‑off for speed; for order‑sensitive matching consider using n‑gram masks.\n- **CJK and logographic scripts**: With thousands of characters, the 64‑bit mask is insufficient. The library does **not** perform romanisation; it is up to the consumer application to convert CJK text to a Latin‑based transcription (e.g., pinyin, romaji) before feeding it to `fuzzy_dict`.\n- **Dynamic alphabets**: The alphabet must be defined before building the dictionary. Changing it later requires rebuilding all masks.\n\n## Acknowledgements\n\nInspired by similar techniques used in fuzzy finders like fuzzysort and FlashFuzzy. The bitmask idea is simple yet effective.\n\nFor a detailed explanation, see the [PAPER.md](PAPER.md) in the repository.\n\n---\n\n**Contributions and improvements are welcome!** If you polish the code, feel free to upload it to crates.io – just keep the original author credit.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsysaulab%2Ffuzzy_dict","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsysaulab%2Ffuzzy_dict","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsysaulab%2Ffuzzy_dict/lists"}