{"id":16012781,"url":"https://github.com/alexpovel/b4s","last_synced_at":"2025-03-16T07:31:37.831Z","repository":{"id":173632631,"uuid":"647281917","full_name":"alexpovel/b4s","owner":"alexpovel","description":"Perform binary search on a single, delimited string slice of sorted but unevenly sized substrings.","archived":false,"fork":false,"pushed_at":"2024-06-01T19:16:15.000Z","size":7132,"stargazers_count":5,"open_issues_count":8,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-15T00:35:55.387Z","etag":null,"topics":["binary-search","fuzz-tested","rust","string","unevenly-spaced"],"latest_commit_sha":null,"homepage":"https://docs.rs/b4s/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alexpovel.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-30T12:54:15.000Z","updated_at":"2024-09-28T15:53:10.000Z","dependencies_parsed_at":"2024-01-22T18:40:56.007Z","dependency_job_id":"f07bcc6e-ccb2-4ce7-ad7b-1fc76e1ace8d","html_url":"https://github.com/alexpovel/b4s","commit_stats":null,"previous_names":["alexpovel/b4s"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexpovel%2Fb4s","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexpovel%2Fb4s/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexpovel%2Fb4s/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alexpovel%2Fb4s/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alexpovel","download_url":"https://codeload.github.com/alexpovel/b4s/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243806032,"owners_count":20350773,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-search","fuzz-tested","rust","string","unevenly-spaced"],"created_at":"2024-10-08T14:21:07.615Z","updated_at":"2025-03-16T07:31:36.386Z","avatar_url":"https://github.com/alexpovel.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!-- markdownlint-disable MD013 --\u003e\n\n# b4s\n\nBinary Search Single Sorted String: Perform binary search on a single, delimited string\nslice of sorted but unevenly sized substrings. This crate might be useful if you...\n\n- have a pre-sorted (`build.rs`, ...) list of words, and\n- need to check if a word is contained,\n- with low overhead, no allocations, minimal extra dependencies.\n\nIn a drop-in manner, this crate replaces linear with binary search for corresponding\nspeed-ups in suitable cases, delivering faster runtime at virtually no extra cost\n(writing this yourself is easy enough, but this crate is tested and fuzzed :-) ). See\nbelow for [more](#motivation), and [benchmarks](#benchmarks).\n\nThe docs are best viewed via [docs.rs](https://docs.rs/b4s).\n\n[![codecov](https://codecov.io/github/alexpovel/b4s/branch/main/graph/badge.svg?token=jCISYOujgB)](https://codecov.io/github/alexpovel/b4s)[![crates](https://img.shields.io/crates/v/b4s.svg)](https://crates.io/crates/b4s)\n\n## Usage\n\nThere are generally two ways to setup this crate: at compile-time, or at runtime. The\nmain (only...) method of interest is [`SortedString::binary_search()`]. View its\ndocumentation for detailed context.\n\n### Runtime\n\n```rust\nuse b4s::{AsciiChar, SortedString};\n\nfn main() {\n    match SortedString::new_checked(\"abc,def,ghi,jkl,mno,pqr,stu,vwx,yz\", AsciiChar::Comma) {\n        Ok(ss) =\u003e {\n            match ss.binary_search(\"ghi\") {\n                Ok(r) =\u003e println!(\"Found at range: {:?}\", r),\n                Err(r) =\u003e println!(\"Not found, last looked at range: {:?}\", r),\n            }\n        }\n        Err(e) =\u003e println!(\"Error: {:?}\", e),\n    }\n}\n```\n\n### Compile-time\n\nFor convenience, there's also a `const fn`, usable statically. As a tradeoff, instance\ncreation will not perform correctness checks. An unsorted string will result in binary\nsearch misbehaving. Though no panics occur, you will be handed back an `Error`. See the\ndocumentation of [`SortedString::new_unchecked()`] for details.\n\n```rust\nuse b4s::{AsciiChar, SortedString};\n\nstatic SS: SortedString =\n    SortedString::new_unchecked(\"abc,def,ghi,jkl,mno,pqr,stu,vwx,yz\", AsciiChar::Comma);\n\nfn main() {\n    match SS.binary_search(\"ghi\") {\n        Ok(r) =\u003e println!(\"Found at range: {:?}\", r),\n        Err(r) =\u003e println!(\"Not found, last looked at range: {:?}\", r),\n    }\n}\n```\n\nThe source for the input string can be anything, for example a file prepared at compile\ntime:\n\n```rust,ignore\nstatic SS: SortedString =\n    SortedString::new_unchecked(include_str!(\"path/to/file\"), AsciiChar::LineFeed);\n```\n\nThis is convenient if a delimited (`\\n`, ...) file is already at hand. It only needs to\nbe sorted once previously, and is then available for string containment checks at good,\nalbeit not perfect, runtime performance, at essentially no startup cost.\n\n## Motivation\n\nThe itch to be scratched is the following:\n\n- there's an array of strings to do lookup in, for example a word list\n- the lookup is a simple containment check, with no modification\n- the word list is available and prepared (sorted) at compile-time (e.g. in\n  [`build.rs`](https://doc.rust-lang.org/cargo/reference/build-scripts.html))\n- the word list is large (potentially much larger than the code itself); think 5MB or\n  more\n- the list is to be distributed as part of the binary\n\nA couple possible approaches come to mind. The summary table, where `n` is the number of\nwords in the dictionary and `k` the number of characters in a word to look up, is (for\nmore context, see the individual sections below):\n\n| Approach             | Pre-compile preprocessing[^1] | Compile time prepr. | Runtime lookup                | Binary size              |\n| -------------------- | ----------------------------- | ------------------- | ----------------------------- | ------------------------ |\n| `b4s`                | [`O(n log n)`][slice-sort]    | Single ref: `O(1)`  | [`O(log n)`][b4s-lib]         | `O(n)`                   |\n| [`fst`][fst-repo]    | [`O(n log n)`][fst-build][^2] | Single ref: `O(1)`  | [`O(k)`][fst-lookup]          | [`\u003c O(n)`][fst-size][^3] |\n| [slice][slice]       | [`O(n log n)`][slice-sort]    | Many refs: `O(n)`   | [`O(log n)`][slice-binsearch] | `~ O(3n)`                |\n| [`phf`][phf-repo]    | None                          | Many refs: `O(n)`   | Hash: `O(1)`                  | `~ O(3n)`                |\n| [`HashSet`][hashset] | None                          | Many refs: `O(n)`   | Hash: `O(1)`                  | `~ O(3n)`                |\n| padded `\u0026str`        | [`~ O(n log n)`][pad-file]    | Single ref: `O(1)`  | Bin. search: `O(log n)`       | `~ O(n)`                 |\n\nThis crate is an attempt to provide a solution with:\n\n1. **good, not perfect runtime performance**,\n2. very little, [one-time](https://doc.rust-lang.org/cargo/reference/build-scripts.html#rerun-if-changed) compile-time preprocessing needed (just sorting),\n3. **essentially no additional startup cost** (unlike, say, constructing a `HashSet` at\n  runtime)[^4],\n4. **binary sizes as small as possible**,\n5. **compile times as fast as possible**.\n\nIt was found that approaches using slices and hash sets (via `phf`) absolutely tanked\ndeveloper experience, with compile times north of 20 minutes (!) for 30 MB word lists\n(even on [fast hardware](#note)), large binaries, and\n[`clippy`](https://github.com/rust-lang/rust-clippy) imploding, taking the IDE with it.\nThis crate was born as a solution. Its main downside is **suboptimal runtime\nperformance**. If that is your primary goal, opt for `phf` or similar crates. This crate\nis not suitable for long-running applications, where initial e.g. `HashSet` creation is\na fraction of overall runtime costs.\n\n## Alternative approaches\n\nThe following alternatives might be considered, but were found unsuitable for one reason\nor another. See [this\nthread](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040)\nfor more discussion.\n\n### Slices\n\nA simple slice is an obvious choice, and can be generated in a build script.\n\n```rust\nstatic WORDS: \u0026[\u0026str] = \u0026[\"abc\", \"def\", \"ghi\", \"jkl\"];\n\nassert_eq!(WORDS.binary_search(\u0026\"ghi\").unwrap(), 2);\n```\n\nThere are two large pains in this approach:\n\n1. compile times become very slow (in the rough ballpark of 1 minute per 100.000 words,\n   YMMV considerably)\n2. binary size becomes large.\n\n   The words are *much* shorter than the `\u0026str` they are contained in. On 64-bit\n   hardware, [a `\u0026str` is 16\n   bytes](https://doc.rust-lang.org/std/primitive.str.html#representation), with a\n   `usize` address pointer and [a `usize`\n   length](https://doc.rust-lang.org/book/ch15-00-smart-pointers.html). For large word\n   lists, this leads to incredible bloat for the resulting binary.\n\n### Hash Set\n\nRegular [`HashSet`s][hashset] are not available at compile time. Crates like\n[`phf`][phf-repo] change that:\n\n```rust\nuse phf::{phf_set, Set};\n\nstatic WORDS: Set\u003c\u0026'static str\u003e = phf_set! {\n    \"abc\",\n    \"def\",\n    \"ghi\",\n    \"jkl\"\n};\n\nassert!(WORDS.contains(\u0026\"ghi\"))\n```\n\nSimilar downsides as for the slices case apply: very long compile times, and\nconsiderable binary bloat from smart pointers. A hash set ultimately is a slice with\ncomputed indices, so this is expected.\n\n### Finite State Transducer/Acceptor (Automaton)\n\nThe [`fst`][fst-repo] crate is a fantastic candidate, [brought\nup](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/7?u=alexpovel)\nby its author (same author as [`ripgrep`](https://github.com/BurntSushi/ripgrep) and\n[`regex`](https://github.com/rust-lang/regex) fame):\n\n```rust\nuse fst::Set; // Don't need FST, just FSA here\n\nstatic WORDS: \u0026[\u0026str] = \u0026[\"abc\", \"def\", \"ghi\", \"jkl\"];\n\nlet set = Set::from_iter(WORDS.into_iter()).unwrap();\nassert!(set.contains(\"ghi\"));\n```\n\nIt offers:\n\n- [almost free (in time and space)\n  deserialization](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/9?u=alexpovel):\n  its serialization format is identical to its in-memory representation, unlike [other\n  solutions](#higher-order-data-structures), facilitating startup-up performance\n- compression[^3] (important for\n  [publishing](https://github.com/rust-lang/crates.io/issues/195)), making it the only\n  candidate in this comparison natively leading to *smaller* size than the original word\n  list\n- extension points (fuzzy and case-insensitive searching, bring-your-own-automaton etc.)\n- [faster lookups than this crate](#benchmarks), by a factor of about 2\n\nIn some sense, for all intents and purposes, **`fst` is likely the best solution** for\nthe niche use case mentioned above.\n\nFor faster lookups than `fst` (closing the gap towards hash sets), [but giving up\ncompression](https://users.rust-lang.org/t/fast-string-lookup-in-a-single-str-containing-millions-of-unevenly-sized-substrings/98040/13?u=alexpovel)\n([TANSTAAFL](https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lunch)!), try an\nautomaton from\n[`regex-automata`](https://docs.rs/regex-automata/latest/regex_automata/dfa/index.html#example-deserialize-a-dfa).\nNote that should your use case involve an initial decompression step, the slower runtime\nlookups but built-in compression of `fst` might still come out ahead in combination.\n\n### Single, sorted and padded string\n\nAnother approach could be to use a single string (saving pointer bloat), but pad all\nwords to the longest occurring length, facilitating easy binary search (and increasing\nbloat to some extent):\n\n```rust\nstatic WORDS: \u0026str = \"abc␣␣def␣␣ghi␣␣jklmn\";\n\n// Perform binary search...\n```\n\nThe binary search implementation is then straightforward, as the elements are of known,\nfixed lengths (in this case, 5). This approach was [found to not perform\nwell](#benchmarks). Find its (bare-bones) implementation in the\n[benchmarks](./benches/main.rs).\n\n### Higher-order data structures\n\nIn certain scenarios, one might reach for more sophisticated approaches, such as\n[tries](https://en.wikipedia.org/wiki/Trie). This is not a case this crate is designed\nfor. Such a structure would have to be either:\n\n- [built at runtime](https://docs.rs/trie-rs/0.1.1/trie_rs/index.html#usage-overview),\n  for example as\n\n  ```rust\n  use trie_rs::TrieBuilder;\n\n  let mut builder = TrieBuilder::new();\n  builder.push(\"abc\");\n  builder.push(\"def\");\n  builder.push(\"ghi\");\n  builder.push(\"jkl\");\n  let trie = builder.build(); // Takes time\n\n  assert!(trie.exact_match(\"def\"));\n  ```\n\n  or alternatively\n- [deserialized from a pre-built structure](https://serde.rs/).\n\nWhile tools like [bincode](https://docs.rs/bincode/latest/bincode/) are fantastic, the\nlatter approach is still numbingly slow at application startup, compared to the (much\nmore ham-fisted) approach the crate at hand takes.\n\n### Linear search\n\nThis is only included here and in the benchmarks as a sanity check and baseline. Linear\nsearch like\n\n```rust\nstatic WORDS: \u0026[\u0026str] = \u0026[\"abc\", \"def\", \"ghi\", \"jkl\"];\nassert!(WORDS.contains(\u0026\"ghi\"));\n```\n\nis $O(n)$, and [slower by a couple orders of magnitude for large\nlists](#linear-search-performance). If your current implementation relies on linear\nsearch, this create might offer an almost drop-in replacement with a significant\nperformance improvement.\n\n## Benchmarks\n\nThe below benchmarks show a performance comparison. The benchmarks run a search for\nrepresentative words (start, middle, end, shortest and longest words found in the\npre-sorted input list), on various different input word list lengths.\n\nSets are unsurprisingly fastest, but naive binary search (the built-in one) seems\nincredibly optimized and just as fast. `b4s` is slower by a factor of 5 to 10. The\n\"padded string\" variant is slowest. One can observe how, as the input lists get longer\n(\"within *X* entries\"), `b4s` becomes slower.\n\nIn the context of this crate's purpose, the slowness might not be an issue: if\napplication startup is measured in milliseconds, and lookups in nanoseconds (!), one can\nperform in the rough ballpark of, say, 100,000 lookups before the tradeoff of this crate\n(fast startup) becomes a problem (this crate would be terrible for a web server).\n\n![benchmark results violin plot](https://raw.githubusercontent.com/alexpovel/b4s/main/assets/benchmark.png)\n\n### Linear search performance\n\nThe [benchmark plot](./assets/benchmarks-with-linear-search.png) including [linear\nsearch](#linear-search) is largely illegible, as the linear horizontal axis scaling\ndwarfs all other search methods. It is therefore linked separately, but paints a clear\npicture.\n\n### Note\n\nThe benchmarks were run on a machine with the following specs:\n\n- AMD Ryzen 7 5800X3D; DDR4 @ 3600MHz; NVMe SSD\n- Debian 12 inside WSL 2 on Windows 10 21H2\n- libraries with versions as of commit 9e2f11c39342f1ea3460dda810a92b225ee9d4b8 (refer\n  to its `Cargo.toml`)\n\nThe benchmarks are not terribly scientific (low sample sizes etc.), but serve as a rough\nguideline and sanity check. Run them yourself from the repository root with `cargo\ninstall just \u0026\u0026 just bench`.\n\n## Note on name\n\nThe 3-letter name is neat. Should you have a more meaningful, larger project that could\nmake better use of it, let me know. I might move this crate to a different name.\n\n[^1]: Note that pre-compile preprocessing is ordinarily performed only **a single\n    time**, unless the word list itself changes. This column might be moot, and\n    considered essentially zero-cost. This viewpoint benefits this crate.\n[^2]: Building itself is `O(n)`, but the raw input might be unsorted (as is assumed for\n    all other approaches as well). Sorting is `O(n log n)`, so building the automaton\n    collapses to `O(n + n log n)` = `O(n log n)`.\n[^3]: As an automaton, the finite state transducer (in this case, finite state acceptor)\n    compresses all common prefixes, like a [trie](https://en.wikipedia.org/wiki/Trie),\n    **but also all suffixes**, unlike a prefix tree. That's a massive advantage should\n    compression be of concern. Languages like German benefit greatly. Take the example\n    of `übersehen`: the countless\n    [conjugations](https://www.duden.de/konjugation/uebersehen_uebersehen) are shared\n    among *all* words, so are only encoded once in the entire automaton. The prefix\n    `über` is also shared among many words, and is also only encoded once. Compression\n    is built-in.\n[^4]: The [program this crate was initially designed\n    for](https://github.com/alexpovel/betterletters) is sensitive to startup-time, as\n    the program's main processing is *rapid*. Even just 50ms of startup time would be\n    very noticeable, slowing down a program run by a factor of about 10.\n\n[slice-sort]: https://doc.rust-lang.org/std/primitive.slice.html#method.sort\n[fst-repo]: https://github.com/BurntSushi/fst\n[fst-build]: https://docs.rs/fst/0.4.7/fst/struct.SetBuilder.html\n[slice-binsearch]: https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search\n[phf-repo]: https://github.com/rust-phf/rust-phf\n[hashset]: https://doc.rust-lang.org/std/collections/struct.HashSet.html\n[pad-file]: ./benches/main.rs\n[slice]: https://doc.rust-lang.org/std/primitive.slice.html\n[b4s-lib]: ./src/lib.rs\n[fst-lookup]: https://blog.burntsushi.net/transducers/#ordered-sets\n[fst-size]: https://blog.burntsushi.net/transducers/#the-dictionary\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexpovel%2Fb4s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falexpovel%2Fb4s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falexpovel%2Fb4s/lists"}