{"id":16976161,"url":"https://github.com/ashvardanian/stringzilla-benchmarks-rs","last_synced_at":"2025-03-22T14:31:48.021Z","repository":{"id":224163012,"uuid":"762493674","full_name":"ashvardanian/stringzilla-benchmarks-rs","owner":"ashvardanian","description":"Comparing performance-oriented string-processing libraries for substring search, multi-pattern matching, hashing, and Levenshtein edit-distance calculations","archived":false,"fork":false,"pushed_at":"2025-03-16T09:19:02.000Z","size":111,"stargazers_count":45,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-16T10:24:26.266Z","etag":null,"topics":["benchmark","libc","memchr","string","string-search","strstr","substring-search"],"latest_commit_sha":null,"homepage":"https://github.com/ashvardanian/stringzilla","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-23T22:31:57.000Z","updated_at":"2025-02-01T19:01:23.000Z","dependencies_parsed_at":"2024-02-24T07:29:00.343Z","dependency_job_id":"5f9f26c1-91a7-40f4-a33f-5373c1aa43c5","html_url":"https://github.com/ashvardanian/stringzilla-benchmarks-rs","commit_stats":null,"previous_names":["ashvardanian/memchr_vs_stringzilla","ashvardanian/stringzilla-benchmarks-rs"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fstringzilla-benchmarks-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fstringzilla-benchmarks-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fstringzilla-benchmarks-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fstringzilla-benchmarks-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/stringzilla-benchmarks-rs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244972259,"owners_count":20540948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","libc","memchr","string","string-search","strstr","substring-search"],"created_at":"2024-10-14T01:25:09.944Z","updated_at":"2025-03-22T14:31:47.693Z","avatar_url":"https://github.com/ashvardanian.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [`memchr`](https://github.com/BurntSushi/memchr) vs [`stringzilla`](https://github.com/ashvardanian/StringZilla)\n\n## Rust Substring Search Benchmarks \n\nSubstring search is one of the most common operations in text processing, and one of the slowest.\nStringZilla was designed to supersede LibC and implement those core operations in CPU-friendly manner, using branchless operations, SWAR, and SIMD assembly instructions.\nNotably, Rust has a `memchr` crate that provides a similar functionality, and it's used in many popular libraries.\nThis repository provides basic benchmarking scripts for comparing the throughput of [`stringzilla`](https://github.com/ashvardanian/StringZilla) and [`memchr`](https://github.com/BurntSushi/memchr).\nFor normal order and reverse order search, over ASCII and UTF8 input data, the following numbers can be expected.\n\n|               |         ASCII ⏩ |         ASCII ⏪ |         UTF8 ⏩ |          UTF8 ⏪ |\n| ------------- | --------------: | --------------: | -------------: | --------------: |\n| Intel:        |                 |                 |                |                 |\n| `memchr`      |       5.89 GB/s |       1.08 GB/s |      8.73 GB/s |       3.35 GB/s |\n| `stringzilla` |   __8.37__ GB/s |   __8.21__ GB/s | __11.21__ GB/s |  __11.20__ GB/s |\n| Arm:          |                 |                 |                |                 |\n| `memchr`      |       6.38 GB/s |       1.12 GB/s | __13.20__ GB/s |       3.56 GB/s |\n| `stringzilla` |   __6.56__ GB/s |   __5.56__ GB/s |      9.41 GB/s |   __8.17__ GB/s |\n|               |                 |                 |                |                 |\n| Average       | __1.2x__ faster | __6.2x__ faster |              - | __2.8x__ faster |\n\n\n\u003e For Intel the benchmark was run on AWS `r7iz` instances with Sapphire Rapids cores.\n\u003e For Arm the benchmark was run on AWS `r7g` instances with Graviton 3 cores.\n\u003e The ⏩ signifies forward search, and ⏪ signifies reverse order search.\n\u003e At the time of writing, the latest versions of `memchr` and `stringzilla` were used - 2.7.1 and 3.3.0, respectively.\n\n## Replicating the Results\n\nBefore running benchmarks, you can test your Rust environment running:\n\n```bash\ncargo install cargo-criterion --locked\nHAYSTACK_PATH=README.md cargo criterion --jobs 8\n```\n\nOn Windows using PowerShell you'd need to set the environment variable differently:\n\n```powershell\n$env:HAYSTACK_PATH=\"README.md\"\ncargo criterion --jobs 8\n```\n\nAs part of the benchmark, the input \"haystack\" file is whitespace-tokenized into an array of strings.\nIn every benchmark iteration, a new \"needle\" is taken from that array of tokens.\nAll inclusions of that token in the haystack are counted, and the throughput is calculated.\nThis generally results in very stable and predictable results.\nThe benchmark also includes a warm-up, to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.\n\n### ASCII Corpus\n\nFor benchmarks on ASCII data I've used the English Leipzig Corpora Collection.\nIt's 124 MB in size, 1'000'000 lines long, and contains 8'388'608 tokens of mean length 5.\n\n```bash\nwget --no-clobber -O leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt \nHAYSTACK_PATH=leipzig1M.txt cargo criterion --jobs 8\n```\n\n### UTF8 Corpus\n\nFor richer mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization.\nIt's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8.\nTo download, unpack, and run the benchmarks, execute the following bash script in your terminal:\n\n```bash\nwget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz\ngzip -d xlsum.csv.gz\nHAYSTACK_PATH=xlsum.csv cargo criterion --jobs 8\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fstringzilla-benchmarks-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fstringzilla-benchmarks-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fstringzilla-benchmarks-rs/lists"}