{"id":31680979,"url":"https://github.com/ashvardanian/StringWars","last_synced_at":"2025-10-08T07:05:57.497Z","repository":{"id":224163012,"uuid":"762493674","full_name":"ashvardanian/StringWars","owner":"ashvardanian","description":"Comparing performance-oriented string-processing libraries for substring search, multi-pattern matching, hashing, edit-distances, sketching, and sorting across CPUs and GPUs in Rust 🦀 and Python 🐍","archived":false,"fork":false,"pushed_at":"2025-10-03T18:11:54.000Z","size":677,"stargazers_count":91,"open_issues_count":1,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-10-05T01:33:55.252Z","etag":null,"topics":["benchmark","bioinformatics","database","dataframe","levenshtein-distance","libc","memchr","polars","rapids","string","string-search","strstr","substring-search"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/stringwars-on-gpus/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-02-23T22:31:57.000Z","updated_at":"2025-10-03T18:11:23.000Z","dependencies_parsed_at":"2024-02-24T07:29:00.343Z","dependency_job_id":"5f9f26c1-91a7-40f4-a33f-5373c1aa43c5","html_url":"https://github.com/ashvardanian/StringWars","commit_stats":null,"previous_names":["ashvardanian/memchr_vs_stringzilla","ashvardanian/stringzilla-benchmarks-rs","ashvardanian/stringwa.rs","ashvardanian/stringwars"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/ashvardanian/StringWars","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/StringWars/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278903036,"owners_count":26065786,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","bioinformatics","database","dataframe","levenshtein-distance","libc","memchr","polars","rapids","string","string-search","strstr","substring-search"],"created_at":"2025-10-08T07:02:03.495Z","updated_at":"2025-10-08T07:05:57.491Z","avatar_url":"https://github.com/ashvardanian.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# StringWars\n\n## Text Processing on CPUs \u0026 GPUs, in Python 🐍 \u0026 Rust 🦀\n\n![StringWars Thumbnail](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringWa.rs.jpg?raw=true)\n\nThere are many __great__ libraries for string processing!\nMostly, of course, written in Assembly, C, and C++, but some in Rust as well. 😅\n\nWhere Rust decimates C and C++, is the __simplicity__ of dependency management, making it great for benchmarking \"Systems Software\" and lining up apples-to-apples across native crates and their Python bindings.\nSo, to accelerate the development of the [`StringZilla`](https://github.com/ashvardanian/StringZilla) C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my \u0026 communities most beloved Rust projects, like:\n\n- [`memchr`](https://github.com/BurntSushi/memchr) for substring search.\n- [`rapidfuzz`](https://github.com/rapidfuzz/rapidfuzz-rs) for edit distances.\n- [`aHash`](https://github.com/tkaitchuck/aHash) and [`crc32fast`](https://github.com/srijs/rust-crc32fast) for hashing.\n- [`aho_corasick`](https://github.com/BurntSushi/aho-corasick) and [`regex`](https://github.com/rust-lang/regex) for multi-search.\n- [`arrow`](https://github.com/apache/arrow-rs) and [`polars`](https://github.com/pola-rs/polars) for collections.\n\nOf course, the functionality of the projects is different, as are the APIs and the usage patterns.\nSo, I focus on the workloads for which StringZilla was designed and compare the throughput of the core operations.\nNotably, I also favor modern hardware with support for a wider range SIMD instructions, like mask-equipped AVX-512 on x86 starting from the 2015 Intel Skylake-X CPUs or more recent predicated variable-length SVE and SVE2 on Arm, that aren't often supported by existing libraries and tooling.\n\n\u003e [!IMPORTANT]  \n\u003e The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method.\n\u003e Most of them were obtained on Intel Sapphire Rapids __(SNR)__ and Granite Rapids __(GNR)__ CPUs and Nvidia Hopper-based __H100__ and Blackwell-based __RTX 6000__ Pro GPUs, using Rust with `-C target-cpu=native` optimization flag.\n\u003e To replicate the results, please refer to the [Replicating the Results](#replicating-the-results) section below.\n\n## Hash\n\nMany great hashing libraries exist in Rust, C, and C++.\nTypical top choices are `aHash`, `xxHash`, `blake3`, `gxhash`, `CityHash`, `MurmurHash`, `crc32fast`, or the native `std::hash`.\nMany of them have similar pitfalls:\n\n- They are not always documented to have a certain reproducible output and are recommended for use only for local in-memory construction of hash tables, not for serialization or network communication.\n- They don't always support streaming and require the whole input to be available in memory at once.\n- They don't always pass the SMHasher test suite, especially with `--extra` checks enabled.\n- They generally don't have a dynamic dispatch mechanism to simplify shipping of precompiled software.\n- They are rarely available for multiple programming languages.\n\nStringZilla addresses those issues and seems to provide competitive performance.\nOn Intel Sapphire Rapids CPU, on `xlsum.csv` dataset, the following numbers can be expected for hashing individual whitespace-delimited words and newline-delimited lines:\n\n| Library               | Bits  | Ports ¹ | Arm ² |    Short Words |      Long Lines |\n| --------------------- | :---: | :-----: | :---: | -------------: | --------------: |\n| Rust 🦀                |       |         |       |                |                 |\n| `std::hash`           |  64   |    ❌    |   ✅   |     0.43 GiB/s |      3.74 GiB/s |\n| `crc32fast::hash`     |  32   |    ✅    |   ✅   |     0.49 GiB/s |      8.45 GiB/s |\n| `xxh3::xxh3_64`       |  64   |    ✅    |   ✅   |     1.08 GiB/s |      9.48 GiB/s |\n| `aHash::hash_one`     |  64   |    ❌    |   ✅   |     1.23 GiB/s |      8.61 GiB/s |\n| `foldhash::hash_one`  |  64   |    ❌    |   ✅   |     1.02 GiB/s |      8.24 GiB/s |\n| `gxhash::gxhash64`    |  64   |    ❌    |   ❌   |     2.68 GiB/s |      9.19 GiB/s |\n| `stringzilla::hash`   |  64   |    ✅    |   ✅   | __1.84 GiB/s__ | __11.38 GiB/s__ |\n|                       |       |         |       |                |\n| Python 🐍              |       |         |       |                |\n| `hash`                | 32/64 |    ❌    |   ✅   |     0.13 GiB/s |      4.27 GiB/s |\n| `xxhash.xxh3_64`      |  64   |    ✅    |   ✅   |     0.04 GiB/s |      6.38 GiB/s |\n| `google_crc32c.value` |  32   |    ✅    |   ✅   |     0.04 GiB/s |      5.96 GiB/s |\n| `mmh3.hash32`         |  32   |    ✅    |   ✅   |     0.05 GiB/s |      2.65 GiB/s |\n| `mmh3.hash64`         |  64   |    ✅    |   ✅   |     0.03 GiB/s |      4.45 GiB/s |\n| `cityhash.CityHash64` |  64   |    ✅    |   ❌   |     0.06 GiB/s |      4.87 GiB/s |\n| `stringzilla.hash`    |  64   |    ✅    |   ✅   | __0.14 GiB/s__ |  __9.19 GiB/s__ |\n\n\n\u003e ¹ Portability means availability in multiple other programming languages, like C, C++, Python, Java, Go, JavaScript, etc.\n\u003e ² Most hash functions work on both x86 and Arm, as well as many other CPU architectures, but gxHash, and many MurMurHash and CityHash implementations don't.\n\nIn larger systems, however, we often need the ability to incrementally hash the data.\nThis is especially important in distributed systems, where the data is too large to fit into memory at once.\n\n| Library                    | Bits  | Ports ¹ |    Short Words |      Long Lines |\n| -------------------------- | :---: | :-----: | -------------: | --------------: |\n| Rust 🦀                     |       |         |                |                 |\n| `std::hash::DefaultHasher` |  64   |    ❌    |     0.51 GiB/s |      3.92 GiB/s |\n| `aHash::AHasher`           |  64   |    ❌    | __1.30 GiB/s__ |      8.56 GiB/s |\n| `foldhash::FoldHasher`     |  64   |    ❌    |     1.27 GiB/s |      8.18 GiB/s |\n| `crc32fast::Hasher`        |  32   |    ✅    |     0.37 GiB/s |      8.39 GiB/s |\n| `stringzilla::Hasher`      |  64   |    ✅    |     0.89 GiB/s | __11.03 GiB/s__ |\n|                            |       |         |                |                 |\n| Python 🐍                   |       |         |                |                 |\n| `xxhash.xxh3_64`           |  64   |    ✅    |     0.09 GiB/s |       7.09 GB/s |\n| `google_crc32c.Checksum`   |  32   |    ✅    |     0.04 GiB/s |      5.96 GiB/s |\n| `stringzilla.Hasher`       |  64   |    ✅    | __0.35 GiB/s__ |   __6.04 GB/s__ |\n\nFor reference, one may want to put those numbers next to check-sum calculation speeds on one end of complexity and cryptographic hashing speeds on the other end.\n\n| Library                | Bits  | Ports ¹ | Short Words |  Long Lines |\n| ---------------------- | :---: | :-----: | ----------: | ----------: |\n| Rust 🦀                 |       |         |             |             |\n| `stringzilla::bytesum` |  64   |    ✅    |  2.16 GiB/s | 11.65 GiB/s |\n| `blake3::hash`         |  256  |    ✅    |  0.10 GiB/s |  1.97 GiB/s |\n|                        |       |         |             |             |\n| Python 🐍               |       |         |             |             |\n| `stringzilla.bytesum`  |  64   |    ✅    |  0.16 GiB/s |  8.62 GiB/s |\n| `blake3.digest`        |  256  |    ✅    |  0.02 GiB/s |  1.82 GiB/s |\n\n\n## Substring Search\n\nSubstring search is one of the most common operations in text processing, and one of the slowest.\nMost of the time, programmers don't think about replacing the `str::find` method, as it's already expected to be optimized.\nIn many languages it's offloaded to the C standard library [`memmem`](https://man7.org/linux/man-pages/man3/memmem.3.html) or [`strstr`](https://en.cppreference.com/w/c/string/byte/strstr) for `NULL`-terminated strings.\nThe C standard library is, however, also implemented by humans, and a better solution can be created.\n\n| Library             | Short Word Queries | Long Line Queries |\n| ------------------- | -----------------: | ----------------: |\n| Rust 🦀              |                    |                   |\n| `std::str::find`    |         9.45 GiB/s |       10.88 GiB/s |\n| `memmem::find`      |         9.48 GiB/s |       10.83 GiB/s |\n| `memmem::Finder`    |         9.51 GiB/s |   __10.99 GiB/s__ |\n| `stringzilla::find` |    __10.51 GiB/s__ |       10.82 GiB/s |\n|                     |                    |                   |\n| Python 🐍            |                    |                   |\n| `str.find`          |         1.05 GiB/s |        1.23 GiB/s |\n| `stringzilla.find`  |    __10.82 GiB/s__ |   __11.79 GiB/s__ |\n\nInterestingly, the reverse order search is almost never implemented in SIMD, assuming fewer people ever need it.\nStill, those are provided by StringZilla mostly for parsing tasks and feature parity.\n\n| Library              | Short Word Queries | Long Line Queries |\n| -------------------- | -----------------: | ----------------: |\n| Rust 🦀               |                    |                   |\n| `std::str::rfind`    |         2.72 GiB/s |        5.94 GiB/s |\n| `memmem::rfind`      |         2.70 GiB/s |        5.90 GiB/s |\n| `memmem::FinderRev`  |         2.79 GiB/s |        5.81 GiB/s |\n| `stringzilla::rfind` |    __10.34 GiB/s__ |   __10.66 GiB/s__ |\n|                      |                    |                   |\n| Python 🐍             |                    |                   |\n| `str.rfind`          |         1.54 GiB/s |        3.84 GiB/s |\n| `stringzilla.rfind`  |     __7.15 GiB/s__ |   __11.56 GiB/s__ |\n\n\n## Byte-Set Search\n\nStringWars takes a few representative examples of various character sets that appear in real parsing or string validation tasks:\n\n- tabulation characters, like `\\n\\r\\v\\f`;\n- HTML and XML markup characters, like `\u003c/\u003e\u0026'\\\"=[]`;\n- numeric characters, like `0123456789`.\n\nIt's common in such cases, to pre-construct some library-specific filter-object or Finite State Machine (FSM) to search for a set of characters.\nOnce that object is constructed, all of it's inclusions in each token (word or line) are counted.\nCurrent numbers should look like this:\n\n| Library                         |    Short Words |     Long Lines |\n| ------------------------------- | -------------: | -------------: |\n| Rust 🦀                          |                |                |\n| `bstr::iter`                    |     0.26 GiB/s |     0.25 GiB/s |\n| `regex::find_iter`              |     0.23 GiB/s |     5.22 GiB/s |\n| `aho_corasick::find_iter`       |     0.41 GiB/s |     0.50 GiB/s |\n| `stringzilla::find_byteset`     | __1.61 GiB/s__ | __8.17 GiB/s__ |\n|                                 |                |                |\n| Python 🐍                        |                |                |\n| `re.finditer`                   |     0.04 GiB/s |     0.19 GiB/s |\n| `stringzilla.Str.find_first_of` | __0.11 GiB/s__ | __8.79 GiB/s__ |\n\n## Sequence Operations\n\nRust has several Dataframe libraries, DBMS and Search engines that heavily rely on string sorting and intersections.\nThose operations mostly are implemented using conventional algorithms:\n\n- Comparison-based Quicksort or Mergesort for sorting.\n- Hash-based or Tree-based algorithms for intersections.\n\nAssuming the compares can be accelerated with SIMD and so can be the hash functions, StringZilla could already provide a performance boost in such applications, but starting with v4 it also provides specialized algorithms for sorting and intersections.\nThose are directly compatible with arbitrary string-comparable collection types with a support of an indexed access to the elements.\n\n| Library                                     |               Short Words |              Long Lines |\n| ------------------------------------------- | ------------------------: | ----------------------: |\n| Rust 🦀                                      |                           |                         |\n| `std::sort_unstable_by_key`                 |        54.35 M compares/s |      57.70 M compares/s |\n| `rayon::par_sort_unstable_by_key` on 1x SPR |        47.08 M compares/s |      50.35 M compares/s |\n| `polars::Series::sort`                      |       200.34 M compares/s |      65.44 M compares/s |\n| `polars::Series::arg_sort`                  |        25.01 M compares/s |      14.05 M compares/s |\n| `arrow::lexsort_to_indices`                 |       122.20 M compares/s |  __84.73 M compares/s__ |\n| `stringzilla::argsort_permutation`          |   __213.73 M compares/s__ |      74.64 M compares/s |\n|                                             |                           |                         |\n| Python 🐍                                    |                           |                         |\n| `list.sort` on 1x SPR                       |        47.06 M compares/s |      22.36 M compares/s |\n| `pandas.Series.sort_values` on 1x SPR       |         9.39 M compares/s |      11.93 M compares/s |\n| `pyarrow.compute.sort_indices` on 1x SPR    |        62.17 M compares/s |       5.53 M compares/s |\n| `polars.Series.sort` on 1x SPR              |       223.38 M compares/s | __181.60 M compares/s__ |\n| `cudf.Series.sort_values` on H100           | __9'463.59 M compares/s__ |      66.44 M compares/s |\n| `stringzilla.Strs.sorted` on 1x SPR         |       171.13 M compares/s |      77.88 M compares/s |\n\n## Random Generation \u0026 Lookup Tables\n\nSome of the most common operations in data processing are random generation and lookup tables.\nThat's true not only for strings but for any data type, and StringZilla has been extensively used in Image Processing and Bioinformatics for those purposes.\nGenerating random byte-streams:\n\n| Library                        |    Short Words |      Long Lines |\n| ------------------------------ | -------------: | --------------: |\n| Rust 🦀                         |                |                 |\n| `getrandom::fill`              |     0.18 GiB/s |      0.45 GiB/s |\n| `rand_chacha::ChaCha20Rng`     |     0.62 GiB/s |      1.85 GiB/s |\n| `rand_xoshiro::Xoshiro128Plus` |     0.83 GiB/s |      3.85 GiB/s |\n| `zeroize::zeroize`             |     0.66 GiB/s |      4.73 GiB/s |\n| `stringzilla::fill_random`     | __2.47 GiB/s__ | __10.57 GiB/s__ |\n|                                |                |                 |\n| Python 🐍                       |                |                 |\n| `numpy.PCG64`                  |     0.01 GiB/s |      1.28 GiB/s |\n| `numpy.Philox`                 |     0.01 GiB/s |      1.59 GiB/s |\n| `pycryptodome.AES-CTR`         |     0.01 GiB/s |     13.16 GiB/s |\n| `stringzilla.random`           | __0.11 GiB/s__ | __20.37 GiB/s__ |\n\nPerforming in-place lookups in a precomputed table of 256 bytes:\n\n| Library                         |    Short Words |     Long Lines |\n| ------------------------------- | -------------: | -------------: |\n| Rust 🦀                          |                |                |\n| serial code                     | __0.61 GiB/s__ |     1.49 GiB/s |\n| `stringzilla::lookup_inplace`   |     0.54 GiB/s | __9.90 GiB/s__ |\n|                                 |                |                |\n| Python 🐍                        |                |                |\n| `bytes.translate`               |     0.05 GiB/s |     1.92 GiB/s |\n| `numpy.take`                    |     0.01 GiB/s |     0.85 GiB/s |\n| `opencv.LUT`                    |     0.01 GiB/s |     1.95 GiB/s |\n| `opencv.LUT` inplace            |     0.01 GiB/s |     2.16 GiB/s |\n| `stringzilla.translate`         |     0.07 GiB/s |     7.92 GiB/s |\n| `stringzilla.translate` inplace | __0.06 GiB/s__ | __8.14 GiB/s__ |\n\n\n## Similarities Scoring\n\nEdit Distance calculation is a common component of Search Engines, Data Cleaning, and Natural Language Processing, as well as in Bioinformatics.\nIt's a computationally expensive operation, generally implemented using dynamic programming, with a quadratic time complexity upper bound.\nFor biological sequences, the Needleman-Wunsch and Smith-Waterman algorithms are more appropriate, as they allow overriding the default substitution costs.\nEach of those has two flavors - with linear and affine gap penalties, also known as the \"Gotoh\" variation.\n\n- byte-level and unicode [Levenshtein](#levenshtein) distance;\n- [Needleman-Wunsch](#needleman-wunsch), [Needleman-Wunsch-Gotoh](#needleman-wunsch-gotoh);\n- [Smith-Waterman](#smith-waterman), [Smith-Waterman-Gotoh](#smith-waterman-gotoh).\n\n### Levenshtein\n\n| Library                                              | ≅ 100 bytes lines | ≅ 1'000 bytes lines |\n| ---------------------------------------------------- | ----------------: | ------------------: |\n| Rust 🦀                                               |                   |\n| `bio::levenshtein` on 1x SPR                         |         428 MCUPS |           823 MCUPS |\n| `rapidfuzz::levenshtein\u003cBytes\u003e` on 1x SPR            |       4'633 MCUPS |        14'316 MCUPS |\n| `rapidfuzz::levenshtein\u003cChars\u003e` on 1x SPR            |       3'877 MCUPS |        13'179 MCUPS |\n| `stringzillas::LevenshteinDistances` on 1x SPR       |       3'315 MCUPS |        13'084 MCUPS |\n| `stringzillas::LevenshteinDistancesUtf8` on 1x SPR   |       3'283 MCUPS |        11'690 MCUPS |\n| `stringzillas::LevenshteinDistances` on 16x SPR      |      29'430 MCUPS |       105'400 MCUPS |\n| `stringzillas::LevenshteinDistancesUtf8` on 16x SPR  |      38'954 MCUPS |       103'500 MCUPS |\n| `stringzillas::LevenshteinDistances` on RTX6000      |  __32'030 MCUPS__ |   __901'990 MCUPS__ |\n| `stringzillas::LevenshteinDistances` on H100         |  __31'913 MCUPS__ |   __925'890 MCUPS__ |\n| `stringzillas::LevenshteinDistances` on 384x GNR     | __114'190 MCUPS__ | __3'084'270 MCUPS__ |\n| `stringzillas::LevenshteinDistancesUtf8` on 384x GNR | __103'590 MCUPS__ | __2'938'320 MCUPS__ |\n|                                                      |                   |                     |\n| Python 🐍                                             |                   |                     |\n| `nltk.edit_distance`                                 |           2 MCUPS |             2 MCUPS |\n| `jellyfish.levenshtein_distance`                     |          81 MCUPS |           228 MCUPS |\n| `rapidfuzz.Levenshtein.distance`                     |         108 MCUPS |         9'272 MCUPS |\n| `editdistance.eval`                                  |          89 MCUPS |           660 MCUPS |\n| `edlib.align`                                        |          82 MCUPS |         7'262 MCUPS |\n| `polyleven.levenshtein`                              |          89 MCUPS |         3'887 MCUPS |\n| `stringzillas.LevenshteinDistances` on 1x SPR        |          53 MCUPS |         3'407 MCUPS |\n| `stringzillas.LevenshteinDistancesUTF8` on 1x SPR    |          57 MCUPS |         3'693 MCUPS |\n| `cudf.edit_distance` batch on H100                   |      24'754 MCUPS |         6'976 MCUPS |\n| `stringzillas.LevenshteinDistances` batch on 1x SPR  |       2'343 MCUPS |        12'141 MCUPS |\n| `stringzillas.LevenshteinDistances` batch on 16x SPR |       3'762 MCUPS |       119'261 MCUPS |\n| `stringzillas.LevenshteinDistances` batch on H100    |  __18'081 MCUPS__ |   __320'109 MCUPS__ |\n\n### Needleman-Wunsch\n\n| Library                                               | ≅ 100 bytes lines | ≅ 1'000 bytes lines |\n| ----------------------------------------------------- | ----------------: | ------------------: |\n| Rust 🦀                                                |                   |                     |\n| `bio::pairwise::global` on 1x SPR                     |          51 MCUPS |            57 MCUPS |\n| `stringzillas::NeedlemanWunschScores` on 1x SPR       |         278 MCUPS |           612 MCUPS |\n| `stringzillas::NeedlemanWunschScores` on 16x SPR      |       4'057 MCUPS |         8'492 MCUPS |\n| `stringzillas::NeedlemanWunschScores` on 384x GNR     |  __64'290 MCUPS__ |   __331'340 MCUPS__ |\n| `stringzillas::NeedlemanWunschScores` on H100         |         131 MCUPS |    __12'113 MCUPS__ |\n|                                                       |                   |                     |\n| Python 🐍                                              |                   |                     |\n| `biopython.PairwiseAligner.score` on 1x SPR           |          95 MCUPS |           557 MCUPS |\n| `stringzillas.NeedlemanWunschScores` on 1x SPR        |          30 MCUPS |           481 MCUPS |\n| `stringzillas.NeedlemanWunschScores` batch on 1x SPR  |         246 MCUPS |           570 MCUPS |\n| `stringzillas.NeedlemanWunschScores` batch on 16x SPR |       3'103 MCUPS |         9'208 MCUPS |\n| `stringzillas.NeedlemanWunschScores` batch on H100    |         127 MCUPS |        12'246 MCUPS |\n\n### Smith-Waterman\n\n| Library                                             | ≅ 100 bytes lines | ≅ 1'000 bytes lines |\n| --------------------------------------------------- | ----------------: | ------------------: |\n| Rust 🦀                                              |                   |                     |\n| `bio::pairwise::local` on 1x SPR                    |          49 MCUPS |            50 MCUPS |\n| `stringzillas::SmithWatermanScores` on 1x SPR       |         263 MCUPS |           552 MCUPS |\n| `stringzillas::SmithWatermanScores` on 16x SPR      |       3'883 MCUPS |         8'011 MCUPS |\n| `stringzillas::SmithWatermanScores` on 384x GNR     |  __58'880 MCUPS__ |   __285'480 MCUPS__ |\n| `stringzillas::SmithWatermanScores` on H100         |         143 MCUPS |    __12'921 MCUPS__ |\n|                                                     |                   |                     |\n| Python 🐍                                            |                   |                     |\n| `biopython.PairwiseAligner.score` on 1x SPR         |          95 MCUPS |           557 MCUPS |\n| `stringzillas.SmithWatermanScores` on 1x SPR        |          28 MCUPS |           440 MCUPS |\n| `stringzillas.SmithWatermanScores` batch on 1x SPR  |         255 MCUPS |           582 MCUPS |\n| `stringzillas.SmithWatermanScores` batch on 16x SPR |   __3'535 MCUPS__ |         8'235 MCUPS |\n| `stringzillas.SmithWatermanScores` batch on H100    |         130 MCUPS |    __12'702 MCUPS__ |\n\n### Needleman-Wunsch-Gotoh\n\n| Library                                           | ≅ 100 bytes lines | ≅ 1'000 bytes lines |\n| ------------------------------------------------- | ----------------: | ------------------: |\n| Rust 🦀                                            |                   |                     |\n| `stringzillas::NeedlemanWunschScores` on 1x SPR   |          83 MCUPS |           354 MCUPS |\n| `stringzillas::NeedlemanWunschScores` on 16x SPR  |       1'267 MCUPS |         4'694 MCUPS |\n| `stringzillas::NeedlemanWunschScores` on 384x GNR |  __42'050 MCUPS__ |   __155'920 MCUPS__ |\n| `stringzillas::NeedlemanWunschScores` on H100     |         128 MCUPS |    __13'799 MCUPS__ |\n\n### Smith-Waterman-Gotoh\n\n| Library                                         | ≅ 100 bytes lines | ≅ 1'000 bytes lines |\n| ----------------------------------------------- | ----------------: | ------------------: |\n| Rust 🦀                                          |                   |                     |\n| `stringzillas::SmithWatermanScores` on 1x SPR   |          79 MCUPS |           284 MCUPS |\n| `stringzillas::SmithWatermanScores` on 16x SPR  |       1'026 MCUPS |         3'776 MCUPS |\n| `stringzillas::SmithWatermanScores` on 384x GNR |  __38'430 MCUPS__ |   __129'140 MCUPS__ |\n| `stringzillas::SmithWatermanScores` on H100     |         127 MCUPS |    __13'205 MCUPS__ |\n\n## Byte-level Fingerprinting \u0026 Sketching Benchmarks\n\nIn large-scale Retrieval workloads a common technique is to convert variable-length messy strings into some fixed-length representations.\nThose are often called \"fingerprints\" or \"sketches\", like \"Min-Hashing\" or \"Count-Min-Sketching\".\nThere are a million variations of those algorithms, all resulting in different speed-vs-accuracy tradeoffs.\nTwo of the approximations worth considering is the number of collisions of produced individual hashes withing fingerprints, and the bit-distribution entropy of the produced fingerprints.\nAdjusting all implementation to the same tokenization scheme, one my experience following numbers:\n\n| Library                                    | ≅ 100 bytes lines | ≅ 1'000 bytes lines |\n| ------------------------------------------ | ----------------: | ------------------: |\n| serial `\u003cByteGrams\u003e` on 1x SPR 🦀           |        0.44 MiB/s |          0.47 MiB/s |\n|                                            | 92.81% collisions |   94.58% collisions |\n|                                            |    0.8528 entropy |      0.7979 entropy |\n|                                            |                   |                     |\n| `pc::MinHash\u003cByteGrams\u003e` on 1x SPR 🦀       |        2.41 MiB/s |          3.16 MiB/s |\n|                                            | 91.80% collisions |   93.17% collisions |\n|                                            |    0.9343 entropy |      0.8779 entropy |\n|                                            |                   |                     |\n| `stringzillas::Fingerprints` on 1x SPR 🦀   |        0.56 MiB/s |          0.51 MiB/s |\n| `stringzillas::Fingerprints` on 16x SPR 🦀  |        6.62 MiB/s |          8.03 MiB/s |\n| `stringzillas::Fingerprints` on 384x GNR 🦀 |  __231.13 MiB/s__ |    __302.30 MiB/s__ |\n| `stringzillas::Fingerprints` on RTX6000 🦀  |     __138 MiB/s__ |        162.99 MiB/s |\n| `stringzillas::Fingerprints` on H100 🦀     |      102.07 MiB/s |    __392.37 MiB/s__ |\n|                                            | 86.80% collisions |   93.21% collisions |\n|                                            |    0.9992 entropy |      0.9967 entropy |\n\n## Replicating the Results\n\n### Replicating the Results in Rust 🦀\n\nBefore running benchmarks, you can test your Rust environment running:\n\n```bash\ncargo install cargo-criterion --locked\n```\n\nTo pull and compile all the dependencies, you can call:\n\n```bash\ncargo build --all-features                  # to compile everything\ncargo check --all-features --all-targets    # to fail on warnings\n```\n\nBy default StringWars links `stringzilla` in CPU mode.\nIf the machine has an NVIDIA GPU with CUDA installed, enable the CUDA kernels explicitly when running benches, for example:\n\n```bash\nRUSTFLAGS=\"-C target-cpu=native\" \\\n    STRINGWARS_DATASET=README.md \\\n    STRINGWARS_TOKENS=lines \\\n    STRINGWARS_FILTER=GPU \\\n    cargo criterion --features \"cuda bench_similarities\" bench_similarities --jobs 1\n```\n\nWars always take long, and so do these benchmarks.\nEvery one of them includes a few seconds of a warm-up phase to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.\nEach of them accepts a few environment variables to control the dataset, the tokenization, and the error bounds.\nYou can log those by printing file-level documentation using `awk` on Linux:\n\n```bash\nawk '/^\\/\\/!/ { print } !/^\\/\\/!/ { exit }' bench_find.rs\n```\n\nCommonly used environment variables are:\n\n- `STRINGWARS_DATASET` - the path to the textual dataset file.\n- `STRINGWARS_TOKENS` - the tokenization mode: `file`, `lines`, or `words`.\n- `STRINGWARS_ERROR_BOUND` - the maximum allowed error in the Levenshtein distance.\n\nHere is an example of a common benchmark run on a Unix-like system:\n\n```bash\nRUSTFLAGS=\"-C target-cpu=native\" \\\n    STRINGWARS_DATASET=README.md \\\n    STRINGWARS_TOKENS=lines \\\n    cargo criterion --features bench_hash bench_hash --jobs $(nproc)\n```\n\nOn Windows using PowerShell you'd need to set the environment variable differently:\n\n```powershell\n$env:STRINGWARS_DATASET=\"README.md\"\ncargo criterion --jobs $(nproc)\n```\n\n### Replicating the Results in Python 🐍\n\nIt's recommended to use `uv` for Python dependency management and running the benchmarks.\nTo install all dependencies for all benchmarks:\n\n```sh\nuv venv --python 3.12\nuv pip install -r requirements.txt -r requirements-cuda.txt\nuv pip install --only-binary=:all: -r requirements.txt -r requirements-cuda.txt\n```\n\nTo install dependencies for individual benchmarks:\n\n```sh\nPIP_EXTRA_INDEX_URL=https://pypi.nvidia.com \\\nuv pip install '.[find,hash,sequence,fingerprints,similarities]'\n```\n\nTo run individual benchmarks, you can call:\n\n```sh\nuv run --no-project python bench_hash.py --help\nuv run --no-project python bench_find.py --help\nuv run --no-project python bench_memory.py --help\nuv run --no-project python bench_sequence.py --help\nuv run --no-project python bench_similarities.py --help\nuv run --no-project python bench_fingerprints.py --help 🔜\n```\n\n## Datasets\n\n### ASCII Corpus\n\nFor benchmarks on ASCII data I've used the English Leipzig Corpora Collection.\nIt's 124 MB in size, 1'000'000 lines long, and contains 8'388'608 tokens of mean length 5.\n\n```bash\nwget --no-clobber -O leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt \nSTRINGWARS_DATASET=leipzig1M.txt cargo criterion --jobs $(nproc)\n```\n\n### UTF8 Corpus\n\nFor richer mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization.\nIt's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8.\nTo download, unpack, and run the benchmarks, execute the following bash script in your terminal:\n\n```bash\nwget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz\ngzip -d xlsum.csv.gz\nSTRINGWARS_DATASET=xlsum.csv cargo criterion --jobs $(nproc)\n```\n\n### DNA Corpus\n\nFor bioinformatics workloads, I use the following datasets with increasing string lengths:\n\n```bash\nwget --no-clobber -O acgt_100.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_100.txt?download=true\nwget --no-clobber -O acgt_1k.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1k.txt?download=true\nwget --no-clobber -O acgt_10k.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10k.txt?download=true\nwget --no-clobber -O acgt_100k.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_100k.txt?download=true\nwget --no-clobber -O acgt_1m.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1m.txt?download=true\nwget --no-clobber -O acgt_10m.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10m.txt?download=true\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2FStringWars","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2FStringWars","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2FStringWars/lists"}