Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ashvardanian/memchr_vs_stringzilla

memchr vs stringzilla - up to 7x throughput difference between two SIMD-accelerated substring search libraries in Rust
https://github.com/ashvardanian/memchr_vs_stringzilla

benchmark libc memchr string string-search strstr substring-search

Last synced: 23 days ago
JSON representation

memchr vs stringzilla - up to 7x throughput difference between two SIMD-accelerated substring search libraries in Rust

Host: GitHub
URL: https://github.com/ashvardanian/memchr_vs_stringzilla
Owner: ashvardanian
Created: 2024-02-23T22:31:57.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-04-18T05:53:14.000Z (7 months ago)
Last Synced: 2024-10-15T01:25:07.103Z (about 1 month ago)
Topics: benchmark, libc, memchr, string, string-search, strstr, substring-search
Language: Rust
Homepage: https://ashvardanian.com/posts/stringzilla/
Size: 23.4 KB
Stars: 45
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# [`memchr`](https://github.com/BurntSushi/memchr) vs [`stringzilla`](https://github.com/ashvardanian/StringZilla)

## Rust Substring Search Benchmarks

Substring search is one of the most common operations in text processing, and one of the slowest.
StringZilla was designed to supersede LibC and implement those core operations in CPU-friendly manner, using branchless operations, SWAR, and SIMD assembly instructions.
Notably, Rust has a `memchr` crate that provides a similar functionality, and it's used in many popular libraries.
This repository provides basic benchmarking scripts for comparing the throughput of [`stringzilla`](https://github.com/ashvardanian/StringZilla) and [`memchr`](https://github.com/BurntSushi/memchr).
For normal order and reverse order search, over ASCII and UTF8 input data, the following numbers can be expected.

| | ASCII ⏩ | ASCII ⏪ | UTF8 ⏩ | UTF8 ⏪ |
| ------------- | --------------: | --------------: | -------------: | --------------: |
| Intel: | | | | |
| `memchr` | 5.89 GB/s | 1.08 GB/s | 8.73 GB/s | 3.35 GB/s |
| `stringzilla` | __8.37__ GB/s | __8.21__ GB/s | __11.21__ GB/s | __11.20__ GB/s |
| Arm: | | | | |
| `memchr` | 6.38 GB/s | 1.12 GB/s | __13.20__ GB/s | 3.56 GB/s |
| `stringzilla` | __6.56__ GB/s | __5.56__ GB/s | 9.41 GB/s | __8.17__ GB/s |
| | | | | |
| Average | __1.2x__ faster | __6.2x__ faster | - | __2.8x__ faster |

> For Intel the benchmark was run on AWS `r7iz` instances with Sapphire Rapids cores.
> For Arm the benchmark was run on AWS `r7g` instances with Graviton 3 cores.
> The ⏩ signifies forward search, and ⏪ signifies reverse order search.
> At the time of writing, the latest versions of `memchr` and `stringzilla` were used - 2.7.1 and 3.3.0, respectively.

## Replicating the Results

Before running benchmarks, you can test your Rust environment running:

```bash
cargo install cargo-criterion --locked
HAYSTACK_PATH=README.md cargo criterion --jobs 8
```

On Windows using PowerShell you'd need to set the environment variable differently:

```powershell
$env:HAYSTACK_PATH="README.md"
cargo criterion --jobs 8
```

As part of the benchmark, the input "haystack" file is whitespace-tokenized into an array of strings.
In every benchmark iteration, a new "needle" is taken from that array of tokens.
All inclusions of that token in the haystack are counted, and the throughput is calculated.
This generally results in very stable and predictable results.
The benchmark also includes a warm-up, to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.

### ASCII Corpus

For benchmarks on ASCII data I've used the English Leipzig Corpora Collection.
It's 124 MB in size, 1'000'000 lines long, and contains 8'388'608 tokens of mean length 5.

```bash
wget --no-clobber -O leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt
HAYSTACK_PATH=leipzig1M.txt cargo criterion --jobs 8
```

### UTF8 Corpus

For richer mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization.
It's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8.
To download, unpack, and run the benchmarks, execute the following bash script in your terminal:

```bash
wget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz
gzip -d xlsum.csv.gz
HAYSTACK_PATH=xlsum.csv cargo criterion --jobs 8
```