Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ashvardanian/StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://github.com/ashvardanian/StringZilla

beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring

Last synced: 3 months ago
JSON representation

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖

Awesome Lists containing this project

README

        

# StringZilla 🦖

![StringZilla banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringZilla.png?raw=true)

The world wastes a minimum of $100M annually due to inefficient string operations.
A typical codebase processes strings character by character, resulting in too many branches and data-dependencies, neglecting 90% of modern CPU's potential.
LibC is different.
It attempts to leverage SIMD instructions to boost some operations, and is often used by higher-level languages, runtimes, and databases.
But it isn't perfect.
1️⃣ First, even on common hardware, including over a billion 64-bit ARM CPUs, common functions like `strstr` and `memmem` only achieve 1/3 of the CPU's throughput.
2️⃣ Second, SIMD coverage is inconsistent: acceleration in forward scans does not guarantee speed in the reverse-order search.
3️⃣ At last, most high-level languages can't always use LibC, as the strings are often not NULL-terminated or may contain the Unicode "Zero" character in the middle of the string.
That's why StringZilla was created.
To provide predictably high performance, portable to any modern platform, operating system, and programming language.

[![StringZilla Python installs](https://static.pepy.tech/personalized-badge/stringzilla?period=total&units=abbreviation&left_color=black&right_color=blue&left_text=StringZilla%20Python%20installs)](https://github.com/ashvardanian/stringzilla)
[![StringZilla Rust installs](https://img.shields.io/crates/d/stringzilla?logo=rust&label=Rust%20installs)](https://crates.io/crates/stringzilla)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/ashvardanian/StringZilla/release.yml?branch=main&label=Ubuntu)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/ashvardanian/StringZilla/release.yml?branch=main&label=Windows)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/ashvardanian/StringZilla/release.yml?branch=main&label=MacOS)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/ashvardanian/StringZilla/release.yml?branch=main&label=Alpine%20Linux)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
![StringZilla code size](https://img.shields.io/github/languages/code-size/ashvardanian/stringzilla)

StringZilla is the GodZilla of string libraries, using [SIMD][faq-simd] and [SWAR][faq-swar] to accelerate string operations on modern CPUs.
It is up to __10x faster than the default and even other SIMD-accelerated string libraries__ in C, C++, Python, and other languages, while covering broad functionality.
It __accelerates exact and fuzzy string matching, edit distance computations, sorting, lazily-evaluated ranges to avoid memory allocations, and even random-string generators__.

[faq-simd]: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
[faq-swar]: https://en.wikipedia.org/wiki/SWAR

- 🐂 __[C](#Basic-Usage-with-C-99-and-Newer) :__ Upgrade LibC's `` to `` in C 99
- 🐉 __[C++](#basic-usage-with-c-11-and-newer):__ Upgrade STL's `` to `` in C++ 11
- 🐍 __[Python](#quick-start-python-🐍):__ Upgrade your `str` to faster `Str`
- 🍎 __[Swift](#quick-start-swift-🍏):__ Use the `String+StringZilla` extension
- 🦀 __[Rust](#quick-start-rust-🦀):__ Use the `StringZilla` traits crate
- 🐚 __[Shell][faq-shell]__: Accelerate common CLI tools with `sz_` prefix
- 📚 Researcher? Jump to [Algorithms & Design Decisions](#algorithms--design-decisions-📚)
- 💡 Thinking to contribute? Look for ["good first issues"][first-issues]
- 🤝 And check the [guide](https://github.com/ashvardanian/StringZilla/blob/main/CONTRIBUTING.md) to setup the environment
- Want more bindings or features? Let [me](https://github.com/ashvardanian) know!

[faq-shell]: https://github.com/ashvardanian/StringZilla/blob/main/cli/README.md
[first-issues]: https://github.com/ashvardanian/StringZilla/issues

__Who is this for?__

- For data-engineers parsing large datasets, like the [CommonCrawl](https://commoncrawl.org/), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), or [LAION](https://laion.ai/blog/laion-5b/).
- For software engineers optimizing strings in their apps and services.
- For bioinformaticians and search engineers looking for edit-distances for [USearch](https://github.com/unum-cloud/usearch).
- For [DBMS][faq-dbms] devs, optimizing `LIKE`, `ORDER BY`, and `GROUP BY` operations.
- For hardware designers, needing a SWAR baseline for strings-processing functionality.
- For students studying SIMD/SWAR applications to non-data-parallel operations.

[faq-dbms]: https://en.wikipedia.org/wiki/Database

## Performance








C
C++
Python
StringZilla



find the first occurrence of a random word from text, ≅ 5 bytes long



strstr 1

x86: 7.4 ·
arm: 2.0 GB/s


.find

x86: 2.9 ·
arm: 1.6 GB/s


.find

x86: 1.1 ·
arm: 0.6 GB/s


sz_find

x86: 10.6 ·
arm: 7.1 GB/s




find the last occurrence of a random word from text, ≅ 5 bytes long




.rfind

x86: 0.5 ·
arm: 0.4 GB/s


.rfind

x86: 0.9 ·
arm: 0.5 GB/s


sz_rfind

x86: 10.8 ·
arm: 6.7 GB/s




split lines separated by \n or \r 2



strcspn 1

x86: 5.42 ·
arm: 2.19 GB/s


.find_first_of

x86: 0.59 ·
arm: 0.46 GB/s


re.finditer

x86: 0.06 ·
arm: 0.02 GB/s


sz_find_charset

x86: 4.08 ·
arm: 3.22 GB/s




find the last occurrence of any of 6 whitespaces 2




.find_last_of

x86: 0.25 ·
arm: 0.25 GB/s



sz_rfind_charset

x86: 0.43 ·
arm: 0.23 GB/s




Random string from a given alphabet, 20 bytes long 5



rand() % n

x86: 18.0 ·
arm: 9.4 MB/s


std::uniform_int_distribution

x86: 47.2 ·
arm: 20.4 MB/s


join(random.choices(...))

x86: 13.3 ·
arm: 5.9 MB/s


sz_generate

x86: 56.2 ·
arm: 25.8 MB/s




Mapping Characters with Look-Up Table Transforms




std::transform

x86: 3.81 ·
arm: 2.65 GB/s


str.translate

x86: 260.0 ·
arm: 140.0 MB/s


sz_look_up_transform

x86: 21.2 ·
arm: 8.5 GB/s




Get sorted order, ≅ 8 million English words 6



qsort_r

x86: 3.55 ·
arm: 5.77 s


std::sort

x86: 2.79 ·
arm: 4.02 s


numpy.argsort

x86: 7.58 ·
arm: 13.00 s


sz_sort

x86: 1.91 ·
arm: 2.37 s




Levenshtein edit distance, ≅ 5 bytes long





via jellyfish 3

x86: 1,550 ·
arm: 2,220 ns


sz_edit_distance

x86: 99 ·
arm: 180 ns




Needleman-Wunsch alignment scores, ≅ 10 K aminoacids long





via biopython 4

x86: 257 ·
arm: 367 ms


sz_alignment_score

x86: 73 ·
arm: 177 ms

StringZilla has a lot of functionality, most of which is covered by benchmarks across C, C++, Python and other languages.
You can find those in the `./scripts` directory, with usage notes listed in the [`CONTRIBUTING.md`](CONTRIBUTING.md) file.
Notably, if the CPU supports misaligned loads, even the 64-bit SWAR backends are faster than either standard library.

> Most benchmarks were conducted on a 1 GB English text corpus, with an average word length of 6 characters.
> The code was compiled with GCC 12, using `glibc` v2.35.
> The benchmarks performed on Arm-based Graviton3 AWS `c7g` instances and `r7iz` Intel Sapphire Rapids.
> Most modern Arm-based 64-bit CPUs will have similar relative speedups.
> Variance withing x86 CPUs will be larger.
> 1 Unlike other libraries, LibC requires strings to be NULL-terminated.
> 2 Six whitespaces in the ASCII set are: ` \t\n\v\f\r`. Python's and other standard libraries have specialized functions for those.
> 3 Most Python libraries for strings are also implemented in C.
> 4 Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c).
> 5 All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities.
> The C++ STL and StringZilla benchmarks used a 64-bit [Mersenne Twister][faq-mersenne-twister] as the generator.
> For C, C++, and StringZilla, an in-place update of the string was used.
> In Python every string had to be allocated as a new object, which makes it less fair.
> 6 Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries.
> That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation.
> The latter is very common in database engines and is most similar to `numpy.argsort`.
> Current StringZilla solution can be at least 4x faster without loss of generality.

[faq-mersenne-twister]: https://en.wikipedia.org/wiki/Mersenne_Twister

## Functionality

StringZilla is compatible with most modern CPUs, and provides a broad range of functionality.

- [x] works on both Little-Endian and Big-Endian architectures.
- [x] works on 32-bit and 64-bit hardware architectures.
- [x] compatible with ASCII and UTF-8 encoding.

Not all features are available across all bindings.
Consider contributing, if you need a feature that's not yet implemented.

| | Maturity | C 99 | C++ 11 | Python | Swift | Rust |
| :----------------------------- | :------: | :---: | :----: | :----: | :---: | :---: |
| Substring Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Character Set Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Edit Distances | 🧐 | ✅ | ✅ | ✅ | ✅ | ⚪ |
| Small String Class | 🧐 | ✅ | ✅ | ❌ | ❌ | ⚪ |
| Sorting & Sequence Operations | 🚧 | ✅ | ✅ | ✅ | ⚪ | ⚪ |
| Lazy Ranges, Compressed Arrays | 🧐 | ⚪ | ✅ | ✅ | ⚪ | ⚪ |
| Hashes & Fingerprints | 🚧 | ✅ | ✅ | ⚪ | ⚪ | ⚪ |

> 🌳 parts are used in production.
> 🧐 parts are in beta.
> 🚧 parts are under active development, and are likely to break in subsequent releases.
> ✅ are implemented.
> ⚪ are considered.
> ❌ are not intended.

## Quick Start: Python 🐍

Python bindings are available on PyPI, and can be installed with `pip`.
You can immediately check the installed version and the used hardware capabilities with following commands:

```bash
pip install stringzilla
python -c "import stringzilla; print(stringzilla.__version__)"
python -c "import stringzilla; print(stringzilla.__capabilities__)"
```

### Basic Usage

If you've ever used the Python `str`, `bytes`, `bytearray`, `memoryview` class, you'll know what to expect.
StringZilla's `Str` class is a hybrid of those two, providing `str`-like interface to byte-arrays.

```python
from stringzilla import Str, File

text_from_str = Str('some-string') # no copies, just a view
text_from_bytes = Str(b'some-array') # no copies, just a view
text_from_file = Str(File('some-file.txt')) # memory-mapped file

import numpy as np
alphabet_array = np.arange(ord("a"), ord("z"), dtype=np.uint8)
text_from_array = Str(memoryview(alphabet_array))
```

The `File` class memory-maps a file from persistent memory without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

### Basic Operations

- Length: `len(text) -> int`
- Indexing: `text[42] -> str`
- Slicing: `text[42:46] -> Str`
- Substring check: `'substring' in text -> bool`
- Hashing: `hash(text) -> int`
- String conversion: `str(text) -> str`

### Advanced Operations

```py
import sys

x: bool = text.contains('substring', start=0, end=sys.maxsize)
x: int = text.find('substring', start=0, end=sys.maxsize)
x: int = text.count('substring', start=0, end=sys.maxsize, allowoverlap=False)
x: str = text.decode(encoding='utf-8', errors='strict')
x: Strs = text.split(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.splitlines(keeplinebreaks=False, maxsplit=sys.maxsize)
```

It's important to note, that the last function behavior is slightly different from Python's `str.splitlines`.
The [native version][faq-splitlines] matches `\n`, `\r`, `\v` or `\x0b`, `\f` or `\x0c`, `\x1c`, `\x1d`, `\x1e`, `\x85`, `\r\n`, `\u2028`, `\u2029`, including 3x two-bytes-long runes.
The StringZilla version matches only `\n`, `\v`, `\f`, `\r`, `\x1c`, `\x1d`, `\x1e`, `\x85`, avoiding two-byte-long runes.

[faq-splitlines]: https://docs.python.org/3/library/stdtypes.html#str.splitlines

### Character Set Operations

Python strings don't natively support character set operations.
This forces people to use regular expressions, which are slow and hard to read.
To avoid the need for `re.finditer`, StringZilla provides the following interfaces:

```py
x: int = text.find_first_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_of('chars', start=0, end=sys.maxsize)
x: int = text.find_first_not_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_not_of('chars', start=0, end=sys.maxsize)
x: Strs = text.split_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
```

You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set.
This would result in a copy - `str` for `str` inputs and `bytes` for other types.

```py
x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)
```

For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary.
This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing.
Here is an example:

```py
import stringzilla as sz
look_up_table = bytes(range(256)) # Identity LUT
image = open("/image/path.jpeg", "rb").read()
sz.translate(image, look_up_table, inplace=True)
```

### Collection-Level Operations

Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices, with minimum memory footprint.
If all the chunks are located in consecutive memory regions, the memory overhead can be as low as 4 bytes per chunk.

```python
lines: Strs = text.split(separator='\n') # 4 bytes per line overhead for under 4 GB of text
batch: Strs = lines.sample(seed=42) # 10x faster than `random.choices`
lines.shuffle(seed=42) # or shuffle all lines in place and shard with slices
# WIP: lines.sort() # explodes to 16 bytes per line overhead for any length text
# WIP: sorted_order: tuple = lines.argsort() # similar to `numpy.argsort`
```

Working on [RedPajama][redpajama], addressing 20 Billion annotated english documents, one will need only 160 GB of RAM instead of Terabytes.
Once loaded, the data will be memory-mapped, and can be reused between multiple Python processes without copies.
And of course, you can use slices to navigate the dataset and shard it between multiple workers.

```python
lines[::3] # every third line
lines[1::1] # every odd line
lines[:-100:-1] # last 100 lines in reverse order
```

[redpajama]: https://github.com/togethercomputer/RedPajama-Data

### Iterators and Memory Efficiency

Python's operations like `split()` and `readlines()` immediately materialize a `list` of copied parts.
This can be very memory-inefficient for large datasets.
StringZilla saves a lot of memory by viewing existing memory regions as substrings, but even more memory can be saved by using lazily evaluated iterators.

```py
x: SplitIterator[Str] = text.split_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.split_charset_iter(separator='chars', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_charset_iter(separator='chars', keepseparator=False)
```

StringZilla can easily be 10x more memory efficient than native Python classes for tokenization.
With lazy operations, it practically becomes free.

```py
import stringzilla as sz
%load_ext memory_profiler

text = open("enwik9.txt", "r").read() # 1 GB, mean word length 7.73 bytes
%memit text.split() # increment: 8670.12 MiB (152 ms)
%memit sz.split(text) # increment: 530.75 MiB (25 ms)
%memit sum(1 for _ in sz.split_iter(text)) # increment: 0.00 MiB
```

### Low-Level Python API

Aside from calling the methods on the `Str` and `Strs` classes, you can also call the global functions directly on `str` and `bytes` instances.
Assuming StringZilla CPython bindings are implemented [without any intermediate tools like SWIG or PyBind](https://ashvardanian.com/posts/pybind11-cpython-tutorial/), the call latency should be similar to native classes.

```py
import stringzilla as sz

contains: bool = sz.contains("haystack", "needle", start=0, end=sys.maxsize)
offset: int = sz.find("haystack", "needle", start=0, end=sys.maxsize)
count: int = sz.count("haystack", "needle", start=0, end=sys.maxsize, allowoverlap=False)
```

### Edit Distances

```py
assert sz.edit_distance("apple", "aple") == 1 # skip one ASCII character
assert sz.edit_distance("αβγδ", "αγδ") == 2 # skip two bytes forming one rune
assert sz.edit_distance_unicode("αβγδ", "αγδ") == 1 # one unicode rune
```

Several Python libraries provide edit distance computation.
Most of them are implemented in C, but are not always as fast as StringZilla.
Taking a 1'000 long proteins around 10'000 characters long, computing just a 100 distances:

- [JellyFish](https://github.com/jamesturk/jellyfish): 62.3s
- [EditDistance](https://github.com/roy-ht/editdistance): 32.9s
- StringZilla: __0.8s__

Moreover, you can pass custom substitution matrices to compute the Needleman-Wunsch alignment scores.
That task is very common in bioinformatics and computational biology.
It's natively supported in BioPython, and its BLOSUM matrices can be converted to StringZilla's format.
Alternatively, you can construct an arbitrary 256 by 256 cost matrix using NumPy.
Depending on arguments, the result may be equal to the negative Levenshtein distance.

```py
import numpy as np
import stringzilla as sz

costs = np.zeros((256, 256), dtype=np.int8)
costs.fill(-1)
np.fill_diagonal(costs, 0)

assert sz.alignment_score("first", "second", substitution_matrix=costs, gap_score=-1) == -sz.edit_distance(a, b)
```

Using the same proteins as for Levenshtein distance benchmarks:

- [BioPython](https://github.com/biopython/biopython): 25.8s
- StringZilla: __7.8s__

§ Example converting from BioPython to StringZilla.

```py
import numpy as np
from Bio import Align
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = 1
aligner.extend_gap_score = 1

# Convert the matrix to NumPy
subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)
subs_reconstructed = np.zeros((256, 256), dtype=np.int8)

# Initialize all banned characters to a the largest possible penalty
subs_reconstructed.fill(127)
for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
reconstructed_row = ord(packed_row_aminoacid)
reconstructed_column = ord(packed_column_aminoacid)
subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]

# Let's pick two examples for of tri-peptides (made of 3 aminoacids)
glutathione = "ECG" # Need to rebuild human tissue?
thyrotropin_releasing_hormone = "QHP" # Or to regulate your metabolism?

assert sz.alignment_score(
glutathione,
thyrotropin_releasing_hormone,
substitution_matrix=subs_reconstructed,
gap_score=1) == aligner.score(glutathione, thyrotropin_releasing_hormone) # Equal to 6
```

### Serialization

#### Filesystem

Similar to how `File` can be used to read a large file, other interfaces can be used to dump strings to disk faster.
The `Str` class has `write_to` to write the string to a file, and `offset_within` to obtain integer offsets of substring view in larger string for navigation.

```py
web_archive = Str("......")
_, end_tag, next_doc = web_archive.partition("