Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ashvardanian/StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://github.com/ashvardanian/StringZilla

beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/ashvardanian/StringZilla
Owner: ashvardanian
License: apache-2.0
Created: 2020-08-14T20:15:16.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2024-04-14T14:45:30.000Z (2 months ago)
Last Synced: 2024-04-14T16:33:45.697Z (2 months ago)
Topics: beautifulsoup, common-crawl, csv, dataset, html, information-retrieval, json, laion, ndjson, parser, pattern-recognition, simd, sorting-algorithms, string, string-manipulation, string-matching, string-parsing, string-search, substring
Language: C++
Homepage: https://ashvardanian.com/posts/stringzilla/
Size: 7.91 MB
Stars: 1,749
Watchers: 17
Forks: 51
Open Issues: 23
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Lists

awesome-cpp - StringZilla - the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower". [Apache-2.0] (Miscellaneous)
awesome-password-cracking - StringZilla - Fastest string sort, search, split, and shuffle for long strings and multi-gigabyte files in Python and C. (Wordlist tools / Generation/Manipulation)
fucking-awesome-cpp - StringZilla - the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower". [Apache-2.0] (Miscellaneous)
awesome-github-repos - ashvardanian/StringZilla - Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, et (C++)
awesome-rust - ashvardanian/stringzilla - SIMD-accelerated string search, sort, edit distances, alignments, and generators for x86 AVX2 & AVX-512, and Arm NEON [![crates.io](https://img.shields.io/crates/v/stringzilla.svg)](https://crates.io/crates/stringzilla) Stars:`1.8K`. (Applications / Text processing)
awesome-cpp - StringZilla - the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower". [Apache-2.0] (Miscellaneous)
awesome-rust - ashvardanian/stringzilla - accelerated string search, sort, edit distances, alignments, and generators for x86 AVX2 & AVX-512, and Arm NEON [![crates.io](https://img.shields.io/crates/v/stringzilla.svg)](https://crates.io/crates/stringzilla) (Applications / Text processing)
awesome-simd - StringZilla - C: Substring search, edit-distances, sorting, fuzzy matching, etc. (Parsing)
awesome-stars - ashvardanian/StringZilla - `★1860` Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖 (C++)

README

        # StringZilla 🦖

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅

- ✅ Single-header pure C 99 implementation [docs](#quick-start-c-🛠️)

- ✅ [Direct CPython bindings](https://ashvardanian.com/posts/pybind11-cpython-tutorial/) with minimal call latency [docs](#quick-start-python-🐍)

- ✅ [SWAR](https://en.wikipedia.org/wiki/SWAR) and [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) acceleration on x86 (AVX2) and ARM (NEON)

- ✅ [Radix](https://en.wikipedia.org/wiki/Radix_sort)-like sorting faster than C++ `std::sort`

- ✅ [Memory-mapping](https://en.wikipedia.org/wiki/Memory-mapped_file) to work with larger-than-RAM datasets

- ✅ Memory-efficient compressed arrays to work with sequences

- 🔜 JavaScript bindings are on their way.

This library saved me tens of thousands of dollars pre-processing large datasets for machine learning, even on the scale of a single experiment.

So if you want to process the 6 Billion images from [LAION](https://laion.ai/blog/laion-5b/), or the 250 Billion web pages from the [CommonCrawl](https://commoncrawl.org/), or even just a few million lines of server logs, and haunted by Python's `open(...).readlines()` and `str().splitlines()` taking forever, this should help 😊

## Performance

StringZilla is built on a very simple heuristic:

> If the first 4 bytes of the string are the same, the strings are likely to be equal.

> Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.

Thanks to that it can avoid scalar code processing one `char` at a time and use hyper-scalar code to achieve `memcpy` speeds.

__The implementation fits into a single C 99 header file__ and uses different SIMD flavors and SWAR on older platforms.

### Substring Search

| Backend \ Device         |                    IoT |                   Laptop |                    Server |

| :----------------------- | ---------------------: | -----------------------: | ------------------------: |

| __Speed Comparison__ 🐇   |                        |                          |                           |

| Python `for` loop        |                 4 MB/s |                  14 MB/s |                   11 MB/s |

| C++ `for` loop           |               520 MB/s |                 1.0 GB/s |                  900 MB/s |

| C++ `string.find`        |               560 MB/s |                 1.2 GB/s |                  1.3 GB/s |

| Scalar StringZilla       |                 2 GB/s |                 3.3 GB/s |                  3.5 GB/s |

| Hyper-Scalar StringZilla |           __4.3 GB/s__ |              __12 GB/s__ |             __12.1 GB/s__ |

| __Efficiency Metrics__ 📊 |                        |                          |                           |

| CPU Specs                | 8-core ARM, 0.5 W/core | 8-core Intel, 5.6 W/core | 22-core Intel, 6.3 W/core |

| Performance/Core         |         2.1 - 3.3 GB/s |              __11 GB/s__ |                 10.5 GB/s |

| Bytes/Joule              |           __4.2 GB/J__ |                   2 GB/J |                  1.6 GB/J |

### Split, Partition, Sort, and Shuffle

Coming soon.

## Quick Start: Python 🐍

1. Install via pip: `pip install stringzilla`  

2. Import the classes you need: `from stringzilla import Str, Strs, File`  

### Basic Usage

StringZilla offers two mostly interchangeable core classes:

```python

from stringzilla import Str, File

text_from_str = Str('some-string')

text_from_file = Str(File('some-file.txt'))

```

The `Str` is designed to replace long Python `str` strings and wrap our C-level API.

On the other hand, the `File` memory-maps a file from persistent memory without loading its copy into RAM.

The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.

A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

### Basic Operations

- Length: `len(text) -> int`

- Indexing: `text[42] -> str`

- Slicing: `text[42:46] -> Str`

- String conversion: `str(text) -> str`

- Substring check: `'substring' in text -> bool`

- Hashing: `hash(text) -> int`

### Advanced Operations

- `text.contains('substring', start=0, end=9223372036854775807) -> bool`

- `text.find('substring', start=0, end=9223372036854775807) -> int`

- `text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int`

- `text.splitlines(keeplinebreaks=False, separator='\n') -> Strs`

- `text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs`

### Collection-Level Operations

Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices.

```python

lines: Strs = text.split(separator='\n')

lines.sort()

lines.shuffle(seed=42)

```

Need copies?

```python

sorted_copy: Strs = lines.sorted()

shuffled_copy: Strs = lines.shuffled(seed=42)

```

Basic `list`-like operations are also supported:

```python

lines.append('Pythonic string')

lines.extend(shuffled_copy)

```

### Low-Level Python API

The StringZilla CPython bindings implement vector-call conventions for faster calls.

```py

import stringzilla as sz

contains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807)

offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807)

count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False)

levenstein: int = sz.levenstein("needle", "nidl")

```

## Quick Start: C 🛠️

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

```c

#include "stringzilla.h"

// Initialize your haystack and needle

sz_string_view_t haystack = {your_text, your_text_length};

sz_string_view_t needle = {your_subtext, your_subtext_length};

// Perform string-level operations

sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a");

sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length);

// Hash strings

sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length);

// Perform collection level operations

sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};

sz_sort(&array, &your_config);

```

## Contributing 👾

Future development plans include:

- [x] [Replace PyBind11 with CPython](https://github.com/ashvardanian/StringZilla/issues/35), [blog](https://ashvardanian.com/posts/pybind11-cpython-tutorial/)

- [x] [Bindings for JavaScript](https://github.com/ashvardanian/StringZilla/issues/25)

- [ ] [Faster string sorting algorithm](https://github.com/ashvardanian/StringZilla/issues/45)

- [ ] [Reverse-order operations in Python](https://github.com/ashvardanian/StringZilla/issues/12)

- [ ] [Splitting with multiple separators at once](https://github.com/ashvardanian/StringZilla/issues/29)

- [ ] Splitting CSV rows into columns

- [ ] UTF-8 validation.

- [ ] Arm SVE backend

- [ ] Bindings for Java and Rust

Here's how to set up your dev environment and run some tests.

### Development

CPython:

```sh

# Clean up, install, and test!

rm -rf build && pip install -e . && pytest scripts/ -s -x

# Install without dependencies

pip install -e . --no-index --no-deps

```

NodeJS:

```sh

npm install && npm test

```

### Benchmarking

To benchmark on some custom file and pattern combinations:

```sh

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

```

To benchmark on synthetic data:

```sh

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

```

### Packaging

To validate packaging:

```sh

cibuildwheel --platform linux

```

### Compiling C++ Tests

```sh

cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test

```

On MacOS it's recommended to use non-default toolchain:

```sh

# Install dependencies

brew install libomp llvm

# Compile and run tests

cmake -B ./build_release \

    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \

    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \

    -DSTRINGZILLA_USE_OPENMP=1 \

    -DSTRINGZILLA_BUILD_TEST=1 \

    && \

    make -C ./build_release -j && ./build_release/stringzilla_test

```

## License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.

---

If you like this project, you may also enjoy [USearch][usearch], [UCall][ucall], [UForm][uform], [UStore][ustore], [SimSIMD][simsimd], and [TenPack][tenpack] 🤗

[usearch]: https://github.com/unum-cloud/usearch

[ucall]: https://github.com/unum-cloud/ucall

[uform]: https://github.com/unum-cloud/uform

[ustore]: https://github.com/unum-cloud/ustore

[simsimd]: https://github.com/ashvardanian/simsimd

[tenpack]: https://github.com/ashvardanian/tenpack