Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ashvardanian/StringZilla
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://github.com/ashvardanian/StringZilla
beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring
Last synced: about 2 months ago
JSON representation
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
- Host: GitHub
- URL: https://github.com/ashvardanian/StringZilla
- Owner: ashvardanian
- License: apache-2.0
- Created: 2020-08-14T20:15:16.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-04-14T14:45:30.000Z (2 months ago)
- Last Synced: 2024-04-14T16:33:45.697Z (2 months ago)
- Topics: beautifulsoup, common-crawl, csv, dataset, html, information-retrieval, json, laion, ndjson, parser, pattern-recognition, simd, sorting-algorithms, string, string-manipulation, string-matching, string-parsing, string-search, substring
- Language: C++
- Homepage: https://ashvardanian.com/posts/stringzilla/
- Size: 7.91 MB
- Stars: 1,749
- Watchers: 17
- Forks: 51
- Open Issues: 23
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Lists
- awesome-cpp - StringZilla - the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower". [Apache-2.0] (Miscellaneous)
- awesome-password-cracking - StringZilla - Fastest string sort, search, split, and shuffle for long strings and multi-gigabyte files in Python and C. (Wordlist tools / Generation/Manipulation)
- fucking-awesome-cpp - StringZilla - the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower". [Apache-2.0] (Miscellaneous)
- awesome-github-repos - ashvardanian/StringZilla - Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, et (C++)
- awesome-rust - ashvardanian/stringzilla - SIMD-accelerated string search, sort, edit distances, alignments, and generators for x86 AVX2 & AVX-512, and Arm NEON [![crates.io](https://img.shields.io/crates/v/stringzilla.svg)](https://crates.io/crates/stringzilla) Stars:`1.8K`. (Applications / Text processing)
- awesome-cpp - StringZilla - the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower". [Apache-2.0] (Miscellaneous)
- awesome-rust - ashvardanian/stringzilla - accelerated string search, sort, edit distances, alignments, and generators for x86 AVX2 & AVX-512, and Arm NEON [![crates.io](https://img.shields.io/crates/v/stringzilla.svg)](https://crates.io/crates/stringzilla) (Applications / Text processing)
- awesome-simd - StringZilla - C: Substring search, edit-distances, sorting, fuzzy matching, etc. (Parsing)
- awesome-stars - ashvardanian/StringZilla - `★1860` Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖 (C++)
README
# StringZilla 🦖
StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅
- ✅ Single-header pure C 99 implementation [docs](#quick-start-c-🛠️)
- ✅ [Direct CPython bindings](https://ashvardanian.com/posts/pybind11-cpython-tutorial/) with minimal call latency [docs](#quick-start-python-🐍)
- ✅ [SWAR](https://en.wikipedia.org/wiki/SWAR) and [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) acceleration on x86 (AVX2) and ARM (NEON)
- ✅ [Radix](https://en.wikipedia.org/wiki/Radix_sort)-like sorting faster than C++ `std::sort`
- ✅ [Memory-mapping](https://en.wikipedia.org/wiki/Memory-mapped_file) to work with larger-than-RAM datasets
- ✅ Memory-efficient compressed arrays to work with sequences
- 🔜 JavaScript bindings are on their way.This library saved me tens of thousands of dollars pre-processing large datasets for machine learning, even on the scale of a single experiment.
So if you want to process the 6 Billion images from [LAION](https://laion.ai/blog/laion-5b/), or the 250 Billion web pages from the [CommonCrawl](https://commoncrawl.org/), or even just a few million lines of server logs, and haunted by Python's `open(...).readlines()` and `str().splitlines()` taking forever, this should help 😊## Performance
StringZilla is built on a very simple heuristic:
> If the first 4 bytes of the string are the same, the strings are likely to be equal.
> Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.Thanks to that it can avoid scalar code processing one `char` at a time and use hyper-scalar code to achieve `memcpy` speeds.
__The implementation fits into a single C 99 header file__ and uses different SIMD flavors and SWAR on older platforms.### Substring Search
| Backend \ Device | IoT | Laptop | Server |
| :----------------------- | ---------------------: | -----------------------: | ------------------------: |
| __Speed Comparison__ 🐇 | | | |
| Python `for` loop | 4 MB/s | 14 MB/s | 11 MB/s |
| C++ `for` loop | 520 MB/s | 1.0 GB/s | 900 MB/s |
| C++ `string.find` | 560 MB/s | 1.2 GB/s | 1.3 GB/s |
| Scalar StringZilla | 2 GB/s | 3.3 GB/s | 3.5 GB/s |
| Hyper-Scalar StringZilla | __4.3 GB/s__ | __12 GB/s__ | __12.1 GB/s__ |
| __Efficiency Metrics__ 📊 | | | |
| CPU Specs | 8-core ARM, 0.5 W/core | 8-core Intel, 5.6 W/core | 22-core Intel, 6.3 W/core |
| Performance/Core | 2.1 - 3.3 GB/s | __11 GB/s__ | 10.5 GB/s |
| Bytes/Joule | __4.2 GB/J__ | 2 GB/J | 1.6 GB/J |### Split, Partition, Sort, and Shuffle
Coming soon.
## Quick Start: Python 🐍
1. Install via pip: `pip install stringzilla`
2. Import the classes you need: `from stringzilla import Str, Strs, File`### Basic Usage
StringZilla offers two mostly interchangeable core classes:
```python
from stringzilla import Str, Filetext_from_str = Str('some-string')
text_from_file = Str(File('some-file.txt'))
```The `Str` is designed to replace long Python `str` strings and wrap our C-level API.
On the other hand, the `File` memory-maps a file from persistent memory without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.### Basic Operations
- Length: `len(text) -> int`
- Indexing: `text[42] -> str`
- Slicing: `text[42:46] -> Str`
- String conversion: `str(text) -> str`
- Substring check: `'substring' in text -> bool`
- Hashing: `hash(text) -> int`### Advanced Operations
- `text.contains('substring', start=0, end=9223372036854775807) -> bool`
- `text.find('substring', start=0, end=9223372036854775807) -> int`
- `text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int`
- `text.splitlines(keeplinebreaks=False, separator='\n') -> Strs`
- `text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs`### Collection-Level Operations
Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices.
```python
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)
```Need copies?
```python
sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)
```Basic `list`-like operations are also supported:
```python
lines.append('Pythonic string')
lines.extend(shuffled_copy)
```### Low-Level Python API
The StringZilla CPython bindings implement vector-call conventions for faster calls.
```py
import stringzilla as szcontains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807)
offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807)
count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False)
levenstein: int = sz.levenstein("needle", "nidl")
```## Quick Start: C 🛠️
There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.
```c
#include "stringzilla.h"// Initialize your haystack and needle
sz_string_view_t haystack = {your_text, your_text_length};
sz_string_view_t needle = {your_subtext, your_subtext_length};// Perform string-level operations
sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a");
sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length);// Hash strings
sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length);// Perform collection level operations
sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
sz_sort(&array, &your_config);
```## Contributing 👾
Future development plans include:
- [x] [Replace PyBind11 with CPython](https://github.com/ashvardanian/StringZilla/issues/35), [blog](https://ashvardanian.com/posts/pybind11-cpython-tutorial/)
- [x] [Bindings for JavaScript](https://github.com/ashvardanian/StringZilla/issues/25)
- [ ] [Faster string sorting algorithm](https://github.com/ashvardanian/StringZilla/issues/45)
- [ ] [Reverse-order operations in Python](https://github.com/ashvardanian/StringZilla/issues/12)
- [ ] [Splitting with multiple separators at once](https://github.com/ashvardanian/StringZilla/issues/29)
- [ ] Splitting CSV rows into columns
- [ ] UTF-8 validation.
- [ ] Arm SVE backend
- [ ] Bindings for Java and RustHere's how to set up your dev environment and run some tests.
### Development
CPython:
```sh
# Clean up, install, and test!
rm -rf build && pip install -e . && pytest scripts/ -s -x# Install without dependencies
pip install -e . --no-index --no-deps
```NodeJS:
```sh
npm install && npm test
```### Benchmarking
To benchmark on some custom file and pattern combinations:
```sh
python scripts/bench.py --haystack_path "your file" --needle "your pattern"
```To benchmark on synthetic data:
```sh
python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"
```### Packaging
To validate packaging:
```sh
cibuildwheel --platform linux
```### Compiling C++ Tests
```sh
cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test
```On MacOS it's recommended to use non-default toolchain:
```sh
# Install dependencies
brew install libomp llvm# Compile and run tests
cmake -B ./build_release \
-DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
-DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
-DSTRINGZILLA_USE_OPENMP=1 \
-DSTRINGZILLA_BUILD_TEST=1 \
&& \
make -C ./build_release -j && ./build_release/stringzilla_test
```## License 📜
Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.
---
If you like this project, you may also enjoy [USearch][usearch], [UCall][ucall], [UForm][uform], [UStore][ustore], [SimSIMD][simsimd], and [TenPack][tenpack] 🤗
[usearch]: https://github.com/unum-cloud/usearch
[ucall]: https://github.com/unum-cloud/ucall
[uform]: https://github.com/unum-cloud/uform
[ustore]: https://github.com/unum-cloud/ustore
[simsimd]: https://github.com/ashvardanian/simsimd
[tenpack]: https://github.com/ashvardanian/tenpack