Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rusticstuff/simdutf8

SIMD-accelerated UTF-8 validation for Rust.
https://github.com/rusticstuff/simdutf8

aarch64 arm64 avx2 neon rust rust-crate simd simd-extensions sse41 unicode utf-8 wasm

Last synced: 3 months ago
JSON representation

SIMD-accelerated UTF-8 validation for Rust.

Awesome Lists containing this project

README

        

[![CI](https://github.com/rusticstuff/simdutf8/actions/workflows/ci.yml/badge.svg)](https://github.com/rusticstuff/simdutf8/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/simdutf8.svg)](https://crates.io/crates/simdutf8)
[![docs.rs](https://docs.rs/simdutf8/badge.svg)](https://docs.rs/simdutf8)

# simdutf8 – High-speed UTF-8 validation

Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from
[simdjson](https://github.com/simdjson/simdjson). Originally ported to Rust by the developers of [simd-json.rs](https://simd-json.rs), but now heavily improved.

## Status
This library has been thoroughly tested with sample data as well as fuzzing and there are no known bugs.

## Features
* `basic` API for the fastest validation, optimized for valid UTF-8
* `compat` API as a fully compatible replacement for `std::str::from_utf8()`
* Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64
* ARM64 (aarch64) SIMD is supported since Rust 1.61
* WASM (wasm32) SIMD is supported
* 🆕 armv7 NEON support with the `armv7_neon` feature on nightly Rust
* x86-64: Up to 23 times faster than the std library on valid non-ASCII, up to four times faster on ASCI
* aarch64: Up to eleven times faster than the std library on valid non-ASCII, up to four times faster on ASCII (Apple Silicon)
* Faster than the original simdjson implementation
* Selects the fastest implementation at runtime based on CPU support (on x86)
* Falls back to the excellent std implementation if SIMD extensions are not supported
* Written in pure Rust
* No dependencies
* No-std support

## Quick start
Add the dependency to your Cargo.toml file:
```toml
[dependencies]
simdutf8 = "0.1.5"
```

Use `simdutf8::basic::from_utf8()` as a drop-in replacement for `std::str::from_utf8()`.

```rust
use simdutf8::basic::from_utf8;

println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());
```

If you need detailed information on validation failures, use `simdutf8::compat::from_utf8()`
instead.

```rust
use simdutf8::compat::from_utf8;

let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));
```

## APIs

### Basic flavor
Use the `basic` API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. `simdutf8::basic::Utf8Error` is a zero-sized error struct.

### Compat flavor
The `compat` flavor is fully API-compatible with `std::str::from_utf8()`. In particular, `simdutf8::compat::from_utf8()`
returns a `simdutf8::compat::Utf8Error`, which has `valid_up_to()` and `error_len()` methods. The first is useful for
verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.

It also fails early: errors are checked on the fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a slight performance penalty compared to the `basic` API even if the input is valid UTF-8.

## Implementation selection

### X86
The fastest implementation is selected at runtime using the `std::is_x86_feature_detected!` macro, unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with `RUSTFLAGS="-C target-cpu=native"` on a recent x86-64 machine, the AVX 2 implementation is selected at
compile-time and runtime selection is disabled.

For no-std support (compiled with `--no-default-features`) the implementation is always selected at compile time based on
the targeted CPU. Use `RUSTFLAGS="-C target-feature=+avx2"` for the AVX 2 implementation or `RUSTFLAGS="-C target-feature=+sse4.2"`
for the SSE 4.2 implementation.

### ARM64
The SIMD implementation is used automatically since Rust 1.61.

### WASM32
For wasm32 support, the implementation is selected at compile time based on the presence of the `simd128` target feature.
Use `RUSTFLAGS="-C target-feature=+simd128"` to enable the WASM SIMD implementation. WASM, at
the time of this writing, doesn't have a way to detect SIMD through WASM itself. Although this capability
is available in various WASM host environments (e.g., [wasm-feature-detect] in the web browser), there is no portable
way from within the library to detect this.

[wasm-feature-detect]: https://github.com/GoogleChromeLabs/wasm-feature-detect

#### Building/Targeting WASM
See [this document](./wasm32-development.md) for more details.

### Access to low-level functionality

If you want to be able to call a SIMD implementation directly, use the `public_imp` feature flag. The validation implementations are then accessible in the `simdutf8::{basic, compat}::imp` hierarchy. Traits
facilitating streaming validation are available there as well.

## Optimisation flags
Do not use [`opt-level = "z"`](https://doc.rust-lang.org/cargo/reference/profiles.html), which prevents inlining and makes
the code quite slow.

## Minimum Supported Rust Version (MSRV)
This crate's minimum supported Rust version is 1.38.0.

## Benchmarks
The benchmarks have been done with [criterion](https://bheisler.github.io/criterion.rs/book/index.html), the tables
are created with [critcmp](https://github.com/BurntSushi/critcmp). Source code and data are in the
[bench directory](https://github.com/rusticstuff/simdutf8/tree/main/bench).

The naming schema is id-charset/size. _0-empty_ is the empty byte slice, _x-error/66536_ is a 64KiB slice where the very
first character is invalid UTF-8. Library versions are simdutf8 v0.1.2 and simdjson v0.9.2. When comparing
with simdjson simdutf8 is compiled with `#inline(never)`.

Configurations:
* X86-64: PC with an AMD Ryzen 7 PRO 3700 CPU (Zen2) on Linux with Rust 1.52.0
* Aarch64: Macbook Air with an Apple M1 CPU (Apple Silicon) on macOS with Rust rustc 1.54.0-nightly (881c1ac40 2021-05-08).

### simdutf8 basic vs std library on x86-64 (AMD Zen2)
![image](https://user-images.githubusercontent.com/3736990/117568104-1c00f900-b0bf-11eb-938f-4c253d192480.png)
Simdutf8 is up to 23 times faster than the std library on valid non-ASCII, up to four times on pure ASCII.

### simdutf8 basic vs std library on aarch64 (Apple Silicon)
![image](https://user-images.githubusercontent.com/3736990/117568160-42bf2f80-b0bf-11eb-86a4-9aeee4cee87d.png)
Simdutf8 is up to to eleven times faster than the std library on valid non-ASCII, up to four times faster on
pure ASCII.

### simdutf8 basic vs simdjson on x86-64
![image](https://user-images.githubusercontent.com/3736990/117568231-80bc5380-b0bf-11eb-8e90-1dcc6d966ebd.png)
Simdutf8 is faster than simdjson on almost all inputs.

### simdutf8 basic vs simdutf8 compat UTF-8 on x86-64
![image](https://user-images.githubusercontent.com/3736990/117568270-af3a2e80-b0bf-11eb-8ec4-e5a0a4ad7210.png)
There is a small performance penalty to continuously checking the error status while processing data, but detecting
errors early provides a huge benefit for the _x-error/66536_ benchmark.

## Technical details
For inputs shorter than 64 bytes validation is delegated to `core::str::from_utf8()` except for the direct-access
functions in `simdutf8::{basic, compat}::imp`.

The SIMD implementation is mostly similar to the one in simdjson except that it is has additional optimizations
for the pure ASCII case. Also it uses prefetch with AVX 2 on x86 which leads to slightly better performance with
some Intel CPUs on synthetic benchmarks.

For the compat API, we need to check the error status vector on each 64-byte block instead of just aggregating it. If an
error is found, the last bytes of the previous block are checked for a cross-block continuation and then
`std::str::from_utf8()` is run to find the exact location of the error.

Care is taken that all functions are properly inlined up to the public interface.

## Thanks
* to the authors of simdjson for coming up with the high-performance SIMD implementation and in particular to Daniel Lemire
for his feedback. It was very helpful.
* to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.

## License
This code is dual-licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) and the [MIT License](https://opensource.org/licenses/MIT).

It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under
the MIT license and Apache 2.0 license as well.

simdjson itself is distributed under the Apache License 2.0.

## References
John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021