https://github.com/killercup/simd-utf8-check
https://github.com/killercup/simd-utf8-check
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/killercup/simd-utf8-check
- Owner: killercup
- Created: 2018-05-17T16:42:29.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-05-29T10:45:49.000Z (almost 8 years ago)
- Last Synced: 2025-04-04T12:11:26.758Z (12 months ago)
- Language: Rust
- Homepage: https://killercup.github.io/simd-utf8-check/report/index.html
- Size: 6.33 MB
- Stars: 13
- Watchers: 4
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# SIMD UTF8 Validation in Rust
After reading the post [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/),
I was curious if this algorithm might be a good fit for Rust's standard library.
Because Rust's `String` type is guaranteed to be UTF8,
you'll need to either use `from_utf8` to convert an array of bytes to a `String`,
or, if you trust the input, use the `unsafe fn from_utf8_unchecked`.
The faster `from_utf8` is, the more people can always use the safe version.
Of course, I'm not the first person to think of this,
and [this Rust PR](https://github.com/rust-lang/rust/pull/30740)
already contains a super fast implementation,
albeit one that that not use explicit SIMD intrinsics.
## Benchmarks
### Results
```
$ env RUSTFLAGS='-C target-cpu=native' cargo bench --quiet
# ...
$ open target/criterion/report/index.html
```
[You can also find the rendered report here.](https://killercup.github.io/simd-utf8-check/report/index.html)
There are two runs, the first without and the second with the `target-cpu=native` flag.
This was benchmarked on a late 2016 MacBook Pro with an Intel i7 6700HQ CPU.
Currently, it looks like the current std impl is a bit faster for inputs that contain mostly ASCII,
but the SIMD version gives a significant speedup when dealing with multi-byte codepoints.
### Data
- jawik10: `curl -L http://dumps.wikimedia.org/archive/2006/2006-07/jawiki/20061016/jawiki-20061016-pages-articles.xml.bz2 | bunzip2 > test/fixtures/jawik10`
- enwiki8: From
- `big10` is the dataset in (see )