https://github.com/killercup/simd-utf8-check

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/killercup/simd-utf8-check
Owner: killercup
Created: 2018-05-17T16:42:29.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-05-29T10:45:49.000Z (almost 8 years ago)
Last Synced: 2025-04-04T12:11:26.758Z (12 months ago)
Language: Rust
Homepage: https://killercup.github.io/simd-utf8-check/report/index.html
Size: 6.33 MB
Stars: 13
Watchers: 4
Forks: 1
Open Issues: 5
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

          # SIMD UTF8 Validation in Rust

After reading the post [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/),

I was curious if this algorithm might be a good fit for Rust's standard library.

Because Rust's `String` type is guaranteed to be UTF8,

you'll need to either use `from_utf8` to convert an array of bytes to a `String`,

or, if you trust the input, use the `unsafe fn from_utf8_unchecked`.

The faster `from_utf8` is, the more people can always use the safe version.

Of course, I'm not the first person to think of this,

and [this Rust PR](https://github.com/rust-lang/rust/pull/30740)

already contains a super fast implementation,

albeit one that that not use explicit SIMD intrinsics.

## Benchmarks

### Results

```

$ env RUSTFLAGS='-C target-cpu=native' cargo bench --quiet

# ...

$ open target/criterion/report/index.html

```

[You can also  find the rendered report here.](https://killercup.github.io/simd-utf8-check/report/index.html)

There are two runs, the first without and the second with the `target-cpu=native` flag.

This was benchmarked on a late 2016 MacBook Pro with an Intel i7 6700HQ CPU.

Currently, it looks like the current std impl is a bit faster for inputs that contain mostly ASCII,

but the SIMD version gives a significant speedup when dealing with multi-byte codepoints.

### Data

- jawik10: `curl -L http://dumps.wikimedia.org/archive/2006/2006-07/jawiki/20061016/jawiki-20061016-pages-articles.xml.bz2 | bunzip2 > test/fixtures/jawik10`

- enwiki8: From 

- `big10` is the dataset in  (see )

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/killercup/simd-utf8-check

Awesome Lists containing this project

README