https://github.com/burntsushi/utf8-ranges
Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
https://github.com/burntsushi/utf8-ranges
Last synced: 6 months ago
JSON representation
Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
- Host: GitHub
- URL: https://github.com/burntsushi/utf8-ranges
- Owner: BurntSushi
- License: unlicense
- Created: 2015-10-16T01:20:38.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2022-04-04T18:56:54.000Z (over 3 years ago)
- Last Synced: 2025-04-09T10:02:05.902Z (6 months ago)
- Language: Rust
- Size: 30.3 KB
- Stars: 43
- Watchers: 5
- Forks: 9
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
**DEPRECATED:** This crate has been folded into the
[`regex-syntax`](https://docs.rs/regex-syntax) and is now deprecated.utf8-ranges
===========
This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte
ranges. This is useful when constructing byte based automata from Unicode.
Stated differently, this lets one embed UTF-8 decoding as part of one's
automaton.[](https://travis-ci.org/BurntSushi/utf8-ranges)
[](https://crates.io/crates/utf8-ranges)Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
### Documentation
https://docs.rs/utf8-ranges
### Example
This shows how to convert a scalar value range (e.g., the basic multilingual
plane) to a sequence of byte based character classes.```rust
extern crate utf8_ranges;use utf8_ranges::Utf8Sequences;
fn main() {
for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
println!("{:?}", range);
}
}
```The output:
```text
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
```These ranges can then be used to build an automaton. Namely:
1. Every arbitrary sequence of bytes matches exactly one of the sequences of
ranges or none of them.
2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous
encodings of surrogate codepoints in UTF-8 cannot match any of the byte
ranges above.)