Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ilyvion/encoding-next
Character encoding support for Rust
https://github.com/ilyvion/encoding-next
Last synced: 8 days ago
JSON representation
Character encoding support for Rust
- Host: GitHub
- URL: https://github.com/ilyvion/encoding-next
- Owner: ilyvion
- License: mit
- Created: 2022-06-30T09:58:01.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2022-07-11T10:09:36.000Z (over 2 years ago)
- Last Synced: 2024-12-23T08:35:58.210Z (16 days ago)
- Language: Rust
- Size: 7.22 MB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Encoding
[![Crates.io](https://img.shields.io/crates/v/encoding-next)](https://crates.io/crates/encoding-next)
[![Crates.io](https://img.shields.io/crates/l/encoding-next)](https://crates.io/crates/encoding-next)
[![Crates.io](https://img.shields.io/crates/d/encoding-next)](https://crates.io/crates/encoding-next)
[![Docs.io](https://docs.rs/encoding-next/badge.svg)](https://docs.rs/encoding-next)
[![Docs master](https://img.shields.io/static/v1?label=docs&message=master&color=5479ab)](https://alexschrod.github.io/encoding-next/)
[![Rust](https://github.com/alexschrod/encoding-next/actions/workflows/CI.yml/badge.svg)](https://github.com/alexschrod/encoding-next/actions/workflows/CI.yml)
[![codecov](https://codecov.io/gh/alexschrod/encoding-next/branch/master/graph/badge.svg?token=C8UJJM7BVJ)](https://codecov.io/gh/alexschrod/encoding-next)Character encoding support for Rust.
It is based on [WHATWG Encoding Standard](http://encoding.spec.whatwg.org/),
and also provides an advanced interface for error detection and recovery.## Usage
Put this in your `Cargo.toml`:
```toml
[dependencies]
encoding-next = "0.3"
```### Data Table
By default, Encoding comes with ~480 KB of data table ("indices").
This allows Encoding to encode and decode legacy encodings efficiently,
but this might not be desirable for some applications.Encoding provides the `no-optimized-legacy-encoding` Cargo feature
to reduce the size of encoding tables (to ~185 KB)
at the expense of encoding performance (typically 5x to 20x slower).
The decoding performance remains identical.
**This feature is strongly intended for end users.
Do not try to enable this feature from library crates, ever.**For finer-tuned optimization, see `src/index/gen_index.py` for
custom table generation.## Overview
To encode a string:
```rust
use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_1;assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),
Ok(vec![99,97,102,233]));
```To encode a string with unrepresentable characters:
```rust
use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_2;assert!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Strict).is_err());
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Replace),
Ok(vec![65,99,109,101,63]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Ignore),
Ok(vec![65,99,109,101]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::NcrEscape),
Ok(vec![65,99,109,101,38,35,49,54,57,59]));
```To decode a byte sequence:
```rust
use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_1;assert_eq!(ISO_8859_1.decode(&[99,97,102,233], DecoderTrap::Strict),
Ok("caf\u{e9}".to_string()));
```To decode a byte sequence with invalid sequences:
```rust
use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_6;assert!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Strict).is_err());
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Replace),
Ok("Acme\u{fffd}".to_string()));
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Ignore),
Ok("Acme".to_string()));
```To encode or decode the input into the already allocated buffer:
```rust
use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{ISO_8859_2, ISO_8859_6};let mut bytes = Vec::new();
let mut chars = String::new();assert!(ISO_8859_2.encode_to("Acme\u{a9}", EncoderTrap::Ignore, &mut bytes).is_ok());
assert!(ISO_8859_6.decode_to(&[65,99,109,101,169], DecoderTrap::Replace, &mut chars).is_ok());assert_eq!(bytes, [65,99,109,101]);
assert_eq!(chars, "Acme\u{fffd}");
```A practical example of custom encoder traps:
```rust
use encoding::{Encoding, ByteWriter, EncoderTrap, DecoderTrap};
use encoding::types::RawEncoder;
use encoding::all::ASCII;// hexadecimal numeric character reference replacement
fn hex_ncr_escape(_encoder: &mut dyn RawEncoder, input: &str, output: &mut dyn ByteWriter) -> bool {
let escapes: Vec =
input.chars().map(|ch| format!("{:x};", ch as isize)).collect();
let escapes = escapes.concat();
output.write_bytes(escapes.as_bytes());
true
}
static HEX_NCR_ESCAPE: EncoderTrap = EncoderTrap::Call(hex_ncr_escape);let orig = "Hello, 世界!".to_string();
let encoded = ASCII.encode(&orig, HEX_NCR_ESCAPE).unwrap();
assert_eq!(ASCII.decode(&encoded, DecoderTrap::Strict),
Ok("Hello, 世界!".to_string()));
```Getting the encoding from the string label, as specified in WHATWG Encoding standard:
```rust
use encoding::{Encoding, DecoderTrap};
use encoding::label::encoding_from_whatwg_label;
use encoding::all::WINDOWS_949;let euckr = encoding_from_whatwg_label("euc-kr").unwrap();
assert_eq!(euckr.name(), "windows-949");
assert_eq!(euckr.whatwg_name(), Some("euc-kr")); // for the sake of compatibility
let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
assert_eq!(euckr.decode(broken, DecoderTrap::Replace),
Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));// corresponding Encoding native API:
assert_eq!(WINDOWS_949.decode(broken, DecoderTrap::Replace),
Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));
```## Types and Stuffs
There are three main entry points to Encoding.
**`Encoding`** is a single character encoding.
It contains `encode` and `decode` methods for converting `String` to `Vec` and vice versa.
For the error handling, they receive **traps** (`EncoderTrap` and `DecoderTrap` respectively)
which replace any error with some string (e.g. `U+FFFD`) or sequence (e.g. `?`).
You can also use `EncoderTrap::Strict` and `DecoderTrap::Strict` traps to stop on an error.There are two ways to get `Encoding`:
- `encoding::all` has static items for every supported encoding.
You should use them when the encoding would not change or only handful of them are required.
Combined with link-time optimization, any unused encoding would be discarded from the binary.
- `encoding::label` has functions to dynamically get an encoding from given string ("label").
They will return a static reference to the encoding,
which type is also known as `EncodingRef`.
It is useful when a list of required encodings is not available in advance,
but it will result in the larger binary and missed optimization opportunities.**`RawEncoder`** is an experimental incremental encoder.
At each step of `raw_feed`, it receives a slice of string
and emits any encoded bytes to a generic `ByteWriter` (normally `Vec`).
It will stop at the first error if any, and would return a `CodecError` struct in that case.
The caller is responsible for calling `raw_finish` at the end of encoding process.**`RawDecoder`** is an experimental incremental decoder.
At each step of `raw_feed`, it receives a slice of byte sequence
and emits any decoded characters to a generic `StringWriter` (normally `String`).
Otherwise it is identical to `RawEncoder`s.One should prefer `Encoding::{encode,decode}` as a primary interface.
`RawEncoder` and `RawDecoder` is experimental and can change substantially.
See the additional documents on `encoding::types` module for more information on them.## Supported Encodings
Encoding covers all encodings specified by WHATWG Encoding Standard and some more:
- 7-bit strict ASCII (`ascii`)
- ArmSCII-8 (`armscii-8`)
- UTF-8 (`utf-8`)
- UTF-16 in little endian (`utf-16` or `utf-16le`) and big endian (`utf-16be`)
- All single byte encoding in WHATWG Encoding Standard:
- IBM code page 866
- ISO 8859-{2,3,4,5,6,7,8,10,13,14,15,16}
- KOI8-R, KOI8-U
- MacRoman (`macintosh`), Macintosh Cyrillic encoding (`x-mac-cyrillic`)
- Windows code pages 874, 1250, 1251, 1252 (instead of ISO 8859-1), 1253,
1254 (instead of ISO 8859-9), 1255, 1256, 1257, 1258
- All multi byte encodings in WHATWG Encoding Standard:
- Windows code page 949 (`euc-kr`, since the strict EUC-KR is hardly used)
- EUC-JP and Windows code page 932 (`shift_jis`,
since it's the most widespread extension to Shift_JIS)
- ISO-2022-JP with asymmetric JIS X 0212 support
(Note: this is not yet up to date to the current standard)
- GBK
- GB 18030
- Big5-2003 with HKSCS-2008 extensions
- Encodings that were originally specified by WHATWG Encoding Standard:
- HZ
- ISO 8859-1 (distinct from Windows code page 1252)
- Code page 437 (`cp437`)Parenthesized names refer to the encoding's primary name assigned by WHATWG Encoding Standard.
Many legacy character encodings lack the proper specification,
and even those that have a specification are highly dependent of the actual implementation.
Consequently one should be careful when picking a desired character encoding.
The only standards reliable in this regard are WHATWG Encoding Standard and
[vendor-provided mappings from the Unicode consortium](http://www.unicode.org/Public/MAPPINGS/).
Whenever in doubt, look at the source code and specifications for detailed explanations.