Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ilyvion/encoding-next

Character encoding support for Rust
https://github.com/ilyvion/encoding-next

Last synced: 8 days ago
JSON representation

Character encoding support for Rust

Awesome Lists containing this project

README

        

# Encoding

[![Crates.io](https://img.shields.io/crates/v/encoding-next)](https://crates.io/crates/encoding-next)
[![Crates.io](https://img.shields.io/crates/l/encoding-next)](https://crates.io/crates/encoding-next)
[![Crates.io](https://img.shields.io/crates/d/encoding-next)](https://crates.io/crates/encoding-next)
[![Docs.io](https://docs.rs/encoding-next/badge.svg)](https://docs.rs/encoding-next)
[![Docs master](https://img.shields.io/static/v1?label=docs&message=master&color=5479ab)](https://alexschrod.github.io/encoding-next/)
[![Rust](https://github.com/alexschrod/encoding-next/actions/workflows/CI.yml/badge.svg)](https://github.com/alexschrod/encoding-next/actions/workflows/CI.yml)
[![codecov](https://codecov.io/gh/alexschrod/encoding-next/branch/master/graph/badge.svg?token=C8UJJM7BVJ)](https://codecov.io/gh/alexschrod/encoding-next)

Character encoding support for Rust.
It is based on [WHATWG Encoding Standard](http://encoding.spec.whatwg.org/),
and also provides an advanced interface for error detection and recovery.

## Usage

Put this in your `Cargo.toml`:

```toml
[dependencies]
encoding-next = "0.3"
```

### Data Table

By default, Encoding comes with ~480 KB of data table ("indices").
This allows Encoding to encode and decode legacy encodings efficiently,
but this might not be desirable for some applications.

Encoding provides the `no-optimized-legacy-encoding` Cargo feature
to reduce the size of encoding tables (to ~185 KB)
at the expense of encoding performance (typically 5x to 20x slower).
The decoding performance remains identical.
**This feature is strongly intended for end users.
Do not try to enable this feature from library crates, ever.**

For finer-tuned optimization, see `src/index/gen_index.py` for
custom table generation.

## Overview

To encode a string:

```rust
use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),
Ok(vec![99,97,102,233]));
```

To encode a string with unrepresentable characters:

```rust
use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_2;

assert!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Strict).is_err());
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Replace),
Ok(vec![65,99,109,101,63]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Ignore),
Ok(vec![65,99,109,101]));
assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::NcrEscape),
Ok(vec![65,99,109,101,38,35,49,54,57,59]));
```

To decode a byte sequence:

```rust
use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.decode(&[99,97,102,233], DecoderTrap::Strict),
Ok("caf\u{e9}".to_string()));
```

To decode a byte sequence with invalid sequences:

```rust
use encoding::{Encoding, DecoderTrap};
use encoding::all::ISO_8859_6;

assert!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Strict).is_err());
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Replace),
Ok("Acme\u{fffd}".to_string()));
assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Ignore),
Ok("Acme".to_string()));
```

To encode or decode the input into the already allocated buffer:

```rust
use encoding::{Encoding, EncoderTrap, DecoderTrap};
use encoding::all::{ISO_8859_2, ISO_8859_6};

let mut bytes = Vec::new();
let mut chars = String::new();

assert!(ISO_8859_2.encode_to("Acme\u{a9}", EncoderTrap::Ignore, &mut bytes).is_ok());
assert!(ISO_8859_6.decode_to(&[65,99,109,101,169], DecoderTrap::Replace, &mut chars).is_ok());

assert_eq!(bytes, [65,99,109,101]);
assert_eq!(chars, "Acme\u{fffd}");
```

A practical example of custom encoder traps:

```rust
use encoding::{Encoding, ByteWriter, EncoderTrap, DecoderTrap};
use encoding::types::RawEncoder;
use encoding::all::ASCII;

// hexadecimal numeric character reference replacement
fn hex_ncr_escape(_encoder: &mut dyn RawEncoder, input: &str, output: &mut dyn ByteWriter) -> bool {
let escapes: Vec =
input.chars().map(|ch| format!("{:x};", ch as isize)).collect();
let escapes = escapes.concat();
output.write_bytes(escapes.as_bytes());
true
}
static HEX_NCR_ESCAPE: EncoderTrap = EncoderTrap::Call(hex_ncr_escape);

let orig = "Hello, 世界!".to_string();
let encoded = ASCII.encode(&orig, HEX_NCR_ESCAPE).unwrap();
assert_eq!(ASCII.decode(&encoded, DecoderTrap::Strict),
Ok("Hello, 世界!".to_string()));
```

Getting the encoding from the string label, as specified in WHATWG Encoding standard:

```rust
use encoding::{Encoding, DecoderTrap};
use encoding::label::encoding_from_whatwg_label;
use encoding::all::WINDOWS_949;

let euckr = encoding_from_whatwg_label("euc-kr").unwrap();
assert_eq!(euckr.name(), "windows-949");
assert_eq!(euckr.whatwg_name(), Some("euc-kr")); // for the sake of compatibility
let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
assert_eq!(euckr.decode(broken, DecoderTrap::Replace),
Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

// corresponding Encoding native API:
assert_eq!(WINDOWS_949.decode(broken, DecoderTrap::Replace),
Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));
```

## Types and Stuffs

There are three main entry points to Encoding.

**`Encoding`** is a single character encoding.
It contains `encode` and `decode` methods for converting `String` to `Vec` and vice versa.
For the error handling, they receive **traps** (`EncoderTrap` and `DecoderTrap` respectively)
which replace any error with some string (e.g. `U+FFFD`) or sequence (e.g. `?`).
You can also use `EncoderTrap::Strict` and `DecoderTrap::Strict` traps to stop on an error.

There are two ways to get `Encoding`:

- `encoding::all` has static items for every supported encoding.
You should use them when the encoding would not change or only handful of them are required.
Combined with link-time optimization, any unused encoding would be discarded from the binary.
- `encoding::label` has functions to dynamically get an encoding from given string ("label").
They will return a static reference to the encoding,
which type is also known as `EncodingRef`.
It is useful when a list of required encodings is not available in advance,
but it will result in the larger binary and missed optimization opportunities.

**`RawEncoder`** is an experimental incremental encoder.
At each step of `raw_feed`, it receives a slice of string
and emits any encoded bytes to a generic `ByteWriter` (normally `Vec`).
It will stop at the first error if any, and would return a `CodecError` struct in that case.
The caller is responsible for calling `raw_finish` at the end of encoding process.

**`RawDecoder`** is an experimental incremental decoder.
At each step of `raw_feed`, it receives a slice of byte sequence
and emits any decoded characters to a generic `StringWriter` (normally `String`).
Otherwise it is identical to `RawEncoder`s.

One should prefer `Encoding::{encode,decode}` as a primary interface.
`RawEncoder` and `RawDecoder` is experimental and can change substantially.
See the additional documents on `encoding::types` module for more information on them.

## Supported Encodings

Encoding covers all encodings specified by WHATWG Encoding Standard and some more:

- 7-bit strict ASCII (`ascii`)
- ArmSCII-8 (`armscii-8`)
- UTF-8 (`utf-8`)
- UTF-16 in little endian (`utf-16` or `utf-16le`) and big endian (`utf-16be`)
- All single byte encoding in WHATWG Encoding Standard:
- IBM code page 866
- ISO 8859-{2,3,4,5,6,7,8,10,13,14,15,16}
- KOI8-R, KOI8-U
- MacRoman (`macintosh`), Macintosh Cyrillic encoding (`x-mac-cyrillic`)
- Windows code pages 874, 1250, 1251, 1252 (instead of ISO 8859-1), 1253,
1254 (instead of ISO 8859-9), 1255, 1256, 1257, 1258
- All multi byte encodings in WHATWG Encoding Standard:
- Windows code page 949 (`euc-kr`, since the strict EUC-KR is hardly used)
- EUC-JP and Windows code page 932 (`shift_jis`,
since it's the most widespread extension to Shift_JIS)
- ISO-2022-JP with asymmetric JIS X 0212 support
(Note: this is not yet up to date to the current standard)
- GBK
- GB 18030
- Big5-2003 with HKSCS-2008 extensions
- Encodings that were originally specified by WHATWG Encoding Standard:
- HZ
- ISO 8859-1 (distinct from Windows code page 1252)
- Code page 437 (`cp437`)

Parenthesized names refer to the encoding's primary name assigned by WHATWG Encoding Standard.

Many legacy character encodings lack the proper specification,
and even those that have a specification are highly dependent of the actual implementation.
Consequently one should be careful when picking a desired character encoding.
The only standards reliable in this regard are WHATWG Encoding Standard and
[vendor-provided mappings from the Unicode consortium](http://www.unicode.org/Public/MAPPINGS/).
Whenever in doubt, look at the source code and specifications for detailed explanations.