Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/ilyvion/encoding-next

Character encoding support for Rust
https://github.com/ilyvion/encoding-next
Last synced: 8 days ago
JSON representation
Character encoding support for Rust
Host: GitHub
URL: https://github.com/ilyvion/encoding-next
Owner: ilyvion
License: mit
Created: 2022-06-30T09:58:01.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2022-07-11T10:09:36.000Z (over 2 years ago)
Last Synced: 2024-12-23T08:35:58.210Z (16 days ago)
Language: Rust
Size: 7.22 MB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Encoding

[![Crates.io](https://img.shields.io/crates/v/encoding-next)](https://crates.io/crates/encoding-next)

[![Crates.io](https://img.shields.io/crates/l/encoding-next)](https://crates.io/crates/encoding-next)

[![Crates.io](https://img.shields.io/crates/d/encoding-next)](https://crates.io/crates/encoding-next)

[![Docs.io](https://docs.rs/encoding-next/badge.svg)](https://docs.rs/encoding-next)

[![Docs master](https://img.shields.io/static/v1?label=docs&message=master&color=5479ab)](https://alexschrod.github.io/encoding-next/)

[![Rust](https://github.com/alexschrod/encoding-next/actions/workflows/CI.yml/badge.svg)](https://github.com/alexschrod/encoding-next/actions/workflows/CI.yml)

[![codecov](https://codecov.io/gh/alexschrod/encoding-next/branch/master/graph/badge.svg?token=C8UJJM7BVJ)](https://codecov.io/gh/alexschrod/encoding-next)

Character encoding support for Rust.

It is based on [WHATWG Encoding Standard](http://encoding.spec.whatwg.org/),

and also provides an advanced interface for error detection and recovery.

## Usage

Put this in your `Cargo.toml`:

```toml

[dependencies]

encoding-next = "0.3"

```

### Data Table

By default, Encoding comes with ~480 KB of data table ("indices").

This allows Encoding to encode and decode legacy encodings efficiently,

but this might not be desirable for some applications.

Encoding provides the `no-optimized-legacy-encoding` Cargo feature

to reduce the size of encoding tables (to ~185 KB)

at the expense of encoding performance (typically 5x to 20x slower).

The decoding performance remains identical.

**This feature is strongly intended for end users.

Do not try to enable this feature from library crates, ever.**

For finer-tuned optimization, see `src/index/gen_index.py` for

custom table generation.

## Overview

To encode a string:

```rust

use encoding::{Encoding, EncoderTrap};

use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),

           Ok(vec![99,97,102,233]));

```

To encode a string with unrepresentable characters:

```rust

use encoding::{Encoding, EncoderTrap};

use encoding::all::ISO_8859_2;

assert!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Strict).is_err());

assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Replace),

           Ok(vec![65,99,109,101,63]));

assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Ignore),

           Ok(vec![65,99,109,101]));

assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::NcrEscape),

           Ok(vec![65,99,109,101,38,35,49,54,57,59]));

```

To decode a byte sequence:

```rust

use encoding::{Encoding, DecoderTrap};

use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.decode(&[99,97,102,233], DecoderTrap::Strict),

           Ok("caf\u{e9}".to_string()));

```

To decode a byte sequence with invalid sequences:

```rust

use encoding::{Encoding, DecoderTrap};

use encoding::all::ISO_8859_6;

assert!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Strict).is_err());

assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Replace),

           Ok("Acme\u{fffd}".to_string()));

assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Ignore),

           Ok("Acme".to_string()));

```

To encode or decode the input into the already allocated buffer:

```rust

use encoding::{Encoding, EncoderTrap, DecoderTrap};

use encoding::all::{ISO_8859_2, ISO_8859_6};

let mut bytes = Vec::new();

let mut chars = String::new();

assert!(ISO_8859_2.encode_to("Acme\u{a9}", EncoderTrap::Ignore, &mut bytes).is_ok());

assert!(ISO_8859_6.decode_to(&[65,99,109,101,169], DecoderTrap::Replace, &mut chars).is_ok());

assert_eq!(bytes, [65,99,109,101]);

assert_eq!(chars, "Acme\u{fffd}");

```

A practical example of custom encoder traps:

```rust

use encoding::{Encoding, ByteWriter, EncoderTrap, DecoderTrap};

use encoding::types::RawEncoder;

use encoding::all::ASCII;

// hexadecimal numeric character reference replacement

fn hex_ncr_escape(_encoder: &mut dyn RawEncoder, input: &str, output: &mut dyn ByteWriter) -> bool {

    let escapes: Vec =

        input.chars().map(|ch| format!("{:x};", ch as isize)).collect();

    let escapes = escapes.concat();

    output.write_bytes(escapes.as_bytes());

    true

}

static HEX_NCR_ESCAPE: EncoderTrap = EncoderTrap::Call(hex_ncr_escape);

let orig = "Hello, 世界!".to_string();

let encoded = ASCII.encode(&orig, HEX_NCR_ESCAPE).unwrap();

assert_eq!(ASCII.decode(&encoded, DecoderTrap::Strict),

           Ok("Hello, 世界!".to_string()));

```

Getting the encoding from the string label, as specified in WHATWG Encoding standard:

```rust

use encoding::{Encoding, DecoderTrap};

use encoding::label::encoding_from_whatwg_label;

use encoding::all::WINDOWS_949;

let euckr = encoding_from_whatwg_label("euc-kr").unwrap();

assert_eq!(euckr.name(), "windows-949");

assert_eq!(euckr.whatwg_name(), Some("euc-kr")); // for the sake of compatibility

let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];

assert_eq!(euckr.decode(broken, DecoderTrap::Replace),

           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

// corresponding Encoding native API:

assert_eq!(WINDOWS_949.decode(broken, DecoderTrap::Replace),

           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));

```

## Types and Stuffs

There are three main entry points to Encoding.

**`Encoding`** is a single character encoding.

It contains `encode` and `decode` methods for converting `String` to `Vec` and vice versa.

For the error handling, they receive **traps** (`EncoderTrap` and `DecoderTrap` respectively)

which replace any error with some string (e.g. `U+FFFD`) or sequence (e.g. `?`).

You can also use `EncoderTrap::Strict` and `DecoderTrap::Strict` traps to stop on an error.

There are two ways to get `Encoding`:

-   `encoding::all` has static items for every supported encoding.

    You should use them when the encoding would not change or only handful of them are required.

    Combined with link-time optimization, any unused encoding would be discarded from the binary.

-   `encoding::label` has functions to dynamically get an encoding from given string ("label").

    They will return a static reference to the encoding,

    which type is also known as `EncodingRef`.

    It is useful when a list of required encodings is not available in advance,

    but it will result in the larger binary and missed optimization opportunities.

**`RawEncoder`** is an experimental incremental encoder.

At each step of `raw_feed`, it receives a slice of string

and emits any encoded bytes to a generic `ByteWriter` (normally `Vec`).

It will stop at the first error if any, and would return a `CodecError` struct in that case.

The caller is responsible for calling `raw_finish` at the end of encoding process.

**`RawDecoder`** is an experimental incremental decoder.

At each step of `raw_feed`, it receives a slice of byte sequence

and emits any decoded characters to a generic `StringWriter` (normally `String`).

Otherwise it is identical to `RawEncoder`s.

One should prefer `Encoding::{encode,decode}` as a primary interface.

`RawEncoder` and `RawDecoder` is experimental and can change substantially.

See the additional documents on `encoding::types` module for more information on them.

## Supported Encodings

Encoding covers all encodings specified by WHATWG Encoding Standard and some more:

-   7-bit strict ASCII (`ascii`)

-   ArmSCII-8 (`armscii-8`)

-   UTF-8 (`utf-8`)

-   UTF-16 in little endian (`utf-16` or `utf-16le`) and big endian (`utf-16be`)

-   All single byte encoding in WHATWG Encoding Standard:

    -   IBM code page 866

    -   ISO 8859-{2,3,4,5,6,7,8,10,13,14,15,16}

    -   KOI8-R, KOI8-U

    -   MacRoman (`macintosh`), Macintosh Cyrillic encoding (`x-mac-cyrillic`)

    -   Windows code pages 874, 1250, 1251, 1252 (instead of ISO 8859-1), 1253,

        1254 (instead of ISO 8859-9), 1255, 1256, 1257, 1258

-   All multi byte encodings in WHATWG Encoding Standard:

    -   Windows code page 949 (`euc-kr`, since the strict EUC-KR is hardly used)

    -   EUC-JP and Windows code page 932 (`shift_jis`,

        since it's the most widespread extension to Shift_JIS)

    -   ISO-2022-JP with asymmetric JIS X 0212 support

        (Note: this is not yet up to date to the current standard)

    -   GBK

    -   GB 18030

    -   Big5-2003 with HKSCS-2008 extensions

-   Encodings that were originally specified by WHATWG Encoding Standard:

    -   HZ

-   ISO 8859-1 (distinct from Windows code page 1252)

-   Code page 437 (`cp437`)

Parenthesized names refer to the encoding's primary name assigned by WHATWG Encoding Standard.

Many legacy character encodings lack the proper specification,

and even those that have a specification are highly dependent of the actual implementation.

Consequently one should be careful when picking a desired character encoding.

The only standards reliable in this regard are WHATWG Encoding Standard and

[vendor-provided mappings from the Unicode consortium](http://www.unicode.org/Public/MAPPINGS/).

Whenever in doubt, look at the source code and specifications for detailed explanations.