https://github.com/hsivonen/encoding_rs

A Gecko-oriented implementation of the Encoding Standard in Rust
https://github.com/hsivonen/encoding_rs

charset encoding rust unicode web

Last synced: about 1 year ago
JSON representation

A Gecko-oriented implementation of the Encoding Standard in Rust

Host: GitHub
URL: https://github.com/hsivonen/encoding_rs
Owner: hsivonen
License: other
Created: 2016-01-04T08:47:55.000Z (over 10 years ago)
Default Branch: main
Last Pushed: 2024-11-13T16:38:58.000Z (over 1 year ago)
Last Synced: 2025-05-03T01:57:17.211Z (about 1 year ago)
Topics: charset, encoding, rust, unicode, web
Language: Rust
Homepage: https://docs.rs/encoding_rs/
Size: 4.6 MB
Stars: 405
Watchers: 13
Forks: 56
Open Issues: 21
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE

Awesome Lists containing this project

awesome-rust-cn - hsivonen/encoding_rs - oriented implementation of the Encoding Standard in Rust [<img src="https://api.travis-ci.org/hsivonen/encoding_rs.svg?branch=master">](https://travis-ci.org/hsivonen/encoding_rs) (Libraries / Encoding)
awesome-rust - hsivonen/encoding_rs - oriented implementation of the Encoding Standard in Rust [<img src="https://api.travis-ci.org/hsivonen/encoding_rs.svg?branch=master">](https://travis-ci.org/hsivonen/encoding_rs) (Libraries / Encoding)
awesome-rust - hsivonen/encoding_rs - oriented implementation of the Encoding Standard (Libraries / Encoding)
awesome-rust-cn - hsivonen/encoding_rs
awesome-rust-zh - hsivonen/encoding_rs - 面向 Gecko 的编码标准 Rust 实现[<img src="https://api.travis-ci.org/hsivonen/encoding_rs.svg?branch=master">](https://travis-ci.org/hsivonen/encoding_rs) (库 / 编码(Encoding))
awesome-rust - hsivonen/encoding_rs - A Gecko-oriented implementation of the Encoding Standard (Libraries / Encoding)
awesome-rust - hsivonen/encoding_rs - oriented implementation of the Encoding Standard in Rust [<img src="https://api.travis-ci.org/hsivonen/encoding_rs.svg?branch=master">](https://travis-ci.org/hsivonen/encoding_rs) (库 Libraries / 加密 Encoding)
awesome-rust - hsivonen/encoding_rs - oriented implementation of the Encoding Standard (Libraries / Encoding)
fucking-awesome-rust - hsivonen/encoding_rs - A Gecko-oriented implementation of the Encoding Standard (Libraries / Encoding)
awesome-rust-with-stars - hsivonen/encoding_rs - oriented implementation of the Encoding Standard | 2025-12-19 | (Libraries / Encoding)

README

# encoding_rs

[![Build Status](https://github.com/hsivonen/encoding_rs/actions/workflows/ci.yml/badge.svg)](https://github.com/hsivonen/encoding_rs/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/encoding_rs.svg)](https://crates.io/crates/encoding_rs)
[![docs.rs](https://docs.rs/encoding_rs/badge.svg)](https://docs.rs/encoding_rs/)

encoding_rs an implementation of the (non-JavaScript parts of) the
[Encoding Standard](https://encoding.spec.whatwg.org/) written in Rust.

The Encoding Standard defines the Web-compatible set of character encodings,
which means this crate can be used to decode Web content. encoding_rs is
used in Gecko starting with Firefox 56. Due to the notable overlap between
the legacy encodings on the Web and the legacy encodings used on Windows,
this crate may be of use for non-Web-related situations as well; see below
for links to adjacent crates.

Additionally, the `mem` module provides various operations for dealing with
in-RAM text (as opposed to data that's coming from or going to an IO boundary).
The `mem` module is a module instead of a separate crate due to internal
implementation detail efficiencies.

## Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from
UTF-16 in addition to supporting the usual Rust use case of decoding to and
encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly
to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

* Decodes a stream of bytes in an Encoding Standard-defined character encoding
into valid aligned native-endian in-RAM UTF-16 (units of `u16` / `char16_t`).
* Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16
(units of `u16` / `char16_t`) into a sequence of bytes in an Encoding
Standard-defined character encoding as if the lone surrogates had been
replaced with the REPLACEMENT CHARACTER before performing the encode.
(Gecko's UTF-16 is potentially invalid.)
* Decodes a stream of bytes in an Encoding Standard-defined character
encoding into valid UTF-8.
* Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding
Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
* Does the above in streaming (input and output split across multiple
buffers) and non-streaming (whole input in a single buffer and whole
output in a single buffer) variants.
* Avoids copying (borrows) when possible in the non-streaming cases when
decoding to or encoding from UTF-8.
* Resolves textual labels that identify character encodings in
protocol text into type-safe objects representing the those encodings
conceptually.
* Maps the type-safe encoding objects onto strings suitable for
returning from `document.characterSet`.
* Validates UTF-8 (in common instruction set scenarios a bit faster for Web
workloads than the standard library; hopefully will get upstreamed some
day) and ASCII.

Additionally, `encoding_rs::mem` does the following:

* Checks if a byte buffer contains only ASCII.
* Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
buffer contains only Latin1 code points (below U+0100).
* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior
(suitable for checking if the Unicode Bidirectional Algorithm can be optimized
out).
* Combined versions of the above two checks.
* Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
* Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
* Converts UTF-8 and UTF-16 to Latin1 (if in range).
* Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
* Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
* Copies ASCII from one buffer to another up to the first non-ASCII byte.
* Converts ASCII to UTF-16 up to the first non-ASCII byte.
* Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.

## Integration with `std::io`

Notably, the above feature list doesn't include the capability to wrap
a `std::io::Read`, decode it into UTF-8 and presenting the result via
`std::io::Read`. The [`encoding_rs_io`](https://crates.io/crates/encoding_rs_io)
crate provides that capability.

## `no_std` Environment

The crate works in a `no_std` environment. By default, the `alloc` feature,
which assumes that an allocator is present is enabled. For a no-allocator
environment, the default features (i.e. `alloc`) can be turned off. This
makes the part of the API that returns `Vec`/`String`/`Cow` unavailable.

## Decoding Email

For decoding character encodings that occur in email, use the
[`charset`](https://crates.io/crates/charset) crate instead of using this
one directly. (It wraps this crate and adds UTF-7 decoding.)

## Windows Code Page Identifier Mappings

For mappings to and from Windows code page identifiers, use the
[`codepage`](https://crates.io/crates/codepage) crate.

## DOS Encodings

This crate does not support single-byte DOS encodings that aren't required by
the Web Platform, but the [`oem_cp`](https://crates.io/crates/oem_cp) crate does.

## Preparing Text for the Encoders

Normalizing text into Unicode Normalization Form C prior to encoding text into
a legacy encoding minimizes unmappable characters. Text can be normalized to
Unicode Normalization Form C using the
[`icu_normalizer`](https://crates.io/crates/icu_normalizer) crate.

The exception is windows-1258, which after normalizing to Unicode Normalization
Form C requires tone marks to be decomposed in order to minimize unmappable
characters. Vietnamese tone marks can be decomposed using the
[`detone`](https://crates.io/crates/detone) crate.

## Licensing

TL;DR: `(Apache-2.0 OR MIT) AND BSD-3-Clause` for the code and data combination.

Please see the file named
[COPYRIGHT](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT).

The non-test code that isn't generated from the WHATWG data in this crate is
under Apache-2.0 OR MIT. Test code is under CC0.

This crate contains code/data generated from WHATWG-supplied data. The WHATWG
upstream changed its license for portions of specs incorporated into source code
from CC0 to BSD-3-Clause between the initial release of this crate and the present
version of this crate. The in-source licensing legends have been updated for the
parts of the generated code that have changed since the upstream license change.

## Documentation

Generated [API documentation](https://docs.rs/encoding_rs/) is available
online.

There is a [long-form write-up](https://hsivonen.fi/encoding_rs/) about the
design and internals of the crate.

## C and C++ bindings

An FFI layer for encoding_rs is available as a
[separate crate](https://github.com/hsivonen/encoding_c). The crate comes
with a [demo C++ wrapper](https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h)
using the C++ standard library and [GSL](https://github.com/Microsoft/GSL/) types.

The bindings for the `mem` module are in the
[encoding_c_mem crate](https://github.com/hsivonen/encoding_c_mem).

For the Gecko context, there's a
[C++ wrapper using the MFBT/XPCOM types](https://searchfox.org/mozilla-central/source/intl/Encoding.h#100).

There's a [write-up](https://hsivonen.fi/modern-cpp-in-rust/) about the C++
wrappers.

## Sample programs

* [Rust](https://github.com/hsivonen/recode_rs)
* [C](https://github.com/hsivonen/recode_c)
* [C++](https://github.com/hsivonen/recode_cpp)

## Optional features

There are currently these optional cargo features:

### `simd-accel`

Enables SIMD acceleration using the nightly-dependent `portable_simd` standard
library feature.

This is an opt-in feature, because enabling this feature _opts out_ of Rust's
guarantees of future compilers compiling old code (aka. "stability story").

Currently, this has not been tested to be an improvement except for these
targets and enabling the `simd-accel` feature is expected to break the build
on other targets:

* x86_64
* i686
* aarch64
* thumbv7neon

If you use nightly Rust, you use targets whose first component is one of the
above, and you are prepared _to have to revise your configuration when updating
Rust_, you should enable this feature. Otherwise, please _do not_ enable this
feature.

Used by Firefox.

### `serde`

Enables support for serializing and deserializing `&'static Encoding`-typed
struct fields using [Serde][1].

[1]: https://serde.rs/

Not used by Firefox.

### `fast-legacy-encode`

A catch-all option for enabling the fastest legacy encode options. _Does not
affect decode speed or UTF-8 encode speed._

At present, this option is equivalent to enabling the following options:
* `fast-hangul-encode`
* `fast-hanja-encode`
* `fast-kanji-encode`
* `fast-gb-hanzi-encode`
* `fast-big5-hanzi-encode`

Adds 176 KB to the binary size.

Not used by Firefox.

### `fast-hangul-encode`

Changes encoding precomposed Hangul syllables into EUC-KR from binary
search over the decode-optimized tables to lookup by index making Korean
plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.