An open API service indexing awesome lists of open source software.

https://github.com/b4n/wtf8tools

WTF-8 conversion tools
https://github.com/b4n/wtf8tools

encoding-convertors unicode utf-16 utf-32 utf-8 wtf-8

Last synced: 2 months ago
JSON representation

WTF-8 conversion tools

Awesome Lists containing this project

README

        

# WTF-8 conversion tools

A set of naive tools to convert between broken UTF-16 and WTF-8.
See https://en.wikipedia.org/wiki/UTF-8#WTF-8

The only purpose of these tools is to convert to and from broken UTF-16 (that
is, with unpaired surrogates), which Windows seem to happily generate.

Basically, all it does is happily read or write unpaired surrogate halves.

## (Broken) UTF-16 to WTF-8

`wtf162wtf8` reads UTF-16 code units, and tries to read code points. If that
succeeds, write the read code point as UTF-8. If it doesn't succeed, i.e. if
it is a high or low surrogate without its other half, write the surrogate half
as UTF-8 (which makes it WTF-8).

The result is WTF-8, and even UTF-8 if the input is valid UTF-16.

## WTF-8 to (broken) UTF-16

`wtf82utf16` does the revers conversion: given WTF-8 input, it reconstructs
the possibly broken UTF-16 data. All it does is actually write every code
points below `0x10000` as plain UTF-16 units, even surrogate halves.

## UTF-32 support

As a proof of concept, there is also support for broken UTF-32. Just like
WTF-8 and broken UTF-16, is allows reserved code points to appear and encodes
and decodes them happily. Only WTF-8/UTF-32 pairs are provided, but they can
be streamed together to convert directly between UTF-16 and UTF-32, using e.g.
`wtf162wtf8 < input | wtf82utf32 > output`.

## Regarding Endianess

These tools are naive, and don't actually do anything about endianess. The
result is that if they are run on a Big Endian machine, they read and write
UTF-16BE, and if they are run on a Little Endian machine (fairly more common),
they read and write UTF-16LE.

As those tools are typically useful with UTF-16LE, and most machines are
Little Endian, it should generally work fine. Hopefully.

## Usage

To convert from (broken) UTF-16 to WTF-8, use `wtf162wtf8 < input > output`.
Similarly, to convert from WTF-8 to (broken) UTF-16, use
`wtf82utf16 < input > output`.

You can control the verbosity through the `VERBOSE` environment variable: set
it to a positive integer to get verbose/debugging output on `stderr`.