https://github.com/b4n/wtf8tools

WTF-8 conversion tools
https://github.com/b4n/wtf8tools

encoding-convertors unicode utf-16 utf-32 utf-8 wtf-8

Last synced: 2 months ago
JSON representation

WTF-8 conversion tools

Host: GitHub
URL: https://github.com/b4n/wtf8tools
Owner: b4n
License: gpl-3.0
Created: 2016-09-19T17:38:22.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2016-09-19T18:12:32.000Z (over 8 years ago)
Last Synced: 2025-01-24T21:14:44.433Z (4 months ago)
Topics: encoding-convertors, unicode, utf-16, utf-32, utf-8, wtf-8
Language: C
Size: 18.6 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: COPYING

Awesome Lists containing this project

README

# WTF-8 conversion tools

A set of naive tools to convert between broken UTF-16 and WTF-8.
See https://en.wikipedia.org/wiki/UTF-8#WTF-8

The only purpose of these tools is to convert to and from broken UTF-16 (that
is, with unpaired surrogates), which Windows seem to happily generate.

Basically, all it does is happily read or write unpaired surrogate halves.

## (Broken) UTF-16 to WTF-8

`wtf162wtf8` reads UTF-16 code units, and tries to read code points. If that
succeeds, write the read code point as UTF-8. If it doesn't succeed, i.e. if
it is a high or low surrogate without its other half, write the surrogate half
as UTF-8 (which makes it WTF-8).

The result is WTF-8, and even UTF-8 if the input is valid UTF-16.

## WTF-8 to (broken) UTF-16

`wtf82utf16` does the revers conversion: given WTF-8 input, it reconstructs
the possibly broken UTF-16 data. All it does is actually write every code
points below `0x10000` as plain UTF-16 units, even surrogate halves.

## UTF-32 support

As a proof of concept, there is also support for broken UTF-32. Just like
WTF-8 and broken UTF-16, is allows reserved code points to appear and encodes
and decodes them happily. Only WTF-8/UTF-32 pairs are provided, but they can
be streamed together to convert directly between UTF-16 and UTF-32, using e.g.
`wtf162wtf8 < input | wtf82utf32 > output`.

## Regarding Endianess

These tools are naive, and don't actually do anything about endianess. The
result is that if they are run on a Big Endian machine, they read and write
UTF-16BE, and if they are run on a Little Endian machine (fairly more common),
they read and write UTF-16LE.

As those tools are typically useful with UTF-16LE, and most machines are
Little Endian, it should generally work fine. Hopefully.

## Usage

To convert from (broken) UTF-16 to WTF-8, use `wtf162wtf8 < input > output`.
Similarly, to convert from WTF-8 to (broken) UTF-16, use
`wtf82utf16 < input > output`.

You can control the verbosity through the `VERBOSE` environment variable: set
it to a positive integer to get verbose/debugging output on `stderr`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/b4n/wtf8tools

Awesome Lists containing this project

README