https://github.com/b4n/wtf8tools
WTF-8 conversion tools
https://github.com/b4n/wtf8tools
encoding-convertors unicode utf-16 utf-32 utf-8 wtf-8
Last synced: 2 months ago
JSON representation
WTF-8 conversion tools
- Host: GitHub
- URL: https://github.com/b4n/wtf8tools
- Owner: b4n
- License: gpl-3.0
- Created: 2016-09-19T17:38:22.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-09-19T18:12:32.000Z (over 8 years ago)
- Last Synced: 2025-01-24T21:14:44.433Z (4 months ago)
- Topics: encoding-convertors, unicode, utf-16, utf-32, utf-8, wtf-8
- Language: C
- Size: 18.6 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
# WTF-8 conversion tools
A set of naive tools to convert between broken UTF-16 and WTF-8.
See https://en.wikipedia.org/wiki/UTF-8#WTF-8The only purpose of these tools is to convert to and from broken UTF-16 (that
is, with unpaired surrogates), which Windows seem to happily generate.Basically, all it does is happily read or write unpaired surrogate halves.
## (Broken) UTF-16 to WTF-8
`wtf162wtf8` reads UTF-16 code units, and tries to read code points. If that
succeeds, write the read code point as UTF-8. If it doesn't succeed, i.e. if
it is a high or low surrogate without its other half, write the surrogate half
as UTF-8 (which makes it WTF-8).The result is WTF-8, and even UTF-8 if the input is valid UTF-16.
## WTF-8 to (broken) UTF-16
`wtf82utf16` does the revers conversion: given WTF-8 input, it reconstructs
the possibly broken UTF-16 data. All it does is actually write every code
points below `0x10000` as plain UTF-16 units, even surrogate halves.## UTF-32 support
As a proof of concept, there is also support for broken UTF-32. Just like
WTF-8 and broken UTF-16, is allows reserved code points to appear and encodes
and decodes them happily. Only WTF-8/UTF-32 pairs are provided, but they can
be streamed together to convert directly between UTF-16 and UTF-32, using e.g.
`wtf162wtf8 < input | wtf82utf32 > output`.## Regarding Endianess
These tools are naive, and don't actually do anything about endianess. The
result is that if they are run on a Big Endian machine, they read and write
UTF-16BE, and if they are run on a Little Endian machine (fairly more common),
they read and write UTF-16LE.As those tools are typically useful with UTF-16LE, and most machines are
Little Endian, it should generally work fine. Hopefully.## Usage
To convert from (broken) UTF-16 to WTF-8, use `wtf162wtf8 < input > output`.
Similarly, to convert from WTF-8 to (broken) UTF-16, use
`wtf82utf16 < input > output`.You can control the verbosity through the `VERBOSE` environment variable: set
it to a positive integer to get verbose/debugging output on `stderr`.