https://github.com/codebrainz/libutfxx

C++ UTF encoding conversion routines
https://github.com/codebrainz/libutfxx
Last synced: about 2 months ago
JSON representation
C++ UTF encoding conversion routines
Host: GitHub
URL: https://github.com/codebrainz/libutfxx
Owner: codebrainz
Created: 2014-08-25T23:25:24.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2014-08-26T03:38:22.000Z (almost 11 years ago)
Last Synced: 2025-03-26T08:23:29.627Z (2 months ago)
Language: C++
Size: 168 KB
Stars: 12
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project

README

        LibUTF++

========

LibUTF++ is a simple C++ library for converting between [UTF-8][utf8],

[UTF-16][utf16], and [UTF-32][utf32] encodings. The API consists of a set

of free functions taking particular `std::basic_string` specialized types

depending on the encoding.

[utf8]: http://en.wikipedia.org/wiki/UTF-8

[utf16]: http://en.wikipedia.org/wiki/UTF-16

[utf32]: http://en.wikipedia.org/wiki/UTF-32

Using LibUTF++ in your Project

------------------------------

The recommended way to use LibUTF++ is to copy the (generated) `utf.cxx` file

and the header `utf.h` into your own project tree and compile them with your

existing build system. This is the intended way to use LibUTF++ since it

reduces distribution and versioning complexities and compatibility problems

from using different compilers and such, even if it's not a best practice on

all platforms.

Using the shared library

------------------------

LibUTF++ comes with a very simple GNU Make build system that can compile

LibUTF++ as a shared library for UNIX-like platforms. To compile the library

simply run `make` from the source directory.

Dependencies

------------

Not much, basically any relatively modern C++ compiler should do.

### C++11

While not required, it is recommended to enable C++11-mode in the C++

compiler, where supported. For GCC-like compilers the `-std=c++0x` or

`-std=c++11` options should do this. Using C++11-mode allows use of Unicode

string literals `u8""` (UTF-8), `u""` (UTF-16), and `U""` (UTF-32) as well as

2 of the 3 proper character types needed by LibUTF++, `char16_t` and

`char32_t`, with the old `char` type filling the place of the missing

`char8_t` type.

### Make Build System

To use the simplistic GNU Make build system requires:

- GNU Make

- A GCC-like C++ compiler

- Python 2.7+

- Various other UNIX-like tools (cp, rm, sed, etc)

__Note:__ it is not recommended to use the GNU Make build system for anything

more than generating the built files (namely `utf.cxx` and `index.html`). See

"Using LibUTF++ in your Project" for more details on integrating LibUTF++

into your own source tree.

WTF are the ConvertUTF.[ch] files?

----------------------------------

The files `ConvertUTF.c` and `ConvertUTF.h` are plain C files that contain

algorithms for converting between UTF encodings. They used to be distributed

on the official Unicode website but are no longer hosted or supported.

Several projects include these same files in their source tree such as

[LLVM/Clang/LLDB][llvm] and [Gears][gears] (the top hits I found when search

Google for "ConvertUTF.c"). I chose to use these existing conversion routines

rather than re-write them myself from scratch (likely much, much more buggy)

or cobble together routines from several different sources. Some day it would

be nice to remove these files and just use the features built in to standard

C++.

To make distribution simpler I have chose to inline these files straight

into the C++ code to avoid numerous files and to possibly provide some more

optimization oportunities for the optimizing compiler. This is similar to the

[SQLite Amalgamation][sqlite].

[llvm]: http://llvm.org/docs/doxygen/html/ConvertUTF_8c_source.html

[gears]: http://gears.googlecode.com/svn/trunk/third_party/convert_utf/ConvertUTF.c

[sqlite]: http://www.sqlite.org/amalgamation.html

Similar and Related Projects

----------------------------

There are many open source and commercial alternatives to LibUTF++, I can

recommened the following projects:

- [UTF8-CPP][utf8cpp]: A nice and simple to use header-only library that provides

routines to convert to and from UTF-8.

- [ICU][icu]: If you need full-blown Unicode support (and more), you probably

won't find a better library than this.

[utf8cpp]: http://utfcpp.sourceforge.net/

[icu]: http://site.icu-project.org/

The API

-------

The functions exposed are very simple to use and are intended to convert

between UTF encodings of whole strings at time. To do streaming-style

conversion of massive amounts of data, consider using the ConvertUTF.[ch]

files directly or using a much better library like ICU.

When the API refers to the numbers 8, 16, and 32, it's referring to the

UTF-8, UTF-16, and UTF-32 encodings, respectively.

### Types

The `utf.h` header typedef's a few types in the `utf` namespace.

#### utf::char8

This is always typedef'd to the builtin C++ `char` type.

#### utf::char16

This is typedef'd differently depending on the compiler's support for C++11

and the size of `wchar_t`. When C++11 support is enabled, this is typedef'd

to `char16_t` (from `cuchar` header), otherwise if the platform uses a 16-bit

`wchar_t` type (ex. Win32), it's typedef'd to that. In all other cases it's

typedef'd to the `uint16_t` type.

When using C++11 mode, you can use the u""-style Unicode string literals

with this type, or else if in 16-bit `wchar_t` mode (ex. Win32) you can use

wide character string literals L"" (not recommended).

#### utf::char32

This is just like `utf::char16` except it's it's 32-bits wide and so uses

`char32_t` in C++11 mode, `wchar_t` if in 32-bit `wchar_t` mode (ex. Linux

and most UNIXes), or `uint32_t` otherwise.

When using C++11 mode, you can use the U""-style Unicode string literals

with this type, or else if in 32-bit `wchar_t` mode you can use wide character

string literals L"" (not recommended).

#### utf::string8

This is a typedef of `std::basic_string`, which is the same as the

`std::string` type.

#### utf::string16

This is a typedef of `std::basic_string`, which, depending on the

`utf::char16` type may be equivalent to `std::u16string`, `std::wstring`

or `std::basic_string`.

#### utf::string32

This is a typedef of `std::basic_string`, which, depending on the

`utf::char32` type may be equivalent to `std::u32string`, `std::wstring`

or `std::basic_string`.

#### utf::conversion_error

This is the top-level exception and any exceptions in the API inherit from

this. It itself derives from `std::runtime_error` and so provides the

`what()` member function to retrieve a string explaining the exception. It

also provides a `code()` member function which gives an error number based

on the ConvertUTF.[ch] result type (mostly useless in C++, just catch the

specific dervied exception type).

#### utf::source_exhausted

This type of exception is thrown when the end of the input string is reached

in the middle of decoding a code point. This class derives from

`utf::conversion_error`.

#### utf::illegal_input

This type of exception is thrown when invalid UTF-encoded data is encountered

in the input string. This class derives from `utf::conversion_error`.

### Conversion Functions

There's a few different types of functions that can be used to perform the

conversions, which one to use is mostly a matter of taste/style and mostly

they are simple inline wrappers around the type-specific conversion functions.

#### Type-specific Conversion Functions

These functions are named according to the input and output encoding. You

probably won't want to use these directly but rather through the `utf::convert()`

function.

The prototype of these functions are like:

	void cvt_N1_to_N2(const utf::stringN1& in, utf::stringN2& out);

Where N1 and N2 are one of 8, 16, or 32 depending on the encoding.

#### Generic Overloaded Conversion Function

This function is probably the best choice in most cases. The signature is

the same as the type-specific conversion functions but uses the argument

types and C++ function overloading to choose the correct type-specific

conversion function automatically.

The prototype of this function is:

	void convert(const utf::stringN1& in, utf::stringN2& out);

Where N1 and N2 are one of 8, 16, or 32 depending on the encoding. For example

to convert from UTF-8 to UTF-32:

	utf::string8 s8 = "Hello World";

	utf::string32 s32;

	try {

		utf::convert(s8, s32);

	} catch (utf::conversion_error& e) {

		std::cerr << "Failed: " << e.what() << std::endl;

	}

#### Return Type-specific Functions

These functions are specific to the return type and rather than use an

output argument for the target string, a new string is created and returned

to the called using the return value (and hopefull RVO).

The prototype for these functions is:

	utf::stringN2 to_utfN2(const utf::stringN1& in);

Where N1 and N2 are one of 8, 16, or 32 depending on the encoding. For example

to convert from UTF-16 to UTF-8 (exception handling not shown):

	utf::string16 s16 = u"Hello World";

	utf::string8 s = utf::to_utf8(s16);

The functions are overloaded to accept any of the UTF-8, UTF-16 or UTF-32

string types defined in the `utf` namespace.

### String Class

LibUTF++ also provides a class in the `utfstring.h` header file that behaves

like a `std::basic_string` by actually containing one and forwarding all the

calls to it, performing conversions where needed. It should be pretty obvious

how to use it if you've used `std::string` and friend before.

There are 3 typedef's for the `utf::string` template class: `utf::u8string`,

`utf::u16string` and `utf::u32string` for UTF-8, 16, and 32, respectively.

Choose the flavour depending on how you want to trade off time and space.

A `utf::u32string` will hold 32-bit code points and so take more memory, while

a `utf::u8string` will hold 8-bit encoded data and so take more time doing

conversions while being more space-efficient.

Here's a little demo using `utf::string` with C++11:

	#include 

	#include 

	...

	int main()

	{

		utf::u8string s1 = U"Some 32-bit string";  // UTF-32 -> UTF-8 conversion

		utf::u16string s2 = u8"Some 8-bit string"; // UTF-8 -> UTF-16 conversion

		utf::u32string s3;

		s3 += s1; // UTF-8 to UTF-32 conversion

		s3 += s2; // UTF-16 to UTF-32 conversion

		std::cout << s3 << std::endl; // UTF-32 to UTF-8 (or other) conversion

		return 0;

	}

Legal

-----

The C++ wrapper code is distributed under the MIT license to make it easier

to embed the files inside other projects. The ConvertUTF.[ch] files from

Unicode, Inc. also have their own license (see below) that is compatible with

LibUTF++'s MIT license.

For using the LibUTF++ files in your project, all you need to do is copy the

(generated) `utf.cxx` file and the header `utf.h` into your source tree and

simply leave the license/copyright comments in the files as is.

### The LibUTF++ MIT License

> Copyright (c) 2014 Matthew Brush 

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:

- The above copyright notice and this permission notice shall be included in

all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN

THE SOFTWARE.

### The Unicode, Inc. license for ConvertUTF.[ch] files

> Copyright 2001-2004 Unicode, Inc.

#### Disclaimer

This source code is provided as is by Unicode, Inc. No claims are

made as to fitness for any particular purpose. No warranties of any

kind are expressed or implied. The recipient agrees to determine

applicability of information provided. If this file has been

purchased on magnetic or optical media from Unicode, Inc., the

sole remedy for any claim will be exchange of defective media

within 90 days of receipt.

#### Limitations on Rights to Redistribute This Code

Unicode, Inc. hereby grants the right to freely use the information

supplied in this file in the creation of products supporting the

Unicode Standard, and to make copies of this file in any form

for internal or external distribution as long as this notice

remains attached.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/codebrainz/libutfxx

Awesome Lists containing this project

README