{"id":20400081,"url":"https://github.com/codebrainz/libutfxx","last_synced_at":"2025-08-08T01:20:09.001Z","repository":{"id":20063351,"uuid":"23332033","full_name":"codebrainz/libutfxx","owner":"codebrainz","description":"C++ UTF encoding conversion routines","archived":false,"fork":false,"pushed_at":"2014-08-26T03:38:22.000Z","size":172,"stargazers_count":12,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-26T08:23:29.627Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codebrainz.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-08-25T23:25:24.000Z","updated_at":"2024-09-10T19:44:16.000Z","dependencies_parsed_at":"2022-09-02T13:41:46.224Z","dependency_job_id":null,"html_url":"https://github.com/codebrainz/libutfxx","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codebrainz%2Flibutfxx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codebrainz%2Flibutfxx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codebrainz%2Flibutfxx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codebrainz%2Flibutfxx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codebrainz","download_url":"https://codeload.github.com/codebrainz/libutfxx/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248575636,"owners_count":21127224,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T04:38:20.783Z","updated_at":"2025-04-12T13:50:56.221Z","avatar_url":"https://github.com/codebrainz.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"LibUTF++\n========\n\nLibUTF++ is a simple C++ library for converting between [UTF-8][utf8],\n[UTF-16][utf16], and [UTF-32][utf32] encodings. The API consists of a set\nof free functions taking particular `std::basic_string` specialized types\ndepending on the encoding.\n\n[utf8]: http://en.wikipedia.org/wiki/UTF-8\n[utf16]: http://en.wikipedia.org/wiki/UTF-16\n[utf32]: http://en.wikipedia.org/wiki/UTF-32\n\nUsing LibUTF++ in your Project\n------------------------------\n\nThe recommended way to use LibUTF++ is to copy the (generated) `utf.cxx` file\nand the header `utf.h` into your own project tree and compile them with your\nexisting build system. This is the intended way to use LibUTF++ since it\nreduces distribution and versioning complexities and compatibility problems\nfrom using different compilers and such, even if it's not a best practice on\nall platforms.\n\nUsing the shared library\n------------------------\n\nLibUTF++ comes with a very simple GNU Make build system that can compile\nLibUTF++ as a shared library for UNIX-like platforms. To compile the library\nsimply run `make` from the source directory.\n\nDependencies\n------------\n\nNot much, basically any relatively modern C++ compiler should do.\n\n### C++11\n\nWhile not required, it is recommended to enable C++11-mode in the C++\ncompiler, where supported. For GCC-like compilers the `-std=c++0x` or\n`-std=c++11` options should do this. Using C++11-mode allows use of Unicode\nstring literals `u8\"\"` (UTF-8), `u\"\"` (UTF-16), and `U\"\"` (UTF-32) as well as\n2 of the 3 proper character types needed by LibUTF++, `char16_t` and\n`char32_t`, with the old `char` type filling the place of the missing\n`char8_t` type.\n\n### Make Build System\n\nTo use the simplistic GNU Make build system requires:\n\n- GNU Make\n- A GCC-like C++ compiler\n- Python 2.7+\n- Various other UNIX-like tools (cp, rm, sed, etc)\n\n__Note:__ it is not recommended to use the GNU Make build system for anything\nmore than generating the built files (namely `utf.cxx` and `index.html`). See\n\"Using LibUTF++ in your Project\" for more details on integrating LibUTF++\ninto your own source tree.\n\nWTF are the ConvertUTF.[ch] files?\n----------------------------------\n\nThe files `ConvertUTF.c` and `ConvertUTF.h` are plain C files that contain\nalgorithms for converting between UTF encodings. They used to be distributed\non the official Unicode website but are no longer hosted or supported.\n\nSeveral projects include these same files in their source tree such as\n[LLVM/Clang/LLDB][llvm] and [Gears][gears] (the top hits I found when search\nGoogle for \"ConvertUTF.c\"). I chose to use these existing conversion routines\nrather than re-write them myself from scratch (likely much, much more buggy)\nor cobble together routines from several different sources. Some day it would\nbe nice to remove these files and just use the features built in to standard\nC++.\n\nTo make distribution simpler I have chose to inline these files straight\ninto the C++ code to avoid numerous files and to possibly provide some more\noptimization oportunities for the optimizing compiler. This is similar to the\n[SQLite Amalgamation][sqlite].\n\n[llvm]: http://llvm.org/docs/doxygen/html/ConvertUTF_8c_source.html\n[gears]: http://gears.googlecode.com/svn/trunk/third_party/convert_utf/ConvertUTF.c\n[sqlite]: http://www.sqlite.org/amalgamation.html\n\nSimilar and Related Projects\n----------------------------\n\nThere are many open source and commercial alternatives to LibUTF++, I can\nrecommened the following projects:\n\n- [UTF8-CPP][utf8cpp]: A nice and simple to use header-only library that provides\nroutines to convert to and from UTF-8.\n- [ICU][icu]: If you need full-blown Unicode support (and more), you probably\nwon't find a better library than this.\n\n[utf8cpp]: http://utfcpp.sourceforge.net/\n[icu]: http://site.icu-project.org/\n\nThe API\n-------\n\nThe functions exposed are very simple to use and are intended to convert\nbetween UTF encodings of whole strings at time. To do streaming-style\nconversion of massive amounts of data, consider using the ConvertUTF.[ch]\nfiles directly or using a much better library like ICU.\n\nWhen the API refers to the numbers 8, 16, and 32, it's referring to the\nUTF-8, UTF-16, and UTF-32 encodings, respectively.\n\n### Types\n\nThe `utf.h` header typedef's a few types in the `utf` namespace.\n\n#### utf::char8\n\nThis is always typedef'd to the builtin C++ `char` type.\n\n#### utf::char16\n\nThis is typedef'd differently depending on the compiler's support for C++11\nand the size of `wchar_t`. When C++11 support is enabled, this is typedef'd\nto `char16_t` (from `cuchar` header), otherwise if the platform uses a 16-bit\n`wchar_t` type (ex. Win32), it's typedef'd to that. In all other cases it's\ntypedef'd to the `uint16_t` type.\n\nWhen using C++11 mode, you can use the u\"\"-style Unicode string literals\nwith this type, or else if in 16-bit `wchar_t` mode (ex. Win32) you can use\nwide character string literals L\"\" (not recommended).\n\n#### utf::char32\n\nThis is just like `utf::char16` except it's it's 32-bits wide and so uses\n`char32_t` in C++11 mode, `wchar_t` if in 32-bit `wchar_t` mode (ex. Linux\nand most UNIXes), or `uint32_t` otherwise.\n\nWhen using C++11 mode, you can use the U\"\"-style Unicode string literals\nwith this type, or else if in 32-bit `wchar_t` mode you can use wide character\nstring literals L\"\" (not recommended).\n\n#### utf::string8\n\nThis is a typedef of `std::basic_string\u003cchar8\u003e`, which is the same as the\n`std::string` type.\n\n#### utf::string16\n\nThis is a typedef of `std::basic_string\u003cchar16\u003e`, which, depending on the\n`utf::char16` type may be equivalent to `std::u16string`, `std::wstring`\nor `std::basic_string\u003cuint16_t\u003e`.\n\n#### utf::string32\n\nThis is a typedef of `std::basic_string\u003cchar32\u003e`, which, depending on the\n`utf::char32` type may be equivalent to `std::u32string`, `std::wstring`\nor `std::basic_string\u003cuint32_t\u003e`.\n\n#### utf::conversion_error\n\nThis is the top-level exception and any exceptions in the API inherit from\nthis. It itself derives from `std::runtime_error` and so provides the\n`what()` member function to retrieve a string explaining the exception. It\nalso provides a `code()` member function which gives an error number based\non the ConvertUTF.[ch] result type (mostly useless in C++, just catch the\nspecific dervied exception type).\n\n#### utf::source_exhausted\n\nThis type of exception is thrown when the end of the input string is reached\nin the middle of decoding a code point. This class derives from\n`utf::conversion_error`.\n\n#### utf::illegal_input\n\nThis type of exception is thrown when invalid UTF-encoded data is encountered\nin the input string. This class derives from `utf::conversion_error`.\n\n### Conversion Functions\n\nThere's a few different types of functions that can be used to perform the\nconversions, which one to use is mostly a matter of taste/style and mostly\nthey are simple inline wrappers around the type-specific conversion functions.\n\n#### Type-specific Conversion Functions\n\nThese functions are named according to the input and output encoding. You\nprobably won't want to use these directly but rather through the `utf::convert()`\nfunction.\n\nThe prototype of these functions are like:\n\n\tvoid cvt_N1_to_N2(const utf::stringN1\u0026 in, utf::stringN2\u0026 out);\n\nWhere N1 and N2 are one of 8, 16, or 32 depending on the encoding.\n\n#### Generic Overloaded Conversion Function\n\nThis function is probably the best choice in most cases. The signature is\nthe same as the type-specific conversion functions but uses the argument\ntypes and C++ function overloading to choose the correct type-specific\nconversion function automatically.\n\nThe prototype of this function is:\n\n\tvoid convert(const utf::stringN1\u0026 in, utf::stringN2\u0026 out);\n\nWhere N1 and N2 are one of 8, 16, or 32 depending on the encoding. For example\nto convert from UTF-8 to UTF-32:\n\n\tutf::string8 s8 = \"Hello World\";\n\tutf::string32 s32;\n\ttry {\n\t\tutf::convert(s8, s32);\n\t} catch (utf::conversion_error\u0026 e) {\n\t\tstd::cerr \u003c\u003c \"Failed: \" \u003c\u003c e.what() \u003c\u003c std::endl;\n\t}\n\n#### Return Type-specific Functions\n\nThese functions are specific to the return type and rather than use an\noutput argument for the target string, a new string is created and returned\nto the called using the return value (and hopefull RVO).\n\nThe prototype for these functions is:\n\n\tutf::stringN2 to_utfN2(const utf::stringN1\u0026 in);\n\nWhere N1 and N2 are one of 8, 16, or 32 depending on the encoding. For example\nto convert from UTF-16 to UTF-8 (exception handling not shown):\n\n\tutf::string16 s16 = u\"Hello World\";\n\tutf::string8 s = utf::to_utf8(s16);\n\nThe functions are overloaded to accept any of the UTF-8, UTF-16 or UTF-32\nstring types defined in the `utf` namespace.\n\n### String Class\n\nLibUTF++ also provides a class in the `utfstring.h` header file that behaves\nlike a `std::basic_string` by actually containing one and forwarding all the\ncalls to it, performing conversions where needed. It should be pretty obvious\nhow to use it if you've used `std::string` and friend before.\n\nThere are 3 typedef's for the `utf::string` template class: `utf::u8string`,\n`utf::u16string` and `utf::u32string` for UTF-8, 16, and 32, respectively.\nChoose the flavour depending on how you want to trade off time and space.\nA `utf::u32string` will hold 32-bit code points and so take more memory, while\na `utf::u8string` will hold 8-bit encoded data and so take more time doing\nconversions while being more space-efficient.\n\nHere's a little demo using `utf::string` with C++11:\n\n\t#include \u003cutfstring.h\u003e\n\t#include \u003ciostream\u003e\n\t...\n\tint main()\n\t{\n\t\tutf::u8string s1 = U\"Some 32-bit string\";  // UTF-32 -\u003e UTF-8 conversion\n\t\tutf::u16string s2 = u8\"Some 8-bit string\"; // UTF-8 -\u003e UTF-16 conversion\n\t\tutf::u32string s3;\n\t\ts3 += s1; // UTF-8 to UTF-32 conversion\n\t\ts3 += s2; // UTF-16 to UTF-32 conversion\n\t\tstd::cout \u003c\u003c s3 \u003c\u003c std::endl; // UTF-32 to UTF-8 (or other) conversion\n\t\treturn 0;\n\t}\n\nLegal\n-----\n\nThe C++ wrapper code is distributed under the MIT license to make it easier\nto embed the files inside other projects. The ConvertUTF.[ch] files from\nUnicode, Inc. also have their own license (see below) that is compatible with\nLibUTF++'s MIT license.\n\nFor using the LibUTF++ files in your project, all you need to do is copy the\n(generated) `utf.cxx` file and the header `utf.h` into your source tree and\nsimply leave the license/copyright comments in the files as is.\n\n### The LibUTF++ MIT License\n\n\u003e Copyright (c) 2014 Matthew Brush \u003cmbrush@codebrainz.ca\u003e\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\n- The above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n\n### The Unicode, Inc. license for ConvertUTF.[ch] files\n\n\u003e Copyright 2001-2004 Unicode, Inc.\n\n#### Disclaimer\n\nThis source code is provided as is by Unicode, Inc. No claims are\nmade as to fitness for any particular purpose. No warranties of any\nkind are expressed or implied. The recipient agrees to determine\napplicability of information provided. If this file has been\npurchased on magnetic or optical media from Unicode, Inc., the\nsole remedy for any claim will be exchange of defective media\nwithin 90 days of receipt.\n\n#### Limitations on Rights to Redistribute This Code\n\nUnicode, Inc. hereby grants the right to freely use the information\nsupplied in this file in the creation of products supporting the\nUnicode Standard, and to make copies of this file in any form\nfor internal or external distribution as long as this notice\nremains attached.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodebrainz%2Flibutfxx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodebrainz%2Flibutfxx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodebrainz%2Flibutfxx/lists"}