{"id":27696962,"url":"https://github.com/rmawatson/utf","last_synced_at":"2026-03-02T22:38:56.665Z","repository":{"id":37592639,"uuid":"158967204","full_name":"rmawatson/utf","owner":"rmawatson","description":"utf iterators \u0026 converters for modern c++","archived":false,"fork":false,"pushed_at":"2024-02-06T01:51:50.000Z","size":179,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-25T15:26:41.075Z","etag":null,"topics":["cpp","iterators","unicode","utf16","utf32","utf8"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rmawatson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-11-24T19:28:22.000Z","updated_at":"2024-06-28T01:22:41.000Z","dependencies_parsed_at":"2024-02-06T02:53:17.718Z","dependency_job_id":null,"html_url":"https://github.com/rmawatson/utf","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rmawatson/utf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmawatson%2Futf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmawatson%2Futf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmawatson%2Futf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmawatson%2Futf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rmawatson","download_url":"https://codeload.github.com/rmawatson/utf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmawatson%2Futf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30022939,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T22:30:10.381Z","status":"ssl_error","status_checked_at":"2026-03-02T22:23:34.650Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","iterators","unicode","utf16","utf32","utf8"],"created_at":"2025-04-25T15:26:33.911Z","updated_at":"2026-03-02T22:38:56.621Z","avatar_url":"https://github.com/rmawatson.png","language":"C++","readme":"[![Build status](https://ci.appveyor.com/api/projects/status/ix95xf1mv55v9pag/branch/master?svg=true)](https://ci.appveyor.com/project/rmawatson/utf/branch/master)\r\n[![Build Status](https://travis-ci.com/rmawatson/utf.svg?branch=master)](https://travis-ci.com/rmawatson/utf)\r\n[![GitHub License](https://img.shields.io/badge/license-Boost%201.0-blue.svg)](https://github.com/rmawatson/utf/blob/master/LICENSE)\r\n![platform](https://img.shields.io/badge/platform-visual%20studio-blue.svg?logo=windows\u0026longCache=true\u0026style=flat\u0026logoColor=white)\r\n![platform](https://img.shields.io/badge/platform-xcode-blue.svg?logo=apple\u0026longCache=true\u0026style=flat\u0026logoColor=white)\r\n![platform](https://img.shields.io/badge/platform-gcc%20%2F%20clang-blue.svg?logo=linux\u0026longCache=true\u0026style=flat\u0026logoColor=white)\r\n# UTF Iterators for modern C++\r\n\r\n## Introduction\r\n\r\n**UTF** is a header only library providing simple unicode iterator adapters to for converting between various unicode formats and differing endianness. Similar to those available with boost, but without the boost dependency. \r\n\r\nUtility functions that build on top of these iterators are available to provide standard conversions and byte order mark detection. Additional support is provided for replacing invalid byte sequences with a user provide replacement character, or the unicode default (U+FFFD �) or throwing an exception.\r\n\r\n* [Iterators](#iterators)\r\n* [Utility Functions](#utility-functions)\r\n* [Examples](#examples)\r\n* [CMake](#cmake)\r\n* [License](#license)\r\n\r\n## Iterators\r\n\r\nEach iterator adapter models the catagory of the underlying base_iterator. To ensure iterators do not iterate past the end when consuming multiple elements from the base_iterator, different constructors are used for certain iterators depending on the catagory of base_iterator.\r\n\r\n#### **|```utf8_to_utf32_iterator\u003cbase_iterator,policies...\u003e```**\r\n\r\n\r\n***constructors:***\r\n\r\n*base_iterator::iterator_category is bidrectional_iterator or random_access_iterator*\u003cbr/\u003e\r\n**```utf8_to_utf32_iterator(start_iterator,range_begin_iterator,range_end_iterator)```**\r\n \r\n*base_iterator::iterator_category is forward_iterator or input_iterator*\u003cbr/\u003e\r\n**```utf8_to_utf32_iterator(start_iterator,range_end_iterator)```**\r\n \r\n***policies:***\r\n\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```utf8_to_utf16_iterator\u003cbase_iterator,policies...\u003e```**\r\n\r\n***constructors:***\r\n\r\n*base_iterator::iterator_category is bidrectional_iterator*\u003cbr/\u003e\r\n**```utf8_to_utf16_iterator(start_iterator,range_begin_iterator,range_end_iterator)```**\r\n \r\n*base_iterator::iterator_category is forward_iterator or input_iterator*\u003cbr/\u003e\r\n**```utf8_to_utf16_iterator(start_iterator,range_end_iterator)```**\r\n \r\n***policies:***\r\n\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```utf32_to_utf8_iterator\u003cbase_iterator,policies...\u003e```**\r\n\r\n***constructors:***\r\n\r\n**```utf32_to_utf8_iterator(base_iterator)```**\r\n \r\n***policies:***\r\n\r\n**```from\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```ut32_to_utf16_iterator\u003cbase_iterator,policies...\u003e```**\r\n\r\n***constructors:***\r\n\r\n**```utf32_to_utf16_iterator(base_iterator)```**\r\n \r\n***policies:***\r\n\r\n**```from\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n\r\n## Utility Functions\r\n\r\nThe ```utfX_to_utfY()``` utility functions provide an easy way to convert between different unicode types. the ```base_iterator``` \r\ndereference operator should yeild a type convertible to uintX_t.\r\n\r\nEach of ```utfX_to_utfY()``` use a default return value of uYstring, which can be changed by providing any type that has begin() end() and value_type as one of the tempalte parameters.\r\nfor example,\r\n\r\n```c++\r\nstd::string u8text = u8\"text\";\r\nuint8_to_uint16\u003cstd::vector\u003cchar16_t\u003e\u003e(text);\r\n```\r\n\r\nTwo overloads exist for each ```utfX_to_utfY()``` conversion function, one whos arguments are a start and end iterator range to convert, the other an iterable object.\r\nThe same set of policies that would be used with the respective iterator can be passed as template parameters to these functions. The order is unimportant. However polices\r\nare not checked and unused policies will be silently ignored (for example a to\u003cbig_endian\u003e in a to_utf8 conversion has no meaning).\r\n\r\n```c++\r\nuint8_to_uint16\u003conerror\u003creplace_with_fffd\u003e,std::vector\u003cchar16_t\u003e\u003e(text);\r\n// the same as\r\nuint8_to_uint16\u003cstd::vector\u003cchar16_t\u003e,onerror\u003creplace_with_fffd\u003e\u003e(text);\r\n```\r\n\r\n#### **|```u16string utf8_to_utf16\u003cresult_type,policies...\u003e(start_iterator,end_iterator)```**\r\n#### **|```u16string utf8_to_utf16\u003cresult_type,policies...\u003e(iteratble)```**\r\n***policies:***\r\n\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```u16string utf8_to_utf32\u003cresult_type,policies...\u003e(start_iterator,end_iterator)```**\r\n#### **|```u16string utf8_to_utf32\u003cresult_type,policies...\u003e(iteratble)```**\r\n***policies:***\r\n\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```u8string utf16_to_utf8\u003cresult_type,policies...\u003e(start_iterator,end_iterator)```**\r\n#### **|```u8string utf16_to_utf8\u003cresult_type,policies...\u003e(iteratble)```**\r\n\r\n***policies:***\r\n\r\n\r\n**```from\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n\r\n#### **|```u16string utf16_to_utf32\u003cresult_type,policies...\u003e(start_iterator,end_iterator)```**\r\n#### **|```u16string utf16_to_utf32\u003cresult_type,policies...\u003e(iteratble)```**\r\n\r\n***policies:***\r\n\r\n**```from\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```u8string utf32_to_utf16\u003cresult_type,policies...\u003e(start_iterator,end_iterator)```**\r\n#### **|```u8string utf32_to_utf16\u003cresult_type,policies...\u003e(iteratble)```**\r\n\r\n***policies:***\r\n\r\n**```from\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```to\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```u8string utf32_to_utf8\u003cresult_type,policies...\u003e(start_iterator,end_iterator)```**\r\n#### **|```u8string utf32_to_utf8\u003cresult_type,policies...\u003e(iteratble)```**\r\n\r\n***policies:***\r\n\r\n**```from\u003cbig_endian|little_endian\u003e```**\u003cbr/\u003e\r\n**```onerror\u003creplace_with\u003cchar32_t\u003e|throw_exception\u003e```**\r\n\r\n---\r\n\r\n#### **|```bom_type detect_bom(start_iterator,end_iterator)```**\r\nDetects a bom from a byte sequence. reads upto 4 bytes to detect the bom.\r\n\r\nThis is nesseccary to distinguish between\r\nutf32_little_endian and utf16_little_endian where the first two bytes are identical. \r\n\r\n*Note: there is potential for ambiguity when the first two bytes after the bom in a utf16_little_endian byte sequence are 0x0.*\r\n\r\nreturns one of,\r\n\r\n```\r\nnone,\r\nutf8,\r\nutf16_little_endian,\r\nutf16_big_endian,\r\nutf32_little_endian,\r\nutf32_big_endian,\r\n```\r\n\r\n## Constants\r\n\r\n**```constexpr uint8_t utf8_bom[]```**\u003cbr/\u003e\r\n**```constexpr uint8_t utf16_little_endian_bom[]```**\u003cbr/\u003e\r\n**```constexpr uint8_t utf16_big_endian_bom[]```**\u003cbr/\u003e\r\n**```constexpr uint8_t utf32_little_endian_bom[]```**\u003cbr/\u003e\r\n**```constexpr uint8_t utf32_little_endian_bom[]```**\u003cbr/\u003e\r\n**```constexpr uint8_t utf32_big_endian_bom[]```**\u003cbr/\u003e\r\n\r\n## Cmake\r\n\r\nThis project can be used as an external project cmake's find_package(),\r\n\r\n```cmake\r\n# CMakeLists.txt\r\nfind_package(utf REQUIRED)\r\n...\r\nadd_library(somelib ...)\r\n...\r\ntarget_link_libraries(somelib PRIVATE utf::utf)\r\n```\r\nor placed in a thirdparty folder and used through add_subdirectory,\r\n```cmake\r\n# Disable building tests\r\nset(UTF_BUILD_TESTS OFF CACHE INTERNAL \"\")\r\n..\r\nadd_subdirectory(thirdparty/utf)\r\n...\r\nadd_library(somelib ...)\r\n...\r\ntarget_link_libraries(somelib PRIVATE utf::utf)\r\n```\r\n\r\n## Examples\r\n\r\n* Convert existing utf16 string to utf8 with utility functions.\r\n\r\n```c++\r\n#include \u003cifstream\u003e\r\n#include \u003citerator\u003e\r\n#include \u003cutf/utf.h\u003e\r\n\r\nint main()\r\n{\r\n\tusing namespace utf;\r\n    \r\n    u16string u16_text = U\"ɦΈ˪˪ʘ\";\r\n    //use default platform endianness, default onerror policy.\r\n    u8string u8_text = utf16_to_utf8(u16_text.begin(),u16_text.end());\r\n}\r\n\r\n```\r\n\r\n* Convert from an existing utf32 string to utf8 with iterators.\r\n\r\n```c++\r\n#include \u003cutf/utf.h\u003e\r\n\r\nint main()\r\n{\r\n    using namespace utf;\r\n\r\n    u32string u32_text = U\"ɦΈ˪˪ʘ\";\r\n\r\n    //use the bidirectional_iterator constructor, default platform endianness, default error policy.\r\n    utf32_to_utf8_iterator\u003cu32string::iterator\u003e pos(u32_text.begin());\r\n    utf32_to_utf8_iterator\u003cu32string::iterator\u003e end(u32_text.end());\r\n\r\n    u8string u8_text(pos, end);\r\n}\r\n\r\n```\r\n\r\n* Reading from a UTF-32 little endian file, checking for a byte order mark, and converting to UTF-8.\r\n\r\n```c++\r\n#include \u003cfstream\u003e\r\n#include \u003citerator\u003e\r\n#include \u003ccassert\u003e\r\n#include \u003cutf/utf.h\u003e\r\n\r\nint main()\r\n{\r\n    using namespace utf;\r\n\r\n    // open the file\r\n    std::ifstream uc_file(\"unicode_file.txt\", std::ios::binary);\r\n\r\n    //The iterators to use for the stream.\r\n    using base_iterator = std::istreambuf_iterator\u003cchar\u003e;\r\n\r\n    base_iterator base_pos(uc_file);\r\n    base_iterator base_end{};\r\n\r\n    //detect the if the file contains a byte order mark and that it is UTF-32 little endian encoded file.\r\n    bom_type uc_type = detect_bom(base_pos, base_end);\r\n    assert( uc_type == bom_type::utf32_little_endian );\r\n\r\n    // detect_bom will read upto 4 bytes to detect the bom, even if the bom is two bytes (utf16) or three bytes (utf8). For input_iterators,\r\n    // bytes may already have been read that are part of the unicode text and cannot be re-read. In this case the bom is 4 bytes. However when\r\n    // detecting utf16 or utf8 boms, the stream should be reset to the begining and the iterator advanced by exactly the size of the bom.\r\n    // Although this is not necessary here, as sizeof(utf32_little_endian_bom) == 4\r\n\r\n    uc_file.clear();\r\n    uc_file.seekg(0, std::ios::beg);\r\n    std::advance(base_pos, sizeof(utf32_little_endian_bom));\r\n\r\n    //read from the stream 4 bytes at a time for the utf32_to_utf8_iterator\r\n    using stridel_iterator = stride_long_iterator\u003cbase_iterator\u003e;\r\n\r\n    // using the input iterator constructor\r\n    stridel_iterator b2l_pos(base_pos, base_end);\r\n    stridel_iterator b2l_end(base_end, base_end);\r\n\r\n    //specify the source endianness. The default will be the platform endianness. using the default onerror\u003cthrow_exception\u003e policy.\r\n    using utf_iterator = utf32_to_utf8_iterator\u003cstridel_iterator, from\u003clittle_endian\u003e\u003e;\r\n\r\n\r\n    utf_iterator utf_pos(b2l_pos);\r\n    utf_iterator utf_end(b2l_end);\r\n\r\n    // construct a new std::basic_string\u003cchar\u003e from the iterators.\r\n    u8string result_u8(utf_pos, utf_end);\r\n\r\n    return 0;\r\n}\r\n```\r\n## License\r\n\r\nThe files in this repository are licensed under the Boost Software License 1.0. A copy of the license is available in the root of the repository.\r\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frmawatson%2Futf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frmawatson%2Futf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frmawatson%2Futf/lists"}