{"id":13591421,"url":"https://github.com/DuffsDevice/tiny-utf8","last_synced_at":"2025-04-08T17:31:31.682Z","repository":{"id":39422237,"uuid":"115557923","full_name":"DuffsDevice/tiny-utf8","owner":"DuffsDevice","description":"Unicode (UTF-8) capable std::string","archived":false,"fork":false,"pushed_at":"2025-01-18T11:13:19.000Z","size":874,"stargazers_count":548,"open_issues_count":7,"forks_count":44,"subscribers_count":26,"default_branch":"master","last_synced_at":"2025-03-31T19:08:00.981Z","etag":null,"topics":["codepoints","conversion","cplusplus","cplusplus-11","cpp","decoder","drop-in","encoder","header-only","std","string","string-conversion","string-manipulation","tiny-utf8","unicode","utf-32","utf-8","utf8","utf8-string"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DuffsDevice.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-27T21:11:34.000Z","updated_at":"2025-03-23T10:32:44.000Z","dependencies_parsed_at":"2025-03-10T15:47:49.390Z","dependency_job_id":null,"html_url":"https://github.com/DuffsDevice/tiny-utf8","commit_stats":null,"previous_names":[],"tags_count":31,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuffsDevice%2Ftiny-utf8","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuffsDevice%2Ftiny-utf8/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuffsDevice%2Ftiny-utf8/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuffsDevice%2Ftiny-utf8/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DuffsDevice","download_url":"https://codeload.github.com/DuffsDevice/tiny-utf8/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247721898,"owners_count":20985084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["codepoints","conversion","cplusplus","cplusplus-11","cpp","decoder","drop-in","encoder","header-only","std","string","string-conversion","string-manipulation","tiny-utf8","unicode","utf-32","utf-8","utf8","utf8-string"],"created_at":"2024-08-01T16:00:57.255Z","updated_at":"2025-04-08T17:31:31.660Z","avatar_url":"https://github.com/DuffsDevice.png","language":"C++","readme":"# TINY \u003cimg src=\"https://github.com/DuffsDevice/tiny-utf8/raw/master/docs/UTF8.png\" width=\"47\" height=\"47\" align=\"top\" alt=\"UTF8 Art\" style=\"display:inline;\"\u003e 4.4\r\n\r\n[![Build Status](https://api.travis-ci.com/DuffsDevice/tiny-utf8.svg?branch=master)](https://travis-ci.com/github/DuffsDevice/tiny-utf8)\u0026nbsp;\u0026nbsp;[![Licence](https://img.shields.io/badge/licence-BSD--3-e20000.svg)](https://github.com/DuffsDevice/tiny-utf8/blob/master/LICENCE)\u0026nbsp;\u0026nbsp;[![Donation](https://img.shields.io/badge/buy%20me%20a%20coffee-paypal-fcd303.svg)](https://www.paypal.me/jakobriedle)\r\n\r\n### DESCRIPTION\r\n**Tiny-utf8** is a library for extremely easy integration of Unicode into an arbitrary C++11 project.\r\nThe library consists solely of the class `utf8_string`, which acts as a drop-in replacement for `std::string`.\r\nIts implementation is successfully in the middle between small memory footprint and fast access. All functionality of `std::string` is therefore replaced by the corresponding codepoint-based UTF-32 version - translating every access to UTF-8 under the hood.\r\n\r\n#### *CHANGES BETWEEN Version 4.4 and 4.3*\r\n\r\n- **tiny-utf8** used to only work with byte-index-based iterator types. The set of iterator types has now been completed with codepoint-based versions and\r\n- the **default has been changed**. That means (`c`)(`r`)`begin`/`end` now return codepoint-based iterators, while `raw_`(`c`)(`r`)`begin`/`end` now return byte-based iterators.\r\n- The upside with byte-based iterators is: they are usually quicker than code-point-based iterators. The downside is: They get invalidated **very quickly**. Example:\r\n`str.erase( std::remove( str.begin() , str.end() , U'W' ) , str.end() )` will work, but `str.erase( std::remove(`**`str.raw_begin()`**`,`**`str.raw_end()`**`, U'W' ) ,`**`str.raw_end()`**`)` will not (at least not always). The reason is: after the call to `std::remove`, the size of the string data might have changed and the second call to `str.raw_end()` might have yielded a now-invalidated iterator.\r\n\r\n### FEATURES\r\n- **Drop-in replacement for `std::string`**\r\n- **Lightweight and self-contained** (~5K SLOC)\r\n- **Very fast**, i.e. highly optimized decoder, encoder and traversal routines\r\n- **Advanced Memory Layout**, i.e. Random Access is\r\n   - ***O(1) for ASCII-only strings (!)*** and\r\n   - O(#Codepoints ∉ ASCII) for the average case.\r\n   - O(n) for strings with a high amount of non-ASCII code points (\u003e25%)\r\n- **Small String Optimization** (SSO) for strings up to an UTF8-encoded length of `sizeof(utf8_string)`! That is, including the trailing `\\0`\r\n- **Growth in Constant Time** (Amortized)\r\n- **On-the-fly Conversion between UTF32 and UTF8**\r\n- **`size()`** returns the size of the data **in bytes**, **`length()`** returns the number of **codepoints** contained.\r\n- Codepoint Range of `0x0` - `0xFFFFFFFF`, i.e. 1-7 Code Units/Bytes per Codepoint (Note: This is more than specified by UTF8, but until now otherwise considered out of scope)\r\n- Complete support for **embedded zeros** (Note: all methods taking `const char*`/`const char32_t*` also have an overload for `const char (\u0026)[N]`/`const char32_t (\u0026)[N]`, allowing correct interpretation of string literals with embedded zeros)\r\n- Single Header File\r\n- Straightforward C++11 Design\r\n- Possibility to prepend the UTF8 BOM (Byte Order Mark) to any string when converting it to an std::string\r\n- Supports raw (Byte-based) access for occasions where Speed is needed\r\n- Supports `shrink_to_fit()`\r\n- Malformed UTF8 sequences will **lead to defined behaviour**\r\n\r\n## THE PURPOSE OF TINY-UTF8\r\nBack when I decided to write a UTF8 solution for C++, I knew I wanted a drop-in replacement for `std::string`. At the time mostly because I found it neat to have one and felt C++ always lacked accessible support for UTF8. Since then, several years have passed and the situation has not improved much. That said, things currently look like they are about to improve - but that doesn't say much, eh?\r\n\r\nThe opinion shared by many \"experienced Unicode programmers\" (e.g. published on [UTF-8 Everywhere](https://www.utf8everywhere.org)) is that \"non-experienced\" programmers both *under* and *over*estimate the need for Unicode- and encoding-specific treatment: This need is...\r\n  1. **overestimated**, because many times we really should care less about codepoint/grapheme borders within string data;\r\n  2. **underestimated**, because if we really want to \"support\" unicode, we need to think about *normalizations*, *visual character comparisons*, *reserved codepoint values*, *illegal code unit sequences* and so on and so forth.\r\n\r\nUnicode is not rocket science but nonetheless hard to get *right*. **Tiny-utf8** does not intend to be an enterprise solution like [ICU](http://site.icu-project.org/) for C++. The goal of **tiny-utf8** is to\r\n  - bridge as many gaps to \"supporting Unicode\" as possible by 'just' replacing `std::string` with a custom class which means to\r\n  - provide you with a Codepoint Abstraction Layer that takes care of the Run-Length Encoding, without you noticing.\r\n\r\n**Tiny-utf8** aims to be the simple-and-dependable groundwork which you build Unicode infrastructure upon. And, if *1)* C++2xyz should happen to make your Unicode life easier than **tiny-utf8** or *2)* you decide to go enterprise, you have not wasted much time replacing `std::string` with `tiny_utf8::string` either. That's what makes **tiny-utf8** so agreeable.\r\n\r\n#### WHAT TINY-UTF8 IS NOT AIMED AT\r\n- Conversion between ISO encodings and UTF8\r\n- Interfacing with UTF16\r\n- Visible character comparison (`'ch'` vs. `'c'+'h'`)\r\n- Codepoint Normalization\r\n- Correction of invalid Code Unit sequences\r\n- Detection of Grapheme Clusters\r\n\r\nNote: ANSI suppport was dropped in Version 2.0 in favor of execution speed.\r\n\r\n## EXAMPLE\r\n\r\n```cpp\r\n#include \u003ciostream\u003e\r\n#include \u003calgorithm\u003e\r\n#include \u003ctinyutf8/tinyutf8.h\u003e\r\nusing namespace std;\r\n\r\nint main()\r\n{\r\n    tiny_utf8::string str = u8\"!🌍 olleH\";\r\n    for_each( str.rbegin() , str.rend() , []( char32_t codepoint ){\r\n      cout \u003c\u003c codepoint;\r\n    } );\r\n    return 0;\r\n}\r\n```\r\n\r\n## EXCEPTION BEHAVIOR\r\n\r\n- **Tiny-utf8** should automatically detect, whether your build system allows the use of exceptions or not. This is done by checking for the feature test macro `__cpp_exceptions`.\r\n- If you would like **tiny-utf8** to be `noexcept` anyway, `#define` the macro `TINY_UTF8_NOEXCEPT`.\r\n- If you would like **tiny-utf8** to use a different exception strategy, `#define` the macro `TINY_UTF8_THROW( location , failing_predicate )`. For using assertions, you would write `#define TINY_UTF8_THROW( _ , pred ) assert( pred )`.\r\n- *Hint:* If exceptions are disabled, `TINY_UTF8_THROW( ... )` is automatically defined as `void()`. This works well, because all uses of `TINY_UTF8_THROW` are immediately followed by a `;` as well as a proper `return` statement with a fallback value. That also means, `TINY_UTF8_THROW` can safely be a NO-OP.\r\n\r\n## BACKWARDS-COMPATIBILITY\r\n\r\n#### *CHANGES BETWEEN Version 4.3 and 4.2*\r\n\r\n- Class `tiny_utf8::basic_utf8_string` has been renamed to `basic_string`, which better resembles its drop-in-capabilities for `std::string`.\r\n\r\n#### *CHANGES BETWEEN Version 4.1 and 4.0*\r\n\r\n- `tinyutf8.h` has been moved into the folder `include/tinyutf8/` in order to mimic the structuring of many other C++-based open source projects.\r\n\r\n#### *CHANGES BETWEEN Version 4.0 and 3.2.4*\r\n\r\n- Class `utf8_string` is now defined inside `namespace tiny_utf8`. If you want the old declaration in the global namespace, `#define TINY_UTF8_GLOBAL_NAMESPACE`\r\n- Support for C++20: Use class `tiny_utf8::u8string`, which uses `char8_t` as underlying data type (instead of `char`)\r\n\r\n#### *CHANGES BETWEEN Version 4.0 and Version 3.2*\r\n\r\n- If you would like to stay compatible with 3.2.* and have `utf8_string` defined in the global namespace, `#define` the macro `TINY_UTF8_GLOBAL_NAMESPACE`.\r\n\r\n## BUGS\r\n\r\nIf you encounter any bugs, please file a bug report through the \"Issues\" tab.\r\nI'll try to answer it soon!\r\n\r\n## THANK YOU\r\n\r\n- @iainchesworth\r\n- @vadim-berman\r\n- @MattHarrington\r\n- @evanmoran\r\n- @bakerstu\r\n- @revel8n\r\n- @githubuser0xFFFF\r\n- @marekfoltyn\r\n- @Megaxela\r\n- @vfiksdal\r\n- @maddouri\r\n- @Abdullah-AlAttar\r\n- @s9w\r\n\r\nfor taking your time to improve **tiny-utf8**.\r\n\r\nCheers,\r\nJakob\r\n","funding_links":["https://www.paypal.me/jakobriedle"],"categories":["C++"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDuffsDevice%2Ftiny-utf8","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDuffsDevice%2Ftiny-utf8","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDuffsDevice%2Ftiny-utf8/lists"}