{"id":13439418,"url":"https://github.com/vitiral/stfu8","last_synced_at":"2025-05-08T23:45:13.246Z","repository":{"id":28266928,"uuid":"117478814","full_name":"vitiral/stfu8","owner":"vitiral","description":"Sorta Text Format in UTF-8","archived":false,"fork":false,"pushed_at":"2024-01-09T23:36:46.000Z","size":92,"stargazers_count":25,"open_issues_count":2,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-08T23:45:07.414Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vitiral.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-15T00:29:52.000Z","updated_at":"2025-01-30T21:26:13.000Z","dependencies_parsed_at":"2024-06-21T02:36:04.833Z","dependency_job_id":"8822a160-8722-479a-9961-90e192736438","html_url":"https://github.com/vitiral/stfu8","commit_stats":{"total_commits":43,"total_committers":6,"mean_commits":7.166666666666667,"dds":"0.34883720930232553","last_synced_commit":"06dfc823a86fb7f8b51e52fa4be937ca870771cd"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitiral%2Fstfu8","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitiral%2Fstfu8/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitiral%2Fstfu8/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vitiral%2Fstfu8/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vitiral","download_url":"https://codeload.github.com/vitiral/stfu8/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253166474,"owners_count":21864467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:01:13.742Z","updated_at":"2025-05-08T23:45:13.228Z","avatar_url":"https://github.com/vitiral.png","language":"Rust","readme":"# STFU-8: Sorta Text Format in UTF-8\n\n[![Build Status](https://travis-ci.org/vitiral/stfu8.svg?branch=master)](https://travis-ci.org/vitiral/stfu8)\n\nSTFU-8 is a hacky text encoding/decoding protocol for data that might be *not\nquite* UTF-8 but is still mostly UTF-8. It is based on the syntax of the `repr`\ncreated when you write (or print) binary text in rust, python, C or other\ncommon programming languages.\n\nIts primary purpose is to be able to allow a human to **visualize and edit**\n\"data\" that is mostly (or fully) **visible** UTF-8 text. It encodes all non\nvisible or non UTF-8 compliant bytes as longform text (i.e. ESC becomes the\nfull string `r\"\\x1B\"`).  It can also encode/decode ill-formed UTF-16.\n\nComparision to other formats:\n- **UTF-8** (i.e. [`std::str`](https://doc.rust-lang.org/std/str/index.html)):\n  UTF-8 is a standardized format for encoding human understandable text in any\n  language on the planet. It is the reason the internet can be understood by\n  almost anyone and should be the primary way that text is encoded. However,\n  not everything that is \"UTF-8 like\" follows the standard exactly. For\n  instance:\n  - The linux command line defines ANSI escape codes to provide styles like\n    color, bold, italic, etc. Even though almost everything printed to a\n    terminal is UTF-8 text these \"escape codes\" might not be, and even\n    if they are UTF-8, they are not visible characters.\n  - Windows paths are not *necessarily* UTF-8 compliant as they can\n    have [ill formed text][utf-16-ill-formed-text].\n  - There might be other cases you can think of or want to create. In general,\n    try _not_ to create more use cases if you don't have to.\n- **rust's [OsStr](https://doc.rust-lang.org/std/ffi/struct.OsStr.html)**:\n  OsStr is the \"cross platform\" type for handling system specific strings,\n  mainly in file paths. Unlike STFU-8 it not (always) coercible into UTF-8\n  and therefore cannot be serialized into JSON or other formats.\n- **WTF-8** ([rust-wtf8](https://github.com/SimonSapin/rust-wtf8)): is great\n  for interoperating with different UTF standards but cannot be used to\n  transmit data over the internet. The\n  [spec states](https://simonsapin.github.io/wtf-8/): \"WTF-8 must not be used\n  to represent text in a file format or for transmission over the Internet.\"\n- **base64** ([`base64`](https://crates.io/crates/base64)): also encodes binary\n  data as UTF-8. If your data is *actually binary* (i.e. not text) then use\n  base64. However, if your data was formerly text (or mostly text) then\n  encoding to base64 will make it completely un(human)readable.\n- **Array[u8]**: obviously great if your data is *actually binary* (i.e. NOT\n  TEXT) and you don't need to put it into a UTF-8 encoding.  However, an array\n  of bytes (i.e. `[0x72, 0x65, 0x61, 0x64, 0x20, 0x69, 0x74]` is\n  not human readable. Even if it were in pure ASCII the only ones who can read\n  it efficiently are low-level programming Gods who have never figured out how\n  to debug-print their ASCII.\n- **STFU-8** (this crate): is \"good\" when you want to have only\n  printable/hand-editable text (and your data is _mostly_ UTF-8) but the data\n  might have a couple of binary/non-printable/ill-formed pieces. It is _very\n  poor_ if your data is actually binary, requiring (on average) a mapping of\n  4/1 for binary data.\n\n[1]: https://simonsapin.github.io/wtf-8/\n\n# Specification\nIn simple terms, encoded STFU-8 is itself *always valid unicode* which decodes\nto binary (the binary is not necessarily UTF-8). It differs from unicode in\nthat single `\\` items are illegal. The following patterns are legal:\n- `\\\\`: decodes to the backward-slash (`\\`) byte (`\\x5c`)\n- `\\t`: decodes to the tab byte (`\\x09`)\n- `\\n`: decodes to the newline byte (`\\x0A`)\n- `\\r`: decodes to the linefeed byte (`\\x0D`)\n- `\\xXX` where XX are exactly two case-insensitive hexidecimal digits: decodes\n  to the `\\xXX` byte, where `XX` is a hexidecimal number (example: `\\x9F`,\n  `\\xaB` or `\\x05`). This *never* gets resolved into a code point, the value\n  is pushed directly into the decoder stream.\n- `\\uXXXXXX` where `XXXXXX` are exacty six case-insensitive hexidecimal digits,\n  decodes to a 24bit number that *typically* represenents a unicode code point.\n  If the value *is* a unicode code point it will always be decoded as such.\n  Otherwise `stfu8` will attempt to store the value into the decoder (if the\n  value is too large for the decoding type it will be an error).\n\n`stfu8` provides 2 different categories of functions for encoding/decoding data\nthat are *not necessarily interoperable* (don't decode output created from `encode_u8`\nwith `decode_u16`).\n- `encode_u8(\u0026[u8]) -\u003e String` and `decode_u8(\u0026str) -\u003e Vec\u003cu8\u003e`: encodes or\n  decodes an array of `u8` values to/from STFU-8, primarily used for interfacing\n  with binary/nonvisible data that is *almost* UTF-8.\n- `encode_u16(\u0026[u16]) -\u003e String` and `decode_u16(\u0026str) -\u003e Vec\u003cu16\u003e`: encodes\n  or decodes an array of `u16` values to/from STFU-8, primarily used for\n  interfacing with legacy UTF-16 formats that may contain\n  [ill formed text][utf-16-ill-formed-text] but also converts unprintable\n  characters.\n\nThere are some general rules for encoding and decoding:\n- If `\\u...` cannot be resolved into a valid UTF code point it *must* fit into\n  the decoder. For instance, trying to decode `\"\\u00DEED\"` (which is an UTF-16\n  Trail surrogage) using `decode_u8` will fail, but will succeed with\n  `decode_u16`.\n- No escaped values are *ever chained*. For example, `\"\\x01\\x02\"` will be\n  `[0x01, 0x02]` **not** `[0x0102]` -- even if you use `decode_u16`.\n- Values escaped with `\\x...` are always copied verbatum into the decoder.\n  I.e. `\\xFF` is a valid UTF-32 code point, but if decoded with `decode_u8`\n  it will be `0xFE` in the buffer, not two bytes of data as the UTF-8 character\n  `'þ'`. Note that with `decode_u16` `0xFE` is a valid UTF-16 code point, so\n  when re-encoded would be the `'þ'` character. Moral of the story: _don't mix\n  inputs/outputs of the the `u8` and `u16` functions_.\n\n\u003e tab, newline, and line-feed characters are \"visible\", so encoding with them\n\u003e in \"pretty form\" is optional.\n\n## UTF-16 Ill Formed Text\nThe problem is succinctly stated here:\n\n\u003e http://unicode.org/faq/utf_bom.html\n\u003e\n\u003e Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?\n\u003e\n\u003e A different issue arises if an unpairedsurrogate is encountered when\n\u003e converting ill-formed UTF-16 data. By represented such an unpaired surrogate\n\u003e on its own as a 3-byte sequence, the resulting UTF-8 data stream would become\n\u003e ill-formed. While it faithfully reflects the nature of the input, Unicode\n\u003e conformance requires that encoding form conversion always results in valid\n\u003e data stream. Therefore a convertermust treat this as an error. [AF]\n\nAlso, from the [WTF-8 spec](https://simonsapin.github.io/wtf-8/#motivation)\n\n\u003e As a result, [unpaired] surrogates do occur in practice and need to be\n\u003e preserved. For example:\n\u003e\n\u003e In ECMAScript (a.k.a. JavaScript), a String value is defined as a sequence\n\u003e of 16-bit integers that usually represents UTF-16 text but may or may not\n\u003e be well-formed.  Windows applications normally use UTF-16, but the file\n\u003e system treats path and file names as an opaque sequence of WCHARs (16-bit\n\u003e code units).\n\u003e\n\u003e We say that strings in these systems are encoded in potentially\n\u003e ill-formed UTF-16 or WTF-16.\n\nBasically: you can't (always) convert from UTF-16 to UTF-8 and it's a real\nbummer. WTF-8, while _kindof_ an answer to this problem, doesn't allow me\nto serialize UTF-16 into a UTF-8 format, send it to my webapp, edit it (as a\nhuman), and send it back. That is what STFU-8 is for.\n\n# LICENSE\nThe source code in this repository is Licensed under either of\n- Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or\n  http://www.apache.org/licenses/LICENSE-2.0)\n- MIT license ([LICENSE-MIT](LICENSE-MIT) or\n  http://opensource.org/licenses/MIT)\n\nat your option.\n\nUnless you explicitly state otherwise, any contribution intentionally submitted\nfor inclusion in the work by you, as defined in the Apache-2.0 license, shall\nbe dual licensed as above, without any additional terms or conditions.\n\nThe STFU-8 protocol/specification itself (including the name) is licensed under\nCC0 Community commons and anyone should be able to reimplement or change it for\nany purpose without need of attribution. However, using the same name for a\ncompletely different protocol would probably confuse people so please don't do\nit.\n\n","funding_links":[],"categories":["Libraries","库 Libraries","库"],"sub_categories":["Encoding","编码 Encoding","加密 Encoding","编码(Encoding)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitiral%2Fstfu8","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvitiral%2Fstfu8","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvitiral%2Fstfu8/lists"}