{"id":15107659,"url":"https://github.com/svenvc/utf8string","last_synced_at":"2025-06-11T23:03:23.532Z","repository":{"id":146208787,"uuid":"466188592","full_name":"svenvc/UTF8String","owner":"svenvc","description":"A proof of concept / prototype alternative String implementation for Pharo using a variable length UTF8 encoded internal representation","archived":false,"fork":false,"pushed_at":"2022-05-07T16:22:53.000Z","size":46,"stargazers_count":12,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-06T15:39:26.436Z","etag":null,"topics":["pharo","pharo-smalltalk","string","utf8"],"latest_commit_sha":null,"homepage":"","language":"Smalltalk","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/svenvc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-04T16:11:34.000Z","updated_at":"2024-02-18T23:18:05.000Z","dependencies_parsed_at":"2023-04-11T17:32:14.230Z","dependency_job_id":null,"html_url":"https://github.com/svenvc/UTF8String","commit_stats":{"total_commits":12,"total_committers":2,"mean_commits":6.0,"dds":"0.16666666666666663","last_synced_commit":"fbc8cc387a310042e1a81e6c25714b2bc4c6f0e1"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/svenvc/UTF8String","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/svenvc%2FUTF8String","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/svenvc%2FUTF8String/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/svenvc%2FUTF8String/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/svenvc%2FUTF8String/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/svenvc","download_url":"https://codeload.github.com/svenvc/UTF8String/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/svenvc%2FUTF8String/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259360728,"owners_count":22845817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pharo","pharo-smalltalk","string","utf8"],"created_at":"2024-09-25T21:40:47.988Z","updated_at":"2025-06-11T23:03:23.509Z","avatar_url":"https://github.com/svenvc.png","language":"Smalltalk","readme":"# UTF8String\n\nA proof of concept / prototype alternative String implementation for Pharo\nusing a variable length UTF8 encoded internal representation.\n\n\n## Introduction\n\nIn Pharo Strings, sequences of Characters, are implemented by storing the Unicode code points of the Characters.\nIn general, 32 bits are needed for Unicode code points. However, the most common ASCII and Latin1 code points fit in 8 bits.\nTwo subclasses of String, WideString and ByteString respectively, cover these cases, transparently.\n\nWhen doing IO or using FFI, Strings must be encoded using an encoder, to and from a ByteArray or binary stream.\nMany encodings are in common use, but today, UTF8 has basically won, as it is the default encoding almost everywhere.\n\nThere is a real cost associated with encoding and decoding, especially with a variable length encoding such as UTF8.\n\nSo one might ask the question: could we not use UTF8 as the internal representation of Strings.\nSome other programming languages, most notably Swift, took this road years ago.\n\n\n## Implementation\n\nUTF8String is concept / prototype alternative String implementation for Pharo\nusing a variable length UTF8 encoded internal representation to explore this idea.\nFurthermore UTF8String is readonly (no #at:put:).\n\nThe main problem with UTF8 is that it is a variable length encoding, with Characters being encoded using 1 to 4 bytes.\nThis means two things: indexing is much harder, as it basically comes down to a linear scan\nand similary knowning the length in number of Characters can only be done after a linear scan.\n\nReplacing one character with another is almost impossible, since this might shift things.\n\nThere are two clear advantages: IO and FFI can be done with zero cost (to UTF8 obviously, not to other encodings)\nand space usage is more efficient in most cases (when at least one character does not fit in 8 bits).\n\n\n## Indexing and length caching\n\nThe UTF8String implementation just stores the UTF8 encoded bytes.\nIt tries to avoid indexing and counting if at all possible.\nIf indexing or the character count are needed, a single scan is performed,\nthat creates an index every stride (32) characters,\nwhile also storing the length (#computeCountAndIndex)\nFurther operations can then be performed faster.\nThe key internal operation being:\n\n- #byteIndexAt: characterIndex\n- #characterIndexAt: byteIndex\n\nBy using the index, the linear search is limited to stride (32) characters at the most.\n\n\n## Operations\n\nA surprising number of operations are possible that avoid indexing\nor the character count:\n\n- equality (#=)\n- hashing (#hash)\n- character inclusion (#includes:)\n- empty test (#isEmpty)\n- substring searching (#includesSubstring:)\n- prefix/suffix matching (#beginsWith: #endsWith:)\n- concatenation (#,)\n\nMany other operation can be written using only a single (partial) scan:\n\n- finding tokens (#findTokens:)\n- formatting by interpolation (#format:)\n- printing (#printOn:)\n- comparing/sorting (#threeWayCompareTo: #\u003c #\u003c= #\u003e= #\u003e)\n- partial copying (#copyUpTo:)\n- enumeration (#do #reverseDo: #collect: #readStream)\n\nOn the other hand, many traditional operation trigger indexing and character counting:\n\n- indexing (#at:)\n- counting the characters (#size:)\n- convenience accessors (#first #last)\n- finding the index of a character or substring (#indexOf:[startingAt:] #indexOfSubCollection:)\n- substring selection (#copyFrom:to:)\n\n\n## Discussion\n\nThe implementation was written to see if it could be done and how it would feel.\nNot every algorithm is fully optimal, more specific loops are possible.\n\nWhen creating a UTF8String on UTF8 encoded bytes, this is a zero cost operation\nonly if we assume the encoding is correct. A validate operation is available\nto check this, but that defeats the speed advantage for the most part.\nBTW, validate automatically does indexing and character counting.\n\nAn aspect that was ignored is the concept of Unicode normalization with respect to concatenation.\nThis is a hard subject has been solved in Pharo using external code, but not integrated in this implementation.\n\nThe concept of readonly strings is worth considering and feels acceptable, but requires a certain mindset.\n\n\n## Conclusion\n\nAlthough this experiment went well, it is not meant for actual use.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsvenvc%2Futf8string","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsvenvc%2Futf8string","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsvenvc%2Futf8string/lists"}