{"id":18006089,"url":"https://github.com/brianhicks/elm-string-graphemes","last_synced_at":"2025-03-26T10:32:18.446Z","repository":{"id":57674508,"uuid":"194597116","full_name":"BrianHicks/elm-string-graphemes","owner":"BrianHicks","description":"Do string operations based on graphemes instead of codepoints or bytes.","archived":false,"fork":false,"pushed_at":"2023-06-27T17:22:50.000Z","size":400,"stargazers_count":24,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-21T16:06:18.885Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://package.elm-lang.org/packages/BrianHicks/elm-string-graphemes/latest/","language":"Elm","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BrianHicks.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-01T04:01:21.000Z","updated_at":"2024-05-02T15:05:10.000Z","dependencies_parsed_at":"2024-10-30T00:47:07.244Z","dependency_job_id":null,"html_url":"https://github.com/BrianHicks/elm-string-graphemes","commit_stats":{"total_commits":196,"total_committers":1,"mean_commits":196.0,"dds":0.0,"last_synced_commit":"20b2a4aa8204b6b844a7a2367deafbaac04353a8"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianHicks%2Felm-string-graphemes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianHicks%2Felm-string-graphemes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianHicks%2Felm-string-graphemes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianHicks%2Felm-string-graphemes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BrianHicks","download_url":"https://codeload.github.com/BrianHicks/elm-string-graphemes/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245637302,"owners_count":20648125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-30T00:23:14.649Z","updated_at":"2025-03-26T10:32:17.949Z","avatar_url":"https://github.com/BrianHicks.png","language":"Elm","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Graphemes\n\nDo string operations based on graphemes instead of codepoints or bytes.\nCompare:\n\n```elm\nimport String.Graphemes\n\nString.toList \"🦸🏽‍♂️\" --\u003e [ '🦸', '🏽', '\\u{200D}', '♂', '\\u{FE0F}' ]\n\nString.Graphemes.toList \"🦸🏽‍♂️\" --\u003e [ \"🦸🏽‍♂️\" ]\n```\n\nThis package currently supports **Unicode 15**.\n\n## What's going on here? Graphemes? What are those?\n\nUnicode defines a system for encoding characters as numbers.\nThese numbers are called codepoints!\nFor example, `a` is codepoint 97, usually written in hex like `0x0061`.\nThere is a huge range of possible codepoints (from `0x0000` to `0x10FFFF`), although not all of these match a symbol.\n\nCodepoints are more complex than numbers, though: for a variety of reasons, a codepoint is encoded using 7 bits instead of 8.\nThat means that we can't use regular 32-bit integers to represent them!\n\nWe do this partially for historical compatibility with ASCII, and partially to save space.\nFor example, you can encode `a` (`0x0061`) in 1 byte, but 🦸  (`0x1F9B8`) takes four.\nIf they didn't vary in length, you would have to pad out `a` with 3 bytes worth of zeros just to support both in the same string!\n\nThere's another layer of optimization, though!\nImagine if you had to store a separate character for each accent mark like a, à, ā, ä, and á.\nYou'd have a lot of characters on your hands, even before considering capital and lowercase letters!\nPlus, some languages use multiple accents for some characters!\nThe combinations get ridiculous really fast, but we only have 1,114,111 (`0x10FFFF`) possible codepoints!\nSo what we do is hardcode some combinations (like ä) for efficiency, but make separate codepoints for accents and let the software figure out how to combine them.\nThese are called diacritic marks.\nSo in addition to the hardcoded ä, you can put `a` and `¨` together to get the same thing.\nYou can do this with more-or-less whatever characters and marks you want.\n\nIf you get really wild, you end up with z̴̙͒ả̴̫̼̫̀̅ĺ̴̔̿͜g̷̨͇͉̊͐̚o̶̳̣̯͌̓ text!\n\nThis raises another problem, though… if I have ä, I think of that as a single character, not two.\nBut if I've encoded it as two codepoints and ask for the string length, it may tell me I have two characters!\nWe deal with that using our final level: the grapheme.\n\nA grapheme is what you'd intuitively think of as \"a character\" in a writing system.\nWhenever you combining codepoints you're working with graphemes.\nThis applies to diacritic marks, as we've already explained, and tons of writing systems use graphemes: Hangul, Devanagari, Thai, and Tamil among others!\nBut it also applies to emoji!\nFor example: 🦸🏽‍♂️ is composed of 🦸 + 🏽 + zero-width joiner (200D) + ♂ + variation selector 16 (FE0F).\nYou tend to think of 🦸🏽‍♂️ as a single character—a very definite expression which can't really be broken up into constituent parts.\nThat means it's a grapheme!\n\nBut, final subtlety: if you used 🦸 by itself it's a grapheme too.\nThe point is not \"what codepoints are there?\", it's \"what is the smallest useful unit when expressing meaning?\"\n\n### So what?\n\nThe above means that when we ask questions like \"how long is this string?\" or \"what is the first character here?\" we sometimes mix three levels:\n\n1. **the byte level.**\n   Operations like `String.length` and `String.left` operate here (or, more specifically, they operate at the UTF-16 level, which assumes that codepoints are two bytes wide.)\n   You should probably never operate here when working with `String` in Elm.\n   It will result in subtle bugs and corrupt data!\n   If you know you're working at the byte level, use [`elm/bytes`](https://package.elm-lang.org/packages/elm/bytes/latest/) instead.\n\n2. **the codepoint level.**\n   Here, our base superhero emoji is only one character, but our skin tone and gender (🦸🏽‍♂️) take more, as discussed.\n   This particular combination happens to be 17 *bytes* but only 5 *codepoints*.\n   Operations like `String.foldl` operate here (so you can safely measure codepoint length with operations like `String.foldl (\\_ len -\u003e len + 1) 0 \"whatever string\"`.)\n   You should operate here if you're implementing higher-level operations on the codepoints, like grapheme segmentation (hi!) or normalization.\n\n3. **the grapheme level.**\n   Despite being 5 codepoints, 🦸🏽‍♂ is only one grapheme️.\n   Operations like `String.Graphemes.toList` operate here.\n   You should operate here if you're working with unicode text in ways meaningful to a user.\n\nTo underscore, if you're modifying text that the user has entered, work at the grapheme level.\nThis reduces the possibility of errors and increases the possibility that your program will \"do the right thing.\"\n\nStill not convinced?\nHere are some practical reasons you should work at the grapheme level in the browser:\n\n- If you operate at the *byte* level, you will split multi-byte characters into invalid unicode sequences.\n  If you do the wrong thing with these sequences, you'll crash your user's browser.\n  In fact, that's what started me writing this library!\n  Everyone does it occasionally, but there are better ways.\n\n- If you operate at the *character* level, you will split skin tones and genders off of people emoji, split flags into country codes, and move diacritic marks around.\n  Your user entered this text precisely in these cases, don't lose their meaning!\n\n- Think your app doesn't need those pesky diacritic marks?\n  Think again!\n  They're crucial to understanding in a lot of languages!\n  For example, in Spanish, papa (potato) is different than papá (father.)\n  Don't make your users call their dad a potato!\n\n## Frequently Asked Questions\n\n### What spec does this package implement?\n\nThe [Grapheme Cluster Boundaries](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) section of [UAX #29](https://unicode.org/reports/tr29/).\n\n### Does this package correctly reverse strings with diacritics?\n\nYes!\nIt reverses the order of the graphemes, not the codepoints.\nThis means that it does not move diacritics around and emoji are perfectly safe.\n\n```elm\nimport String.Graphemes\n\n-- äo without normalization\nString.Graphemes.reverse \"a\\u{0308}o\" --\u003e \"oa\\u{0308}\"\n\n-- compare with String\nString.reverse \"a\\u{0308}o\" --\u003e \"o\\u{0308}a\"\n```\n\n### Does this package do normalization?\n\nNo, and it probably never will.\nIt's a [whole 'nother spec](https://unicode.org/reports/tr15/#Norm_Forms) in the Unicode standard which doesn't really fit in this package.\n\nThat said, it *looks* like you could implement it in a similar way as the internal `String.Graphemes.Parser`, so give it a go in a new package of your own!\n\n(n.b. normalization in this case means turning `\"a\\u{0308}\"` into `\"ä\"`, usually for the purposes of improving equality checks.)\n\n### Does this package segment words or sentences?\n\nNo, and it probably never will.\nSegmenting words and sentences is locale- and implementation-dependent, so it's really hard to address them in a general way.\nRather than introducing confusion (\"it *should* segment here… why doesn't it?\") we only segment graphemes.\n\nThat said, word and sentence segmentation rely on grapheme segmentation, so you're on the right track by asking this!\n[UAX #29](https://unicode.org/reports/tr29/) has guidance here.\n\n### Why not \"fix\" `elm/core`'s `String` instead of writing a new package?\n\nThe `String` module solves a different—but overlapping—set of problems.\nFor example, you do not always want to work with graphemes: sometimes you need to be able to decompose into codepoints or operate at the byte level.\nAs usual, it's all tradeoffs.\n\nThat said, if it eventually becomes obvious that merging into core would be a good thing we may do that.\nIn that case, we would probably just keep equivalents of `String.Graphemes.uncons` and `String.Graphemes.foldl`.\nEverything else is implemented in terms of those two operations.\n\n### Why a drop-in replacement? / Why does the code refer to `String` functions so much?\n\nUnless you've worked with unicode strings a lot, it can be tricky to know which level (bytes, codepoints, or graphemes) you're operating at with any given time.\nSo instead of giving you the functions you *might* need, and leaving you to implement the rest on your own, we provide all of them and only change the ones where you'd run into trouble.\n\nBut not *all* of the functions in `String` need to be modified.\nIn those cases, we just pass through to the `String` functions!\n\nThis way, you don't have to worry about it.\nYou could potentially do `import String.Graphemes as String` in a module, fix the type errors, and all of a sudden all your string operations work with graphemes.\n\n## Climate Action\n\nI want my open-source activities to support projects addressing the climate crisis (for example, projects in clean energy, public transit, reforestation, or sustainable agriculture.)\nIf you are working on such a project, and find a bug or missing feature in any of my libraries, **please let me know and I will treat your issue as high priority.**\nI'd also be happy to support such projects in other ways.\nIn particular, I've worked with Elm for a long time and would be happy to advise on your implementation.\n\n## License\n\nThis code in this project is licensed under the BSD 3-Clause license, located at LICENSE in the source.\n\nThe documentation strings in `String.Graphemes` are derived from those in `elm/core`'s `String`, © 2019 Evan Czaplicki, and licensed under the BSD 3-Clause license.\n\nThe grapheme break property data used here are © 2019 Unicode®, Inc., and licensed under their [terms of use](http://www.unicode.org/terms_of_use.html).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrianhicks%2Felm-string-graphemes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrianhicks%2Felm-string-graphemes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrianhicks%2Felm-string-graphemes/lists"}