{"id":23579877,"url":"https://github.com/elixir-unicode/unicode_string","last_synced_at":"2026-04-14T07:00:45.110Z","repository":{"id":62430657,"uuid":"223293006","full_name":"elixir-unicode/unicode_string","owner":"elixir-unicode","description":"String utilities based upon Unicode sets","archived":false,"fork":false,"pushed_at":"2026-01-18T18:41:37.000Z","size":907,"stargazers_count":20,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-07T23:42:45.360Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elixir-unicode.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-11-22T00:48:53.000Z","updated_at":"2026-01-18T18:32:51.000Z","dependencies_parsed_at":"2025-05-06T18:43:19.575Z","dependency_job_id":"f362b20d-fd82-41ba-886f-e0bf11bf92de","html_url":"https://github.com/elixir-unicode/unicode_string","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/elixir-unicode/unicode_string","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode_string","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode_string/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode_string/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode_string/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elixir-unicode","download_url":"https://codeload.github.com/elixir-unicode/unicode_string/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode_string/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31785681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-26T23:13:09.260Z","updated_at":"2026-04-14T07:00:45.094Z","avatar_url":"https://github.com/elixir-unicode.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unicode String\n\n![Build status](https://github.com/elixir-unicode/unicode_string/actions/workflows/ci.yml/badge.svg)\n[![Hex.pm](https://img.shields.io/hexpm/v/unicode_string.svg)](https://hex.pm/packages/unicode_string)\n[![Hex.pm](https://img.shields.io/hexpm/dw/unicode_string.svg?)](https://hex.pm/packages/unicode_string)\n[![Hex.pm](https://img.shields.io/hexpm/l/unicode_string.svg)](https://hex.pm/packages/unicode_string)\n\nAdds functions supporting some string algorithms in the Unicode standard. For example:\n\n* The [Unicode Case Folding](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm to provide case-independent equality checking irrespective of language or script with `Unicode.String.fold/2` and `Unicode.String.equals_ignoring_case?/2`\n\n* The [Unicode Code Mapping](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm that implements locale-aware `Unicode.String.upcase/2`, `Unicode.String.downcase/2` and `Unicode.String.titlecase/2`.\n\n* The [Unicode Segmentation](https://unicode.org/reports/tr29/) algorithm to detect, break, split or stream strings into grapheme clusters, words and sentences.\n\n* The [Unicode Line Breaking](https://www.unicode.org/reports/tr14/) algorithm to determine line breaks (breaks meaning where word-wrapping would be acceptable).\n\n## Installation\n\nThe package can be installed by adding `:unicode_string` to your list of dependencies in `mix.exs`:\n\n```elixir\ndef deps do\n  [\n    {:unicode_string, \"~\u003e 1.0\"},\n    ...\n  ]\nend\n```\n\nThen run `mix dep.get`.\n\n\u003e #### Word Break Dictionary Download {: .info}\n\u003e\n\u003e If you plan to perform word break segmentation on Chinese, Japanese, Lao,\n\u003e Burmese, Thai or Khmer languages you will need to download the word break dictionaries\n\u003e by running `mix unicode.string.download.dictionaries`.\n\n## Casing\n\n### Case Folding\n\nThe [Unicode Case Folding](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm defines how to perform case folding. This allows comparison of strings in a case-insensitive fashion. It does not define the means to compare ignoring diacritical marks (accents). Some examples follow, for details see:\n\n* `Unicode.String.fold/2`\n* `Unicode.String.equals_ignoring_case?/3`\n\n\u003e #### Note {: .info}\n\u003e\n\u003e Although the folding algorithm commonly downcases characters, folding is not a general purpose downcasing process. It exists only to facilitate case insensitive string comparison.\n\n\n```elixir\niex\u003e Unicode.String.equals_ignoring_case? \"ABC\", \"abc\"\ntrue\n\niex\u003e Unicode.String.equals_ignoring_case? \"beißen\", \"beissen\"\ntrue\n\niex\u003e Unicode.String.equals_ignoring_case? \"grüßen\", \"grussen\"\nfalse\n```\n\n### Case Mapping\n\nThe [Unicode Case Mapping](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm defines the process and data to transform text into upper case, lower case or title case. Since most languages are not bicameral, characters which have no case mapping remain unchanged.\n\nThree case mapping functions are provided:\n\n* `Unicode.String.upcase/2` which will convert text to upper case characters.\n* `Unicode.String.downcase/2` which will convert text to lower case characters.\n* `Unicode.String.titlecase/2` which will convert text to title case.  Title case means that the first character or each word is set to upper case and all other characters in the word are set to lower case. `Unicode.String.split/2` is used to split the string into words before title casing.\n\nEach function operates in a locale-aware manner implementing some basic capabilities:\n\n* Casing rules for the Turkish dotted capital `I` and dotless small `i`.\n* Casing rules for the retention of dots over `i` for Lithuanian letters with additional accents.\n* Titlecasing of IJ at the start of words in Dutch.\n* Removal of diacritics when upper casing letters in Greek.\n\nThere are other casing rules that are not currently implemented such as:\n\n* Titlecasing of second or subsequent letters in words in orthographies that include caseless letters such as apostrophes.\n* Uppercasing of U+00DF `ß` latin small letter sharp `s` to U+1E9E `ẞ` latin capital letter sharp `s`.\n\n```elixir\n# Basic case transformation\niex\u003e Unicode.String.upcase(\"the quick brown fox\")\n\"THE QUICK BROWN FOX\"\n\n# Dotted-I in Turkish and Azeri\niex\u003e Unicode.String.upcase(\"Diyarbakır\", locale: :tr)\n\"DİYARBAKIR\"\n\n# Upper case in Greek removes diacritics\niex\u003e Unicode.String.upcase(\"Πατάτα, Αέρας, Μυστήριο\", locale: :el)\n\"ΠΑΤΑΤΑ, ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ\"\n\n# Lower case Greek with a final sigma\niex\u003e Unicode.String.downcase(\"ὈΔΥΣΣΕΎΣ\", locale: :el)\n\"ὀδυσσεύς\"\n\n# Title case Dutch with leading dipthong\niex\u003e Unicode.String.titlecase(\"ijsselmeer\", locale: :nl)\n\"IJsselmeer\"\n```\n\n## Segmentation\n\nThe [Unicode Segmentation](https://unicode.org/reports/tr29/) annex details the algorithm to be applied with segmenting text (Elixir strings) into words, sentences, graphemes and line breaks. Some examples follow, for details see:\n\n* `Unicode.String.split/2`\n* `Unicode.String.break?/2`\n* `Unicode.String.break/2`\n* `Unicode.String.splitter/2`\n* `Unicode.String.next/2`\n* `Unicode.String.stream/2`\n\n```elixir\n# Split text at a word boundary.\niex\u003e Unicode.String.split \"This is a sentence. And another.\", break: :word\n[\"This\", \" \", \"is\", \" \", \"a\", \" \", \"sentence\", \".\", \" \", \"And\", \" \", \"another\", \".\"]\n\n# Split text at a word boundary but omit any whitespace\niex\u003e Unicode.String.split \"This is a sentence. And another.\", break: :word, trim: true\n[\"This\", \"is\", \"a\", \"sentence\", \".\", \"And\", \"another\", \".\"]\n\n# Split text at a sentence boundary.\niex\u003e Unicode.String.split \"This is a sentence. And another.\", break: :sentence\n[\"This is a sentence. \", \"And another.\"]\n\n# By default, common abbreviations are suppressed (ie\n# they do not cause a break)\niex\u003e Unicode.String.split \"No, I don't have a Ph.D. but I don't think it matters.\", break: :word, trim: true\n[\"No\", \",\", \"I\", \"don't\", \"have\", \"a\", \"Ph.D\", \".\", \"but\", \"I\", \"don't\",\n \"think\", \"it\", \"matters\", \".\"]\n\niex\u003e Unicode.String.split \"No, I don't have a Ph.D. but I don't think it matters.\", break: :sentence, trim: true\n[\"No, I don't have a Ph.D. but I don't think it matters.\"]\n\n# Sentence Break suppressions are locale sensitive.\niex\u003e Unicode.String.Segment.known_locales\n[\"de\", \"el\", \"en\", \"en-US\", \"en-US-POSIX\", \"es\", \"fi\", \"fr\", \"it\", \"ja\", \"pt\",\n \"root\", \"ru\", \"sv\", \"zh\", \"zh-Hant\"]\n\niex\u003e Unicode.String.split \"Non, c'est M. Dubois.\", break: :sentence, trim: true, locale: \"fr\"\n[\"Non, c'est M. Dubois.\"]\n\n# Note that break: :line does NOT mean split the string\n# at newlines. It splits the string where a line break would be\n# acceptable. This is very useful for calculating where\n# to perform word-wrap on some text.\niex\u003e Unicode.String.split \"This is a sentence. And another.\", break: :line\n[\"This \", \"is \", \"a \", \"sentence. \", \"And \", \"another.\"]\n```\n\n### Dictionary-based word segmentation\n\nSome languages, commonly East Asian and Southeast Asian languages, don't typically use whitespace to separate words, so a dictionary lookup is needed for word-break segmentation.\n\nThis implementation supports dictionary-based word breaking for:\n\n* Chinese (`zh`, `zh-Hant`, `zh-Hans`, `zh-Hant-HK`, `yue`, `yue-Hans`) locales,\n* Japanese (`ja`) using the same dictionary as for Chinese,\n* Thai (`th`),\n* Lao (`lo`),\n* Khmer (`km`) and\n* Burmese (`my`).\n\nThe dictionaries are those used in [CLDR](https://cldr.unicode.org) since they are under an open source license and are consistent with [ICU](https://icu.unicode.org).\n\nNote that these dictionaries need to be downloaded with `mix unicode.string.download.dictionaries` prior to use. Each dictionary will be parsed and loaded into [persistent_term](https://www.erlang.org/doc/man/persistent_term) on demand. Note that each dictionary has a sizable memory footprint as measured by `:persistent_term.info/0`:\n\n| Dictionary  | Memory Mb   |\n| ----------- | ----------: |\n| Chinese     | 104.8       |\n| Thai        | 9.6         |\n| Lao         | 11.4        |\n| Khmer       | 38.8        |\n| Burmese     | 23.1        |\n\n#### How dictionary break works\n\nFor Thai, Lao, Khmer, and Burmese the dictionary break algorithm is implemented in `Unicode.String.DictionaryBreak`. It uses the same approach as ICU's `DictionaryBreakEngine`: a cost-based lookahead that considers multiple word candidates at each position to find the best segmentation.\n\nThe algorithm proceeds through the text as follows:\n\n1. **Candidate gathering.** At each position, all dictionary words that start at that position are found (shortest to longest match) using prefix search against the trie-structured dictionary.\n\n2. **Single candidate.** If exactly one candidate matches, it is accepted immediately.\n\n3. **Multiple candidates with 3-word lookahead.** When multiple candidates exist, each is tested by looking ahead up to two more words. The candidate that leads to the longest chain of consecutive dictionary words wins. Candidates are tried longest-first, and the first candidate confirmed by a 3-word chain is accepted.\n\n4. **Non-dictionary resync.** When no dictionary word is found (or only a very short one), the algorithm scans forward through non-dictionary characters until reaching a position where dictionary words resume. The non-dictionary stretch is combined with the preceding word.\n\n5. **Combining mark absorption.** After each word boundary, any following Unicode combining marks (General Category M — vowel signs, tone marks, virama/coeng characters) are absorbed into the preceding word so that diacritics remain attached to their base.\n\n6. **Thai suffix handling.** For Thai, the suffix characters PAIYANNOI (U+0E2F) and MAIYAMOK (U+0E46) are absorbed into the preceding word when no dictionary word follows.\n\nFor Chinese and Japanese, the standard [UAX #29](https://unicode.org/reports/tr29/) word-break rules are used with dictionary lookups for ideographic character sequences. The dictionary determines word boundaries within runs of CJK ideographs.\n\n#### Mixed-script text\n\nWhen text contains a mix of dictionary-script characters and other scripts (e.g., a Khmer sentence with embedded Latin words), the `split_with_fallback/3` function partitions the text into same-script runs. Dictionary breaking is applied to the target-script ranges, and a fallback function (typically the standard UAX #29 word breaker) handles the rest. The results are concatenated to produce a single segmentation covering the full string.\n\nSee `conformance.md` for details on conformance with the UAX #29 break algorithm and differences between this implementation and ICU.\n\n## Segment Streaming\n\nSegmentation can also be streamed using `Unicode.String.stream/2`. For large strings this may improve memory usage since the intermediate segments will be garbage collected when they fall out of scope.\n\n```elixir\niex\u003e Enum.to_list Unicode.String.stream(\"this is a list of words\", trim: true)                       [\"this\", \"is\", \"a\", \"list\", \"of\", \"words\"]\n\niex\u003e Enum.map Unicode.String.stream(\"this is a list of words\", trim: true),\n...\u003e   fn word -\u003e %{word: word, length: String.length(word)} end\n[\n  %{length: 4, word: \"this\"},\n  %{length: 2, word: \"is\"},\n  %{length: 1, word: \"a\"},\n  %{length: 3, word: \"list\"},\n  %{length: 2, word: \"of\"},\n  %{length: 5, word: \"words\"}\n]\n```\n\n## References\n\n* Unicode maintains a [break testing utility](https://util.unicode.org/UnicodeJsps/breaks.jsp).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felixir-unicode%2Funicode_string","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felixir-unicode%2Funicode_string","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felixir-unicode%2Funicode_string/lists"}