{"id":22826884,"url":"https://github.com/elixir-unicode/unicode","last_synced_at":"2026-04-29T05:00:54.469Z","repository":{"id":51082339,"uuid":"112465323","full_name":"elixir-unicode/unicode","owner":"elixir-unicode","description":"Unicode codepoint introspection and fast detection (lower, upper, alpha, numeric, whitespace, ...) in Elixir","archived":false,"fork":false,"pushed_at":"2024-04-28T05:19:03.000Z","size":5218,"stargazers_count":37,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-04-29T22:21:42.667Z","etag":null,"topics":["elixir","unicode"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elixir-unicode.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-29T11:10:00.000Z","updated_at":"2024-04-28T05:19:07.000Z","dependencies_parsed_at":"2023-10-16T08:39:45.869Z","dependency_job_id":"389f7c47-0dc0-482a-a748-b5887455a67c","html_url":"https://github.com/elixir-unicode/unicode","commit_stats":null,"previous_names":["elixir-cldr/cldr_unicode"],"tags_count":32,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elixir-unicode%2Funicode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elixir-unicode","download_url":"https://codeload.github.com/elixir-unicode/unicode/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230494917,"owners_count":18235046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elixir","unicode"],"created_at":"2024-12-12T18:06:38.784Z","updated_at":"2026-04-29T05:00:54.461Z","avatar_url":"https://github.com/elixir-unicode.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unicode\n\n![Build status](https://github.com/elixir-unicode/unicode/actions/workflows/ci.yml/badge.svg)\n[![Hex.pm](https://img.shields.io/hexpm/v/unicode.svg)](https://hex.pm/packages/unicode)\n[![Hex.pm](https://img.shields.io/hexpm/dw/unicode.svg?)](https://hex.pm/packages/unicode)\n[![Hex.pm](https://img.shields.io/hexpm/dt/unicode.svg?)](https://hex.pm/packages/unicode)\n[![Hex.pm](https://img.shields.io/hexpm/l/unicode.svg)](https://hex.pm/packages/unicode)\n\nFunctions to return information about Unicode codepoints.\n\nElixir strings are UTF-8-encoded [Unicode](https://unicode.org) binaries. This is a flexible and complete encoding scheme for the worlds many scripts, characters and emjois. However since it's a variable length encoding (using between one and four bytes for UTF-8) it is harder to use high-performance byte-oriented functions to decompose strings.\n\nSince checking strings and codepoints for certain attributes - like whether they are upper case, or symbols, or whitespace - is a common occurrence, a performant approach to such detection is useful.\n\nIt is tempting to assume the use of [US ASCII](https://en.wikipedia.org/wiki/ASCII) encoding and checking only for characters in that range. For example it is very common to see code in Elixir checking `codepoint in ?a..?z` to check for lowercase alphabetic characters. When the underlying programming language has no canonical form for a string beyond bytes this may be considered acceptable - the programmer is defining the script domain as he or she sees fit.\n\nHowever since Elixir strings are declared to be [UTF-8 encoded Unicode strings](https://unicode.org/faq/utf_bom.html#utf8-1) it seems appropriate to make it easier to determine the characteristics of codepoints (and strings) using this standard.\n\nThe Elixir standard library does not provide introspection beyond that required to support casing (String.downcase/1, String.upcase/1, String.capitalize/1).  This library aims to *fill in the blanks* a little bit.\n\n### Unicode version\n\nAs of [unicode version 1.21.0](https://hex.pm/packages/unicode/1.21.0) published on January 19th, 2026, [Unicode 17.0](https://www.unicode.org/versions/Unicode17.0.0/) forms the underlying data.\n\n## Additional Unicode libraries\n\n[ex_unicode](https://hex.pm/packages/unicode) provides basic introspection of Unicode codepoints and strings.  Additional libraries (either released or in development) build upon this library):\n\n* [unicode_set](https://github.com/elixir-unicode/unicode_set) implements functions to parse and match on [unicode sets](http://unicode.org/reports/tr35/#Unicode_Sets)\n\n* [unicode_guards](https://github.com/elixir-unicode/unicode_guards) is a simple library implementing common function guards using `unicode_set` and `unicode`\n\n* [unicode_string](https://github.com/elixir-unicode/unicode_string) is a library to implement efficient string splitting into words and sentences based upon the [Unicode Segementation](https://unicode.org/reports/tr29/) algorithm.\n\n* [unicode_transform](https://github.com/elixir-unicode/unicode_transform) implements the [Unicode transform](https://unicode.org/reports/tr35/tr35-general.html#Transforms) specification.\n\n## Unicode Functions\n\nThe following is a partial list of functions included in the library. See the documentation for the relevant module for further information:\n\n### Codepoint ranges\n\nThese functions return the codepoints as list of 2-tuples for the given property:\n\n* `Unicode.Block.blocks/0`\n* `Unicode.Script.scripts/0`\n* `Unicode.GeneralCategory.categories/0`\n* `Unicode.CombiningClass.combining_classes/0`\n* `Unicode.GraphemeBreak.grapheme_breaks/0`\n* `Unicode.LineBreak.line_breaks/0`\n* `Unicode.SentenceBreak.sentence_breaks/0`\n* `Unicode.IndicSyllabicCategory.indic_syllabic_categories/0`\n* `Unicode.Property.properties/0`\n\n### Introspection of codepoints and strings\n\nThe following functions return the block, script and category for codepoints and strings:\n\n*   `Unicode.script/1`\n\n    ```elixir\n    iex\u003e Unicode.script ?ä\n    :latin\n\n    iex\u003e Unicode.script ?خ\n    :arabic\n\n    iex\u003e Unicode.script ?अ\n    :devanagari\n    ```\n\n*   `Unicode.block/1`\n\n    ```elixir\n    iex\u003e Unicode.block ?ä\n    :latin_1_supplement\n\n    iex\u003e Unicode.block ?A\n    :basic_latin\n\n    iex\u003e Unicode.block \"äA\"\n    [:latin_1_supplement, :basic_latin]\n    ```\n\n*   `Unicode.category/1`\n\n    ```elixir\n    iex\u003e Unicode.category ?ä\n    :Ll\n    iex\u003e Unicode.category ?A\n    :Lu\n    iex\u003e Unicode.category ?🧐\n    :So\n    ```\n\n*   `Unicode.properties/1`\n\n    ```elixir\n    iex\u003e Unicode.properties 0x1bf0\n    [\n      :alphabetic,\n      :case_ignorable,\n      :grapheme_extend,\n      :id_continue,\n      :other_alphabetic,\n      :xid_continue\n    ]\n\n    iex\u003e Unicode.properties ?A\n    [\n      :alphabetic,\n      :ascii_hex_digit,\n      :cased,\n      :changes_when_casefolded,\n      :changes_when_casemapped,\n      :changes_when_lowercased,\n      :grapheme_base,\n      :hex_digit,\n      :id_continue,\n      :id_start,\n      :uppercase,\n      :xid_continue,\n      :xid_start\n    ]\n\n    iex\u003e Unicode.properties ?+\n    [:grapheme_base, :math, :pattern_syntax]\n\n    iex\u003e Unicode.properties \"a1+\"\n    [\n      [\n        :alphabetic,\n        :ascii_hex_digit,\n        :cased,\n        :changes_when_casemapped,\n        :changes_when_titlecased,\n        :changes_when_uppercased,\n        :grapheme_base,\n        :hex_digit,\n        :id_continue,\n        :id_start,\n        :lowercase,\n        :xid_continue,\n        :xid_start\n      ],\n      [\n        :ascii_hex_digit,\n        :emoji,\n        :grapheme_base,\n        :hex_digit,\n        :id_continue,\n        :xid_continue\n      ],\n      [:grapheme_base, :math, :pattern_syntax]\n    ]\n    ```\n\n### Character classes\n\nThese functions help filter codepoints and strings based upon their properties. They return a boolean result.\n\n* `Unicode.alphabetic?/1`\n* `Unicode.alphanumeric?/1`\n* `Unicode.digits?/1`\n* `Unicode.numeric?/1`\n* `Unicode.emoji?/1`\n* `Unicode.math?/1`\n* `Unicode.cased?/1`\n* `Unicode.lowercase?/1`\n* `Unicode.uppercase?/1`\n\nAny known property can be called as a function `Unicode.Property.\u003cproperty_name\u003e(codepoint_or_string)` or `Unicode.Property.\u003cproperty_name\u003e?(codepoint_or_string)` to return a boolean.\n\n### Transformations\n\nThe function `Unicode.unaccent/1` attempts to transform a Unicode string into a subset of the Latin-1 alphabet by removing diacritical marks from text. It is not a full transformation (which will be available in the upcoming `unicode_transform` library.)\n\n## Recognition\n\nThe information functions are heavily inspired by [@qqwy's elixir-unicode package](https://github.com/Qqwy/elixir-unicode) and compatibility with some of the api is represented by including some of the doctests from that package. Originally published under the `:unicode` package name on hex, this original work is now replaced with this library code.\n\n## Installation\n\nThe package can be installed by adding `unicode` to your list of dependencies in `mix.exs`:\n\n\u003c!-- BEGIN: VERSION --\u003e\n```elixir\ndef deps do\n  [\n    {:unicode, \"~\u003e 1.21\"}\n  ]\nend\n```\n\u003c!-- END: VERSION --\u003e\n\nThe docs can be found at [https://hexdocs.pm/unicode](https://hexdocs.pm/unicode).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felixir-unicode%2Funicode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felixir-unicode%2Funicode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felixir-unicode%2Funicode/lists"}