{"id":20776736,"url":"https://github.com/sjorek/unicode-normalization","last_synced_at":"2025-12-25T02:03:56.300Z","repository":{"id":57052296,"uuid":"121472575","full_name":"sjorek/unicode-normalization","owner":"sjorek","description":"An enhanced facade to existing unicode-normalization implementations.","archived":false,"fork":false,"pushed_at":"2018-03-25T16:52:48.000Z","size":7676,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-11T21:44:24.716Z","etag":null,"topics":["composer","composer-package","php","stream-filter","unicode"],"latest_commit_sha":null,"homepage":"https://sjorek.github.io/unicode-normalization/","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sjorek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-14T04:57:51.000Z","updated_at":"2020-07-07T06:37:04.000Z","dependencies_parsed_at":"2022-08-24T05:10:16.712Z","dependency_job_id":null,"html_url":"https://github.com/sjorek/unicode-normalization","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/sjorek/unicode-normalization","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjorek%2Funicode-normalization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjorek%2Funicode-normalization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjorek%2Funicode-normalization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjorek%2Funicode-normalization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sjorek","download_url":"https://codeload.github.com/sjorek/unicode-normalization/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sjorek%2Funicode-normalization/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28017003,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-25T02:00:05.988Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["composer","composer-package","php","stream-filter","unicode"],"created_at":"2024-11-17T13:11:22.941Z","updated_at":"2025-12-25T02:03:56.282Z","avatar_url":"https://github.com/sjorek.png","language":"PHP","readme":"# [Unicode-Normalization](https://sjorek.github.io/unicode-normalization/)\n\nA [composer](http://getcomposer.org)-package providing an enhanced facade to existing unicode-normalization\nimplementations.\n\n\n## Installation\n\n```bash\nphp composer.phar require sjorek/unicode-normalization\n```\n\n\n## Usage\n\n### Unicode Normalization\n\n```php\n\u003c?php\n\n/**\n * Class for normalizing unicode.\n *\n *    “Normalization: A process of removing alternate representations of equivalent\n *    sequences from textual data, to convert the data into a form that can be\n *    binary-compared for equivalence. In the Unicode Standard, normalization refers\n *    specifically to processing to ensure that canonical-equivalent (and/or\n *    compatibility-equivalent) strings have unique representations.”\n *\n *     -- quoted from unicode glossary linked below\n *\n * @see http://www.unicode.org/glossary/#normalization\n * @see http://www.php.net/manual/en/class.normalizer.php\n * @see http://www.w3.org/wiki/I18N/CanonicalNormalization\n * @see http://www.w3.org/TR/charmod-norm/\n * @see http://blog.whatwg.org/tag/unicode\n * @see http://en.wikipedia.org/wiki/Unicode_equivalence\n * @see http://stackoverflow.com/questions/7931204/what-is-normalized-utf-8-all-about\n * @see http://php.net/manual/en/class.normalizer.php\n */\nclass Sjorek\\UnicodeNormalization\\Normalizer\n    implements Sjorek\\UnicodeNormalization\\Implementation\\NormalizerInterface\n{\n\n    /**\n     * Constructor.\n     *\n     * @param null|bool|int|string $form (optional) Set normalization form, default: NFC\n     *\n     * Besides the normalization form class constants defined below,\n     * the following case-insensitive aliases are supported:\n     * \u003cpre\u003e\n     * - Disable unicode-normalization     : 0,  false, null, empty\n     * - Ignore/skip unicode-normalization : 1,  NONE, true, binary, default, validate\n     * - Normalization form D              : 2,  NFD, FORM_D, D, form-d, decompose, collation\n     * - Normalization form D (mac)        : 18, NFD_MAC, FORM_D_MAC, D_MAC, form-d-mac, d-mac, mac\n     * - Normalization form KD             : 3,  NFKD, FORM_KD, KD, form-kd\n     * - Normalization form C              : 4,  NFC, FORM_C, C, form-c, compose, recompose, legacy, html5\n     * - Normalization form KC             : 5,  NFKC, FORM_KC, KC, form-kc, matching\n     * \u003c/pre\u003e\n     *\n     * Hints:\n     * \u003cpre\u003e\n     * - The W3C recommends NFC for HTML5 Output.\n     * - Mac OS X's HFS+ filesystem uses a NFD variant to store paths. We provide one implementation for this\n     *   special variant, but plain NFD works in most cases too. Even if you use something else than NFD or its\n     *   variant HFS+ will always use decomposed NFD path-strings if needed.\n     * \u003c/pre\u003e\n     */\n    public function __construct($form = null);\n\n    /**\n     * Ignore any decomposition/composition.\n     *\n     * Ignoring Implementation decomposition/composition, means nothing is automatically normalized.\n     * Many Linux- and BSD-filesystems do not normalize paths and filenames, but treat them as binary data.\n     * Apple™'s APFS filesystem treats paths and filenames as binary data.\n     *\n     * @var int\n     */\n    const NONE = 1;\n\n    /**\n     * Canonical decomposition.\n     *\n     *    “A normalization form that erases any canonical differences, and produces a\n     *    decomposed result. For example, ä is converted to a + umlaut in this form.\n     *    This form is most often used in internal processing, such as in collation.”\n     *\n     *    -- quoted from unicode glossary linked below\n     *\n     * @var int\n     *\n     * @see http://www.unicode.org/glossary/#normalization_form_d\n     * @see https://developer.apple.com/library/content/qa/qa1173/_index.html\n     * @see https://developer.apple.com/library/content/qa/qa1235/_index.html\n     */\n    const NFD = 2;\n\n    /**\n     * Compatibility decomposition.\n     *\n     *    “A normalization form that erases both canonical and compatibility differences,\n     *    and produces a decomposed result: for example, the single ǆ character is\n     *    converted to d + z + caron in this form.”\n     *\n     *    -- quoted from unicode glossary linked below\n     *\n     * @var int\n     *\n     * @see http://www.unicode.org/glossary/#normalization_form_kd\n     */\n    const NFKD = 3;\n\n    /**\n     * Canonical decomposition followed by canonical composition.\n     *\n     *    “A normalization form that erases any canonical differences, and generally produces\n     *    a composed result. For example, a + umlaut is converted to ä in this form. This form\n     *    most closely matches legacy usage.”\n     *\n     *    -- quoted from unicode glossary linked below\n     *\n     * W3C recommends NFC for HTML5 output and requires NFC for HTML5-compliant parser implementations.\n     *\n     * @var int\n     * @var int $FORM_C\n     *\n     * @see http://www.unicode.org/glossary/#normalization_form_c\n     */\n    const NFC = 4;\n\n    /**\n     * Compatibility Decomposition followed by Canonical Composition.\n     *\n     *    “A normalization form that erases both canonical and compatibility differences,\n     *    and generally produces a composed result: for example, the single ǆ character\n     *    is converted to d + ž in this form. This form is commonly used in matching.”\n     *\n     *    -- quoted from unicode glossary linked below\n     *\n     * @var int\n     * @var int $FORM_KC\n     *\n     * @see http://www.unicode.org/glossary/#normalization_form_kc\n     */\n    const NFKC = 5;\n\n    /**\n     * Apple™ Canonical decomposition for HFS Plus filesystems.\n     *\n     *    “For example, HFS Plus (OS X Extended) uses a variant of Normal Form D in\n     *    which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF\n     *    are not decomposed …”\n     *\n     *    -- quoted from Apple™'s Technical Q\u0026A 1173 linked below\n     *\n     *    “The characters with codes in the range u+2000 through u+2FFF are punctuation,\n     *    symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has\n     *    single characters for things like u+249c \"⒜\". The characters in this range are\n     *    not fully decomposed; they are left unchanged in HFS Plus strings. This allows\n     *    strings in Mac OS encodings to be converted to Implementation and back without loss of\n     *    information. This is not unnatural since a user would not necessarily expect a\n     *    dingbat \"⒜\" to be equivalent to the three character sequence \"(a)\" in a file name.\n     *\n     *    The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs,\n     *    and are not decomposed in HFS Plus strings.\n     *\n     *    So, for the example given earlier, u+00E9 (\"é\") must be stored as the two Implementation\n     *    characters u+0065 and u+0301 (in that order). The Implementation character u+00E9 (\"é\")\n     *    may not appear in a Implementation string used as part of an HFS Plus B-tree key.”\n     *\n     *    -- quoted from Apple™'s Technical Q\u0026A 1150 linked below\n     *\n     * @var int\n     *\n     * @see NormalizerInterface::NFD\n     * @see https://developer.apple.com/library/content/qa/qa1173/_index.html\n     * @see https://developer.apple.com/library/content/qa/qa1235/_index.html\n     * @see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.html#CanonicalDecomposition\n     * @see https://opensource.apple.com/source/libiconv/libiconv-50/libiconv/lib/utf8mac.h.auto.html\n     */\n    const NFD_MAC = 18; // 0x02 (NFD) | 0x10 = 0x12 (18)\n\n    /**\n     * Set the default normalization form to the given value.\n     *\n     * @param int|string $form\n     *\n     * @see \\Sjorek\\UnicodeNormalization\\NormalizationUtility::parseForm()\n     *\n     * @throws \\Sjorek\\UnicodeNormalization\\Exception\\InvalidNormalizationForm\n     */\n    public function setForm($form);\n\n    /**\n     * Retrieve the current normalization-form constant.\n     *\n     * @return int\n     */\n    public function getForm();\n\n    /**\n     * Normalizes the input provided and returns the normalized string.\n     *\n     * @param string $input the input string to normalize\n     * @param int    $form  (optional) One of the normalization forms\n     *\n     * @throws \\Sjorek\\UnicodeNormalization\\Exception\\InvalidNormalizationForm\n     *\n     * @return string the normalized string or FALSE if an error occurred\n     *\n     * @see http://php.net/manual/en/normalizer.normalize.php\n     */\n    public function normalize($input, $form = null);\n\n    /**\n     * Checks if the provided string is already in the specified normalization form.\n     *\n     * @param string $input The input string to normalize\n     * @param int    $form  (optional) One of the normalization forms\n     *\n     * @throws \\Sjorek\\UnicodeNormalization\\Exception\\InvalidNormalizationForm\n     *\n     * @return bool TRUE if normalized, FALSE otherwise or if an error occurred\n     *\n     * @see http://php.net/manual/en/normalizer.isnormalized.php\n     */\n    public function isNormalized($input, $form = null);\n\n    /**\n     * Normalizes the $string provided to the given or default $form and returns the normalized string.\n     *\n     * Calls underlying implementation even if given $form is NONE, but finally it normalizes only if needed.\n     *\n     * @param string $input the string to normalize\n     * @param int    $form  (optional) normalization form to use, overriding the default\n     *\n     * @throws \\Sjorek\\UnicodeNormalization\\Exception\\InvalidNormalizationForm\n     *\n     * @return null|string Normalized string or null if an error occurred\n     */\n    public function normalizeTo($input, $form = null);\n\n    /**\n     * Normalizes the $string provided to the given or default $form and returns the normalized string.\n     *\n     * Does not call underlying implementation if given normalization is NONE and normalizes only if needed.\n     *\n     * @param string $input the string to normalize\n     * @param int    $form  (optional) normalization form to use, overriding the default\n     *\n     * @throws \\Sjorek\\UnicodeNormalization\\Exception\\InvalidNormalizationForm\n     *\n     * @return null|string Normalized string or null if an error occurred\n     */\n    public function normalizeStringTo($input, $form = null);\n\n    /**\n     * Get the supported unicode version level as version triple (\"X.Y.Z\").\n     *\n     * @return string\n     */\n    public static function getUnicodeVersion();\n\n    /**\n     * Get the supported unicode normalization forms as array.\n     *\n     * @return int[]\n     */\n    public static function getNormalizationForms();\n}\n```\n\n### Stream filtering\n\n```php\n\u003c?php\n\n/**\n * @var $stream        resource    The stream to filter.\n * @var $form          string      The form to normalize unicode to.\n * @var $read_write    int         (optional) STREAM_FILTER_* constant to override the filter injection point\n * @var $params        string|int  (optional) A normalization-form alias or value\n *\n * @link http://php.net/manual/en/function.stream-filter-append.php\n * @link http://php.net/manual/en/function.stream-filter-prepend.php\n */\nstream_filter_append($stream, \"convert.unicode-normalization.$form\"[, $read_write[, $params]]);\n```\n\nNote: Be careful when using on streams in `r+` or `w+` (or similar) modes; by default PHP will assign the\nfilter to both the reading and writing chain. This means it will attempt to convert the data twice - first when\nreading from the stream, and once again when writing to it.\n\n\n## Examples\n\n### Unicode Normalization\n\n```php\n\u003c?php\n\nuse Sjorek\\UnicodeNormalization\\Normalizer;\n\n$string = 'äöü';\n\n$normalizer = new Normalizer(Normalizer::NONE);\n$nfc = new Normalizer();\n$nfd = new Normalizer(Normalizer::NFD);\n$nfkc = new Normalizer('matching');\n\nvar_dump(\n    // yields false as form NONE is never normalized\n    $normalizer-\u003eisNormalized($string),\n\n    // yields true, as NFC is the default for utf8 in the web.\n    $nfc-\u003eisNormalized($string),\n\n    // yields false\n    $nfd-\u003eisNormalized($string),\n\n    // yields false\n    $nfkc-\u003eisNormalized($string),\n\n    // yields false\n    $normalizer-\u003eisNormalized($string, Normalizer::NFKD),\n\n    // yields true\n    $normalizer-\u003enormalize($string) === $string,\n\n    // yields true\n    $nfc-\u003enormalize($string) === $string,\n\n    // yields false\n    $nfd-\u003enormalize($string) === $string,\n\n    // yields true, as only combined characters (means two or more letters in one\n    // character, like the single ǆ character) are decomposed (for faster matching).\n    $nfkc-\u003enormalize($string) === $string,\n\n    Normalizer::getUnicodeVersion(),\n    Normalizer::getNormalizationForms()\n);\n\n```\n\n### Stream filtering\n\n```php\n\u003c?php\n\n$in_file = fopen('utf8-file.txt', 'r');\n$out_file = fopen('utf8-normalized-to-nfc-file.txt', 'w');\n\n// It works as a read filter:\nstream_filter_append($in_file, 'convert.unicode-normalization.NFC');\n\n// Normalization form may be given as fourth parameter:\n// stream_filter_append($in_file, 'convert.unicode-normalization', null, 'NFC');\n\n// And it also works as a write filter:\n// stream_filter_append($out_file, 'convert.unicode-normalization.NFC');\n\nstream_copy_to_stream($in_file, $out_file);\n```\n\n\n## Contributing\n\nLook at the [contribution guidelines](CONTRIBUTING.md)\n\n## Links\n\n### Status\n\n[![Build Status](https://img.shields.io/travis/sjorek/unicode-normalization.svg)](https://travis-ci.org/sjorek/unicode-normalization)\n\n\n### GitHub\n\n[![GitHub Issues](https://img.shields.io/github/issues/sjorek/unicode-normalization.svg)](https://github.com/sjorek/unicode-normalization/issues)\n[![GitHub Latest Tag](https://img.shields.io/github/tag/sjorek/unicode-normalization.svg)](https://github.com/sjorek/unicode-normalization/tags)\n[![GitHub Total Downloads](https://img.shields.io/github/downloads/sjorek/unicode-normalization/total.svg)](https://github.com/sjorek/unicode-normalization/releases)\n\n\n### Packagist\n\n[![Packagist Latest Stable Version](https://poser.pugx.org/sjorek/unicode-normalization/version)](https://packagist.org/packages/sjorek/unicode-normalization)\n[![Packagist Total Downloads](https://poser.pugx.org/sjorek/unicode-normalization/downloads)](https://packagist.org/packages/sjorek/unicode-normalization)\n[![Packagist Latest Unstable Version](https://poser.pugx.org/sjorek/unicode-normalization/v/unstable)](https://packagist.org/packages/sjorek/unicode-normalization)\n[![Packagist License](https://poser.pugx.org/sjorek/unicode-normalization/license)](https://packagist.org/packages/sjorek/unicode-normalization)\n\n\n### Social\n\n[![GitHub Forks](https://img.shields.io/github/forks/sjorek/unicode-normalization.svg?style=social)](https://github.com/sjorek/unicode-normalization/network)\n[![GitHub Stars](https://img.shields.io/github/stars/sjorek/unicode-normalization.svg?style=social)](https://github.com/sjorek/unicode-normalization/stargazers)\n[![GitHub Watchers](https://img.shields.io/github/watchers/sjorek/unicode-normalization.svg?style=social)](https://github.com/sjorek/unicode-normalization/watchers)\n[![Twitter](https://img.shields.io/twitter/url/https/github.com/sjorek/unicode-normalization.svg?style=social)](https://twitter.com/intent/tweet?url=https%3A%2F%2Fsjorek.github.io%2Funicode-normalization%2F)\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsjorek%2Funicode-normalization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsjorek%2Funicode-normalization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsjorek%2Funicode-normalization/lists"}