https://github.com/sjorek/unicode-normalization

An enhanced facade to existing unicode-normalization implementations.
https://github.com/sjorek/unicode-normalization
composer composer-package php stream-filter unicode
Last synced: about 2 months ago
JSON representation
An enhanced facade to existing unicode-normalization implementations.
Host: GitHub
URL: https://github.com/sjorek/unicode-normalization
Owner: sjorek
License: bsd-3-clause
Created: 2018-02-14T04:57:51.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2018-03-25T16:52:48.000Z (almost 8 years ago)
Last Synced: 2025-03-11T21:44:24.716Z (12 months ago)
Topics: composer, composer-package, php, stream-filter, unicode
Language: PHP
Homepage: https://sjorek.github.io/unicode-normalization/
Size: 7.32 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

README

          # [Unicode-Normalization](https://sjorek.github.io/unicode-normalization/)

A [composer](http://getcomposer.org)-package providing an enhanced facade to existing unicode-normalization

implementations.

## Installation

```bash

php composer.phar require sjorek/unicode-normalization

```

## Usage

### Unicode Normalization

```php

     * - Disable unicode-normalization     : 0,  false, null, empty

     * - Ignore/skip unicode-normalization : 1,  NONE, true, binary, default, validate

     * - Normalization form D              : 2,  NFD, FORM_D, D, form-d, decompose, collation

     * - Normalization form D (mac)        : 18, NFD_MAC, FORM_D_MAC, D_MAC, form-d-mac, d-mac, mac

     * - Normalization form KD             : 3,  NFKD, FORM_KD, KD, form-kd

     * - Normalization form C              : 4,  NFC, FORM_C, C, form-c, compose, recompose, legacy, html5

     * - Normalization form KC             : 5,  NFKC, FORM_KC, KC, form-kc, matching

     * 

     *

     * Hints:

     * 

     * - The W3C recommends NFC for HTML5 Output.

     * - Mac OS X's HFS+ filesystem uses a NFD variant to store paths. We provide one implementation for this

     *   special variant, but plain NFD works in most cases too. Even if you use something else than NFD or its

     *   variant HFS+ will always use decomposed NFD path-strings if needed.

     * 

     */

    public function __construct($form = null);

    /**

     * Ignore any decomposition/composition.

     *

     * Ignoring Implementation decomposition/composition, means nothing is automatically normalized.

     * Many Linux- and BSD-filesystems do not normalize paths and filenames, but treat them as binary data.

     * Apple™'s APFS filesystem treats paths and filenames as binary data.

     *

     * @var int

     */

    const NONE = 1;

    /**

     * Canonical decomposition.

     *

     *    “A normalization form that erases any canonical differences, and produces a

     *    decomposed result. For example, ä is converted to a + umlaut in this form.

     *    This form is most often used in internal processing, such as in collation.”

     *

     *    -- quoted from unicode glossary linked below

     *

     * @var int

     *

     * @see http://www.unicode.org/glossary/#normalization_form_d

     * @see https://developer.apple.com/library/content/qa/qa1173/_index.html

     * @see https://developer.apple.com/library/content/qa/qa1235/_index.html

     */

    const NFD = 2;

    /**

     * Compatibility decomposition.

     *

     *    “A normalization form that erases both canonical and compatibility differences,

     *    and produces a decomposed result: for example, the single ǆ character is

     *    converted to d + z + caron in this form.”

     *

     *    -- quoted from unicode glossary linked below

     *

     * @var int

     *

     * @see http://www.unicode.org/glossary/#normalization_form_kd

     */

    const NFKD = 3;

    /**

     * Canonical decomposition followed by canonical composition.

     *

     *    “A normalization form that erases any canonical differences, and generally produces

     *    a composed result. For example, a + umlaut is converted to ä in this form. This form

     *    most closely matches legacy usage.”

     *

     *    -- quoted from unicode glossary linked below

     *

     * W3C recommends NFC for HTML5 output and requires NFC for HTML5-compliant parser implementations.

     *

     * @var int

     * @var int $FORM_C

     *

     * @see http://www.unicode.org/glossary/#normalization_form_c

     */

    const NFC = 4;

    /**

     * Compatibility Decomposition followed by Canonical Composition.

     *

     *    “A normalization form that erases both canonical and compatibility differences,

     *    and generally produces a composed result: for example, the single ǆ character

     *    is converted to d + ž in this form. This form is commonly used in matching.”

     *

     *    -- quoted from unicode glossary linked below

     *

     * @var int

     * @var int $FORM_KC

     *

     * @see http://www.unicode.org/glossary/#normalization_form_kc

     */

    const NFKC = 5;

    /**

     * Apple™ Canonical decomposition for HFS Plus filesystems.

     *

     *    “For example, HFS Plus (OS X Extended) uses a variant of Normal Form D in

     *    which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF

     *    are not decomposed …”

     *

     *    -- quoted from Apple™'s Technical Q&A 1173 linked below

     *

     *    “The characters with codes in the range u+2000 through u+2FFF are punctuation,

     *    symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has

     *    single characters for things like u+249c "⒜". The characters in this range are

     *    not fully decomposed; they are left unchanged in HFS Plus strings. This allows

     *    strings in Mac OS encodings to be converted to Implementation and back without loss of

     *    information. This is not unnatural since a user would not necessarily expect a

     *    dingbat "⒜" to be equivalent to the three character sequence "(a)" in a file name.

     *

     *    The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs,

     *    and are not decomposed in HFS Plus strings.

     *

     *    So, for the example given earlier, u+00E9 ("é") must be stored as the two Implementation

     *    characters u+0065 and u+0301 (in that order). The Implementation character u+00E9 ("é")

     *    may not appear in a Implementation string used as part of an HFS Plus B-tree key.”

     *

     *    -- quoted from Apple™'s Technical Q&A 1150 linked below

     *

     * @var int

     *

     * @see NormalizerInterface::NFD

     * @see https://developer.apple.com/library/content/qa/qa1173/_index.html

     * @see https://developer.apple.com/library/content/qa/qa1235/_index.html

     * @see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.html#CanonicalDecomposition

     * @see https://opensource.apple.com/source/libiconv/libiconv-50/libiconv/lib/utf8mac.h.auto.html

     */

    const NFD_MAC = 18; // 0x02 (NFD) | 0x10 = 0x12 (18)

    /**

     * Set the default normalization form to the given value.

     *

     * @param int|string $form

     *

     * @see \Sjorek\UnicodeNormalization\NormalizationUtility::parseForm()

     *

     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm

     */

    public function setForm($form);

    /**

     * Retrieve the current normalization-form constant.

     *

     * @return int

     */

    public function getForm();

    /**

     * Normalizes the input provided and returns the normalized string.

     *

     * @param string $input the input string to normalize

     * @param int    $form  (optional) One of the normalization forms

     *

     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm

     *

     * @return string the normalized string or FALSE if an error occurred

     *

     * @see http://php.net/manual/en/normalizer.normalize.php

     */

    public function normalize($input, $form = null);

    /**

     * Checks if the provided string is already in the specified normalization form.

     *

     * @param string $input The input string to normalize

     * @param int    $form  (optional) One of the normalization forms

     *

     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm

     *

     * @return bool TRUE if normalized, FALSE otherwise or if an error occurred

     *

     * @see http://php.net/manual/en/normalizer.isnormalized.php

     */

    public function isNormalized($input, $form = null);

    /**

     * Normalizes the $string provided to the given or default $form and returns the normalized string.

     *

     * Calls underlying implementation even if given $form is NONE, but finally it normalizes only if needed.

     *

     * @param string $input the string to normalize

     * @param int    $form  (optional) normalization form to use, overriding the default

     *

     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm

     *

     * @return null|string Normalized string or null if an error occurred

     */

    public function normalizeTo($input, $form = null);

    /**

     * Normalizes the $string provided to the given or default $form and returns the normalized string.

     *

     * Does not call underlying implementation if given normalization is NONE and normalizes only if needed.

     *

     * @param string $input the string to normalize

     * @param int    $form  (optional) normalization form to use, overriding the default

     *

     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm

     *

     * @return null|string Normalized string or null if an error occurred

     */

    public function normalizeStringTo($input, $form = null);

    /**

     * Get the supported unicode version level as version triple ("X.Y.Z").

     *

     * @return string

     */

    public static function getUnicodeVersion();

    /**

     * Get the supported unicode normalization forms as array.

     *

     * @return int[]

     */

    public static function getNormalizationForms();

}

```

### Stream filtering

```php

isNormalized($string),

    // yields true, as NFC is the default for utf8 in the web.

    $nfc->isNormalized($string),

    // yields false

    $nfd->isNormalized($string),

    // yields false

    $nfkc->isNormalized($string),

    // yields false

    $normalizer->isNormalized($string, Normalizer::NFKD),

    // yields true

    $normalizer->normalize($string) === $string,

    // yields true

    $nfc->normalize($string) === $string,

    // yields false

    $nfd->normalize($string) === $string,

    // yields true, as only combined characters (means two or more letters in one

    // character, like the single ǆ character) are decomposed (for faster matching).

    $nfkc->normalize($string) === $string,

    Normalizer::getUnicodeVersion(),

    Normalizer::getNormalizationForms()

);

```

### Stream filtering

```php
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sjorek/unicode-normalization

Awesome Lists containing this project

README