Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sjorek/unicode-normalization

An enhanced facade to existing unicode-normalization implementations.
https://github.com/sjorek/unicode-normalization

composer composer-package php stream-filter unicode

Last synced: about 2 months ago
JSON representation

An enhanced facade to existing unicode-normalization implementations.

Awesome Lists containing this project

README

        

# [Unicode-Normalization](https://sjorek.github.io/unicode-normalization/)

A [composer](http://getcomposer.org)-package providing an enhanced facade to existing unicode-normalization
implementations.

## Installation

```bash
php composer.phar require sjorek/unicode-normalization
```

## Usage

### Unicode Normalization

```php

* - Disable unicode-normalization : 0, false, null, empty
* - Ignore/skip unicode-normalization : 1, NONE, true, binary, default, validate
* - Normalization form D : 2, NFD, FORM_D, D, form-d, decompose, collation
* - Normalization form D (mac) : 18, NFD_MAC, FORM_D_MAC, D_MAC, form-d-mac, d-mac, mac
* - Normalization form KD : 3, NFKD, FORM_KD, KD, form-kd
* - Normalization form C : 4, NFC, FORM_C, C, form-c, compose, recompose, legacy, html5
* - Normalization form KC : 5, NFKC, FORM_KC, KC, form-kc, matching
*
*
* Hints:
*


* - The W3C recommends NFC for HTML5 Output.
* - Mac OS X's HFS+ filesystem uses a NFD variant to store paths. We provide one implementation for this
* special variant, but plain NFD works in most cases too. Even if you use something else than NFD or its
* variant HFS+ will always use decomposed NFD path-strings if needed.
*

*/
public function __construct($form = null);

/**
* Ignore any decomposition/composition.
*
* Ignoring Implementation decomposition/composition, means nothing is automatically normalized.
* Many Linux- and BSD-filesystems do not normalize paths and filenames, but treat them as binary data.
* Apple™'s APFS filesystem treats paths and filenames as binary data.
*
* @var int
*/
const NONE = 1;

/**
* Canonical decomposition.
*
* “A normalization form that erases any canonical differences, and produces a
* decomposed result. For example, ä is converted to a + umlaut in this form.
* This form is most often used in internal processing, such as in collation.”
*
* -- quoted from unicode glossary linked below
*
* @var int
*
* @see http://www.unicode.org/glossary/#normalization_form_d
* @see https://developer.apple.com/library/content/qa/qa1173/_index.html
* @see https://developer.apple.com/library/content/qa/qa1235/_index.html
*/
const NFD = 2;

/**
* Compatibility decomposition.
*
* “A normalization form that erases both canonical and compatibility differences,
* and produces a decomposed result: for example, the single dž character is
* converted to d + z + caron in this form.”
*
* -- quoted from unicode glossary linked below
*
* @var int
*
* @see http://www.unicode.org/glossary/#normalization_form_kd
*/
const NFKD = 3;

/**
* Canonical decomposition followed by canonical composition.
*
* “A normalization form that erases any canonical differences, and generally produces
* a composed result. For example, a + umlaut is converted to ä in this form. This form
* most closely matches legacy usage.”
*
* -- quoted from unicode glossary linked below
*
* W3C recommends NFC for HTML5 output and requires NFC for HTML5-compliant parser implementations.
*
* @var int
* @var int $FORM_C
*
* @see http://www.unicode.org/glossary/#normalization_form_c
*/
const NFC = 4;

/**
* Compatibility Decomposition followed by Canonical Composition.
*
* “A normalization form that erases both canonical and compatibility differences,
* and generally produces a composed result: for example, the single dž character
* is converted to d + ž in this form. This form is commonly used in matching.”
*
* -- quoted from unicode glossary linked below
*
* @var int
* @var int $FORM_KC
*
* @see http://www.unicode.org/glossary/#normalization_form_kc
*/
const NFKC = 5;

/**
* Apple™ Canonical decomposition for HFS Plus filesystems.
*
* “For example, HFS Plus (OS X Extended) uses a variant of Normal Form D in
* which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF
* are not decomposed …”
*
* -- quoted from Apple™'s Technical Q&A 1173 linked below
*
* “The characters with codes in the range u+2000 through u+2FFF are punctuation,
* symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has
* single characters for things like u+249c "⒜". The characters in this range are
* not fully decomposed; they are left unchanged in HFS Plus strings. This allows
* strings in Mac OS encodings to be converted to Implementation and back without loss of
* information. This is not unnatural since a user would not necessarily expect a
* dingbat "⒜" to be equivalent to the three character sequence "(a)" in a file name.
*
* The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs,
* and are not decomposed in HFS Plus strings.
*
* So, for the example given earlier, u+00E9 ("é") must be stored as the two Implementation
* characters u+0065 and u+0301 (in that order). The Implementation character u+00E9 ("é")
* may not appear in a Implementation string used as part of an HFS Plus B-tree key.”
*
* -- quoted from Apple™'s Technical Q&A 1150 linked below
*
* @var int
*
* @see NormalizerInterface::NFD
* @see https://developer.apple.com/library/content/qa/qa1173/_index.html
* @see https://developer.apple.com/library/content/qa/qa1235/_index.html
* @see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.html#CanonicalDecomposition
* @see https://opensource.apple.com/source/libiconv/libiconv-50/libiconv/lib/utf8mac.h.auto.html
*/
const NFD_MAC = 18; // 0x02 (NFD) | 0x10 = 0x12 (18)

/**
* Set the default normalization form to the given value.
*
* @param int|string $form
*
* @see \Sjorek\UnicodeNormalization\NormalizationUtility::parseForm()
*
* @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
*/
public function setForm($form);

/**
* Retrieve the current normalization-form constant.
*
* @return int
*/
public function getForm();

/**
* Normalizes the input provided and returns the normalized string.
*
* @param string $input the input string to normalize
* @param int $form (optional) One of the normalization forms
*
* @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
*
* @return string the normalized string or FALSE if an error occurred
*
* @see http://php.net/manual/en/normalizer.normalize.php
*/
public function normalize($input, $form = null);

/**
* Checks if the provided string is already in the specified normalization form.
*
* @param string $input The input string to normalize
* @param int $form (optional) One of the normalization forms
*
* @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
*
* @return bool TRUE if normalized, FALSE otherwise or if an error occurred
*
* @see http://php.net/manual/en/normalizer.isnormalized.php
*/
public function isNormalized($input, $form = null);

/**
* Normalizes the $string provided to the given or default $form and returns the normalized string.
*
* Calls underlying implementation even if given $form is NONE, but finally it normalizes only if needed.
*
* @param string $input the string to normalize
* @param int $form (optional) normalization form to use, overriding the default
*
* @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
*
* @return null|string Normalized string or null if an error occurred
*/
public function normalizeTo($input, $form = null);

/**
* Normalizes the $string provided to the given or default $form and returns the normalized string.
*
* Does not call underlying implementation if given normalization is NONE and normalizes only if needed.
*
* @param string $input the string to normalize
* @param int $form (optional) normalization form to use, overriding the default
*
* @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
*
* @return null|string Normalized string or null if an error occurred
*/
public function normalizeStringTo($input, $form = null);

/**
* Get the supported unicode version level as version triple ("X.Y.Z").
*
* @return string
*/
public static function getUnicodeVersion();

/**
* Get the supported unicode normalization forms as array.
*
* @return int[]
*/
public static function getNormalizationForms();
}
```

### Stream filtering

```php
isNormalized($string),

// yields true, as NFC is the default for utf8 in the web.
$nfc->isNormalized($string),

// yields false
$nfd->isNormalized($string),

// yields false
$nfkc->isNormalized($string),

// yields false
$normalizer->isNormalized($string, Normalizer::NFKD),

// yields true
$normalizer->normalize($string) === $string,

// yields true
$nfc->normalize($string) === $string,

// yields false
$nfd->normalize($string) === $string,

// yields true, as only combined characters (means two or more letters in one
// character, like the single dž character) are decomposed (for faster matching).
$nfkc->normalize($string) === $string,

Normalizer::getUnicodeVersion(),
Normalizer::getNormalizationForms()
);

```

### Stream filtering

```php