https://github.com/composewell/unicode-transforms
Fast Unicode normalization in Haskell
https://github.com/composewell/unicode-transforms
haskell haskell-library unicode unicode-normalization
Last synced: 9 months ago
JSON representation
Fast Unicode normalization in Haskell
- Host: GitHub
- URL: https://github.com/composewell/unicode-transforms
- Owner: composewell
- License: bsd-3-clause
- Created: 2016-03-23T15:13:39.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2025-02-05T07:08:09.000Z (12 months ago)
- Last Synced: 2025-05-08T23:48:12.358Z (9 months ago)
- Topics: haskell, haskell-library, unicode, unicode-normalization
- Language: Haskell
- Homepage:
- Size: 29.5 MB
- Stars: 47
- Watchers: 8
- Forks: 16
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: Changelog.md
- License: LICENSE
Awesome Lists containing this project
README
# Unicode Transforms
[](https://hackage.haskell.org/package/unicode-transforms)
[](https://travis-ci.com/composewell/unicode-transforms)
[](https://ci.appveyor.com/project/harendra-kumar/unicode-transforms)
[](https://coveralls.io/github/composewell/unicode-transforms?branch=master)
Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).
## What is normalization?
Unicode characters with adornments (e.g. Á) can be represented in two different
forms, as a single composed character (U+00C1 = Á) or as multiple decomposed
characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte
sequences but for humans they have exactly the same visual appearance.
A regular byte comparison may tell that two strings are different even though
they might be equivalent. We need to convert both the strings in a
[`normalized`](http://unicode.org/reports/tr15/) form using the [Unicode
Character Database](http://www.unicode.org/Public/UCD/latest/) before we can
compare them for equivalence. For example:
```
>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True
```
## Performance
Normalization performance comparison of this package (v0.3.7) with
the [text-icu](http://hackage.haskell.org/package/text-icu) package
using the [ICU C++ library](http://site.icu-project.org/download)
version ICU4C 65.1 on macOS. The benchmarks compare the time taken in
milliseconds to normalize files in different languages and normalization
forms using both the packages. In most cases `unicode-transforms`
outperforms ICU.
```
Benchmark unicode-transforms(ms) ICU(ms) % Diff
--------------- ---------------------- ------- --------
NFKD/Korean 7.78 37.10 +376.87
NFD/Korean 7.86 37.06 +371.50
NFKD/Vietnamese 6.85 12.48 +82.20
NFKD/Deutsch 2.17 3.55 +63.30
NFKD/English 1.71 2.78 +62.30
NFKC/Korean 4.77 7.65 +60.28
NFD/Deutsch 2.24 3.53 +57.41
NFD/English 1.76 2.77 +57.32
NFC/Vietnamese 10.66 16.63 +56.00
NFKC/Vietnamese 10.95 16.58 +51.43
NFD/Devanagari 6.48 8.68 +34.10
NFC/Devanagari 6.77 8.49 +25.48
NFD/AllChars 6.18 7.41 +19.91
NFD/Japanese 7.80 9.20 +17.99
NFKC/Devanagari 7.33 8.48 +15.74
NFKD/Japanese 8.71 10.05 +15.39
NFD/Vietnamese 5.94 6.83 +14.99
NFKD/Devanagari 7.59 8.68 +14.27
NFKD/AllChars 9.80 10.66 +8.82
NFKC/Deutsch 3.21 3.18 -0.72
NFC/Korean 4.62 4.38 -5.35
NFKC/English 2.21 2.06 -6.88
NFC/English 2.19 2.04 -7.21
NFKC/AllChars 14.67 9.75 -50.51
NFC/Deutsch 3.02 1.95 -54.39
NFKC/Japanese 12.46 5.42 -129.93
NFC/AllChars 9.72 3.58 -171.63
NFC/Japanese 11.90 3.04 -292.04
```
## Talks
* Talks: [Functional Conf 2018 Video](https://www.youtube.com/watch?v=aJvwORrBJ0o) | [Functional Conf 2018 Slides](https://www.slideshare.net/HarendraKumar10/high-performance-haskell)
## Contributing
Please use https://github.com/harendra-kumar/unicode-transforms to raise
issues, or send pull requests.