Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/composewell/unicode-transforms
Fast Unicode normalization in Haskell
https://github.com/composewell/unicode-transforms
haskell haskell-library unicode unicode-normalization
Last synced: 7 days ago
JSON representation
Fast Unicode normalization in Haskell
- Host: GitHub
- URL: https://github.com/composewell/unicode-transforms
- Owner: composewell
- License: bsd-3-clause
- Created: 2016-03-23T15:13:39.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2023-10-25T03:26:27.000Z (over 1 year ago)
- Last Synced: 2024-04-25T22:30:40.942Z (9 months ago)
- Topics: haskell, haskell-library, unicode, unicode-normalization
- Language: Haskell
- Homepage:
- Size: 29.5 MB
- Stars: 47
- Watchers: 9
- Forks: 15
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Changelog: Changelog.md
- License: LICENSE
Awesome Lists containing this project
README
# Unicode Transforms
[![Hackage](https://img.shields.io/hackage/v/unicode-transforms.svg?style=flat)](https://hackage.haskell.org/package/unicode-transforms)
[![Build Status](https://travis-ci.com/composewell/unicode-transforms.svg?branch=master)](https://travis-ci.com/composewell/unicode-transforms)
[![Windows Build status](https://ci.appveyor.com/api/projects/status/5wov8m1m0asvbv32?svg=true)](https://ci.appveyor.com/project/harendra-kumar/unicode-transforms)
[![Coverage Status](https://coveralls.io/repos/composewell/unicode-transforms/badge.svg?branch=master&service=github)](https://coveralls.io/github/composewell/unicode-transforms?branch=master)Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).
## What is normalization?
Unicode characters with adornments (e.g. Á) can be represented in two different
forms, as a single composed character (U+00C1 = Á) or as multiple decomposed
characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte
sequences but for humans they have exactly the same visual appearance.A regular byte comparison may tell that two strings are different even though
they might be equivalent. We need to convert both the strings in a
[`normalized`](http://unicode.org/reports/tr15/) form using the [Unicode
Character Database](http://www.unicode.org/Public/UCD/latest/) before we can
compare them for equivalence. For example:
```
>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True
```## Performance
Normalization performance comparison of this package (v0.3.7) with
the [text-icu](http://hackage.haskell.org/package/text-icu) package
using the [ICU C++ library](http://site.icu-project.org/download)
version ICU4C 65.1 on macOS. The benchmarks compare the time taken in
milliseconds to normalize files in different languages and normalization
forms using both the packages. In most cases `unicode-transforms`
outperforms ICU.```
Benchmark unicode-transforms(ms) ICU(ms) % Diff
--------------- ---------------------- ------- --------
NFKD/Korean 7.78 37.10 +376.87
NFD/Korean 7.86 37.06 +371.50
NFKD/Vietnamese 6.85 12.48 +82.20
NFKD/Deutsch 2.17 3.55 +63.30
NFKD/English 1.71 2.78 +62.30
NFKC/Korean 4.77 7.65 +60.28
NFD/Deutsch 2.24 3.53 +57.41
NFD/English 1.76 2.77 +57.32
NFC/Vietnamese 10.66 16.63 +56.00
NFKC/Vietnamese 10.95 16.58 +51.43
NFD/Devanagari 6.48 8.68 +34.10
NFC/Devanagari 6.77 8.49 +25.48
NFD/AllChars 6.18 7.41 +19.91
NFD/Japanese 7.80 9.20 +17.99
NFKC/Devanagari 7.33 8.48 +15.74
NFKD/Japanese 8.71 10.05 +15.39
NFD/Vietnamese 5.94 6.83 +14.99
NFKD/Devanagari 7.59 8.68 +14.27
NFKD/AllChars 9.80 10.66 +8.82
NFKC/Deutsch 3.21 3.18 -0.72
NFC/Korean 4.62 4.38 -5.35
NFKC/English 2.21 2.06 -6.88
NFC/English 2.19 2.04 -7.21
NFKC/AllChars 14.67 9.75 -50.51
NFC/Deutsch 3.02 1.95 -54.39
NFKC/Japanese 12.46 5.42 -129.93
NFC/AllChars 9.72 3.58 -171.63
NFC/Japanese 11.90 3.04 -292.04
```## Talks
* Talks: [Functional Conf 2018 Video](https://www.youtube.com/watch?v=aJvwORrBJ0o) | [Functional Conf 2018 Slides](https://www.slideshare.net/HarendraKumar10/high-performance-haskell)
## Contributing
Please use https://github.com/harendra-kumar/unicode-transforms to raise
issues, or send pull requests.