{"id":16880194,"url":"https://github.com/jgm/unicode-collation","last_synced_at":"2025-03-22T07:32:11.906Z","repository":{"id":45786728,"uuid":"354396827","full_name":"jgm/unicode-collation","owner":"jgm","description":"Haskell implementation of the Unicode Collation Algorithm","archived":false,"fork":false,"pushed_at":"2023-12-20T19:14:15.000Z","size":2839,"stargazers_count":16,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-05-09T14:05:45.155Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Haskell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jgm.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["jgm"]}},"created_at":"2021-04-03T21:21:02.000Z","updated_at":"2024-06-18T03:24:16.943Z","dependencies_parsed_at":"2024-06-18T03:24:15.573Z","dependency_job_id":"2cd6f39e-acbc-42d0-a721-4777c0503123","html_url":"https://github.com/jgm/unicode-collation","commit_stats":{"total_commits":255,"total_committers":3,"mean_commits":85.0,"dds":0.0117647058823529,"last_synced_commit":"d31128a41c26adda8d2743b818609437ea0c28c5"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgm%2Funicode-collation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgm%2Funicode-collation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgm%2Funicode-collation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jgm%2Funicode-collation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jgm","download_url":"https://codeload.github.com/jgm/unicode-collation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244925175,"owners_count":20532873,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T15:57:49.345Z","updated_at":"2025-03-22T07:32:11.151Z","avatar_url":"https://github.com/jgm.png","language":"Haskell","readme":"# unicode-collation\n\n[![GitHub\nCI](https://github.com/jgm/unicode-collation/workflows/CI%20tests/badge.svg)](https://github.com/jgm/unicode-collation/actions)\n[![Hackage](https://img.shields.io/hackage/v/unicode-collation.svg?logo=haskell)](https://hackage.haskell.org/package/unicode-collation)\n[![BSD-2-Clause license](https://img.shields.io/badge/license-BSD--2--Clause-blue.svg)](LICENSE)\n\nHaskell implementation of [unicode collation algorithm].\n\n[unicode collation algorithm]:  https://www.unicode.org/reports/tr10\n\n## Motivation\n\nPreviously there was no way to do correct unicode collation\n(sorting) in Haskell without depending on the C library `icu`\nand the barely maintained Haskell wrapper `text-icu`.  This\nlibrary offers a pure Haskell solution.\n\n## Conformance\n\nThe library passes all UCA conformance tests.\n\nLocalized collations have not been tested as extensively.\n\n## Performance\n\nAs might be expected, this library is slower than `text-icu`,\nwhich wraps a heavily optimized C library.  How much slower\ndepends quite a bit on the input.\n\nOn a sample of ten thousand random Unicode strings, we get a\nfactor of about 3:\n\n```\n  sort a list of 10000 random Texts (en):\n    5.9 ms ± 487 μs,  22 MB allocated, 899 KB copied\n  sort same list with text-icu (en):\n    2.1 ms ±  87 μs, 7.1 MB allocated, 148 KB copied\n```\n\nPerformance is worse on a sample drawn from a smaller character\nset including predominantly composed accented letters, which mut\nbe decomposed as part of the algorithm:\n\n```\n  sort a list of 10000 Texts (composed latin) (en):\n     12 ms ± 1.1 ms,  34 MB allocated, 910 KB copied\n  sort same list with text-icu (en):\n    2.3 ms ±  56 μs, 7.0 MB allocated, 146 KB copied\n```\n\nMuch of the impact here comes from normalization (decomposition).\nIf we use a pre-normalized sample and disable normalization\nin the collator, it's much faster:\n\n```\n  sort same list but pre-normalized (en-u-kk-false):\n    5.4 ms ± 168 μs,  19 MB allocated, 909 KB copied\n```\n\nOn plain ASCII, we get a factor of 3 again:\n\n```\n  sort a list of 10000 ASCII Texts (en):\n    4.6 ms ± 405 μs,  17 MB allocated, 880 KB copied\n  sort same list with text-icu (en):\n    1.6 ms ± 114 μs, 6.2 MB allocated, 130 KB copied\n```\n\nNote that this library does incremental normalization,\nso when strings can mostly be distinguished on the basis\nof the first two characters, as in the first sample, the\nimpact is much less.  On the other hand, performance is\nmuch slower on a sample of texts which differ only after\nthe first 32 characters:\n\n```\n  sort a list of 10000 random Texts that agree in first 32 chars:\n    116 ms ± 8.6 ms, 430 MB allocated, 710 KB copied\n  sort same list with text-icu (en):\n    3.2 ms ± 251 μs, 8.8 MB allocated, 222 KB copied\n```\n\nHowever, in the special case where the texts are identical,\nthe algorithm can be short-circuited entirely and sorting\nis very fast:\n\n```\n  sort a list of 10000 identical Texts (en):\n    877 μs ±  54 μs, 462 KB allocated, 9.7 KB copied\n```\n\n## Localized collations\n\nThe following localized collations are available.\nFor languages not listed here, the root collation is\nused.\n\n```\naf\nar\nas\naz\nbe\nbn\nca\ncs\ncu\ncy\nda\nde-AT-u-co-phonebk\nde-u-co-phonebk\ndsb\nee\neo\nes\nes-u-co-trad\net\nfa\nfi\nfi-u-co-phonebk\nfil\nfo\nfr-CA\ngu\nha\nhaw\nhe\nhi\nhr\nhu\nhy\nig\nis\nja\nkk\nkl\nkn\nko\nkok\nlkt\nln\nlt\nlv\nmk\nml\nmr\nmt\nnb\nnn\nnso\nom\nor\npa\npl\nro\nsa\nse\nsi\nsi-u-co-dict\nsk\nsl\nsq\nsr\nsv\nsv-u-co-reformed\nta\nte\nth\ntn\nto\ntr\nug-Cyrl\nuk\nur\nvi\nvo\nwae\nwo\nyo\nzh\nzh-u-co-big5han\nzh-u-co-gb2312\nzh-u-co-pinyin\nzh-u-co-stroke\nzh-u-co-zhuyin\n```\n\nCollation reordering (e.g. `[reorder Latn Kana Hani]`)\nis not suported\n\n## Data files\n\nVersion 13.0.0 of the Unicode data is used:\n\u003chttp://www.unicode.org/Public/UCA/13.0.0/\u003e\n\nLocale-specific tailorings are derived from the Perl\nmodule Unicode::Collate:\nhttps://cpan.metacpan.org/authors/id/S/SA/SADAHIRO/Unicode-Collate-1.29.tar.gz\n\n## Executable\n\nThe package includes an executable component, `unicode-collate`,\nwhich may be used for testing and for collating in scripts.\nTo build it, enable the `executable` flag.\nFor usage instructions, `unicode-collate --help`.\n\n## References\n\n- Unicode Technical Standard #35:\n  Unicode Locale Data Markup Language (LDML):\n  \u003chttp://www.unicode.org/reports/tr35/\u003e\n- Unicode Technical Standard #10:\n  Unicode Collation Algorithm:\n  \u003chttps://www.unicode.org/reports/tr10\u003e\n- Unicode Technical Standard #215:\n  Unicode Normalization Forms:\n  \u003chttps://unicode.org/reports/tr15/\u003e\n\n","funding_links":["https://github.com/sponsors/jgm"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjgm%2Funicode-collation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjgm%2Funicode-collation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjgm%2Funicode-collation/lists"}