Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jtauber/pyuca
a Python implementation of the Unicode Collation Algorithm
https://github.com/jtauber/pyuca
unicode unicode-collation-algorithm
Last synced: 5 days ago
JSON representation
a Python implementation of the Unicode Collation Algorithm
- Host: GitHub
- URL: https://github.com/jtauber/pyuca
- Owner: jtauber
- License: mit
- Created: 2012-06-21T13:24:23.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2024-03-08T13:18:44.000Z (10 months ago)
- Last Synced: 2024-12-30T09:09:52.349Z (12 days ago)
- Topics: unicode, unicode-collation-algorithm
- Language: Python
- Homepage:
- Size: 14.8 MB
- Stars: 219
- Watchers: 13
- Forks: 23
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Authors: AUTHORS
Awesome Lists containing this project
- starred-awesome - pyuca - a Python implementation of the Unicode Collation Algorithm (Python)
README
# pyuca: Python Unicode Collation Algorithm implementation
[![Build Status](http://img.shields.io/travis/jtauber/pyuca.svg)](https://travis-ci.org/jtauber/pyuca)
[![Coverage Status](http://img.shields.io/coveralls/jtauber/pyuca.svg)](https://coveralls.io/r/jtauber/pyuca?branch=master)
![MIT License](http://img.shields.io/badge/license-MIT-brightgreen.svg)[![DOI](https://zenodo.org/badge/3769/jtauber/pyuca.svg)](https://zenodo.org/badge/latestdoi/3769/jtauber/pyuca)
[![JOSS](http://joss.theoj.org/papers/10.21105/joss.00021/status.svg)](http://joss.theoj.org/papers/10.21105/joss.00021)This is a Python implementation of the
[Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/). It
passes 100% of the UCA conformance tests for Unicode 5.2.0 (Python 2.7),
Unicode 6.3.0 (Python 3.3+), Unicode 8.0.0 (Python 3.5+), Unicode 9.0.0
(Python 3.6+), and Unicode 10.0.0 (Python 3.7+) with a variable-weighting
setting of Non-ignorable.## What do you use it for?
In short, sorting non-English strings properly.
The core of the algorithm involves multi-level comparison. For example,
``café`` comes before ``caff`` because at the primary level, the accent is
ignored and the first word is treated as if it were ``cafe``. The secondary
level (which considers accents) only applies then to words that are equivalent
at the primary level.The Unicode Collation Algorithm and pyuca also support contraction and
expansion. **Contraction** is where multiple letters are treated as a single
unit. In Spanish, ``ch`` is treated as a letter coming between ``c`` and ``d``
so that, for example, words beginning ``ch`` should sort after all other words
beginnings with ``c``. **Expansion** is where a single letter is treated as
though it were multiple letters. In German, ``ä`` is sorted as if it were
``ae``, i.e. after ``ad`` but before ``af``.## How to use it
Here is how to use the ``pyuca`` module.
pip install pyuca
Usage example:
from pyuca import Collator
c = Collator()assert sorted(["cafe", "caff", "café"]) == ["cafe", "caff", "café"]
assert sorted(["cafe", "caff", "café"], key=c.sort_key) == ["cafe", "café", "caff"]``Collator`` can also take an optional filename for specifying a custom
collation element table.You can also import collators for specific Unicode versions,
e.g. `from pyuca.collator import Collator_8_0_0`.
But just `from pyuca import Collator` will ensure that the collator version
matches the version of `unicodata` provided by the standard library for your
version of Python.## How to cite it
Tauber, J. K. (2016). pyuca: a Python implementation of the Unicode Collation Algorithm. The Journal of Open Source Software. DOI: 10.21105/joss.00021
## License
Python code is made available under an MIT license (see `LICENSE`).
`allkeys.txt` is made available under the similar license defined in
`LICENSE-allkeys`.## Contacting the Developer
If you have any problems, questions or suggestions, it's best to file an issue
on GitHub although you can also contact me at [email protected].For more of my work on linguistics and Ancient Greek, see
.