https://github.com/leonidessaguisagjr/unicodeutil
Python (currently tested against Python 2.7, Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10, PyPy 2.7 and PyPy 3.7) classes and functions for working with Unicode® data. Based on v13.0.0 of the Unicode® Character Database (UCD).
https://github.com/leonidessaguisagjr/unicodeutil
casefold hangul jamo unicode unicode-casefold unicode-character-database unicode-characters unicode-data unicode-support
Last synced: 2 months ago
JSON representation
Python (currently tested against Python 2.7, Python 3.6, Python 3.7, Python 3.8, Python 3.9, Python 3.10, PyPy 2.7 and PyPy 3.7) classes and functions for working with Unicode® data. Based on v13.0.0 of the Unicode® Character Database (UCD).
- Host: GitHub
- URL: https://github.com/leonidessaguisagjr/unicodeutil
- Owner: leonidessaguisagjr
- License: mit
- Created: 2018-04-23T05:45:18.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2023-05-22T21:54:42.000Z (about 2 years ago)
- Last Synced: 2025-03-22T00:34:10.587Z (3 months ago)
- Topics: casefold, hangul, jamo, unicode, unicode-casefold, unicode-character-database, unicode-characters, unicode-data, unicode-support
- Language: Python
- Homepage: https://pypi.org/project/unicodeutil/
- Size: 748 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
``unicodeutil``
===============.. image:: https://img.shields.io/pypi/v/unicodeutil.svg
:target: https://pypi.python.org/pypi/unicodeutil.. image:: https://img.shields.io/github/workflow/status/leonidessaguisagjr/unicodeutil/Python%20unicodeutil
:target: https://github.com/leonidessaguisagjr/unicodeutil/actions/workflows/python-app.ymlPython classes and functions for working with Unicode® data. This was initially built with Python 2 in mind but has also been tested with Python 3, PyPy and PyPy3.
Dependencies
------------This package has the following external dependencies:
* `six `_ - for Python 2 to 3 compatibility
Case folding function
---------------------``casefold(s)`` is a function for performing case folding per section 3.13 of the `Unicode® Standard `_. Also see the `W3C page on case folding `_ for more information on what case folding is.
Python 3.3 and newer has ``str.casefold()`` already built in. This is my attempt at building a case folding function to use with Python 2 and as such was initially only tested with Python 2.7.14. It essentially parses the ``CaseFolding.txt`` file that is included in the `Unicode® Character Database `_ to build a dictionary that is then used as a lookup table to create a copy of the input string that has been transformed to facilitate caseless comparisons.
A bit more information about how I put this together on my `blog `_.
By default, the ``casefold(s)`` function performs full case folding. To use simple case folding, pass the parameter ``fullcasefold=False`` (the default is ``fullcasefold=True``). See the comments in ``CaseFolding.txt`` for an explanation of the difference between simple and full case folding.
By default, the ``casefold(s)`` function will not use the Turkic special case mappings for dotted and dotless 'i'. To use the Turkic mapping, pass the parameter ``useturkicmapping=True`` to the function. See the following web pages for more information on the dotted vs dotless 'i':
* https://en.wikipedia.org/wiki/Dotted_and_dotless_I
* http://www.i18nguy.com/unicode/turkish-i18n.html#problemExample usage
^^^^^^^^^^^^^Using Python 2::
>>> from unicodeutil import casefold
>>> s1 = u"weiß"
>>> s2 = u"WEISS"
>>> casefold(s1) == casefold(s2)
True
>>> s1 = u"LİMANI"
>>> s2 = u"limanı"
>>> casefold(s1) == casefold(s2)
False
>>> casefold(s1, useturkicmapping=True) == casefold(s2, useturkicmapping=True)
TrueSplitting a Python 2 string into chars, preserving surrogate pairs
-------------------------------------------------------------------------The ``preservesurrogates(s)`` function will split a string into a list of characters, preserving `surrogate pairs `_.
Example usage
^^^^^^^^^^^^^Using Python 2::
>>> from unicodeutil import preservesurrogates
>>> s = u"ABC\U0001e900DeF\U000118a0gHıİ"
>>> list(s)
[u'A', u'B', u'C', u'\ud83a', u'\udd00', u'D', u'e', u'F', u'\ud806', u'\udca0', u'g', u'H', u'\u0131', u'\u0130']
>>> for c in s:
... print c
...
A
B
C
???
???
D
e
F
???
???
g
H
ı
İ
>>> list(preservesurrogates(s))
[u'A', u'B', u'C', u'\U0001e900', u'D', u'e', u'F', u'\U000118a0', u'g', u'H', u'\u0131', u'\u0130']
>>> for c in preservesurrogates(s):
... print(c)
...
A
B
C
𞤀
D
e
F
𑢠
g
H
ı
İUsing the latest Unicode® Character Database (UCD)
--------------------------------------------------For the Python 2.7.x line, the `unicodedata module in Python 2.7.18 `_ is still using data from version 5.2.0 of the UCD. Even Python 3 releases up to the 3.10.x line are also still not on the latest version of the UCD e.g. the `unicodedata module in Python 3.10.7 `_ is still using data from version 13.0.0 of the UCD. The UCD is `currently up to version 15.0.0 `_.
The ``UnicodeCharacter`` namedtuple encapsulates the various properties associated with each Unicode® character, as explained in `Unicode Standard Annex #44, UnicodeData.txt `_.
The ``UnicodeData`` class represents the contents of the UCD as parsed from the `latest UnicodeData.txt `_ found on the Unicode Consortium FTP site. Once an instance of the ``UnicodeData`` class has been created, it is possible to do ``dict`` style lookups using the Unicode scalar value, lookup by Unicode character by using the ``lookup_by_char(c)`` method, or lookups by name using the ``lookup_by_name(name)`` and ``lookup_by_partial_name(partial_name)`` methods. The name lookup uses the `UAX44-LM2 `_ loose matching rule when doing lookups. Iterating through all of the data is also possible via ``items()``, ``keys()`` and ``values()`` methods.
The ``UnicodeBlocks`` class encapsulates the block information associated with a Unicode character. Once an instance of the ``UnicodeBlocks`` class has been created, it is possible to get the Block name associated with a particular Unicode character by either doing ``dict`` style lookups using the Unicode scalar value, or using the ``lookup_by_char(c)`` method to lookup by Unicode character. Iterating through all of the data is also possible via the ``items()``, ``keys()`` and ``values()`` methods.
Example usage
^^^^^^^^^^^^^Using Python 2::
>>> from unicodeutil import UnicodeBlocks, UnicodeData
>>> ucd = UnicodeData()
>>> ucd[0x00df]
UnicodeCharacter(code=u'U+00DF', name='LATIN SMALL LETTER SHARP S', category='Ll', combining=0, bidi='L', decomposition='', decimal='', digit='', numeric='', mirrored='N', unicode_1_name='', iso_comment='', uppercase='', lowercase='', titlecase='')
>>> ucd[0x0130].name
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> ucd.lookup_by_char(u"ᜊ")
UnicodeCharacter(code=u'U+170A', name=u'TAGALOG LETTER BA', category=u'Lo', combining=0, bidi=u'L', decomposition=u'', decimal=u'', digit=u'', numeric=u'', mirrored=u'N', unicode_1_name=u'', iso_comment=u'', uppercase=u'', lowercase=u'', titlecase=u'')
>>> ucd.lookup_by_name("latin small letter sharp_s")
UnicodeCharacter(code=u'U+00DF', name='LATIN SMALL LETTER SHARP S', category='Ll', combining=0, bidi='L', decomposition='', decimal='', digit='', numeric='', mirrored='N', unicode_1_name='', iso_comment='', uppercase='', lowercase='', titlecase='')
>>> blocks = UnicodeBlocks()
>>> blocks[0x00DF]
u'Latin-1 Supplement'
>>> blocks.lookup_by_char(u"ẞ")
u'Latin Extended Additional'Composing and decomposing Hangul Syllables
------------------------------------------The function ``compose_hangul_syllable(jamo)`` takes a tuple or list of Unicode scalar values of Jamo and returns its equivalent precomposed Hangul syllable. The complementary function ``decompose_hangul_syllable(hangul_syllable, fully_decompose=False)`` takes the Unicode scalar value of a hangul syllable and will either do a canonical decomposition (default, fully_decompose=False) or a full canonical decomposition (fully_decompose=True) of a Hangul syllable. The return value will be a tuple of Unicode scalar values corresponding to the Jamo that the Hangul syllable has been decomposed into. For example (taken from the `Unicode Standard, ch. 03, section 3.12, Conjoing Jamo Behavior `_)::
U+D4DB <-> # Canonical Decomposition (default)
U+D4CC <->
U+D4DB <-> # Full Canonical DecompositionExample usage:
^^^^^^^^^^^^^^The following sample code snippet::
import sys
from unicodeutil import UnicodeData, compose_hangul_syllable, \
decompose_hangul_syllableucd = None
def pprint_composed(jamo):
hangul = compose_hangul_syllable(jamo)
hangul_data = ucd[hangul]
print("<{0}> -> {1}".format(
", ".join([" ".join([jamo_data.code, jamo_data.name])
for jamo_data in [ucd[j] for j in jamo]]),
" ".join([hangul_data.code, hangul_data.name])
))def pprint_decomposed(hangul, decomposition):
hangul_data = ucd[hangul]
print("{0} -> <{1}>".format(
" ".join([hangul_data.code, hangul_data.name]),
", ".join([" ".join([jamo_data.code, jamo_data.name])
for jamo_data in [ucd[jamo]
for jamo in decomposition if jamo]])
))def main():
if len(sys.argv) not in {2, 3, 4}:
print("Invalid number of arguments!")
sys.exit(1)
global ucd
ucd = UnicodeData()
if len(sys.argv) == 2:
hangul = int(sys.argv[1], 16)
print("Canonical Decomposition:")
pprint_decomposed(hangul,
decompose_hangul_syllable(hangul,
fully_decompose=False))
print("Full Canonical Decomposition:")
pprint_decomposed(hangul,
decompose_hangul_syllable(hangul,
fully_decompose=True))
elif len(sys.argv) in {3, 4}:
print("Composition:")
pprint_composed(tuple([int(arg, 16) for arg in sys.argv[1:]]))if __name__ == "__main__":
main()Will produce the following (tested in Python 2 and Python 3)::
$ python pprint_hangul.py 0xD4DB
Canonical Decomposition:
U+D4DB HANGUL SYLLABLE PWILH ->
Full Canonical Decomposition:
U+D4DB HANGUL SYLLABLE PWILH ->
$ python3 pprint_hangul.py 0xD4CC 0x11B6
Composition:
-> U+D4DB HANGUL SYLLABLE PWILH
$ pypy pprint_hangul.py 0x1111 0x1171 0x11b6
Composition:
-> U+D4DB HANGUL SYLLABLE PWILHLicense
-------This is released under an MIT license. See the ``LICENSE`` file in this repository for more information.
The included ``Blocks.txt``, ``CaseFolding.txt``, ``HangulSyllableType.txt``, ``Jamo.txt`` and ``UnicodeData.txt`` files are part of the Unicode® Character Database that is published by Unicode, Inc. Please consult the `Unicode® Terms of Use `_ prior to use.