Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dylan-profiler/tangled-up-in-unicode
Access to the Unicode Character Database (UCD)
https://github.com/dylan-profiler/tangled-up-in-unicode
data-analysis data-quality exploration linguistic-analysis linguistics python unicode
Last synced: 2 months ago
JSON representation
Access to the Unicode Character Database (UCD)
- Host: GitHub
- URL: https://github.com/dylan-profiler/tangled-up-in-unicode
- Owner: dylan-profiler
- License: other
- Created: 2019-09-22T12:30:34.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-11-08T17:51:33.000Z (about 2 years ago)
- Last Synced: 2024-09-20T01:18:13.798Z (4 months ago)
- Topics: data-analysis, data-quality, exploration, linguistic-analysis, linguistics, python, unicode
- Language: Python
- Homepage:
- Size: 7.2 MB
- Stars: 3
- Watchers: 4
- Forks: 5
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tangled up in Unicode
This module provides access to character properties for all Unicode characters, from the Unicode Character Database (UCD) .
This module provides an alternative to Python's standard library [`unicodedata`](https://docs.python.org/3/library/unicodedata.html).
`Tangled up in Unicode` provides four main benefits compared to the standard library:
- The [latest version](http://www.unicode.org/versions/latest/) of the Unicode database is used.
- Adds human-readable class names (Property value aliases).
- Extends the properties to use more potential of the database.
- UCD version independent of Python version (Python 3.6 has UCD 9.0, 3.7 has UCD 11.0.0, 3.8 has 12.0.1, 3.9 has 13.0.0)Note that Python 3 added unicode support, but that this is different from the UCD.
Unicode support handles storing and manipulating unicode characters, while this package aims to provide properties of specific characters.## Example
The default lookup in `unicodedata` for `$`:
| Property | Value |
|---------------------------|-------------------|
| Name | Dollar Sign |
| Category (Short) | Sc |
| Bidirectional (Short) | ET |
| Combining | 0 |
| Mirrored | 0 |
| East Asian Width (Short) | Na |
| Decomposition | |Extra information provided by this package
| Property | Value |
|-------------------------------|-----------------------|
| Category Alias (Long) | Currency_Symbol |
| Bidirectional Alias (Long) | European_Terminator |
| East Asian Width Alias (Long) | Narrow |
| Script (Long) | Common |
| Script (Short) | Zyyy |
| Block (Long) | Basic_Latin |
| Block (Short) | ASCII |
| PropList | Pattern_Syntax |
| Uppercase Character | |
| Lowercase Character | |
| Titlecase Character | |## Properties comparison
| Property | `tangled-up-in-unicode` | `unicodedata` |
|---------------------------|-------------------------------|-----------------------|
| Name | ☑ | ☑ |
| Decimal | ☑ | ☑ |
| Digit | ☑ | ☑ |
| Numeric | ☑ | ☑ |
| Combining | ☑ + alias | ☑ |
| Mirrored | ☑ | ☑ |
| Decomposition | ☑ | ☑ |
| Category | ☑ + alias | ☑ |
| Bidirectional | ☑ + alias | ☑ |
| East Asian Width | ☑ + alias | ☑ |
| Script | ☑ + alias | - |
| Block | ☑ + alias | - |
| Age | ☑ + alias | - |
| Binary Property Values | ☑ | - |
| Version | 14.0.0 ([latest](http://www.unicode.org/versions/latest/)) | 12.0.1 |_Table 1: presence of properties is denoted by ☑ (Unicode Character 'BALLOT BOX WITH CHECK' (U+2611))._
## Usage
```python
import tangled_up_in_unicode as unicodedata
```The package can be installed via pip:
```
pip install tangled-up-in-unicode
```## Performance
The module is written in Python.
It can be compiled with Cython to gain [competitive performance](# "Meaning the null hypothesis of the two libraries having the same average runtime could not be rejected.") with the native library.## Unsupported features
Some of the features in `unicodedata` are not supported.
| Feature | `tangled-up-in-unicode` | `unicodedata` |
|-----------------------|-------------------------------|-----------------------|
| lookup | - | ☑ |
| normalize | - | ☑ |
| ucd_3_2_0 | - | ☑ |## Acknowledgements
Where possible, code and documentation of the original module are used.
This repository is part of the Dylan Profiling project.