Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dylan-profiler/tangled-up-in-unicode

Access to the Unicode Character Database (UCD)
https://github.com/dylan-profiler/tangled-up-in-unicode

data-analysis data-quality exploration linguistic-analysis linguistics python unicode

Last synced: 2 months ago
JSON representation

Access to the Unicode Character Database (UCD)

Awesome Lists containing this project

README

        

# Tangled up in Unicode

This module provides access to character properties for all Unicode characters, from the Unicode Character Database (UCD) .
This module provides an alternative to Python's standard library [`unicodedata`](https://docs.python.org/3/library/unicodedata.html).
`Tangled up in Unicode` provides four main benefits compared to the standard library:
- The [latest version](http://www.unicode.org/versions/latest/) of the Unicode database is used.
- Adds human-readable class names (Property value aliases).
- Extends the properties to use more potential of the database.
- UCD version independent of Python version (Python 3.6 has UCD 9.0, 3.7 has UCD 11.0.0, 3.8 has 12.0.1, 3.9 has 13.0.0)

Note that Python 3 added unicode support, but that this is different from the UCD.
Unicode support handles storing and manipulating unicode characters, while this package aims to provide properties of specific characters.

## Example

The default lookup in `unicodedata` for `$`:

| Property | Value |
|---------------------------|-------------------|
| Name | Dollar Sign |
| Category (Short) | Sc |
| Bidirectional (Short) | ET |
| Combining | 0 |
| Mirrored | 0 |
| East Asian Width (Short) | Na |
| Decomposition | |

Extra information provided by this package

| Property | Value |
|-------------------------------|-----------------------|
| Category Alias (Long) | Currency_Symbol |
| Bidirectional Alias (Long) | European_Terminator |
| East Asian Width Alias (Long) | Narrow |
| Script (Long) | Common |
| Script (Short) | Zyyy |
| Block (Long) | Basic_Latin |
| Block (Short) | ASCII |
| PropList | Pattern_Syntax |
| Uppercase Character | |
| Lowercase Character | |
| Titlecase Character | |

## Properties comparison

| Property | `tangled-up-in-unicode` | `unicodedata` |
|---------------------------|-------------------------------|-----------------------|
| Name | ☑ | ☑ |
| Decimal | ☑ | ☑ |
| Digit | ☑ | ☑ |
| Numeric | ☑ | ☑ |
| Combining | ☑ + alias | ☑ |
| Mirrored | ☑ | ☑ |
| Decomposition | ☑ | ☑ |
| Category | ☑ + alias | ☑ |
| Bidirectional | ☑ + alias | ☑ |
| East Asian Width | ☑ + alias | ☑ |
| Script | ☑ + alias | - |
| Block | ☑ + alias | - |
| Age | ☑ + alias | - |
| Binary Property Values | ☑ | - |
| Version | 14.0.0 ([latest](http://www.unicode.org/versions/latest/)) | 12.0.1 |

_Table 1: presence of properties is denoted by ☑ (Unicode Character 'BALLOT BOX WITH CHECK' (U+2611))._

## Usage

```python
import tangled_up_in_unicode as unicodedata
```

The package can be installed via pip:

```
pip install tangled-up-in-unicode
```

## Performance

The module is written in Python.
It can be compiled with Cython to gain [competitive performance](# "Meaning the null hypothesis of the two libraries having the same average runtime could not be rejected.") with the native library.

## Unsupported features

Some of the features in `unicodedata` are not supported.

| Feature | `tangled-up-in-unicode` | `unicodedata` |
|-----------------------|-------------------------------|-----------------------|
| lookup | - | ☑ |
| normalize | - | ☑ |
| ucd_3_2_0 | - | ☑ |

## Acknowledgements
Where possible, code and documentation of the original module are used.
This repository is part of the Dylan Profiling project.