https://github.com/explosion/tokenizations
Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/
https://github.com/explosion/tokenizations
Last synced: 3 months ago
JSON representation
Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/
- Host: GitHub
- URL: https://github.com/explosion/tokenizations
- Owner: explosion
- License: mit
- Archived: true
- Created: 2019-12-30T10:02:22.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-10-04T15:52:19.000Z (over 2 years ago)
- Last Synced: 2025-01-17T14:35:16.402Z (12 months ago)
- Language: Rust
- Homepage:
- Size: 3.26 MB
- Stars: 189
- Watchers: 10
- Forks: 20
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Robust and Fast tokenizations alignment library for Rust and Python
[](https://crates.io/crates/tokenizations)
[](https://pypi.org/project/pytokenizations/)
[](https://github.com/explosion/tokenizations/actions)

Demo: [demo](https://tamuhey.github.io/tokenizations/)
Rust document: [docs.rs](https://docs.rs/tokenizations)
Blog post: [How to calculate the alignment between BERT and spaCy tokens effectively and robustly](https://gist.github.com/tamuhey/af6cbb44a703423556c32798e1e1b704)
## Usage (Python)
- Installation
```bash
$ pip install -U pip # update pip
$ pip install pytokenizations
```
- Or, install from source
This library uses [maturin](https://github.com/PyO3/maturin) to build the wheel.
```console
$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build
```
Now the wheel is created in `python/target/wheels` directory, and you can install it with `pip install *whl`.
### `get_alignments`
```python
def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...
```
Returns alignment mappings for two different tokenizations:
```python
>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]
```
`a2b[i]` is a list representing the alignment from `tokens_a` to `tokens_b`.
## Usage (Rust)
See here: [docs.rs](https://docs.rs/tokenizations)
## Related
- [Algorithm overview](./note/algorithm.md)
- [Blog post](./note/blog_post.md)
- [seqdiff](https://github.com/tamuhey/seqdiff) is used for the diff process.
- [textspan](https://github.com/tamuhey/textspan)
- [explosion/spacy-alignments: 💫 A spaCy package for Yohei Tamura's Rust tokenizations library](https://github.com/explosion/spacy-alignments)
- Python bindings for this library, maintained by Explosion, author of spaCy. If you feel difficult to install pytokenizations, please try this.