https://github.com/explosion/tokenizations

Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/
https://github.com/explosion/tokenizations

Last synced: 9 months ago
JSON representation

Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/

Host: GitHub
URL: https://github.com/explosion/tokenizations
Owner: explosion
License: mit
Archived: true
Created: 2019-12-30T10:02:22.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-10-04T15:52:19.000Z (over 2 years ago)
Last Synced: 2025-01-17T14:35:16.402Z (over 1 year ago)
Language: Rust
Homepage:
Size: 3.26 MB
Stars: 189
Watchers: 10
Forks: 20
Open Issues: 12
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          # Robust and Fast tokenizations alignment library for Rust and Python

[![creates.io](https://img.shields.io/crates/v/tokenizations.svg)](https://crates.io/crates/tokenizations)

[![pypi](https://img.shields.io/pypi/v/pytokenizations.svg)](https://pypi.org/project/pytokenizations/)

[![Actions Status](https://github.com/explosion/tokenizations/workflows/Test/badge.svg)](https://github.com/explosion/tokenizations/actions)

![sample](./img/demo.png)

Demo: [demo](https://tamuhey.github.io/tokenizations/)  

Rust document: [docs.rs](https://docs.rs/tokenizations)  

Blog post: [How to calculate the alignment between BERT and spaCy tokens effectively and robustly](https://gist.github.com/tamuhey/af6cbb44a703423556c32798e1e1b704)

## Usage (Python)

- Installation

```bash

$ pip install -U pip # update pip

$ pip install pytokenizations

```

- Or, install from source

This library uses [maturin](https://github.com/PyO3/maturin) to build the wheel.

```console

$ git clone https://github.com/tamuhey/tokenizations

$ cd tokenizations/python

$ pip install maturin

$ maturin build

```

Now the wheel is created in `python/target/wheels` directory, and you can install it with `pip install *whl`.

### `get_alignments`

```python

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

```

Returns alignment mappings for two different tokenizations:

```python

>>> tokens_a = ["å", "BC"]

>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)

>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)

>>> print(a2b)

[[0], [0]]

>>> print(b2a)

[[0, 1]]

```

`a2b[i]` is a list representing the alignment from `tokens_a` to `tokens_b`.   

## Usage (Rust)

See here: [docs.rs](https://docs.rs/tokenizations)  

## Related

- [Algorithm overview](./note/algorithm.md)  

- [Blog post](./note/blog_post.md)  

- [seqdiff](https://github.com/tamuhey/seqdiff) is used for the diff process.

- [textspan](https://github.com/tamuhey/textspan)

- [explosion/spacy-alignments: 💫 A spaCy package for Yohei Tamura's Rust tokenizations library](https://github.com/explosion/spacy-alignments)

  - Python bindings for this library, maintained by Explosion, author of spaCy. If you feel difficult to install pytokenizations, please try this.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/explosion/tokenizations

Awesome Lists containing this project

README