{"id":15103580,"url":"https://github.com/explosion/tokenizations","last_synced_at":"2025-09-27T02:31:43.770Z","repository":{"id":39012752,"uuid":"230892092","full_name":"explosion/tokenizations","owner":"explosion","description":"Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/","archived":true,"fork":false,"pushed_at":"2023-10-04T15:52:19.000Z","size":3415,"stargazers_count":189,"open_issues_count":12,"forks_count":20,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-01-17T14:35:16.402Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"tamuhey"}},"created_at":"2019-12-30T10:02:22.000Z","updated_at":"2024-12-31T11:46:31.000Z","dependencies_parsed_at":"2024-09-16T01:05:51.677Z","dependency_job_id":"e6d5f868-9a74-477c-9848-1a56ea7e3181","html_url":"https://github.com/explosion/tokenizations","commit_stats":{"total_commits":252,"total_committers":4,"mean_commits":63.0,"dds":"0.13095238095238093","last_synced_commit":"20139107b953cbef58ae58be75bb8892765ef07e"},"previous_names":[],"tags_count":64,"template":false,"template_full_name":null,"purl":"pkg:github/explosion/tokenizations","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ftokenizations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ftokenizations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ftokenizations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ftokenizations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/tokenizations/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ftokenizations/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277171521,"owners_count":25773234,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-27T02:00:08.978Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-25T19:40:36.406Z","updated_at":"2025-09-27T02:31:43.250Z","avatar_url":"https://github.com/explosion.png","language":"Rust","readme":"# Robust and Fast tokenizations alignment library for Rust and Python\n[![creates.io](https://img.shields.io/crates/v/tokenizations.svg)](https://crates.io/crates/tokenizations)\n[![pypi](https://img.shields.io/pypi/v/pytokenizations.svg)](https://pypi.org/project/pytokenizations/)\n[![Actions Status](https://github.com/explosion/tokenizations/workflows/Test/badge.svg)](https://github.com/explosion/tokenizations/actions)\n\n![sample](./img/demo.png)\n\nDemo: [demo](https://tamuhey.github.io/tokenizations/)  \nRust document: [docs.rs](https://docs.rs/tokenizations)  \nBlog post: [How to calculate the alignment between BERT and spaCy tokens effectively and robustly](https://gist.github.com/tamuhey/af6cbb44a703423556c32798e1e1b704)\n\n## Usage (Python)\n\n- Installation\n\n```bash\n$ pip install -U pip # update pip\n$ pip install pytokenizations\n```\n\n- Or, install from source\n\nThis library uses [maturin](https://github.com/PyO3/maturin) to build the wheel.\n\n```console\n$ git clone https://github.com/tamuhey/tokenizations\n$ cd tokenizations/python\n$ pip install maturin\n$ maturin build\n```\n\nNow the wheel is created in `python/target/wheels` directory, and you can install it with `pip install *whl`.\n\n### `get_alignments`\n\n```python\ndef get_alignments(a: Sequence[str], b: Sequence[str]) -\u003e Tuple[List[List[int]], List[List[int]]]: ...\n```\n\nReturns alignment mappings for two different tokenizations:\n\n```python\n\u003e\u003e\u003e tokens_a = [\"å\", \"BC\"]\n\u003e\u003e\u003e tokens_b = [\"abc\"] # the accent is dropped (å -\u003e a) and the letters are lowercased(BC -\u003e bc)\n\u003e\u003e\u003e a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)\n\u003e\u003e\u003e print(a2b)\n[[0], [0]]\n\u003e\u003e\u003e print(b2a)\n[[0, 1]]\n```\n\n`a2b[i]` is a list representing the alignment from `tokens_a` to `tokens_b`.   \n\n## Usage (Rust)\n\nSee here: [docs.rs](https://docs.rs/tokenizations)  \n\n## Related\n\n- [Algorithm overview](./note/algorithm.md)  \n- [Blog post](./note/blog_post.md)  \n- [seqdiff](https://github.com/tamuhey/seqdiff) is used for the diff process.\n- [textspan](https://github.com/tamuhey/textspan)\n- [explosion/spacy-alignments: 💫 A spaCy package for Yohei Tamura's Rust tokenizations library](https://github.com/explosion/spacy-alignments)\n  - Python bindings for this library, maintained by Explosion, author of spaCy. If you feel difficult to install pytokenizations, please try this.\n","funding_links":["https://github.com/sponsors/tamuhey"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Ftokenizations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Ftokenizations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Ftokenizations/lists"}