https://github.com/explosion/curated-tokenizers
Lightweight piece tokenization library
https://github.com/explosion/curated-tokenizers
Last synced: 6 months ago
JSON representation
Lightweight piece tokenization library
- Host: GitHub
- URL: https://github.com/explosion/curated-tokenizers
- Owner: explosion
- License: other
- Created: 2022-07-08T19:05:49.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-04-15T14:13:07.000Z (about 2 years ago)
- Last Synced: 2025-01-29T01:27:49.504Z (about 1 year ago)
- Language: Cython
- Homepage:
- Size: 265 KB
- Stars: 12
- Watchers: 7
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🥢 Curated Tokenizers
This Python library provides word-/sentencepiece tokenizers. The following
types of tokenizers are currenty supported:
| Tokenizer | Binding | Example model |
| --------- | ------------- | ------------- |
| BPE | sentencepiece | |
| Byte BPE | Native | RoBERTa/GPT-2 |
| Unigram | sentencepiece | XLM-RoBERTa |
| Wordpiece | Native | BERT |
## ⚠️ Warning: experimental package
This package is experimental and it is likely that the APIs will change in
incompatible ways.
## ⏳ Install
Curated tokenizers is availble through PyPI:
```bash
pip install curated_tokenizers
```
## 🚀 Quickstart
The best way to get started with curated tokenizers is through the
[`curated-transformers`](https://github.com/explosion/curated-transformers)
library. `curated-transformers` also provides functionality to load tokenization
models from Huggingface Hub.