https://github.com/openvinotoolkit/openvino_tokenizers
OpenVINO Tokenizers extension
https://github.com/openvinotoolkit/openvino_tokenizers
Last synced: about 1 year ago
JSON representation
OpenVINO Tokenizers extension
- Host: GitHub
- URL: https://github.com/openvinotoolkit/openvino_tokenizers
- Owner: openvinotoolkit
- License: apache-2.0
- Created: 2024-02-01T15:38:57.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2025-04-03T08:08:16.000Z (about 1 year ago)
- Last Synced: 2025-04-03T09:23:26.102Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 245 MB
- Stars: 31
- Watchers: 25
- Forks: 36
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# OpenVINO Tokenizers
[](https://pepy.tech/project/openvino-tokenizers)
[](https://anaconda.org/conda-forge/openvino-tokenizers)
OpenVINO Tokenizers adds text processing operations to OpenVINO.
## Features
- Perform tokenization and detokenization without third-party dependencies
- Convert a HuggingFace tokenizer into OpenVINO model tokenizer and detokenizer
- Combine OpenVINO models into a single model
- Add greedy decoding pipeline to text generation model
## Installation
(Recommended) Create and activate virtual env:
```bash
python3 -m venv venv
source venv/bin/activate
# or
conda create --name openvino_tokenizers
conda activate openvino_tokenizers
```
### Minimal Installation
Use minimal installation when you have a converted OpenVINO tokenizer:
```bash
pip install openvino-tokenizers
# or
conda install -c conda-forge openvino openvino-tokenizers
```
### Convert Tokenizers Installation
If you want to convert HuggingFace tokenizers into OpenVINO tokenizers:
```bash
pip install openvino-tokenizers[transformers]
# or
conda install -c conda-forge openvino openvino-tokenizers && pip install transformers[sentencepiece] tiktoken
```
### Install Pre-release Version
Use `openvino-tokenizers[transformers]` to install tokenizers conversion dependencies.
```bash
pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
```
### Build and Install from Source
#### Using OpenVINO PyPI package
openvino-tokenizers build depends on [openvino](https://pypi.org/project/openvino/) package which will be automatically installed from PyPI during the build process. To install unreleased versions, you would need to install openvino package from the nightly distribution channel using `--extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly`
```bash
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install . --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
```
This command is the equivalent of minimal installation. Install tokenizers conversion dependencies if needed:
```bash
pip install transformers[sentencepiece] tiktoken
```
:warning: Latest commit of OpenVINO Tokenizers might rely on features that are not present in the release OpenVINO version.
Use [a nightly build](https://docs.openvino.ai/2024/get-started/install-openvino.html?VERSION=NIGHTLY) of OpenVINO or build
OpenVINO Tokenizers from a release branch if you have issues with the build process.
#### Using OpenVINO archive
Install [OpenVINO archive](https://docs.openvino.ai/2024/get-started/install-openvino.html) distribution. Use `--no-deps` to avoid OpenVINO installation from PyPI into your current environment.
`--extra-index-url` is needed to resolve build dependencies only.
```bash
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install --no-deps . --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
```
This command is the equivalent of minimal installation. Install tokenizers conversion dependencies if needed:
```bash
pip install transformers[sentencepiece] tiktoken
```
:warning: Latest commit of OpenVINO Tokenizers might rely on features that are not present in the release OpenVINO version.
Use [a nightly build](https://docs.openvino.ai/2024/get-started/install-openvino.html?VERSION=NIGHTLY) of OpenVINO or build
OpenVINO Tokenizers from a release branch if you have issues with the build process.
### Build and install for development
#### Using OpenVINO PyPI package
```bash
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install -e .[all] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
# verify installation by running tests
cd tests/
pytest .
```
#### Using OpenVINO archive
Install [OpenVINO archive](https://docs.openvino.ai/2024/get-started/install-openvino.html) distribution. Use `--no-deps` to avoid OpenVINO installation from PyPI into your current environment.
`--extra-index-url` is needed to resolve build dependencies only.
```bash
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install -e .[all] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
# verify installation by running tests
cd tests/
pytest .
```
### C++ Installation
You can use converted tokenizers in C++ pipelines with prebuild binaries.
1. Download OpenVINO archive distribution for your OS from [here](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html) and extract the archive.
2. Download OpenVINO Tokenizers prebuild libraries from [here](https://storage.openvinotoolkit.org/repositories/openvino_tokenizers/packages/). To ensure compatibility first three numbers of OpenVINO Tokenizers version should match OpenVINO version and OS.
3. Extract OpenVINO Tokenizers archive into OpenVINO installation directory. OpenVINO Tokenizers archive maintains the structure to be aligned with OpenVINO archive:
- Windows: `\runtime\bin\intel64\Release\`
- MacOS_x86: `/runtime/lib/intel64/Release`
- MacOS_arm64: `/runtime/lib/arm64/Release/`
- Linux_x86: `/runtime/lib/intel64/`
- Linux_arm64: `/runtime/lib/aarch64/`
After that you can add binary extension in the code with:
- `core.add_extension("openvino_tokenizers.dll")` for Windows
- `core.add_extension("libopenvino_tokenizers.dylib")` for MacOS
- `core.add_extension("libopenvino_tokenizers.so")` for Linux
and `read`/`compile` converted (de)tokenizers models.
If you use version `2023.3.0.0`, the binary extension file is called `(lib)user_ov_extension.(dll/dylib/so)`.
### C++ Build
To build OpenVINO Tokenizers binaries locally, use this command:
```bash
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
```
After that, you can transfer all binaries from `build/src` to `` as described in the C++ installation instruction above.
## Usage
:warning: OpenVINO Tokenizers can be inferred on a `CPU` device only.
### Convert HuggingFace tokenizer
OpenVINO Tokenizers ships with CLI tool that can convert tokenizers from Huggingface Hub
or Huggingface tokenizers saved on disk:
```shell
convert_tokenizer codellama/CodeLlama-7b-hf --with-detokenizer -o output_dir
```
There is also `convert_tokenizer` function that can convert tokenizer python object.
```python
import numpy as np
from transformers import AutoTokenizer
from openvino import compile_model, save_model
from openvino_tokenizers import convert_tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
ov_tokenizer = convert_tokenizer(hf_tokenizer)
compiled_tokenzier = compile_model(ov_tokenizer)
text_input = ["Test string"]
hf_output = hf_tokenizer(text_input, return_tensors="np")
ov_output = compiled_tokenzier(text_input)
for output_name in hf_output:
print(f"OpenVINO {output_name} = {ov_output[output_name]}")
print(f"HuggingFace {output_name} = {hf_output[output_name]}")
# OpenVINO input_ids = [[ 101 3231 5164 102]]
# HuggingFace input_ids = [[ 101 3231 5164 102]]
# OpenVINO token_type_ids = [[0 0 0 0]]
# HuggingFace token_type_ids = [[0 0 0 0]]
# OpenVINO attention_mask = [[1 1 1 1]]
# HuggingFace attention_mask = [[1 1 1 1]]
# save tokenizer for later use
save_model(ov_tokenizer, "openvino_tokenizer.xml")
loaded_tokenizer = compile_model("openvino_tokenizer.xml")
loaded_ov_output = loaded_tokenizer(text_input)
for output_name in hf_output:
assert np.all(loaded_ov_output[output_name] == ov_output[output_name])
```
### Connect Tokenizer to a Model
To infer and convert the original model, install torch or torch-cpu to the virtual environment.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from openvino import compile_model, convert_model
from openvino_tokenizers import convert_tokenizer, connect_models
checkpoint = "mrm8488/bert-tiny-finetuned-sms-spam-detection"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
hf_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
text_input = ["Free money!!!"]
hf_input = hf_tokenizer(text_input, return_tensors="pt")
hf_output = hf_model(**hf_input)
ov_tokenizer = convert_tokenizer(hf_tokenizer)
ov_model = convert_model(hf_model, example_input=hf_input.data)
combined_model = connect_models(ov_tokenizer, ov_model)
compiled_combined_model = compile_model(combined_model)
openvino_output = compiled_combined_model(text_input)
print(f"OpenVINO logits: {openvino_output['logits']}")
# OpenVINO logits: [[ 1.2007061 -1.4698029]]
print(f"HuggingFace logits {hf_output.logits}")
# HuggingFace logits tensor([[ 1.2007, -1.4698]], grad_fn=)
```
### Use Extension With Converted (De)Tokenizer or Model With (De)Tokenizer
Import `openvino_tokenizers` will register tokenizer-related operations to OpenVINO,
after which you can work with saved tokenizers and detokenizers.
```python
import numpy as np
import openvino_tokenizers
from openvino import Core
core = Core()
# detokenizer from codellama sentencepiece model
compiled_detokenizer = core.compile_model("detokenizer.xml")
token_ids = np.random.randint(100, 1000, size=(3, 5))
openvino_output = compiled_detokenizer(token_ids)
print(openvino_output["string_output"])
# ['sc�ouition�', 'intvenord hasient', 'g shouldwer M more']
```
### Text Generation Pipeline
```python
import numpy as np
from openvino import compile_model, convert_model
from openvino_tokenizers import add_greedy_decoding, convert_tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_checkpoint = "JackFram/llama-68m"
hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint, use_cache=False)
# convert hf tokenizer
text_input = ["Quick brown fox jumped "]
ov_tokenizer, ov_detokenizer = convert_tokenizer(hf_tokenizer, with_detokenizer=True)
compiled_tokenizer = compile_model(ov_tokenizer)
# transform input text into tokens
ov_input = compiled_tokenizer(text_input)
hf_input = hf_tokenizer(text_input, return_tensors="pt")
# convert Pytorch model to OpenVINO IR and add greedy decoding pipeline to it
ov_model = convert_model(hf_model, example_input=hf_input.data)
ov_model_with_greedy_decoding = add_greedy_decoding(ov_model)
compiled_model = compile_model(ov_model_with_greedy_decoding)
# generate new tokens
new_tokens_size = 10
prompt_size = ov_input["input_ids"].shape[-1]
input_dict = {
output.any_name: np.hstack([tensor, np.zeros(shape=(1, new_tokens_size), dtype=np.int_)])
for output, tensor in ov_input.items()
}
for idx in range(prompt_size, prompt_size + new_tokens_size):
output = compiled_model(input_dict)["token_ids"]
input_dict["input_ids"][:, idx] = output[:, idx - 1]
input_dict["attention_mask"][:, idx] = 1
ov_token_ids = input_dict["input_ids"]
hf_token_ids = hf_model.generate(
**hf_input,
min_new_tokens=new_tokens_size,
max_new_tokens=new_tokens_size,
temperature=0, # greedy decoding
)
# decode model output
compiled_detokenizer = compile_model(ov_detokenizer)
ov_output = compiled_detokenizer(ov_token_ids)["string_output"]
hf_output = hf_tokenizer.batch_decode(hf_token_ids, skip_special_tokens=True)
print(f"OpenVINO output string: `{ov_output}`")
# OpenVINO output string: `['Quick brown fox was walking through the forest. He was looking for something']`
print(f"HuggingFace output string: `{hf_output}`")
# HuggingFace output string: `['Quick brown fox was walking through the forest. He was looking for something']`
```
### TensorFlow Text Integration
OpenVINO Tokenizers include converters for certain TensorFlow Text operations.
Currently, only the MUSE model is supported.
Here is an example of model conversion and inference:
```python
import numpy as np
import tensorflow_hub as hub
import tensorflow_text # register tf text ops
from openvino import convert_model, compile_model
import openvino_tokenizers # register ov tokenizer ops and translators
sentences = ["dog", "I cuccioli sono carini.", "私は犬と一緒にビーチを散歩するのが好きです"]
tf_embed = hub.load(
"https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/"
"TensorFlow2/variations/multilingual/versions/2"
)
# convert model that uses Sentencepiece tokenizer op from TF Text
ov_model = convert_model(tf_embed)
ov_embed = compile_model(ov_model, "CPU")
ov_result = ov_embed(sentences)[ov_embed.output()]
tf_result = tf_embed(sentences)
assert np.all(np.isclose(ov_result, tf_result, atol=1e-4))
```
### RWKV Tokenizer
```python
from urllib.request import urlopen
from openvino import compile_model
from openvino_tokenizers import build_rwkv_tokenizer
rwkv_vocab_url = (
"https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/tokenizer/rwkv_vocab_v20230424.txt"
)
with urlopen(rwkv_vocab_url) as vocab_file:
vocab = map(bytes.decode, vocab_file)
tokenizer, detokenizer = build_rwkv_tokenizer(vocab)
tokenizer, detokenizer = compile_model(tokenizer), compile_model(detokenizer)
print(tokenized := tokenizer(["Test string"])["input_ids"]) # [[24235 47429]]
print(detokenizer(tokenized)["string_output"]) # ['Test string']
```
### Tokenizer From GGUF Model
```python
from transformers import AutoTokenizer
import openvino as ov
from openvino_tokenizers import convert_tokenizer
model_id = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF"
filename = "DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf"
hf_tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
ov_tokenizer, ov_detokenizer = convert_tokenizer(hf_tokenizer, with_detokenizer=True)
ov_tokenizer, ov_detokenizer = ov.compile_model(ov_tokenizer), ov.compile_model(ov_detokenizer)
print(ov_res := ov_tokenizer(["Test string"])["input_ids"]) # [[2271 914]]
print(ov_detokenizer(ov_res)["string_output"]) # ['Test string']
```
### C++ Usage Example
This example shows how to run inference with C++ on a text-classification model from Hugging Face. It
expects the path to a model directory as parameter, and prints the logits returned by the model inference.
Export an example model by running the following command after `pip install optimum[openvino]`:
```sh
optimum-cli export openvino microsoft/deberta-base-mnli deberta-base-mnli-ov
```
```cpp
#include
#include
#include
int main(int argc, char* argv[]) {
std::string dirname = argv[1];
std::filesystem::path dir_path(dirname);
std::filesystem::path model_xml = dir_path / "openvino_model.xml";
std::filesystem::path tokenizer_xml = dir_path / "openvino_tokenizer.xml";
ov::Core core;
// use "openvino_tokenizers.dll" on Windows, "libopenvino_tokenizers.dylib" on macOS
core.add_extension("libopenvino_tokenizers.so");
ov::InferRequest tokenizer_request = core.compile_model(tokenizer_xml, "CPU").create_infer_request();
std::string prompt="Hello world!";
tokenizer_request.set_input_tensor(ov::Tensor{ov::element::string, {1}, &prompt});
tokenizer_request.infer();
ov::Tensor input_ids = tokenizer_request.get_tensor("input_ids");
ov::Tensor attention_mask = tokenizer_request.get_tensor("attention_mask");
ov::InferRequest infer_request = core.compile_model(model_xml, "CPU").create_infer_request();
infer_request.set_tensor("input_ids", input_ids);
infer_request.set_tensor("attention_mask", attention_mask);
infer_request.infer();
auto output = infer_request.get_tensor("logits");
const float *output_buffer = output.data();
size_t num_elements = output.get_size();
for (size_t i = 0; i < num_elements; i++) {
std::cout << output_buffer[i] << " ";
}
std::cout << std::endl;
return 0;
}
```
### Unicode Support
- OpenVINO Tokenizers support UTF-8 encoded inputs.
- Internal tokenizer vocabulary is stored in UTF-8 encoding:
- Providing a tokenizer model with non-UTF-8 input may lead to unexpected outputs or errors,
- Detokenizer output is UTF-8 encoded; if your terminal does not expect UTF-8, you might see garbage characters.
- By default, a detokenizer replaces invalid UTF-8 output with � character. You can change this behavior during conversion.
## Supported Tokenizer Types
| Huggingface
Tokenizer Type | Tokenizer Model Type | Tokenizer | Detokenizer |
|---------------------------------|----------------------|----------|-----------|
| Fast | WordPiece | ✅ | ✅ |
| | BPE | ✅ | ✅ |
| | Unigram | ✅ | ✅ |
| | WordLevel* | ✅ | ✅ |
| Legacy | SentencePiece .model | ✅ | ✅ |
| Custom | tiktoken | ✅ | ✅ |
| RWKV | Trie | ✅ | ✅ |
## Test Results
This report is autogenerated and includes tokenizers and detokenizers tests. The `Output Matched, %` column shows the percent of test strings for which the results of OpenVINO and Huggingface Tokenizers are the same. To update the report run `pytest --update_readme tokenizers_test.py` in `tests` directory.
### Output Match by Tokenizer Type
Tokenizer Type
Output Matched, %
Number of Tests
BPE
99.28
5827
SentencePiece
89.82
5157
Tiktoken
96.56
524
Unigram
95.24
1470
WordLevel
98.96
192
WordPiece
99.07
1289
### Output Match by Model
Tokenizer Type
Model
Output Matched, %
Number of Tests
BPE
NousResearch/Llama-2-13b-hf
97.55
245
BPE
NousResearch/Meta-Llama-3-8B-Instruct
100.00
247
BPE
Salesforce/codegen-16B-multi
100.00
261
BPE
TinyLlama/TinyLlama-1.1B-Chat-v1.0
100.00
247
BPE
Xenova/gpt-4o
100.00
261
BPE
ai-forever/rugpt3large_based_on_gpt2
100.00
261
BPE
allenai/OLMo-1B-hf
100.00
245
BPE
answerdotai/ModernBERT-base
100.00
261
BPE
bigscience/bloom
97.55
245
BPE
deepseek-ai/DeepSeek-V3-0324
99.24
263
BPE
deepseek-ai/deepseek-coder-6.7b-instruct
99.24
263
BPE
facebook/galactica-120b
100.00
245
BPE
gpt2
100.00
261
BPE
koalajun/Gemma-2-9b-it-Ko-Crypto-Translate
100.00
247
BPE
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
100.00
261
BPE
microsoft/Phi-3-mini-128k-instruct
100.00
247
BPE
microsoft/deberta-base
100.00
245
BPE
mlx-community/quantized-gemma-7b-it
97.57
247
BPE
roberta-base
100.00
261
BPE
stabilityai/stablecode-completion-alpha-3b-4k
100.00
245
BPE
stabilityai/stablelm-2-1_6b
100.00
245
BPE
tiiuae/Falcon3-7B-Instruct
96.20
263
BPE
tiiuae/falcon-7b
96.17
261
SentencePiece
BAAI/bge-reranker-v2-m3
96.73
245
SentencePiece
BAAI/bge-reranker-v2-m3_legacy
96.73
245
SentencePiece
NousResearch/Llama-2-13b-hf
94.29
245
SentencePiece
NousResearch/Llama-2-13b-hf_legacy
97.55
245
SentencePiece
TinyLlama/TinyLlama-1.1B-Chat-v1.0
100.00
247
SentencePiece
TinyLlama/TinyLlama-1.1B-Chat-v1.0_legacy
98.38
247
SentencePiece
baichuan-inc/Baichuan2-7B-Chat_legacy
100.00
245
SentencePiece
camembert-base
55.10
245
SentencePiece
camembert-base_legacy
78.37
245
SentencePiece
facebook/musicgen-small
82.45
245
SentencePiece
facebook/musicgen-small_legacy
77.14
245
SentencePiece
google/flan-t5-xxl
75.92
245
SentencePiece
google/flan-t5-xxl_legacy
75.51
245
SentencePiece
microsoft/Phi-3-mini-128k-instruct
99.19
247
SentencePiece
microsoft/Phi-3-mini-128k-instruct_legacy
97.57
247
SentencePiece
microsoft/deberta-v3-base
95.10
245
SentencePiece
microsoft/deberta-v3-base_legacy
98.37
245
SentencePiece
mlx-community/quantized-gemma-7b-it
96.76
247
SentencePiece
mlx-community/quantized-gemma-7b-it_legacy
97.57
247
SentencePiece
rinna/bilingual-gpt-neox-4b
83.67
245
SentencePiece
rinna/bilingual-gpt-neox-4b_legacy
89.39
245
Tiktoken
Qwen/Qwen-14B-Chat
100.00
261
Tiktoken
THUDM/glm-4-9b-chat
93.16
263
Unigram
BAAI/bge-reranker-v2-m3
98.37
245
Unigram
camembert-base
84.49
245
Unigram
facebook/musicgen-small
98.37
245
Unigram
google/flan-t5-xxl
91.84
245
Unigram
microsoft/deberta-v3-base
98.37
245
Unigram
rinna/bilingual-gpt-neox-4b
100.00
245
WordLevel
cisco-ai/mini-bart-g2p
98.96
192
WordPiece
bert-base-multilingual-cased
100.00
261
WordPiece
cointegrated/rubert-tiny2
100.00
261
WordPiece
google/mobilebert-uncased
100.00
245
WordPiece
rasa/LaBSE
95.40
261
WordPiece
sentence-transformers/all-MiniLM-L6-v2
100.00
261
### Recreating Tokenizers From Tests
In some tokenizers, you need to select certain settings so that their output is closer to the Huggingface tokenizers:
- `THUDM/chatglm3-6b` detokenizer don't skips special tokens. Use `skip_special_tokens=False` during conversion
- All tested tiktoken based detokenizers leave extra spaces. Use `clean_up_tokenization_spaces=False` during conversion