https://github.com/bluryar/tokenizers.cpp

Native C++ inference-only tokenizer runtime port
https://github.com/bluryar/tokenizers.cpp

cpp ggml huggingface inference tokenizers

Last synced: 6 days ago
JSON representation

Native C++ inference-only tokenizer runtime port

Host: GitHub
URL: https://github.com/bluryar/tokenizers.cpp
Owner: bluryar
License: apache-2.0
Created: 2026-05-07T04:55:46.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-05-15T03:47:59.000Z (about 2 months ago)
Last Synced: 2026-05-15T05:45:56.736Z (about 2 months ago)
Topics: cpp, ggml, huggingface, inference, tokenizers
Language: C++
Size: 42 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
- Roadmap: docs/roadmap.md
- Notice: NOTICE
- Agents: AGENTS.md

Awesome Lists containing this project

README

          # tokenizers.cpp

Native C++ inference-only port of the Hugging Face `tokenizers` runtime surface

needed by downstream GGML/C++ projects.

This is not a Rust FFI wrapper. Runtime use does not require Rust, Python,

network access, trainers, HTTP/from-pretrained loading, or wrapper bindings.

## Repository Setup

This repository expects the Hugging Face Rust tokenizer reference as a git

submodule:

```sh

git submodule update --init --recursive

```

`third_party/tokenizers` is the read-only Hugging Face Rust reference used for

development parity.

ICU4C is vendored as pinned upstream release archives, not as a submodule:

- `third_party/icu4c-78.3/icu4c-78.3-sources.tgz`

- `third_party/icu4c-78.3/icu4c-78.3-data.zip`

- `third_party/icu4c-78.3/SHASUM512.txt`

The generated ICU install prefix is intentionally not committed:

```sh

scripts/dev/build_vendored_icu4c.sh

```

That script verifies the release inputs exist, extracts the source archive under

`build/`, and installs ICU under `third_party/icu4c-install`, which is the

default `TOKENIZERS_CPP_ICU_ROOT`.

## Build

```sh

cmake -S . -B build-icu \

  -DTOKENIZERS_CPP_BUILD_TESTS=ON \

  -DTOKENIZERS_CPP_FETCH_DEPS=OFF

cmake --build build-icu

ctest --test-dir build-icu --output-on-failure

```

The supported default build uses the vendored/static ICU4C install under

`third_party/icu4c-install` and includes a Linux audit that rejects accidental

shared `libicu*.so` runtime linkage.

Some parity tests use Hugging Face tokenizer JSON fixtures from the local

`hf-internal-testing/tokenizers-test-data` checkout under this project root.

Those tests are enabled by default only when `TOKENIZERS_CPP_HF_TEST_DATA_DIR`

exists. The self-contained tests, examples, install/export smoke tests, and

generated fixtures do not require that checkout.

## Unicode And Regex Backend

The default backend is vendored/static ICU4C. The upstream Rust tokenizer

runtime uses a combination of `onig`, `regex`, Unicode normalization/category

crates, and SentencePiece precompiled data, so the C++ port needs broad Unicode

normalization, lowercase/category checks, regex Unicode properties, lookahead,

and offset projection. ICU covers that full surface behind the private

`tokenizers_cpp::unicode` backend without depending on OS `libicu*.so`/`.dll`

packages at runtime.

`utf8proc` is a good future candidate for an optional lightweight Unicode

backend because it is small and covers UTF-8 normalization, case folding,

grapheme helpers, and Unicode categories. It is not a drop-in replacement for

the current default because it does not provide regex, and tokenizer parity

would still need a separate regex engine plus offset-projection glue.

RE2 may be useful later as a private optional fast path for RE2-compatible

serialized regexes, but it is not the default regex backend. It does not support

lookaround assertions, while common tokenizer patterns such as the GPT-2

ByteLevel regex require negative lookahead.

## CMake Consumption

Source-tree use:

```cmake

set(TOKENIZERS_CPP_BUILD_TESTS OFF CACHE BOOL "" FORCE)

set(TOKENIZERS_CPP_BUILD_EXAMPLES OFF CACHE BOOL "" FORCE)

add_subdirectory(path/to/tokenizers.cpp)

target_link_libraries(app PRIVATE tokenizers_cpp::tokenizers_cpp)

```

Installed-package use:

```cmake

find_package(tokenizers_cpp CONFIG REQUIRED)

target_link_libraries(app PRIVATE tokenizers_cpp::tokenizers_cpp)

```

See `docs/integration.md` for the full consumer guide.

## Public API

The public API is tokenizer-centered:

- `Tokenizer::from_file`

- `Tokenizer::from_bpe_files`

- raw, pair, batch, pre-tokenized, and char-offset encode APIs

- `decode`, `decode_batch`, and `decode_stream`

- `Encoding`, `Offset`, `DecodeStream`, `AddedToken`, and `BpeOptions`

Internal component configs and model details are intentionally private. See

`docs/api-stability.md`.

## Examples

Self-contained examples live under `examples/`:

- `basic_encode_decode.cpp`

- `batch_and_padding.cpp`

- `stream_decode.cpp`

These examples write temporary tokenizer JSON files and do not depend on the HF

test-data checkout.

## License

`tokenizers.cpp` is released under the Apache License 2.0. See `LICENSE`,

`NOTICE`, and `THIRD_PARTY_NOTICES.md` for upstream and vendored dependency

notices.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bluryar/tokenizers.cpp

Awesome Lists containing this project

README