{"id":51235776,"url":"https://github.com/bluryar/tokenizers.cpp","last_synced_at":"2026-06-28T20:02:49.524Z","repository":{"id":356233903,"uuid":"1231573916","full_name":"bluryar/tokenizers.cpp","owner":"bluryar","description":"Native C++ inference-only tokenizer runtime port","archived":false,"fork":false,"pushed_at":"2026-05-15T03:47:59.000Z","size":44004,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T05:45:56.736Z","etag":null,"topics":["cpp","ggml","huggingface","inference","tokenizers"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bluryar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-07T04:55:46.000Z","updated_at":"2026-05-15T03:48:04.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/bluryar/tokenizers.cpp","commit_stats":null,"previous_names":["bluryar/tokenizers.cpp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bluryar/tokenizers.cpp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluryar%2Ftokenizers.cpp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluryar%2Ftokenizers.cpp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluryar%2Ftokenizers.cpp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluryar%2Ftokenizers.cpp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bluryar","download_url":"https://codeload.github.com/bluryar/tokenizers.cpp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluryar%2Ftokenizers.cpp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34901959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","ggml","huggingface","inference","tokenizers"],"created_at":"2026-06-28T20:02:48.909Z","updated_at":"2026-06-28T20:02:49.513Z","avatar_url":"https://github.com/bluryar.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tokenizers.cpp\n\nNative C++ inference-only port of the Hugging Face `tokenizers` runtime surface\nneeded by downstream GGML/C++ projects.\n\nThis is not a Rust FFI wrapper. Runtime use does not require Rust, Python,\nnetwork access, trainers, HTTP/from-pretrained loading, or wrapper bindings.\n\n## Repository Setup\n\nThis repository expects the Hugging Face Rust tokenizer reference as a git\nsubmodule:\n\n```sh\ngit submodule update --init --recursive\n```\n\n`third_party/tokenizers` is the read-only Hugging Face Rust reference used for\ndevelopment parity.\n\nICU4C is vendored as pinned upstream release archives, not as a submodule:\n\n- `third_party/icu4c-78.3/icu4c-78.3-sources.tgz`\n- `third_party/icu4c-78.3/icu4c-78.3-data.zip`\n- `third_party/icu4c-78.3/SHASUM512.txt`\n\nThe generated ICU install prefix is intentionally not committed:\n\n```sh\nscripts/dev/build_vendored_icu4c.sh\n```\n\nThat script verifies the release inputs exist, extracts the source archive under\n`build/`, and installs ICU under `third_party/icu4c-install`, which is the\ndefault `TOKENIZERS_CPP_ICU_ROOT`.\n\n## Build\n\n```sh\ncmake -S . -B build-icu \\\n  -DTOKENIZERS_CPP_BUILD_TESTS=ON \\\n  -DTOKENIZERS_CPP_FETCH_DEPS=OFF\ncmake --build build-icu\nctest --test-dir build-icu --output-on-failure\n```\n\nThe supported default build uses the vendored/static ICU4C install under\n`third_party/icu4c-install` and includes a Linux audit that rejects accidental\nshared `libicu*.so` runtime linkage.\n\nSome parity tests use Hugging Face tokenizer JSON fixtures from the local\n`hf-internal-testing/tokenizers-test-data` checkout under this project root.\nThose tests are enabled by default only when `TOKENIZERS_CPP_HF_TEST_DATA_DIR`\nexists. The self-contained tests, examples, install/export smoke tests, and\ngenerated fixtures do not require that checkout.\n\n## Unicode And Regex Backend\n\nThe default backend is vendored/static ICU4C. The upstream Rust tokenizer\nruntime uses a combination of `onig`, `regex`, Unicode normalization/category\ncrates, and SentencePiece precompiled data, so the C++ port needs broad Unicode\nnormalization, lowercase/category checks, regex Unicode properties, lookahead,\nand offset projection. ICU covers that full surface behind the private\n`tokenizers_cpp::unicode` backend without depending on OS `libicu*.so`/`.dll`\npackages at runtime.\n\n`utf8proc` is a good future candidate for an optional lightweight Unicode\nbackend because it is small and covers UTF-8 normalization, case folding,\ngrapheme helpers, and Unicode categories. It is not a drop-in replacement for\nthe current default because it does not provide regex, and tokenizer parity\nwould still need a separate regex engine plus offset-projection glue.\n\nRE2 may be useful later as a private optional fast path for RE2-compatible\nserialized regexes, but it is not the default regex backend. It does not support\nlookaround assertions, while common tokenizer patterns such as the GPT-2\nByteLevel regex require negative lookahead.\n\n## CMake Consumption\n\nSource-tree use:\n\n```cmake\nset(TOKENIZERS_CPP_BUILD_TESTS OFF CACHE BOOL \"\" FORCE)\nset(TOKENIZERS_CPP_BUILD_EXAMPLES OFF CACHE BOOL \"\" FORCE)\nadd_subdirectory(path/to/tokenizers.cpp)\ntarget_link_libraries(app PRIVATE tokenizers_cpp::tokenizers_cpp)\n```\n\nInstalled-package use:\n\n```cmake\nfind_package(tokenizers_cpp CONFIG REQUIRED)\ntarget_link_libraries(app PRIVATE tokenizers_cpp::tokenizers_cpp)\n```\n\nSee `docs/integration.md` for the full consumer guide.\n\n## Public API\n\nThe public API is tokenizer-centered:\n\n- `Tokenizer::from_file`\n- `Tokenizer::from_bpe_files`\n- raw, pair, batch, pre-tokenized, and char-offset encode APIs\n- `decode`, `decode_batch`, and `decode_stream`\n- `Encoding`, `Offset`, `DecodeStream`, `AddedToken`, and `BpeOptions`\n\nInternal component configs and model details are intentionally private. See\n`docs/api-stability.md`.\n\n## Examples\n\nSelf-contained examples live under `examples/`:\n\n- `basic_encode_decode.cpp`\n- `batch_and_padding.cpp`\n- `stream_decode.cpp`\n\nThese examples write temporary tokenizer JSON files and do not depend on the HF\ntest-data checkout.\n\n## License\n\n`tokenizers.cpp` is released under the Apache License 2.0. See `LICENSE`,\n`NOTICE`, and `THIRD_PARTY_NOTICES.md` for upstream and vendored dependency\nnotices.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluryar%2Ftokenizers.cpp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbluryar%2Ftokenizers.cpp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluryar%2Ftokenizers.cpp/lists"}