{"id":36236331,"url":"https://github.com/ml-rust/splintr","last_synced_at":"2026-01-11T06:00:14.930Z","repository":{"id":326157209,"uuid":"1104242691","full_name":"ml-rust/splintr","owner":"ml-rust","description":"A high-performance BPE tokenizer built with Rust with Python bindings, focused on speed, safety, and resource optimization.","archived":false,"fork":false,"pushed_at":"2025-12-24T06:55:02.000Z","size":6450,"stargazers_count":55,"open_issues_count":1,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-25T19:34:55.653Z","etag":null,"topics":["huggingface","llm","machine-learning","openai","rust","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ml-rust.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-26T00:25:17.000Z","updated_at":"2025-12-24T06:53:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ml-rust/splintr","commit_stats":null,"previous_names":["farhan-syah/splintr"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/ml-rust/splintr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fsplintr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fsplintr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fsplintr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fsplintr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ml-rust","download_url":"https://codeload.github.com/ml-rust/splintr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Fsplintr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28293188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-11T04:44:51.577Z","status":"ssl_error","status_checked_at":"2026-01-11T04:44:44.232Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["huggingface","llm","machine-learning","openai","rust","tokenizer"],"created_at":"2026-01-11T06:00:13.751Z","updated_at":"2026-01-11T06:00:14.923Z","avatar_url":"https://github.com/ml-rust.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Splintr](images/splntr.png)\n\n[![Crates.io](https://img.shields.io/crates/v/splintr.svg)](https://crates.io/crates/splintr) [![PyPI](https://img.shields.io/pypi/v/splintr-rs.svg)](https://pypi.org/project/splintr-rs/) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n**A high-performance BPE tokenizer built with Rust with Python bindings, focused on speed, safety, and resource optimization.**\n\n## The Problem\n\nTokenization is everywhere in modern AI. Whether you're building LLM applications, training models, or processing data pipelines, you're tokenizing text constantly. But existing tokenizers have a problem: they're slow.\n\nWhen you need to tokenize batches of prompts, documents, or training data, you're stuck waiting. Python-based tokenizers can't fully leverage modern multi-core CPUs. You need something faster.\n\n## The Solution\n\nSplintr brings Rust performance to Python. Built from the ground up for speed and efficiency:\n\n![Batch Encoding Throughput](images/benchmark_batch.png)\n\n| Configuration | Splintr      | Tiktoken | HuggingFace | TokenDagger |\n| ------------- | ------------ | -------- | ----------- | ----------- |\n| 1,000 texts   | **111 MB/s** | 9 MB/s   | 28 MB/s     | 9 MB/s      |\n| 500 texts     | **107 MB/s** | 10 MB/s  | 27 MB/s     | 8 MB/s      |\n| 100 texts     | **69 MB/s**  | 7 MB/s   | 20 MB/s     | 6 MB/s      |\n\n**10-12x faster than tiktoken. 4x faster than HuggingFace. Built in Rust, accessible from Python.**\n\n## Quick Start\n\n### Python\n\n```bash\npip install splintr-rs\n```\n\n```python\nfrom splintr import Tokenizer\n\n# Load a pretrained vocabulary\ntokenizer = Tokenizer.from_pretrained(\"cl100k_base\")  # OpenAI GPT-4/3.5\n# tokenizer = Tokenizer.from_pretrained(\"llama3\")      # Meta Llama 3 family\n# tokenizer = Tokenizer.from_pretrained(\"deepseek_v3\") # DeepSeek V3/R1\n# tokenizer = Tokenizer.from_pretrained(\"mistral_v1\")  # Mistral 7B v0.1/v0.2\n# tokenizer = Tokenizer.from_pretrained(\"mistral_v2\")  # Mistral 7B v0.3, Codestral\n# tokenizer = Tokenizer.from_pretrained(\"mistral_v3\")  # Mistral NeMo, Large 2\n\n# Encode and decode\ntokens = tokenizer.encode(\"Hello, world!\")\ntext = tokenizer.decode(tokens)\n\n# Batch encode (10-12x faster)\ntexts = [\"Hello, world!\", \"How are you?\", \"Machine learning is fun!\"]\nbatch_tokens = tokenizer.encode_batch(texts)\n```\n\nSee the [API Guide](docs/api_guide.md) for complete documentation and examples.\n\n### Rust\n\n```toml\n[dependencies]\nsplintr = \"*\"  # or pin to a specific version\n```\n\n```rust\nuse splintr::{Tokenizer, CL100K_BASE_PATTERN};\n\nlet tokenizer = Tokenizer::new(encoder, special_tokens, CL100K_BASE_PATTERN)?;\nlet tokens = tokenizer.encode(\"Hello, world!\");\nlet batch_tokens = tokenizer.encode_batch(\u0026texts);\n```\n\nSee the [API Guide](docs/api_guide.md) and [docs.rs](https://docs.rs/splintr) for complete Rust documentation.\n\n## Key Features\n\n**Performance where it matters:**\n\n- **12x faster batch encoding** - Parallel processing across multiple texts using Rayon\n- **3-4x faster single text encoding** - Optimized sequential algorithm for typical use cases\n- **Smart parallelization** - Sequential for small texts (\u003c1MB), parallel for large datasets\n- **LRU caching** - Avoid redundant encoding of frequently seen text chunks\n\n**Built for production:**\n\n- **Compatible vocabularies** - Supports cl100k_base, o200k_base (OpenAI), Llama 3 family (Meta), DeepSeek V3 (DeepSeek), and Mistral V1/V2/V3 (Mistral AI)\n- **Streaming decoders** - Real-time LLM output display with proper UTF-8 handling ([guide](docs/api_guide.md#streaming-decoder))\n- **54 agent tokens** - Built-in support for chat, CoT reasoning, ReAct agents, tool calling, RAG citations ([docs](docs/special_tokens.md))\n- **Battle-tested algorithms** - Regexr with JIT (pure Rust), Aho-Corasick for special tokens, linked-list BPE\n\n**Cross-platform:**\n\n- Python bindings via PyO3 (Linux, macOS, Windows)\n- Native Rust library for maximum performance\n\n## Performance Deep Dive\n\nAll benchmarks performed on Linux (6.16.8-arch3-1) with 24 CPU cores, comparing against tiktoken (reference Python implementation), Hugging Face tokenizers, and TokenDagger.\n\n### Single Text Encoding\n\nFor single texts, splintr achieves **3-4x faster** encoding across various text sizes:\n\n![Single Text Encoding Comparison](images/benchmark_single.png)\n\n**Latency by content type:**\n\n![Latency Comparison](images/benchmark_single_latency.png)\n\nConsistent low latency across Python code, JSON, English prose, and Chinese text makes splintr ideal for interactive applications and real-time processing.\n\n### Batch Encoding\n\nThe real magic happens with batches. Splintr parallelizes across texts to achieve **10-12x speedup**:\n\n![Batch Speedup vs Tiktoken](images/benchmark_batch_speedup.png)\n\nHigher speedups on larger batches where parallelization overhead is amortized. Perfect for:\n\n- Training data preprocessing\n- Bulk document tokenization\n- API batch processing\n- Data pipeline throughput\n\n### Design Decision: Sequential by Default\n\nSplintr uses **sequential encoding for single texts** and **parallel encoding across batches** based on empirical benchmarking:\n\n![Sequential vs Rayon Internal Parallelization](images/benchmark_splintr.png)\n\n**Key findings:**\n\n- Sequential is faster for texts up to ~1MB (typical LLM prompts and documents)\n- Rayon's parallelization overhead only pays off at ~1MB+ text sizes\n- Most real-world inputs are well under 1MB\n- `encode()` uses sequential processing for optimal single-text performance\n- `encode_batch()` parallelizes across multiple texts for maximum throughput\n- `encode_rayon()` available for the rare cases where you have \u003e1MB single texts\n\nThis architecture ensures splintr is optimized for the most common tokenization patterns in LLM applications.\n\n### Running Benchmarks Yourself\n\n```bash\n# Clone and install\ngit clone https://github.com/ml-rust/splintr.git\ncd splintr\npip install -e .\npip install tiktoken\n\n# Run the benchmark suite\ncd benchmarks\npython benchmark.py --model cl100k_base --output results/my_benchmark.json\n\n# View results\ncat results/my_benchmark.md\n```\n\nThe benchmark suite tests single text encoding, batch encoding, streaming decoder performance, and special token handling across various content types.\n\n### Regex Backends\n\nSplintr uses a pure-Rust regex engine ([`regexr`](https://crates.io/crates/regexr)) by default, with optional PCRE2 support for compatibility.\n\n**Default Backend (regexr):**\n\n- Pure Rust implementation (no C dependencies)\n- JIT compilation and SIMD acceleration\n- Native UTF-8 and Unicode property support\n\n**Optional PCRE2 Backend:**\n\n```python\nfrom splintr import Tokenizer\n\n# Default: regexr backend (pure Rust)\ntokenizer = Tokenizer.from_pretrained(\"cl100k_base\")\n\n# Optional: switch to PCRE2 (requires --features pcre2)\ntokenizer = Tokenizer.from_pretrained(\"cl100k_base\").pcre2(True)\n```\n\nTo enable PCRE2, build with the feature flag:\n\n```bash\nmaturin develop --release --features pcre2\n```\n\n**Benchmarking:**\n\n```bash\n# Compare backends (requires PCRE2 feature)\npython benchmarks/benchmark_regexr_comparison.py --model cl100k_base\n\n# Visual comparison with charts\npython benchmarks/benchmark_regexr_viz.py --model cl100k_base\n```\n\n## Streaming Decoders\n\nFor real-time LLM applications where tokens arrive one at a time, Splintr provides streaming decoders that handle UTF-8 boundary alignment:\n\n```python\n# Regular streaming decoder (cl100k_base, o200k_base, llama3)\ndecoder = tokenizer.streaming_decoder()\n\n# ByteLevel streaming decoder (deepseek_v3, GPT-2)\ndecoder = tokenizer.byte_level_streaming_decoder()\n\n# Process tokens as they arrive\nfor token_id in token_stream:\n    if text := decoder.add_token(token_id):\n        print(text, end=\"\", flush=True)\nprint(decoder.flush())\n```\n\n**Why streaming decoders?** BPE tokens don't align with UTF-8 character boundaries. A multi-byte character like \"世\" might split across tokens. The streaming decoder buffers incomplete sequences and only outputs complete characters.\n\nSee the [API Guide](docs/api_guide.md#streaming-decoder) for detailed usage, examples, and best practices.\n\n## Supported Vocabularies\n\n| Vocabulary    | Used By                             | Vocabulary Size | Special Tokens  | Import Constant            |\n| ------------- | ----------------------------------- | --------------- | --------------- | -------------------------- |\n| `cl100k_base` | GPT-4, GPT-3.5-turbo                | ~100,000        | 5 + 54 agent    | `CL100K_BASE_PATTERN`      |\n| `o200k_base`  | GPT-4o                              | ~200,000        | 2 + 54 agent    | `O200K_BASE_PATTERN`       |\n| `llama3`      | Llama 3, 3.1, 3.2, 3.3 (Meta)       | ~128,000        | 11 + 54 agent   | `LLAMA3_PATTERN`           |\n| `deepseek_v3` | DeepSeek V3, DeepSeek R1            | ~128,000        | 17 + 54 agent   | `LLAMA3_PATTERN`           |\n| `mistral_v1`  | Mistral 7B v0.1/v0.2, Mixtral 8x7B  | ~32,000         | 3 + 54 agent    | `SENTENCEPIECE_PATTERN`    |\n| `mistral_v2`  | Mistral 7B v0.3, Codestral, 8x22B   | ~32,768         | 10 + 54 agent   | `SENTENCEPIECE_PATTERN`    |\n| `mistral_v3`  | Mistral NeMo, Large 2, Pixtral      | ~131,000        | 10 + 54 agent   | `MISTRAL_V3_PATTERN`       |\n\n**OpenAI standard tokens:**\n\n- **cl100k_base**: `\u003c|endoftext|\u003e`, `\u003c|fim_prefix|\u003e`, `\u003c|fim_middle|\u003e`, `\u003c|fim_suffix|\u003e`, `\u003c|endofprompt|\u003e`\n- **o200k_base**: `\u003c|endoftext|\u003e`, `\u003c|endofprompt|\u003e`\n\n**Meta Llama 3 standard tokens:**\n\n- **llama3**: `\u003c|begin_of_text|\u003e`, `\u003c|end_of_text|\u003e`, `\u003c|start_header_id|\u003e`, `\u003c|end_header_id|\u003e`, `\u003c|eot_id|\u003e`, `\u003c|eom_id|\u003e` (3.1+), `\u003c|python_tag|\u003e` (3.1+), `\u003c|step_id|\u003e` (3.2-Vision), `\u003c|image|\u003e` (3.2-Vision)\n\n**DeepSeek V3 standard tokens:**\n\n- **deepseek_v3**: `\u003c｜begin▁of▁sentence｜\u003e`, `\u003c｜end▁of▁sentence｜\u003e`, `\u003cthink\u003e`, `\u003c/think\u003e`, `\u003c｜User｜\u003e`, `\u003c｜Assistant｜\u003e`, `\u003c|EOT|\u003e`, FIM tokens (`\u003c｜fim▁hole｜\u003e`, `\u003c｜fim▁begin｜\u003e`, `\u003c｜fim▁end｜\u003e`), tool calling tokens (`\u003c｜tool▁calls▁begin｜\u003e`, `\u003c｜tool▁call▁begin｜\u003e`, etc.)\n\n**Mistral standard tokens:**\n\n- **mistral_v1**: `\u003cunk\u003e`, `\u003cs\u003e`, `\u003c/s\u003e` (SentencePiece native)\n- **mistral_v2**: Same as V1 + control tokens: `[INST]`, `[/INST]`, `[TOOL_CALLS]`, `[AVAILABLE_TOOLS]`, `[/AVAILABLE_TOOLS]`, `[TOOL_RESULTS]`, `[/TOOL_RESULTS]`\n- **mistral_v3**: `\u003cunk\u003e`, `\u003cs\u003e`, `\u003c/s\u003e` + control tokens (Tekken/Tiktoken-based, NOT SentencePiece)\n\n### Agent Tokens (54 per model)\n\nSplintr extends all vocabularies with 54 specialized tokens for building agent systems:\n\n```python\nfrom splintr import Tokenizer, CL100K_AGENT_TOKENS\n\ntokenizer = Tokenizer.from_pretrained(\"cl100k_base\")\ntext = \"\u003c|think|\u003eLet me reason...\u003c|/think|\u003eThe answer is 42.\"\ntokens = tokenizer.encode_with_special(text)\nprint(CL100K_AGENT_TOKENS.THINK)      # 100282\nprint(CL100K_AGENT_TOKENS.FUNCTION)   # 100292\n```\n\n| Category     | Example Tokens                                      | Purpose                    |\n| ------------ | --------------------------------------------------- | -------------------------- |\n| Conversation | `system`, `user`, `assistant`, `im_start`, `im_end` | ChatML format              |\n| Thinking     | `think`                                             | Chain-of-Thought reasoning |\n| ReAct        | `plan`, `step`, `act`, `observe`                    | Agent action loops         |\n| Tools        | `function`, `result`, `error`                       | Function calling           |\n| RAG          | `context`, `quote`, `cite`, `source`                | Citations                  |\n\nSee [docs/special_tokens.md](docs/special_tokens.md) for the complete list and [API Guide](docs/api_guide.md#agent-tokens-usage) for usage examples.\n\n## How It Works\n\nSplintr implements several optimizations that make tokenization faster:\n\n- **Regexr with JIT compilation**: Pure Rust regex engine with SIMD acceleration\n- **Rayon parallelism**: Leverages multiple CPU cores for batch encoding\n- **Linked-list BPE algorithm**: Avoids O(N²) complexity on pathological inputs\n- **FxHashMap**: Faster lookups than default SipHash for non-adversarial contexts\n- **Aho-Corasick for special tokens**: Fast multi-pattern matching without regex alternation\n- **LRU cache**: Avoids redundant BPE encoding of frequently seen chunks\n\n## Use Cases\n\n**LLM Applications:**\n\n- Tokenizing prompts with 3-4x lower latency\n- Streaming decoder for real-time output display\n- Token counting for API cost estimation\n\n**Agent Systems:**\n\n- Building ReAct agents with structured reasoning tokens\n- Tool-calling systems with function tokens\n- Chain-of-Thought reasoning with thinking tokens\n\n**Training Pipelines:**\n\n- Fast batch encoding of large datasets (10-12x speedup)\n- Preprocessing millions of documents efficiently\n- Parallel tokenization across distributed systems\n\n**RAG Applications:**\n\n- Structured context injection with citation tokens\n- Document chunking with section markers\n- Source tracking through tokenization\n\n**Data Processing:**\n\n- Bulk document tokenization\n- Multi-language text processing\n- Real-time text preprocessing\n\n## Contributing\n\nContributions are welcome! Here's how you can help:\n\n1. **Report bugs**: Open an issue with a minimal reproduction case\n2. **Suggest features**: Describe your use case and why the feature would be helpful\n3. **Submit pull requests**:\n   - Add tests for new functionality\n   - Run `cargo test` and `cargo clippy` before submitting\n   - Update documentation as needed\n\n### Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/ml-rust/splintr.git\ncd splintr\n\n# Install pre-commit hook (recommended)\ncp hooks/pre-commit .git/hooks/pre-commit\nchmod +x .git/hooks/pre-commit\n\n# Build the Rust library\ncargo build --release\n\n# Build Python bindings\npip install maturin\nmaturin develop --release\n\n# Run tests\ncargo test                    # Rust tests\ncargo clippy --all-targets    # Linting\ncargo fmt --all --check       # Format check\n```\n\nThe pre-commit hook automatically runs formatting, clippy, and tests before each commit.\n\n## Acknowledgments\n\nSplintr builds upon concepts from:\n\n- [tiktoken](https://github.com/openai/tiktoken) - OpenAI's reference BPE tokenizer\n- [tokenizers](https://github.com/huggingface/tokenizers) - Hugging Face's tokenization library\n\nThe performance optimizations are informed by profiling real-world usage patterns in LLM applications.\n\n## Citation\n\nIf you use Splintr in your research, please cite:\n\n```bibtex\n@software{splintr,\n  author = {Farhan Syah},\n  title = {Splintr: High-Performance BPE Tokenizer},\n  year = {2025},\n  url = {https://github.com/ml-rust/splintr}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-rust%2Fsplintr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fml-rust%2Fsplintr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-rust%2Fsplintr/lists"}