{"id":27232969,"url":"https://github.com/gweidart/rs-bpe","last_synced_at":"2025-04-28T16:44:53.422Z","repository":{"id":283167376,"uuid":"950906564","full_name":"gweidart/rs-bpe","owner":"gweidart","description":"A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust","archived":false,"fork":false,"pushed_at":"2025-03-19T06:32:46.000Z","size":2590,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T14:30:36.650Z","etag":null,"topics":["bpe","bpe-tokenizer","byte-pair-encoding","byte-pair-tokenizer","huggingface","llm","openai","pypi-package","python","rust","tiktoken","tokenizers"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/rs-bpe/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gweidart.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-18T21:35:35.000Z","updated_at":"2025-04-09T02:13:02.000Z","dependencies_parsed_at":"2025-03-18T22:40:26.582Z","dependency_job_id":null,"html_url":"https://github.com/gweidart/rs-bpe","commit_stats":null,"previous_names":["gweidart/rs-bpe"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gweidart%2Frs-bpe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gweidart%2Frs-bpe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gweidart%2Frs-bpe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gweidart%2Frs-bpe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gweidart","download_url":"https://codeload.github.com/gweidart/rs-bpe/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251347953,"owners_count":21575184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bpe","bpe-tokenizer","byte-pair-encoding","byte-pair-tokenizer","huggingface","llm","openai","pypi-package","python","rust","tiktoken","tokenizers"],"created_at":"2025-04-10T14:11:10.994Z","updated_at":"2025-04-28T16:44:53.403Z","avatar_url":"https://github.com/gweidart.png","language":"Python","funding_links":[],"categories":["🔹 **BPE (Byte Pair Encoding) Implementations**"],"sub_categories":[],"readme":"[![Build](https://github.com/gweidart/rs-bpe/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/gweidart/rs-bpe/actions/workflows/ci.yml) \u0026nbsp; [![GH Release](https://github.com/gweidart/rs-bpe/actions/workflows/gh_release.yml/badge.svg?branch=main)](https://github.com/gweidart/rs-bpe/actions/workflows/gh_release.yml) \u0026nbsp; [![PyPI Release](https://github.com/gweidart/rs-bpe/actions/workflows/release.yml/badge.svg?branch=main)](https://github.com/gweidart/rs-bpe/actions/workflows/release.yml)\n\nThe main purpose of this library is to provide fast and correct token counting for chunking algorithms with a focus on high performance. It implements novel algorithms for BPE tokenization that are both correct and significantly faster than existing solutions.\n\n#### Installation\n\n##### Python Package\n\n```\npip install rs-bpe\n```\n\n*rs_bpe consistently outperforms the latest release of both tiktoken and huggingface's tokenizers (March 18, 2025)*\n\n![rs_bpe throughput](benchmark/tokenizer_benchmark_results_throughput.svg)\n\n## Key Features\n\n* Efficient token counting with linear time complexity even for adversarial inputs\n* Split text at exact token boundaries while respecting UTF-8 character boundaries\n* Incrementally count tokens while appending text to a chunk\n* Calculate token counts for sub-ranges of text with constant-time complexity\n* Python bindings with OpenAI-compatible interface\n\nThese operations are particularly important for LLM applications but are challenging to implement efficiently for BPE tokenization.\n\n## Motivation *(problems rs-bpe aims to solve)*\n\nExisting BPE tokenizers often face performance and correctness issues when used for chunking operations:\n\n### Split-at-N-Tokens Problem\n\nNaively splitting text after N tokens by first encoding the entire text and then selecting a boundary often produces suboptimal results:\n\n* The split point might not align with a UTF-8 character boundary\n* Dropping tokens until a character boundary is reached might result in chunks much shorter than desired\n* The algorithm wastes resources by encoding more text than necessary\n\n### Incremental Counting Problem\n\nIncrementally counting tokens as text is appended is challenging with traditional implementations:\n\n* Recomputing the encoding after every append leads to quadratic complexity\n* Approximating counts by aggregating piece counts leads to incorrect results due to BPE's non-monotonic nature\n* Incorrect counting can cause problems when staying within token limits for LLM APIs\n\n### Interval Counting Problem\n\nCounting tokens for arbitrary subranges traditionally requires reprocessing the entire substring:\n\n* Leads to poor performance for applications that need to count many subranges\n* Makes operations like binary search for token boundaries inefficient\n\nOur library provides novel algorithms to solve these problems with superior performance characteristics.\n\n## Implementation\n\nThe rs-bpe library is written in Rust with Python bindings, designed for both speed and correctness. It implements several encoding strategies:\n\n### Core Algorithm\n\nOur novel algorithm achieves O(n) complexity while preserving the exact output of the original BPE algorithm. The key insight is tracking encodings of all text prefixes efficiently using mathematical properties of valid BPE encodings.\n\nInstead of storing full token sequences for each prefix, we only need to remember the last token of each prefix. This is possible because:\n\n1. There exists exactly one valid encoding sequence for any input text\n2. Any substring of a valid encoding sequence is itself a valid encoding sequence\n3. Knowing the last token of a valid encoding sequence uniquely determines the full sequence\n\nThe algorithm efficiently determines the correct last token for each prefix by checking token compatibility with the preceding token, leading to a linear-time solution.\n\n### Backtracking Optimization\n\nFor average-case performance improvement, the library implements a backtracking-based algorithm that:\n\n1. Tries the greedy approach first, using the longest matching token at each step\n2. Backtracks when necessary to ensure valid BPE encoding\n3. Uses a bitfield to make runtime linear in the input length\n\n### Special Purpose Encoders\n\nThe library provides specialized encoders for specific use cases:\n\n* **AppendableEncoder**: Maintains token count state while appending text character by character\n* **IntervalEncoding**: Preprocesses text once to enable constant-time token counting for any substring\n* **BacktrackEncoder**: Provides the fastest correct implementation for general encoding\n* **OpenAI-compatible Tokenizer**: Implements tiktoken-compatible interface with cl100k and o200k models\n\n## Performance\n\nOur benchmarks show significant performance improvements over existing implementations:\n\n\u003e **Note**: All benchmark results shown here were achieved using the Python bindings, not the direct Rust implementation. This provides a more realistic representation of the performance users will experience in Python applications. Many libraries release benchmarks based solely on their native implementation, which can be misleading as the language boundary crossing adds overhead.\n\n### Single-Text Tokenization\n\nInternal benchmarks show rs-bpe outperforms existing tokenizers across all text sizes:\n\n\n| Text Size | rs-bpe\\_cached vs tiktoken | rs-bpe\\_cached vs HuggingFace |\n| ----------- | ---------------------------- | ------------------------------- |\n| Small     | 15.1x faster               | 7.6x faster                   |\n| Medium    | 3.7x faster                | 8.8x faster                   |\n| Large     | 2.2x faster                | 14.0x faster                  |\n\n_Encoding speed (benchmark.py results):\n\n\n![](assets/20250318_210013_tokenizer_benchmark_results_throughput.svg)\n\n_\n\n```\nSMALL TEXT:\n  tiktoken: 0.000605s\n  tokenizers: 0.000305s\n  rs_bpe_basic: 0.000095s\n  rs_bpe_cached: 0.000040s\n\nMEDIUM TEXT:\n  tiktoken: 0.000287s\n  tokenizers: 0.000677s\n  rs_bpe_basic: 0.000096s\n  rs_bpe_cached: 0.000077s\n\nLARGE TEXT:\n  tiktoken: 0.003613s\n  tokenizers: 0.023054s\n  rs_bpe_basic: 0.001438s\n  rs_bpe_cached: 0.001652s\n```\n\nThe rs-bpe library also provides significantly faster decoding and roundtrip operations:\n\n_Decoding speed:\n\n\n![](assets/20250318_210117_tokenizer_benchmark_results_time.svg)\n\n_\n\n```\nLARGE TEXT:\n  tiktoken: 0.000257s\n  tokenizers: 0.003614s\n  rs_bpe_basic: 0.000184s\n  rs_bpe_cached: 0.000158s\n```\n\n### Batch Processing Performance\n\nrs-bpe provides efficient batch processing that scales better with batch size:\n\n\n| Batch Size | Encoding Speedup | Decoding Speedup | Roundtrip Speedup |\n| ------------ | ------------------ | ------------------ | ------------------- |\n| 1          | 5.1x faster      | 2.6x faster      | 1.7x faster       |\n| 10         | 2.8x faster      | 1.6x faster      | 2.1x faster       |\n| 100        | 3.0x faster      | 1.3x faster      | 2.3x faster       |\n| 1000       | 3.1x faster      | 1.8x faster      | 2.5x faster       |\n\n_Throughput comparison (tokens per second):\n\n\n![](assets/20250318_210645_batch_benchmark_results_throughput.svg)\n\n_\n\n```\nBATCH SIZE 1000:\n  tiktoken: 0.032521s, 1,970,002 tokens/sec\n  rs_bpe_standard_batch: 0.010663s, 6,008,200 tokens/sec\n```\n\n### Worst-Case Performance\n\nWhile tiktoken shows quadratic growth for certain adversarial inputs, rs-bpe maintains linear scaling even in worst-case scenarios. This is critical for production systems that need consistent performance guarantees.\n\n### Key Performance Advantages\n\n1. **Memory Efficiency**: The implementation uses compact data structures and avoids redundant token storage\n2. **Thread Pool Optimization**: Batch processing uses an optimized thread pool with smart worker allocation\n3. **Caching**: The library includes intelligent state caching for repeated operations\n4. **No Correctness Trade-offs**: Unlike some implementations that sacrifice correctness for speed, rs-bpe is both fast and correct\n\nAll benchmarks were run on standard hardware and results may vary based on your specific environment.\n\n## Python Usage Examples\n\n### Basic Tokenization\n\n```python\nfrom rs_bpe.bpe import openai\n\n# Load OpenAI tokenizers (automatically caches for reuse)\ncl100k_tokenizer = openai.cl100k_base()  # GPT-3.5/4 tokenizer\no200k_tokenizer = openai.o200k_base()    # o200k tokenizer\n\n# Basic encoding\ntext = \"Hello, world! This is an example.\"\ntokens = cl100k_tokenizer.encode(text)\nprint(f\"Encoded tokens: {tokens}\")\n\n# Basic decoding\ndecoded_text = cl100k_tokenizer.decode(tokens)\nprint(f\"Decoded text: {decoded_text}\")\n\n# Simple token counting\ntoken_count = cl100k_tokenizer.count(text)\nprint(f\"Token count: {token_count}\")\n```\n\n### Efficient Token Limiting\n\nOne of the key features of rs-bpe is the ability to efficiently count tokens up to a limit, which is useful when you need to stay within token constraints:\n\n```python\nfrom rs_bpe.bpe import openai\n\ntokenizer = openai.cl100k_base()\nmax_tokens = 50\n\n# Count tokens until limit is reached\ntext = \"This is a long text that might exceed our token limit... \" * 20\nchar_position = tokenizer.count_till_limit(text, max_tokens)\n\nif char_position is not None:\n    # We reached the limit before the end of the text\n    truncated_text = text[:char_position]\n    print(f\"Truncated to {tokenizer.count(truncated_text)} tokens\")\n    print(f\"Truncated text: {truncated_text}\")\nelse:\n    # The entire text is within our token limit\n    print(f\"Text is within token limit: {tokenizer.count(text)} tokens\")\n```\n\n### Batch Processing\n\nrs-bpe excels at batch processing with automatic parallelization, which is perfect for processing large datasets:\n\n```python\nfrom rs_bpe.bpe import openai\nimport time\n\n# Load the tokenizer\ntokenizer = openai.cl100k_base()\n\n# Create a batch of texts\ntexts = [\n    \"This is the first document to encode.\",\n    \"Here's another one with different content.\",\n    \"A third document with some more text to process.\",\n    # Add more as needed...\n]\n\n# Configure parallel processing options (optional)\nparallel_options = openai.ParallelOptions(\n    min_batch_size=20,      # Minimum batch size to engage parallel processing\n    chunk_size=100,         # Number of texts to process in each thread\n    max_threads=0,          # 0 means use optimal thread count (based on CPU cores)\n    use_thread_pool=True    # Reuse thread pool for better performance\n)\n\n# Encode batch with performance metrics\nstart_time = time.time()\nresult = tokenizer.encode_batch(texts, parallel_options)\nend_time = time.time()\n\nprint(f\"Processed {len(texts)} texts in {result.time_taken:.6f}s\")\nprint(f\"Total tokens: {result.total_tokens}\")\nprint(f\"Throughput: {result.total_tokens / result.time_taken:.1f} tokens/second\")\n\n# Access individual token lists\nfor i, tokens in enumerate(result.tokens):\n    print(f\"Text {i} has {len(tokens)} tokens\")\n```\n\n### Advanced Usage: Checking Token Compatibility\n\nFor specialized applications, you might need to check if a text can be tokenized within a specific token limit:\n\n```python\nfrom rs_bpe.bpe import openai\n\ntokenizer = openai.cl100k_base()\nmax_tokens = 4096\n\ndef is_compatible(text, max_tokens):\n    \"\"\"Check if text can be tokenized within the token limit.\"\"\"\n    count = tokenizer.count(text)\n    compatible = count \u003c= max_tokens\n    return compatible, count\n\n# Example usage for verifying text compatibility\ntexts_to_check = [\n    \"Short text that's definitely within limits.\",\n    \"A\" * 20000  # A very long text that might exceed limits\n]\n\nfor i, text in enumerate(texts_to_check):\n    compatible, count = is_compatible(text, max_tokens)\n    status = \"compatible\" if compatible else \"too long\"\n    print(f\"Text {i}: {status} ({count} tokens)\")\n```\n\n### Text Chunking\n\nYou can use rs-bpe to efficiently chunk text based on token counts:\n\n```python\nfrom rs_bpe.bpe import openai\n\ntokenizer = openai.cl100k_base()\n\ndef chunk_text(text, max_chunk_tokens=1024, overlap_tokens=50):\n    \"\"\"Split text into chunks of approximately max_chunk_tokens.\"\"\"\n    chunks = []\n  \n    # Get the full text token count\n    total_tokens = tokenizer.count(text)\n  \n    if total_tokens \u003c= max_chunk_tokens:\n        return [text]\n  \n    # Keep track of where we are in the text\n    start_pos = 0\n  \n    while start_pos \u003c len(text):\n        # Find where to end this chunk\n        char_position = tokenizer.count_till_limit(text[start_pos:], max_chunk_tokens)\n      \n        if char_position is None:\n            # The rest of the text fits within our limit\n            chunks.append(text[start_pos:])\n            break\n      \n        # Add the chunk\n        end_pos = start_pos + char_position\n        chunks.append(text[start_pos:end_pos])\n      \n        # Move to the next chunk, considering overlap\n        if overlap_tokens \u003e 0 and end_pos \u003c len(text):\n            # Move back by overlap tokens\n            overlap_char_position = tokenizer.count_till_limit(\n                text[start_pos:end_pos], max_chunk_tokens - overlap_tokens\n            )\n            if overlap_char_position is not None:\n                start_pos += overlap_char_position\n            else:\n                start_pos = end_pos\n        else:\n            start_pos = end_pos\n  \n    return chunks\n\n# Example usage\nlong_text = \"This is a long document that needs to be split into chunks. \" * 100\nchunks = chunk_text(long_text, max_chunk_tokens=100, overlap_tokens=10)\n\nprint(f\"Split text into {len(chunks)} chunks:\")\nfor i, chunk in enumerate(chunks):\n    token_count = tokenizer.count(chunk)\n    print(f\"Chunk {i}: {token_count} tokens, {len(chunk)} chars\")\n```\n\n### Thread Pool Configuration\n\nFor high-volume applications, you can control how rs-bpe manages thread pools:\n\n```python\nfrom rs_bpe.bpe import openai\nimport multiprocessing\n\n# Get the number of CPU cores\ncpu_cores = multiprocessing.cpu_count()\nphysical_cores = cpu_cores // 2  # Approximation for physical cores\n\n# Configure parallel options based on workload needs\nlow_latency_options = openai.ParallelOptions(\n    min_batch_size=1,        # Parallelize even small batches\n    chunk_size=10,           # Process in smaller chunks\n    max_threads=2,           # Use fewer threads to minimize overhead\n    use_thread_pool=True\n)\n\nhigh_throughput_options = openai.ParallelOptions(\n    min_batch_size=50,                # Only parallelize large batches\n    chunk_size=200,                   # Larger chunks for better efficiency\n    max_threads=physical_cores - 1,   # Leave one core free for system\n    use_thread_pool=True\n)\n\n# Process batches with different settings based on priority\ntokenizer = openai.cl100k_base()\n\n# For interactive, latency-sensitive operations\nsmall_batch = [\"Quick response needed\"] * 5\nresult_small = tokenizer.encode_batch(small_batch, low_latency_options)\n\n# For background processing jobs\nlarge_batch = [\"Process in background\"] * 1000\nresult_large = tokenizer.encode_batch(large_batch, high_throughput_options)\n```\n\n### Building from Source\n\n```\ngit clone https://github.com/gweidart/rs-bpe.git\ncd rs-bpe\ncd python\nmaturin develop --release\n```\n\n## License\n\n[MIT License](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgweidart%2Frs-bpe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgweidart%2Frs-bpe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgweidart%2Frs-bpe/lists"}