{"id":22065418,"url":"https://github.com/tryagi/tiktoken","last_synced_at":"2026-04-01T23:49:53.146Z","repository":{"id":173822430,"uuid":"651349718","full_name":"tryAGI/Tiktoken","owner":"tryAGI","description":"High-performance .NET BPE tokenizer — up to 618 MiB/s, competitive with Rust. Zero-allocation counting, multilingual cache, o200k/cl100k/r50k/p50k encodings + HuggingFace tokenizer.json support.","archived":false,"fork":false,"pushed_at":"2026-03-24T00:56:23.000Z","size":11150,"stargazers_count":82,"open_issues_count":1,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-03-24T11:23:01.265Z","etag":null,"topics":["ai","bpe","cl100k-base","csharp","dotnet","gpt4o","high-performance","huggingface","o200k-base","openai","sdk","tiktoken","tokenizer","zero-allocation"],"latest_commit_sha":null,"homepage":"https://www.nuget.org/packages/Tiktoken/","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tryAGI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"HavenDV","patreon":"havendv","ko_fi":"havendv","custom":["https://www.paypal.me/havendv","https://www.buymeacoffee.com/havendv","https://donate.stripe.com/00gfZ19zkeKLh1eaEE","https://www.upwork.com/freelancers/~017b1ad6f6af9cc189"]}},"created_at":"2023-06-09T03:49:44.000Z","updated_at":"2026-03-24T00:56:27.000Z","dependencies_parsed_at":"2024-03-25T22:29:27.799Z","dependency_job_id":"5f601c75-ad69-4d46-8cfb-0146340a0007","html_url":"https://github.com/tryAGI/Tiktoken","commit_stats":null,"previous_names":["tryagi/tiktoken"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/tryAGI/Tiktoken","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tryAGI%2FTiktoken","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tryAGI%2FTiktoken/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tryAGI%2FTiktoken/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tryAGI%2FTiktoken/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tryAGI","download_url":"https://codeload.github.com/tryAGI/Tiktoken/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tryAGI%2FTiktoken/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31293123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T21:15:39.731Z","status":"ssl_error","status_checked_at":"2026-04-01T21:15:34.046Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","bpe","cl100k-base","csharp","dotnet","gpt4o","high-performance","huggingface","o200k-base","openai","sdk","tiktoken","tokenizer","zero-allocation"],"created_at":"2024-11-30T19:17:30.690Z","updated_at":"2026-04-01T23:49:53.137Z","avatar_url":"https://github.com/tryAGI.png","language":"C#","funding_links":["https://github.com/sponsors/HavenDV","https://patreon.com/havendv","https://ko-fi.com/havendv","https://www.paypal.me/havendv","https://www.buymeacoffee.com/havendv","https://donate.stripe.com/00gfZ19zkeKLh1eaEE","https://www.upwork.com/freelancers/~017b1ad6f6af9cc189"],"categories":[],"sub_categories":[],"readme":"# Tiktoken\n\n[![Nuget package](https://img.shields.io/nuget/vpre/Tiktoken)](https://www.nuget.org/packages/Tiktoken/)\n[![dotnet](https://github.com/tryAGI/Tiktoken/actions/workflows/dotnet.yml/badge.svg?branch=main)](https://github.com/tryAGI/Tiktoken/actions/workflows/dotnet.yml)\n[![License: MIT](https://img.shields.io/github/license/tryAGI/Tiktoken)](https://github.com/tryAGI/Tiktoken/blob/main/LICENSE.txt)\n[![Discord](https://img.shields.io/discord/1115206893015662663?label=Discord\u0026logo=discord\u0026logoColor=white\u0026color=d82679)](https://discord.gg/Ca2xhfBf3v)\n[![Throughput](https://img.shields.io/badge/throughput-618_MiB%2Fs-brightgreen)](https://github.com/tryAGI/Tiktoken#benchmarks)\n\nOne of the fastest BPE tokenizers in any language — the fastest in .NET, competitive with pure Rust implementations.\nZero-allocation token counting, built-in multilingual cache, and up to **42x faster** than other .NET tokenizers.\nWe will be happy to accept any PR.\n\n### Implemented encodings\n- `o200k_base`\n- `cl100k_base`\n- `r50k_base`\n- `p50k_base`\n- `p50k_edit`\n\n### Usage\n```csharp\nusing Tiktoken;\n\nvar encoder = TikTokenEncoder.CreateForModel(Models.Gpt4o);\nvar tokens = encoder.Encode(\"hello world\"); // [15339, 1917]\nvar text = encoder.Decode(tokens); // hello world\nvar numberOfTokens = encoder.CountTokens(text); // 2\nvar stringTokens = encoder.Explore(text); // [\"hello\", \" world\"]\n\n// Alternative APIs:\nvar encoder = ModelToEncoder.For(\"gpt-4o\");\nvar encoder = new Encoder(new O200KBase());\n```\n\n### Load from HuggingFace tokenizer.json\n\nThe `Tiktoken.Encodings.Tokenizer` package enables loading any HuggingFace-format `tokenizer.json` file — supporting GPT-2, Llama 3, Qwen2, DeepSeek, and other BPE-based models.\n\n```csharp\nusing Tiktoken;\nusing Tiktoken.Encodings;\n\n// From a local file\nvar encoding = TokenizerJsonLoader.FromFile(\"path/to/tokenizer.json\");\nvar encoder = new Encoder(encoding);\n\n// From a stream (HTTP responses, embedded resources)\nusing var stream = File.OpenRead(\"tokenizer.json\");\nvar encoding = TokenizerJsonLoader.FromStream(stream);\n\n// From a URL (e.g., HuggingFace Hub)\nusing var httpClient = new HttpClient();\nvar encoding = await TokenizerJsonLoader.FromUrlAsync(\n    new Uri(\"https://huggingface.co/openai-community/gpt2/raw/main/tokenizer.json\"),\n    httpClient,\n    name: \"gpt2\");\n\n// Custom regex patterns (optional — auto-detected by default)\nvar encoding = TokenizerJsonLoader.FromFile(\"tokenizer.json\", patterns: myPatterns);\n```\n\n**Supported pre-tokenizer types:**\n- `ByteLevel` — GPT-2 and similar models\n- `Split` with regex pattern — direct regex-based splitting\n- `Sequence[Split, ByteLevel]` — Llama 3, Qwen2, DeepSeek, and other modern models\n\n### Count message tokens (OpenAI chat format)\n\nCount tokens for chat messages using OpenAI's official formula, including support for function/tool definitions:\n\n```csharp\nusing Tiktoken;\n\n// Simple message counting\nvar messages = new List\u003cChatMessage\u003e\n{\n    new(\"system\", \"You are a helpful assistant.\"),\n    new(\"user\", \"What is the weather in Paris?\"),\n};\nint count = TikTokenEncoder.CountMessageTokens(\"gpt-4o\", messages);\n\n// With tool/function definitions\nvar tools = new List\u003cChatFunction\u003e\n{\n    new(\"get_weather\", \"Get the current weather\", new List\u003cFunctionParameter\u003e\n    {\n        new(\"location\", \"string\", \"The city name\", isRequired: true),\n        new(\"unit\", \"string\", \"Temperature unit\",\n            enumValues: new[] { \"celsius\", \"fahrenheit\" }),\n    }),\n};\nint countWithTools = TikTokenEncoder.CountMessageTokens(\"gpt-4o\", messages, tools);\n\n// Or use the Encoder instance directly\nvar encoder = ModelToEncoder.For(\"gpt-4o\");\nint toolTokens = encoder.CountToolTokens(tools);\n```\n\nNested object parameters and array types are also supported:\n```csharp\nnew FunctionParameter(\"address\", \"object\", \"Mailing address\", properties: new List\u003cFunctionParameter\u003e\n{\n    new(\"street\", \"string\", \"Street address\", isRequired: true),\n    new(\"city\", \"string\", \"City name\", isRequired: true),\n});\n```\n\n### Custom encodings\n\nLoad encoding data from `.tiktoken` text files or `.ttkb` binary files:\n\n```csharp\nusing Tiktoken.Encodings;\n\n// Load from file (auto-detects format by extension)\nvar ranks = EncodingLoader.LoadEncodingFromFile(\"my_encoding.ttkb\");\nvar ranks = EncodingLoader.LoadEncodingFromFile(\"my_encoding.tiktoken\");\n\n// Load from binary byte array (e.g., from embedded resource or network)\nbyte[] binaryData = File.ReadAllBytes(\"my_encoding.ttkb\");\nvar ranks = EncodingLoader.LoadEncodingFromBinaryData(binaryData);\n\n// Convert text format to binary for faster loading\nvar textRanks = EncodingLoader.LoadEncodingFromFile(\"my_encoding.tiktoken\");\nusing var output = File.Create(\"my_encoding.ttkb\");\nEncodingLoader.WriteEncodingToBinaryStream(output, textRanks);\n```\n\nThe `.ttkb` binary format loads ~30% faster than `.tiktoken` text (no base64 decoding) and is 34% smaller. See [`data/README.md`](data/README.md) for the format specification and conversion tools.\n\n### Benchmarks\n\nBenchmarked on Apple M4 Max, .NET 10.0, o200k_base encoding. Tested with diverse inputs: short ASCII, multilingual (12 scripts + emoji), CJK-heavy, Python code, and long documents.\n\n#### CountTokens — zero allocation, fastest in class\n\n| Input | SharpToken | TiktokenSharp | Microsoft.ML | **Tiktoken** | **Throughput** | **Speedup** |\n|-------|-----------|---------------|-------------|-------------|:-----------:|:-----------:|\n| Hello, World! (13 B) | 217 ns | 164 ns | 319 ns | **88 ns** | 141 MiB/s | 1.9-3.6x |\n| Multilingual (382 B, 12 scripts) | 14.7 us | 9.5 us | 5.1 us | **1.1 us** | 339 MiB/s | 4.7-13.6x |\n| CJK-heavy (1,676 B, 6 scripts) | 109.4 us | 65.6 us | 37.0 us | **2.6 us** | 618 MiB/s | 14.3-42.3x |\n| Python code (879 B) | 13.1 us | 9.7 us | 21.6 us | **5.5 us** | 153 MiB/s | 1.8-4.0x |\n| Multilingual long (4,312 B) | 283.1 us | 155.7 us | 71.0 us | **9.0 us** | 458 MiB/s | 7.9-31.6x |\n| Bitcoin whitepaper (19,884 B) | 400.3 us | 255.4 us | 321.3 us | **105.1 us** | 180 MiB/s | 2.4-3.8x |\n\n\u003e **Zero allocation** across all inputs (0 B). Tiktoken's advantage is most pronounced on multilingual/CJK text — up to **42x faster** than competitors. Throughput on cached multilingual text reaches **618 MiB/s**, competitive with the fastest Rust tokenizers.\n\n#### Cache effect on CountTokens\n\nBuilt-in token cache dramatically accelerates repeated non-ASCII patterns:\n\n| Input | No cache | Cached | Cache speedup |\n|-------|---------|--------|:-------------:|\n| Hello, World! (13 B) | 88 ns | 86 ns | — |\n| Multilingual (382 B) | 5.4 us | 1.1 us | **4.9x** |\n| CJK-heavy (1,676 B) | 33.7 us | 2.6 us | **13.1x** |\n| Python code (879 B) | 5.6 us | 5.5 us | — |\n| Multilingual long (4,312 B) | 78.0 us | 9.0 us | **8.6x** |\n| Bitcoin whitepaper (19,884 B) | 122.7 us | 104.9 us | 1.2x |\n\n\u003e Cache has no effect on ASCII-dominant inputs (already on fast path). On multilingual/CJK text, cache provides **5-13x speedup** by skipping UTF-8 conversion and BPE on subsequent calls. Cold-path performance was significantly improved by the O(n log n) min-heap BPE merge optimization.\n\n#### Encode — returns token IDs\n\n| Input | SharpToken | TiktokenSharp | Microsoft.ML | **Tiktoken** | **Throughput** | **Speedup** |\n|-------|-----------|---------------|-------------|-------------|:-----------:|:-----------:|\n| Hello, World! (13 B) | 214 ns | 163 ns | 316 ns | **109 ns** | 114 MiB/s | 1.5-2.9x |\n| Multilingual (382 B, 12 scripts) | 14.5 us | 9.4 us | 5.2 us | **1.3 us** | 287 MiB/s | 4.1-11.4x |\n| CJK-heavy (1,676 B, 6 scripts) | 107.9 us | 64.7 us | 37.0 us | **3.3 us** | 484 MiB/s | 11.2-32.7x |\n| Python code (879 B) | 13.1 us | 9.7 us | 23.6 us | **5.8 us** | 145 MiB/s | 1.7-4.1x |\n| Multilingual long (4,312 B) | 276.4 us | 151.3 us | 70.7 us | **10.9 us** | 376 MiB/s | 6.5-25.2x |\n| Bitcoin whitepaper (19,884 B) | 366.1 us | 245.5 us | 317.7 us | **111.8 us** | 170 MiB/s | 2.2-3.3x |\n\n\u003e Same performance characteristics as CountTokens, with additional allocation for the output `int[]` array.\n\n#### Construction — encoder initialization\n\n| Encoding | Time | Description |\n|----------|------|-------------|\n| **o200k_base** | **0.78 ms** | GPT-4o (200K vocab, pre-computed hash table, lazy FastEncoder) |\n| **cl100k_base** | **0.46 ms** | GPT-3.5/4 (100K vocab) |\n\n\u003e Encoder construction includes loading embedded binary data, building hash tables, and compiling regex. FastEncoder and Decoder dictionaries are lazy-initialized on first use only. Reuse `Encoder` instances across calls for best performance.\n\n#### Cross-language context\n\nAll numbers below measured on **Apple M4 Max** with **identical inputs** and **o200k_base** encoding. See [`benchmarks/cross-language/results/`](benchmarks/cross-language/results/) for full reports.\n\n| Implementation | Language | Encode Throughput | CountTokens Throughput | Notes |\n|---------------|----------|:-----------------:|:----------------------:|-------|\n| **Tiktoken** (cached) | **.NET/C#** | **114-484 MiB/s** | **141-618 MiB/s** | **Zero-alloc counting; cache gives 5-13x on multilingual** |\n| **Tiktoken** (no cache) | **.NET/C#** | **44-145 MiB/s** | **47-155 MiB/s** | **Cold/first-call with O(n log n) min-heap BPE merge** |\n| [`tiktoken`](https://lib.rs/crates/tiktoken) v3 | Rust | 34-88 MiB/s | — | Pure Rust, arena-based |\n| GitHub [`bpe`](https://github.com/github/rust-gems) v0.3 | Rust | 33-64 MiB/s | 29-66 MiB/s | Aho-Corasick, O(n) worst case |\n| [OpenAI tiktoken](https://github.com/openai/tiktoken) 0.12 | Python | 7-20 MiB/s | — | Rust core, but Python FFI overhead |\n\n\u003e .NET Tiktoken's token cache makes it dramatically faster than native Rust on repeated/multilingual text — up to **7x faster** than the fastest Rust tokenizer on CJK text. Even without the cache, .NET is competitive with or faster than both Rust crates on most inputs thanks to the O(n log n) min-heap BPE merge optimization.\n\nYou can view the full raw BenchmarkDotNet reports for each version [here](benchmarks).\n\n## CLI Tool\n\n**ttok** is a standalone CLI for counting, encoding, decoding, and exploring tokens — powered by this library. NativeAOT-compiled for instant startup.\n\n```bash\n# Install (macOS)\nbrew install tryAGI/tap/ttok\n\n# Install (macOS/Linux)\ncurl -fsSL https://raw.githubusercontent.com/tryAGI/Tiktoken/main/install.sh | sh\n\n# Install (.NET global tool)\ndotnet tool install -g Tiktoken.Cli\n\n# Usage\necho \"Hello world\" | ttok        # 3\nttok src/ --include \"*.cs\"       # count tokens in files\n```\n\nSee the full [CLI documentation](src/cli/Tiktoken.Cli/README.md) for all options and install methods.\n\n## Support\n\nPriority place for bugs: https://github.com/tryAGI/LangChain/issues  \nPriority place for ideas and general questions: https://github.com/tryAGI/LangChain/discussions  \nDiscord: https://discord.gg/Ca2xhfBf3v  ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftryagi%2Ftiktoken","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftryagi%2Ftiktoken","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftryagi%2Ftiktoken/lists"}