https://github.com/yuniko-software/qwen3-tokenizer-dotnet
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
https://github.com/yuniko-software/qwen3-tokenizer-dotnet
bpe-tokenizer csharp dotnet embedding-models huggingface inference llm machine-learning onnx qwen vector-database
Last synced: 2 months ago
JSON representation
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
- Host: GitHub
- URL: https://github.com/yuniko-software/qwen3-tokenizer-dotnet
- Owner: yuniko-software
- License: apache-2.0
- Created: 2025-10-23T20:28:46.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-11-12T23:05:36.000Z (7 months ago)
- Last Synced: 2025-11-13T00:22:11.341Z (7 months ago)
- Topics: bpe-tokenizer, csharp, dotnet, embedding-models, huggingface, inference, llm, machine-learning, onnx, qwen, vector-database
- Language: C#
- Homepage:
- Size: 102 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Yuniko.Software.Qwen3Tokenizer
[](https://github.com/yuniko-software/qwen3-tokenizer/actions/workflows/ci.yml)
[](https://www.nuget.org/packages/Yuniko.Software.Qwen3Tokenizer)
[](https://www.nuget.org/packages/Yuniko.Software.Qwen3Tokenizer)
Native .NET tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding with HuggingFace integration.
## Features
- **Native .NET Implementation**: Qwen3 tokenizer built specifically for .NET applications
- **Identical to HuggingFace**: Produces the same token IDs and outputs as the official HuggingFace tokenizer
- **Qwen Model Family Support**: Compatible with all Qwen3 variants (LLM, Embedding, Reranker, and Vision-Language models)
- **HuggingFace Integration**: Load tokenizer files directly from HuggingFace model repositories
- **Configurable**: Customize tokenization behavior through options and custom file providers
- **No External Dependencies**: Does not require Python or other runtime dependencies
## Installation
```bash
dotnet add package Yuniko.Software.Qwen3Tokenizer
```
Or via Package Manager:
```powershell
Install-Package Yuniko.Software.Qwen3Tokenizer
```
## Quick Start
```csharp
using Yuniko.Software.Qwen3Tokenizer;
// Load from HuggingFace model (specify if it's for an embedding model)
var tokenizer = await Qwen3Tokenizer.FromHuggingFaceAsync(
"Qwen/Qwen3-0.6B",
isForEmbeddingModel: false
);
// Encode text
int[] tokenIds = tokenizer.Encode("Hello, world!");
Console.WriteLine($"Token IDs: [{string.Join(", ", tokenIds)}]");
Console.WriteLine($"Token count: {tokenIds.Length}");
// Decode tokens
string decodedText = tokenizer.Decode(tokenIds);
Console.WriteLine($"Decoded: {decodedText}");
```
## Usage Examples
### Basic Tokenization
```csharp
// For regular LLM models
var tokenizer = Qwen3Tokenizer.FromHuggingFace("Qwen/Qwen3-0.6B", isForEmbeddingModel: false);
// Encode text into token IDs
int[] ids = tokenizer.Encode("The quick brown fox jumps over the lazy dog");
// Decode token IDs back to text
string text = tokenizer.Decode(ids);
// Count tokens without full encoding
int tokenCount = tokenizer.CountTokens("Some text to count");
```
### Working with Embedding Models
```csharp
// For embedding models - adds pad token at the end when addSpecialTokens=true
var embeddingTokenizer = Qwen3Tokenizer.FromHuggingFace(
"Qwen/Qwen3-Embedding-0.6B",
isForEmbeddingModel: true
);
// With special tokens (includes pad token at the end)
int[] withSpecial = embeddingTokenizer.Encode("Your text here", addSpecialTokens: true);
// Without special tokens
int[] withoutSpecial = embeddingTokenizer.Encode("Your text here", addSpecialTokens: false);
Console.WriteLine($"With special tokens: {withSpecial.Length} tokens");
Console.WriteLine($"Without special tokens: {withoutSpecial.Length} tokens");
```
### Decoding with Special Tokens
```csharp
// Encode text with special tokens
string chatMessage = "<|im_start|>user\nHello!<|im_end|>";
int[] tokens = tokenizer.Encode(chatMessage);
// Decode with special tokens preserved
string withSpecial = tokenizer.Decode(tokens, skipSpecialTokens: false);
// Decode with special tokens removed (default behavior)
string withoutSpecial = tokenizer.Decode(tokens, skipSpecialTokens: true);
Console.WriteLine($"With special tokens: {withSpecial}");
Console.WriteLine($"Without special tokens: {withoutSpecial}");
```
### Detailed Encoding Information
```csharp
// Get detailed information about tokens
var result = tokenizer.EncodeDetailed("Hello, world!");
for (int i = 0; i < result.Ids.Length; i++)
{
Console.WriteLine($"Token: '{result.Tokens[i]}' | ID: {result.Ids[i]} | " +
$"Offset: {result.Offsets[i].Index}, Length: {result.Offsets[i].Length}");
}
```
### ONNX Runtime Integration
```csharp
// Prepare inputs for ONNX Runtime inference (dynamic length, no padding)
var inputs = tokenizer.PrepareForOnnx("Your input text here");
// Use with ONNX Runtime
// Note: Some models (e.g., embedding models) may not require position_ids
long[] inputIds = inputs.InputIds;
long[] attentionMask = inputs.AttentionMask;
long[] positionIds = inputs.PositionIds;
```
### Loading from Local Files
```csharp
// Load tokenizer from local vocabulary and merges files
var tokenizer = Qwen3Tokenizer.FromFiles(
vocabPath: "/path/to/vocab.json",
mergesPath: "/path/to/merges.txt",
isForEmbeddingModel: false
);
```
### Custom File Provider
```csharp
// Implement custom file provider for advanced scenarios
public class CustomFileProvider : ITokenizerFileProvider
{
public (string VocabPath, string MergesPath) GetFiles()
{
// Custom logic to provide tokenizer files
return ("/path/to/vocab.json", "/path/to/merges.txt");
}
public Task<(string VocabPath, string MergesPath)> GetFilesAsync(
CancellationToken cancellationToken = default)
{
return Task.FromResult(GetFiles());
}
}
// Use custom provider
var tokenizer = Qwen3Tokenizer.FromProvider(new CustomFileProvider(), isForEmbeddingModel: false);
```
### Accessing Vocabulary and Token Information
```csharp
// Get vocabulary size
int vocabSize = tokenizer.VocabularySize;
// Access full vocabulary
IReadOnlyDictionary vocab = tokenizer.Vocabulary;
// Access added tokens (special tokens and others)
IReadOnlyDictionary addedTokens = tokenizer.AddedTokens;
// Access special token IDs
IReadOnlySet specialTokenIds = tokenizer.SpecialTokenIds;
// Use predefined token constants
Console.WriteLine($"IM_END token: {Qwen3Tokens.ImEnd} (ID: {Qwen3Tokens.ImEndTokenId})");
Console.WriteLine($"PAD token: {Qwen3Tokens.EndOfText} (ID: {Qwen3Tokens.EndOfTextTokenId})");
Console.WriteLine($"IM_START token: {Qwen3Tokens.ImStart} (ID: {Qwen3Tokens.ImStartTokenId})");
```
For more examples, see the [sample project](samples/Yuniko.Software.Qwen3Tokenizer.Sample).
## Supported Models
Works with all Qwen3 model variants:
- Qwen3 LLM models (text generation)
- Qwen3-Embedding models (text embeddings)
- Qwen3-Reranker models (document reranking)
- Qwen3-VL models (vision-language)
## Requirements
- .NET 10.0 or later
## License
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/yuniko-software/qwen3-tokenizer/blob/main/LICENSE) file for details.
## Contributing
Contributions are welcome! Please visit the [GitHub repository](https://github.com/yuniko-software/qwen3-tokenizer) for more information.
## Support
For issues, questions, or suggestions, please open an issue on [GitHub](https://github.com/yuniko-software/qwen3-tokenizer/issues).
---
⭐ If you find this project useful, please consider giving it a star on GitHub! ⭐
Your support helps make this project more visible to other developers who might benefit from a native .NET implementation of the Qwen3 tokenizer.