https://github.com/zerfoo/ztoken
BPE tokenizer library for Go with HuggingFace tokenizer.json and GGUF tokenizer extraction support. Zero external dependencies.
https://github.com/zerfoo/ztoken
Last synced: 25 days ago
JSON representation
BPE tokenizer library for Go with HuggingFace tokenizer.json and GGUF tokenizer extraction support. Zero external dependencies.
- Host: GitHub
- URL: https://github.com/zerfoo/ztoken
- Owner: zerfoo
- License: apache-2.0
- Created: 2026-03-16T01:39:03.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-26T18:32:08.000Z (about 2 months ago)
- Last Synced: 2026-03-27T06:46:29.790Z (about 2 months ago)
- Language: Go
- Size: 97.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# ztoken
[](https://github.com/zerfoo/ztoken/actions/workflows/ci.yml)
[](https://pkg.go.dev/github.com/zerfoo/ztoken)
[](https://opensource.org/licenses/Apache-2.0)
BPE tokenizer library for Go with HuggingFace compatibility.
Part of the [Zerfoo](https://github.com/zerfoo) ML ecosystem.
## Features
- **Byte-Pair Encoding (BPE)** tokenizer with full merge-based encoding/decoding
- **HuggingFace `tokenizer.json`** loading — compatible with GPT-2, Llama, Gemma, Mistral, and other models
- **GGUF tokenizer extraction** — extract tokenizer data directly from GGUF model files via `ztoken/gguf`
- **SentencePiece compatibility** — handles U+2581 space markers used by Llama-family models
- **Special token handling** — BOS, EOS, PAD, UNK with exact-match encoding for control tokens
- **Byte-level BPE** — GPT-2 style byte-to-Unicode encoding for full UTF-8 coverage
- **Text normalization** — configurable normalizer pipeline (NFC, NFD, NFKC, lowercase, etc.)
- **Zero external dependencies** — stdlib only, plus `golang.org/x/text` for Unicode normalization
## Installation
```bash
go get github.com/zerfoo/ztoken
```
## Quick Start
### Load from HuggingFace tokenizer.json
```go
package main
import (
"fmt"
"github.com/zerfoo/ztoken"
)
func main() {
// Load a HuggingFace tokenizer.json file
tok, err := ztoken.LoadFromJSON("tokenizer.json")
if err != nil {
panic(err)
}
// Encode text to token IDs
ids, _ := tok.Encode("Hello, world!")
fmt.Println(ids)
// Decode token IDs back to text
text, _ := tok.Decode(ids)
fmt.Println(text) // Hello, world!
// Inspect vocabulary
fmt.Println(tok.VocabSize())
// Access special tokens
special := tok.SpecialTokens()
fmt.Printf("BOS=%d EOS=%d PAD=%d UNK=%d\n",
special.BOS, special.EOS, special.PAD, special.UNK)
}
```
### Extract Tokenizer from GGUF Model Files
The `ztoken/gguf` sub-package extracts tokenizer data directly from GGUF model files, so you don't need a separate `tokenizer.json`:
```go
package main
import (
"fmt"
"github.com/zerfoo/ztoken/gguf"
)
func main() {
// metadata is any type implementing gguf.Metadata interface:
// GetString(key string) (string, bool)
// GetStringArray(key string) ([]string, bool)
// GetUint32(key string) (uint32, bool)
// GetInt32Array(key string) ([]int32, bool)
tok, err := gguf.ExtractTokenizer(metadata)
if err != nil {
panic(err)
}
ids, _ := tok.Encode("Hello from GGUF!")
fmt.Println(ids)
}
```
### Build a Tokenizer Programmatically
```go
package main
import (
"fmt"
"github.com/zerfoo/ztoken"
)
func main() {
vocab := map[string]int{
"hello": 0, "world": 1, " ": 2,
"": 3, "": 4, "": 5, "": 6,
}
merges := []ztoken.MergePair{
{Left: "hel", Right: "lo"},
{Left: "wor", Right: "ld"},
}
special := ztoken.SpecialTokens{BOS: 4, EOS: 5, PAD: 6, UNK: 3}
tok := ztoken.NewBPETokenizer(vocab, merges, special, false)
ids, _ := tok.Encode("hello")
fmt.Println(ids) // [0]
}
```
## SentencePiece Compatibility
Models using SentencePiece tokenization (Llama, Gemma) encode spaces as the U+2581 character. ztoken handles this automatically when loading from GGUF files with `tokenizer.ggml.model = "llama"`, or you can enable it manually:
```go
tok := ztoken.NewBPETokenizer(vocab, merges, special, false)
tok.SetSentencePiece(true)
```
## Use Cases
- **ML inference preprocessing** — tokenize prompts before feeding them to transformer models via [zerfoo](https://github.com/zerfoo/zerfoo)
- **Text processing pipelines** — encode/decode text with production-grade BPE
- **Model tooling** — extract and inspect tokenizers from GGUF and HuggingFace model files
- **Embedding in Go services** — zero-CGo tokenization that compiles with `go build`
## Package Structure
| Package | Description |
|---------|-------------|
| `ztoken` | Core tokenizer interface, BPE implementation, HuggingFace JSON loader |
| `ztoken/gguf` | GGUF metadata-based tokenizer extraction |
## Dependencies
ztoken has zero external dependencies beyond the Go standard library and `golang.org/x/text` for Unicode normalization.
ztoken is used by:
- [zerfoo](https://github.com/zerfoo/zerfoo) — ML inference, training, and serving framework
## License
Apache 2.0