https://github.com/eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer
https://github.com/eliben/go-sentencepiece

encoding go golang language-model llm sentencepiece tokenization

Last synced: 11 months ago
JSON representation

Go implementation of the SentencePiece tokenizer

Host: GitHub
URL: https://github.com/eliben/go-sentencepiece
Owner: eliben
License: apache-2.0
Created: 2024-08-05T21:59:49.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-09-05T13:15:58.000Z (almost 2 years ago)
Last Synced: 2024-12-03T10:23:19.519Z (over 1 year ago)
Topics: encoding, go, golang, language-model, llm, sentencepiece, tokenization
Language: Go
Homepage:
Size: 200 KB
Stars: 22
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# go-sentencepiece

Logo

----

[![Go Reference](https://pkg.go.dev/badge/github.com/eliben/go-sentencepiece.svg)](https://pkg.go.dev/github.com/eliben/go-sentencepiece)

This is a pure Go implementation of encoding and decoding text with
the [SentencePiece tokenizer](https://github.com/google/sentencepiece).

"Encoding" is the operation used to split text into tokens, using
a trained tokenizer model. "Decoding" is the reverse process - converting
a list of tokens into the original text.

SentencePiece is a general family of tokenizers that is configured
by a protobuf configuration file. This repository currently focuses
on implementing just the functionality required to reproduce the
tokenization of [Gemma models](https://ai.google.dev/gemma) (the same
tokenizer is used for Google's proprietary Gemini family of models).
Specifically, it only implements BPE tokenization since this is what
Gemma uses.

## Current status

This package should be ready to use for encoding text into tokens
using the Gemma tokenizer; it's been reasonably optimized and extensively
tested vs. the [SentencePiece Python bindings](https://pypi.org/project/sentencepiece/)
(see `system_test.go` in this repository).

If you find any problems or discrepancies, please open an issue.

## Tokenizer configuration

The configuration file for the tokenizer is a protobuf (structured
data, serialized in the [protocol buffer format](https://protobuf.dev/))
that describes a trained tokenizer model; it includes
the complete learned vocabulary used for tokenization, as well as
other configuration information.

It is not part of this repository. Please fetch it from the
[official Gemma implementation repository](https://github.com/google/gemma_pytorch/tree/main/tokenizer).
`NewProcessor*` constructors will expect to read this file.

## Developing

A protobuf is used to configure the tokenizer. The structure of the
protobuf is described by the `internal/model/sentencepiece_model.proto` file,
which is vendored from https://github.com/google/sentencepiece

To re-generate the `*.pb.go` file from it:

```
$ cd internal/model
$ ./gen.sh
```

The configuration protobuf itself is obtained as described in the
[Tokenizer configuration](#tokenizer-configuration) section. All
tests require the `MODELPATH` env var to point to a local
copy of the tokenizer configuration file.

## Online demo

To see an in-browser demo of this tokenizer in action, visit
https://eliben.github.io/go-sentencepiece/

The Go code is compiled to WebAssembly and loaded from a small
JS program to allow interactive encoding of text.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eliben/go-sentencepiece

Awesome Lists containing this project

README