https://github.com/samber/go-gpt-3-encoder
Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3
https://github.com/samber/go-gpt-3-encoder
bpe byte-pair-encoding codex decoder encoder go gpt-2 gpt-3 openai token tokenizer transformer
Last synced: 6 months ago
JSON representation
Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3
- Host: GitHub
- URL: https://github.com/samber/go-gpt-3-encoder
- Owner: samber
- License: mit
- Created: 2022-12-21T22:53:13.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-12-02T07:56:39.000Z (10 months ago)
- Last Synced: 2025-03-29T02:05:51.751Z (6 months ago)
- Topics: bpe, byte-pair-encoding, codex, decoder, encoder, go, gpt-2, gpt-3, openai, token, tokenizer, transformer
- Language: Go
- Homepage: https://pkg.go.dev/github.com/samber/go-gpt-3-encoder
- Size: 558 KB
- Stars: 80
- Watchers: 2
- Forks: 21
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# go-gpt-3-encoder
Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3.
## About
GPT2 and GPT3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a Go implementation of OpenAI's original Python encoder/decoder which can be found [here](https://github.com/openai/gpt-2/blob/master/src/encoder.py).
This code was inspired by [Javascript implementation](https://github.com/latitudegames/GPT-3-Encoder) and partially generated by OpenAI himself!
> [!WARNING]
> This implementation of BPE tokenizer is not valid. See [https://x.com/burkov/status/1863007688470241336](https://x.com/burkov/status/1863007688470241336).## Install
```bash
go get github.com/samber/go-gpt-3-encoder
```## Usage
```go
import tokenizer "github.com/samber/go-gpt-3-encoder"encoder, err := tokenizer.NewEncoder()
if err != nil {
log.Fatal(err)
}str := "This is an example sentence to try encoding out on!"
encoded, err := encoder.Encode(str)
if err != nil {
log.Fatal(err)
}fmt.Println("We can look at each token and what it represents:")
for _, token := range encoded {
fmt.Printf("%d -- %s\n", token, encoder.Decode([]int{token}))
}decoded := encoder.Decode(encoded)
fmt.Printf("We can decode it back into: %s\n", decoded)
```## Contribute
Some corner cases are not covered by this library. See `@TODO` in tests.