An open API service indexing awesome lists of open source software.

https://github.com/milansuk/token_go

Simple & fast Encoder/Decoder for tiktoken vocabulary.
https://github.com/milansuk/token_go

llm tokenizer

Last synced: 9 months ago
JSON representation

Simple & fast Encoder/Decoder for tiktoken vocabulary.

Awesome Lists containing this project

README

          

## Token_go
Simple & fast Encoder/Decoder for tiktoken vocabulary.
Implemented from scratch(no regex library). Tokenizer is in vocab.go which has ~120 lines of code.

## Performance
p50k_base.tiktoken:
- Encoder: 4.625M toks/sec, 19.143 MB/sec, 1 thread
- Decoder: 37.817M toks/sec, 156.516 MB/sec, 1 thread

cl100k_base.tiktoken:
- Encoded 3.949M toks/sec, 16.748 MB/sec, 1 thread
- Decoded 35.825M toks/sec, 151.952 MB/sec, 1 thread

Server(p50k_base)
- 8x clients calls 100K times Encode("Hi there!" + index).
- 800K total requests in 26.7sec => 30K req/sec.

## Examples
Encode/Decode:

vb, err := NewVocab("p50k_base.tiktoken", true)

toks := vb.Encode("Hi there!")
fmt.Println(toks)

str := vb.Decode(toks)
fmt.Println(str)

Client/Server:

go NewServer("8090", true)   //run server in extra thread

client := NewClient("localhost:8090", "p50k_base")

toks, err := client.Encode([]byte("Hi there!"))
fmt.Println(toks)

text, err := client.Decode([]int{17250, 612, 0})
fmt.Println(text)

## Build
Written in Go language(https://go.dev/doc/install). No dependencies.

git clone https://github.com/milansuk/token_go

cd token_go
go build
./token_go

## Author
Milan Suk

Email: milan@skyalt.com

Twitter: https://twitter.com/milansuk/

**Sponsor**: https://github.com/sponsors/MilanSuk

*Feel free to follow or contact me with any idea, question or problem.*

## Contributing
Your feedback and code are welcome!

For bug report or question, please use [GitHub's Issues](https://github.com/skyaltlabs/skyalt/issues)

SkyAlt is licensed under **Apache v2.0** license. This repository includes 100% of the code.