https://github.com/sugarme/tokenizer

NLP tokenizers written in Go language
https://github.com/sugarme/tokenizer

deep-learning golang-tokenizer nlp tokenizer

Last synced: 7 months ago
JSON representation

NLP tokenizers written in Go language

Host: GitHub
URL: https://github.com/sugarme/tokenizer
Owner: sugarme
License: apache-2.0
Created: 2020-08-03T06:47:01.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-12-13T01:50:40.000Z (about 1 year ago)
Last Synced: 2024-12-13T02:30:15.193Z (about 1 year ago)
Topics: deep-learning, golang-tokenizer, nlp, tokenizer
Language: Go
Homepage:
Size: 1.48 MB
Stars: 183
Watchers: 9
Forks: 30
Open Issues: 9
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

my-awesome - sugarme/tokenizer - learning,golang-tokenizer,nlp,tokenizer pushed_at:2025-11 star:0.3k fork:0.1k NLP tokenizers written in Go language (Go)

README

          # Tokenizer [![License](https://img.shields.io/:license-apache-blue.svg)](https://opensource.org/licenses/Apache-2.0)[![Go.Dev reference](https://img.shields.io/badge/go.dev-reference-007d9c?logo=go&logoColor=white&style=flat-square)](https://pkg.go.dev/github.com/sugarme/tokenizer?tab=doc)[![Travis CI](https://api.travis-ci.org/sugarme/tokenizer.svg?branch=master)](https://travis-ci.org/sugarme/tokenizer)[![Go Report Card](https://goreportcard.com/badge/github.com/sugarme/tokenizer)](https://goreportcard.com/report/github.com/sugarme/tokenizer) 

## Overview

`tokenizer` is pure Go package to facilitate applying Natural Language Processing (NLP) models train/test and inference in Go. 

It is heavily inspired by and based on the popular [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers). 

`tokenizer` is part of an ambitious goal (together with [**transformer**](https://github.com/sugarme/transformer) and [**gotch**](https://github.com/sugarme/gotch)) to bring more AI/deep-learning tools to Gophers so that they can stick to the language they love and build faster software in production. 

## Features

`tokenizer` is built in modules located in sub-packages. 

1. Normalizer

2. Pretokenizer

3. Tokenizer

4. Post-processing

It implements various tokenizer models: 

- [x] Word level model

- [x] Wordpiece model

- [x] Byte Pair Encoding (BPE)

It can be used for both **training** new models from scratch or **fine-tuning** existing models. See [examples](./example) detail.

## Basic example

This tokenizer package is compatible to load pretrained models from Huggingface. Some of them can be loaded using `pretrained` subpackage.

```go

import (

	"fmt"

	"log"

	"github.com/sugarme/tokenizer/pretrained"

)

func main() {

    // Download and cache pretrained tokenizer. In this case `bert-base-uncased` from Huggingface

    // can be any model with `tokenizer.json` available. E.g. `tiiuae/falcon-7b`

	configFile, err := tokenizer.CachedPath("bert-base-uncased", "tokenizer.json")

	if err != nil {

		panic(err)

	}

	tk, err := pretrained.FromFile(configFile)

	if err != nil {

		panic(err)

	}

	sentence := `The Gophers craft code using [MASK] language.`

	en, err := tk.EncodeSingle(sentence)

	if err != nil {

		log.Fatal(err)

	}

	fmt.Printf("tokens: %q\n", en.Tokens)

	fmt.Printf("offsets: %v\n", en.Offsets)

	// Output

	// tokens: ["the" "go" "##pher" "##s" "craft" "code" "using" "[MASK]" "language" "."]

	// offsets: [[0 3] [4 6] [6 10] [10 11] [12 17] [18 22] [23 28] [29 35] [36 44] [44 45]]

}

```

All models can be loaded from files manually. [pkg.go.dev](https://pkg.go.dev/github.com/sugarme/tokenizer?tab=doc) for detail APIs.

## Getting Started

- See [pkg.go.dev](https://pkg.go.dev/github.com/sugarme/tokenizer?tab=doc) for detail APIs 

## License

`tokenizer` is Apache 2.0 licensed.

## Acknowledgement

- This project has been inspired and used many concepts from [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sugarme/tokenizer

Awesome Lists containing this project

README