https://github.com/nlpodyssey/gotokenizers
Go implementation of today's most used tokenizers
https://github.com/nlpodyssey/gotokenizers
bert language-model natural-language-processing natural-language-understanding nlp transformers
Last synced: 6 months ago
JSON representation
Go implementation of today's most used tokenizers
- Host: GitHub
- URL: https://github.com/nlpodyssey/gotokenizers
- Owner: nlpodyssey
- License: bsd-2-clause
- Created: 2020-03-14T15:36:20.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-12-12T23:29:39.000Z (almost 5 years ago)
- Last Synced: 2025-03-26T19:03:37.166Z (7 months ago)
- Topics: bert, language-model, natural-language-processing, natural-language-understanding, nlp, transformers
- Language: Go
- Homepage:
- Size: 224 KB
- Stars: 41
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GoTokenizers
[](https://goreportcard.com/report/github.com/nlpodyssey/gotokenizers)
[](https://codeclimate.com/github/nlpodyssey/gotokenizers/maintainability)
[](https://bettercodehub.com/)
[](https://codecov.io/gh/nlpodyssey/gotokenizers)GoTokenizers is a Go module implementing some of the most used
state-of-the-art natural language tokenizers.This is a **work-in-progress** port from the original Hugging Face's
[tokenizers](https://github.com/huggingface/tokenizers), written in Rust.The project's goal is the creation of a similar tokenization library
entirely written in Go - that is, without requiring foreign language
integrations. Without having any competitive performance ambition, we simply
aim to provide a good alternative for the Go community.**Warning: the development is currently in its very early stage**. We are
progressively porting to Go the original Rust code, which in turn is under
active development.
We want to quickly reach a stable complete port of the library, so we are
adopting a mostly quantitative approach. That means that we are consciously
sacrificing code elegance and performance, keeping instead a stricter
parallelism between the original Rust code and the newly written Go code.Only on a second phase, once we have a first complete library port, we expect
to focus on cleaning, refactoring, testing, and maybe even re-engineering
some components. Eventually, we will evaluate further expansions, and a
possible integration or combination with the
[spago](https://github.com/nlpodyssey/spago) library. However, as you can guess,
it's too soon to tell!## License
GoTokenizers is licensed under the
[BSD 2-Clause "Simplified" License](https://github.com/nlpodyssey/gotokenizers/blob/master/LICENSE).This project is strongly based on
[Hugging Face Tokenizers](https://github.com/huggingface/tokenizers),
which is licensed under the
[Apache License 2.0](https://github.com/huggingface/tokenizers/blob/master/LICENSE).