https://github.com/nlpoptimize/awesome-tokenizers

A curated list of tokenizer libraries for blazing-fast NLP processing.
https://github.com/nlpoptimize/awesome-tokenizers

List: awesome-tokenizers

awesome collections python python-library pythonframework sentence tokenizer wordpiece

Last synced: 3 months ago
JSON representation

A curated list of tokenizer libraries for blazing-fast NLP processing.

Host: GitHub
URL: https://github.com/nlpoptimize/awesome-tokenizers
Owner: NLPOptimize
Created: 2025-04-02T22:53:07.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-04-09T06:44:59.000Z (3 months ago)
Last Synced: 2025-04-09T07:42:39.678Z (3 months ago)
Topics: awesome, collections, python, python-library, pythonframework, sentence, tokenizer, wordpiece
Homepage:
Size: 12.7 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

ultimate-awesome - awesome-tokenizers - A curated list of tokenizer libraries for blazing-fast NLP processing. (Other Lists / TeX Lists)

README

        




  





# Awesome-tokenizer  [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A repository with the 🔥 symbol is a tokenizer that is significantly faster than other tokenizers.

## 🔹 **WordPiece Tokenizer Implementations**

* 🔥 **[FlashTokenizer](https://github.com/NLPOptimize/flash-tokenizer)** (C++/Python)

  *  The world's fastest CPU tokenizer library!

- **[huggingface/tokenizers](https://github.com/huggingface/tokenizers)** *(Rust/Python)*

  - Official Hugging Face tokenizer, fast Rust implementation with Python bindings.

- 🔥 **[FastBertTokenizer](https://github.com/kekyo/FastBertTokenizer)** *(C#)*

  - Highly optimized tokenizer for speed, reduced accuracy on non-English inputs.

- **[BertTokenizers](https://github.com/microsoft/BertTokenizers)** *(C#)*

  - Microsoft's original C# tokenizer implementation (slower than FastBertTokenizer).

- 🔥 **[rust-tokenizers](https://github.com/guillaume-be/rust-tokenizers)** *(Rust/Python)*

  - Rust tokenizer library; faster than pure Python but slower than BlingFire or Flash.

- **[tokenizers-cpp](https://github.com/monologg/tokenizers-cpp)** *(C++)*

  - Wrapper around SentencePiece and Hugging Face’s tokenizers; not a standalone implementation.

- **[bertTokenizer (Java)](https://github.com/robrua/easy-bert)** *(Java)*

  - Java-based Bert tokenizer implementation.

- **[ZhuoruLin/fast-wordpiece](https://github.com/ZhuoruLin/fast-wordpiece)** *(Rust)*

  - Rust implementation using LinMaxMatching; likely comparable or slower than optimized C++ versions.

- **[huggingface_tokenizer_cpp](https://github.com/BlinkDL/huggingface_tokenizer_cpp)** *(C++)*

  - Naive pure C++ implementation; slow performance.

- **[SeanLee97/BertWordPieceTokenizer.jl](https://github.com/SeanLee97/BertWordPieceTokenizer.jl)** *(Julia)*

  - Julia implementation, not widely benchmarked.

- 🔥 **[BlingFire](https://github.com/microsoft/BlingFire)** *(C++/Python)*

  - Microsoft's high-speed tokenizer optimized for batch processing, available as Python bindings.

- **[tensorflow-text WordpieceTokenizer](https://github.com/tensorflow/text)** *(C++/Python)*

  - TensorFlow-integrated Google's tokenizer optimized for use in TensorFlow pipelines.

- **[transformers BertTokenizer](https://github.com/huggingface/transformers)** *(Python)*

  - Hugging Face's Python implementation; easy to use but slower due to pure Python nature.

- **[Deep Java Library (DJL) BertTokenizer](https://github.com/deepjavalibrary/djl)** *(Java)*

  - Amazon’s Java implementation, integrated within DJL framework.

- **[tokenizers.net](https://github.com/ScottLogic/tokenizers.net)** *(C#)*

  - .NET/C# binding of Hugging Face tokenizers optimized for .NET runtimes.

- **[Tokenizers.jl](https://github.com/JuliaText/Tokenizers.jl)** *(Julia)*

  - Julia tokenizer library inspired by Hugging Face implementations.

- **[fast-bert-tokenizer-py](https://github.com/kakaobrain/fast-bert-tokenizer-py)** *(Python/Cython)*

  - Python tokenizer accelerated with Cython.

- **[ml-commons/tokenizer](https://github.com/mlcommons/tokenizer)** *(C++)*

  - High-performance C++ tokenizer supporting WordPiece and other algorithms.

------

## 🔹 **BPE (Byte Pair Encoding) Implementations**

- **[OpenAI TikToken](https://github.com/openai/tiktoken)** *(Rust/Python)*

  - Official BPE tokenizer from OpenAI (used in GPT models), highly optimized.

- **[huggingface/tokenizers](https://github.com/huggingface/tokenizers)** *(Rust/Python)*

  - General-purpose tokenizer supporting BPE, from Hugging Face.

- **[bpe-tokenizer (Rust)](https://docs.rs/bpe-tokenizer/latest/bpe_tokenizer/)** *(Rust)*

  - Rust BPE tokenizer library, identifying frequent pairs effectively.

- **[YouTokenToMe](https://github.com/VKCOM/YouTokenToMe)** *(C++/Python)*

  - Efficient BPE tokenizer with fast training and inference, developed by VK.com.

- 🔥  /**[fastBPE](https://github.com/glample/fastBPE)** *(C++/Python)*

  - Facebook’s fast and memory-efficient BPE tokenizer, widely used in NLP research.

- 🔥 **[sentencepiece](https://github.com/google/sentencepiece)** *(C++/Python)*

  - Google's SentencePiece implementation also provides BPE as one of the algorithms.

- **[Subword-nmt](https://github.com/rsennrich/subword-nmt)** *(Python)*

  - Python implementation commonly used in MT research, simple but slower.

- 🔥 **[rs-bpe](https://github.com/gweidart/rs-bpe)** (Rust)

  - A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

------

## 🔹 **SentencePiece Implementations**

- **[google/sentencepiece](https://github.com/google/sentencepiece)** *(C++/Python)*

  - Google's official, language-independent, neural-based subword tokenizer.

- **[sentencepiece-rs](https://github.com/finalfusion/sentencepiece)** *(Rust)*

  - Rust binding for Google's SentencePiece.

- **[huggingface/tokenizers](https://github.com/huggingface/tokenizers)** *(Rust/Python)*

  - Hugging Face tokenizer library supporting SentencePiece.

- **[TensorFlow Text SentencepieceTokenizer](https://github.com/tensorflow/text)** *(C++/Python)*

  - Google's TensorFlow Text includes SentencePiece tokenizer optimized for TF environments.

- **[sentencepiece.NET](https://github.com/Curiosity-ai/sentencepiece.NET)** *(C#)*

  - .NET binding for SentencePiece tokenizer.

- **[sentencepiece-jni](https://github.com/go-skynet/sentencepiece-jni)** *(Java)*

  - JNI bindings for Google's SentencePiece tokenizer for Java applications.

- **[sentencepiece-swift](https://github.com/xenova/sentencepiece-swift)** *(Swift)*

  - Swift bindings for Google's SentencePiece tokenizer.

------

## Contributing

Your contributions are always welcome! Please take a look at the [contribution guidelines](./CONTRIBUTING.md) first.

## Question

Also, if you have any questions, please send a message directly to WeChat, Line, or Telegram below.

💬 LINE



  



💬 Telegram



  



💬 WeChat

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nlpoptimize/awesome-tokenizers

Awesome Lists containing this project

README