https://github.com/nlpoptimize/awesome-tokenizers
A curated list of tokenizer libraries for blazing-fast NLP processing.
https://github.com/nlpoptimize/awesome-tokenizers
List: awesome-tokenizers
awesome collections python python-library pythonframework sentence tokenizer wordpiece
Last synced: about 2 months ago
JSON representation
A curated list of tokenizer libraries for blazing-fast NLP processing.
- Host: GitHub
- URL: https://github.com/nlpoptimize/awesome-tokenizers
- Owner: NLPOptimize
- Created: 2025-04-02T22:53:07.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-04-09T06:44:59.000Z (about 2 months ago)
- Last Synced: 2025-04-09T07:42:39.678Z (about 2 months ago)
- Topics: awesome, collections, python, python-library, pythonframework, sentence, tokenizer, wordpiece
- Homepage:
- Size: 12.7 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
- ultimate-awesome - awesome-tokenizers - A curated list of tokenizer libraries for blazing-fast NLP processing. (Other Lists / Julia Lists)
README
# Awesome-tokenizer [](https://github.com/sindresorhus/awesome)
A repository with the 🔥 symbol is a tokenizer that is significantly faster than other tokenizers.
## 🔹 **WordPiece Tokenizer Implementations**
* 🔥 **[FlashTokenizer](https://github.com/NLPOptimize/flash-tokenizer)** (C++/Python)
* The world's fastest CPU tokenizer library!
- **[huggingface/tokenizers](https://github.com/huggingface/tokenizers)** *(Rust/Python)*
- Official Hugging Face tokenizer, fast Rust implementation with Python bindings.
- 🔥 **[FastBertTokenizer](https://github.com/kekyo/FastBertTokenizer)** *(C#)*
- Highly optimized tokenizer for speed, reduced accuracy on non-English inputs.
- **[BertTokenizers](https://github.com/microsoft/BertTokenizers)** *(C#)*
- Microsoft's original C# tokenizer implementation (slower than FastBertTokenizer).
- 🔥 **[rust-tokenizers](https://github.com/guillaume-be/rust-tokenizers)** *(Rust/Python)*
- Rust tokenizer library; faster than pure Python but slower than BlingFire or Flash.
- **[tokenizers-cpp](https://github.com/monologg/tokenizers-cpp)** *(C++)*
- Wrapper around SentencePiece and Hugging Face’s tokenizers; not a standalone implementation.
- **[bertTokenizer (Java)](https://github.com/robrua/easy-bert)** *(Java)*
- Java-based Bert tokenizer implementation.
- **[ZhuoruLin/fast-wordpiece](https://github.com/ZhuoruLin/fast-wordpiece)** *(Rust)*
- Rust implementation using LinMaxMatching; likely comparable or slower than optimized C++ versions.
- **[huggingface_tokenizer_cpp](https://github.com/BlinkDL/huggingface_tokenizer_cpp)** *(C++)*
- Naive pure C++ implementation; slow performance.
- **[SeanLee97/BertWordPieceTokenizer.jl](https://github.com/SeanLee97/BertWordPieceTokenizer.jl)** *(Julia)*
- Julia implementation, not widely benchmarked.
- 🔥 **[BlingFire](https://github.com/microsoft/BlingFire)** *(C++/Python)*
- Microsoft's high-speed tokenizer optimized for batch processing, available as Python bindings.
- **[tensorflow-text WordpieceTokenizer](https://github.com/tensorflow/text)** *(C++/Python)*
- TensorFlow-integrated Google's tokenizer optimized for use in TensorFlow pipelines.
- **[transformers BertTokenizer](https://github.com/huggingface/transformers)** *(Python)*
- Hugging Face's Python implementation; easy to use but slower due to pure Python nature.
- **[Deep Java Library (DJL) BertTokenizer](https://github.com/deepjavalibrary/djl)** *(Java)*
- Amazon’s Java implementation, integrated within DJL framework.
- **[tokenizers.net](https://github.com/ScottLogic/tokenizers.net)** *(C#)*
- .NET/C# binding of Hugging Face tokenizers optimized for .NET runtimes.
- **[Tokenizers.jl](https://github.com/JuliaText/Tokenizers.jl)** *(Julia)*
- Julia tokenizer library inspired by Hugging Face implementations.
- **[fast-bert-tokenizer-py](https://github.com/kakaobrain/fast-bert-tokenizer-py)** *(Python/Cython)*
- Python tokenizer accelerated with Cython.
- **[ml-commons/tokenizer](https://github.com/mlcommons/tokenizer)** *(C++)*
- High-performance C++ tokenizer supporting WordPiece and other algorithms.------
## 🔹 **BPE (Byte Pair Encoding) Implementations**
- **[OpenAI TikToken](https://github.com/openai/tiktoken)** *(Rust/Python)*
- Official BPE tokenizer from OpenAI (used in GPT models), highly optimized.
- **[huggingface/tokenizers](https://github.com/huggingface/tokenizers)** *(Rust/Python)*
- General-purpose tokenizer supporting BPE, from Hugging Face.
- **[bpe-tokenizer (Rust)](https://docs.rs/bpe-tokenizer/latest/bpe_tokenizer/)** *(Rust)*
- Rust BPE tokenizer library, identifying frequent pairs effectively.
- **[YouTokenToMe](https://github.com/VKCOM/YouTokenToMe)** *(C++/Python)*
- Efficient BPE tokenizer with fast training and inference, developed by VK.com.
- 🔥 /**[fastBPE](https://github.com/glample/fastBPE)** *(C++/Python)*
- Facebook’s fast and memory-efficient BPE tokenizer, widely used in NLP research.
- 🔥 **[sentencepiece](https://github.com/google/sentencepiece)** *(C++/Python)*
- Google's SentencePiece implementation also provides BPE as one of the algorithms.
- **[Subword-nmt](https://github.com/rsennrich/subword-nmt)** *(Python)*
- Python implementation commonly used in MT research, simple but slower.
- 🔥 **[rs-bpe](https://github.com/gweidart/rs-bpe)** (Rust)
- A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
------## 🔹 **SentencePiece Implementations**
- **[google/sentencepiece](https://github.com/google/sentencepiece)** *(C++/Python)*
- Google's official, language-independent, neural-based subword tokenizer.
- **[sentencepiece-rs](https://github.com/finalfusion/sentencepiece)** *(Rust)*
- Rust binding for Google's SentencePiece.
- **[huggingface/tokenizers](https://github.com/huggingface/tokenizers)** *(Rust/Python)*
- Hugging Face tokenizer library supporting SentencePiece.
- **[TensorFlow Text SentencepieceTokenizer](https://github.com/tensorflow/text)** *(C++/Python)*
- Google's TensorFlow Text includes SentencePiece tokenizer optimized for TF environments.
- **[sentencepiece.NET](https://github.com/Curiosity-ai/sentencepiece.NET)** *(C#)*
- .NET binding for SentencePiece tokenizer.
- **[sentencepiece-jni](https://github.com/go-skynet/sentencepiece-jni)** *(Java)*
- JNI bindings for Google's SentencePiece tokenizer for Java applications.
- **[sentencepiece-swift](https://github.com/xenova/sentencepiece-swift)** *(Swift)*
- Swift bindings for Google's SentencePiece tokenizer.------
## Contributing
Your contributions are always welcome! Please take a look at the [contribution guidelines](./CONTRIBUTING.md) first.
## Question
Also, if you have any questions, please send a message directly to WeChat, Line, or Telegram below.
💬 LINE
![]()
💬 Telegram
![]()
💬 WeChat
![]()