https://github.com/opennmt/tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support
https://github.com/opennmt/tokenizer

bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode

Last synced: 24 days ago
JSON representation

Fast and customizable text tokenization library with BPE and SentencePiece support

Host: GitHub
URL: https://github.com/opennmt/tokenizer
Owner: OpenNMT
License: mit
Created: 2017-02-14T15:23:39.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2025-04-15T15:40:45.000Z (about 2 months ago)
Last Synced: 2025-05-16T05:07:28.823Z (24 days ago)
Topics: bpe, cpp, icu, machine-translation, natural-language-processing, python, sentencepiece, tokenization, tokenizer, unicode
Language: C++
Homepage: https://opennmt.net/
Size: 1.69 MB
Stars: 304
Watchers: 19
Forks: 74
Open Issues: 10
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md

Awesome Lists containing this project

README

        [![CI](https://github.com/OpenNMT/Tokenizer/workflows/CI/badge.svg)](https://github.com/OpenNMT/Tokenizer/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/pyonmttok.svg)](https://badge.fury.io/py/pyonmttok) [![Forum](https://img.shields.io/discourse/status?server=https%3A%2F%2Fforum.opennmt.net%2F)](https://forum.opennmt.net/)

# Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

## Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

* **Reversible tokenization**
Marking joints or spaces by annotating tokens or injecting modifier characters.

* **Subword tokenization**
Support for training and using BPE and SentencePiece models.

* **Advanced text segmentation**
Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.

* **Case management**
Lowercase text and return case information as a separate feature or inject case modifier tokens.

* **Protected sequences**
Sequences can be protected against tokenization with the special characters ｟ and ｠.

See the [available options](docs/options.md) for an overview of supported features.

## Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

### Python API

```bash

pip install pyonmttok

```

```python

>>> import pyonmttok

>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)

>>> tokens = tokenizer("Hello World!")

>>> tokens

['Hello', 'World', '￭!']

>>> tokenizer.detokenize(tokens)

'Hello World!'

```

See the [Python API description](bindings/python) for more details.

### C++ API

```cpp

#include 

using namespace onmt;

int main() {

  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);

  std::vector tokens;

  tokenizer.tokenize("Hello World!", tokens);

}

```

See the [Tokenizer class](include/onmt/Tokenizer.h) for more details.

### Command line clients

```bash

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate

Hello World ￭!

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize

Hello World!

```

See the `-h` flag to list the available options.

## Development

### Dependencies

* [ICU](http://site.icu-project.org/)

### Compiling

*CMake and a compiler that supports the C++11 standard are required to compile the project.*

```

git submodule update --init

mkdir build

cd build

cmake ..

make

```

It will produce the dynamic library `libOpenNMTTokenizer` and tokenization clients in `cli/`.

* To compile only the library, use the `-DLIB_ONLY=ON` flag.

### Testing

The tests are using [Google Test](https://github.com/google/googletest) which is included as a Git submodule. Run the tests with:

```

mkdir build

cd build

cmake -DBUILD_TESTS=ON ..

make

test/onmt_tokenizer_test ../test/data

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/opennmt/tokenizer

Awesome Lists containing this project

README