https://github.com/alula/tokenizers

A fast and easy to use implementation of today's most used tokenizers.
https://github.com/alula/tokenizers

Last synced: 2 months ago
JSON representation

A fast and easy to use implementation of today's most used tokenizers.

Host: GitHub
URL: https://github.com/alula/tokenizers
Owner: alula
Created: 2020-01-01T22:47:44.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-01-01T22:48:42.000Z (over 5 years ago)
Last Synced: 2025-03-24T15:57:44.811Z (2 months ago)
Language: Rust
Size: 35.2 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        [![PyPI version](https://badge.fury.io/py/tokenizers.svg)](https://badge.fury.io/py/tokenizers)

# Tokenizers

A fast and easy to use implementation of today's most used tokenizers.

 - High Level design: [master](https://github.com/huggingface/tokenizers)

This API is currently in the process of being stabilized. We might introduce breaking changes

really often in the coming days/weeks, so use at your own risks.

### Installation

#### With pip:

```bash

pip install tokenizers

```

#### From sources:

To use this method, you need to have the Rust nightly toolchain installed.

```bash

# Install with:

curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y

export PATH="$HOME/.cargo/bin:$PATH"

# Or select the right toolchain:

rustup default nightly-2019-11-01

```

Once Rust is installed and using the right toolchain you can do the following.

```bash

git clone https://github.com/huggingface/tokenizers

cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)

python -m venv .env

source .env/bin/activate

# Install `tokenizers` in the current virtual env

pip install maturin

maturin develop --release

```

### Usage

#### Use a pre-trained tokenizer

```python

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model

vocab = "./path/to/vocab.json"

merges = "./path/to/merges.txt"

bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer

tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding

tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))

tokenizer.with_decoder(decoders.ByteLevel.new())

# And then encode:

encoded = tokenizer.encode("I can feel the magic, can you?")

print(encoded)

# Or tokenize multiple sentences at once:

encoded = tokenizer.encode_batch([

	"I can feel the magic, can you?",

	"The quick brown fox jumps over the lazy dog"

])

print(encoded)

```

#### Train a new tokenizer

```python

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer

tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding

tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))

tokenizer.with_decoder(decoders.ByteLevel.new())

# And then train

trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)

tokenizer.train(trainer, [

	"./path/to/dataset/1.txt",

	"./path/to/dataset/2.txt",

	"./path/to/dataset/3.txt"

])

# Now we can encode

encoded = tokenizer.encode("I can feel the magic, can you?")

print(encoded)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alula/tokenizers

Awesome Lists containing this project

README