https://github.com/lucidrains/charformer-pytorch
Implementation of the GBST block from the Charformer paper, in Pytorch
https://github.com/lucidrains/charformer-pytorch
artificial-intelligence deep-learning tokenization transformer
Last synced: 6 months ago
JSON representation
Implementation of the GBST block from the Charformer paper, in Pytorch
- Host: GitHub
- URL: https://github.com/lucidrains/charformer-pytorch
- Owner: lucidrains
- License: mit
- Created: 2021-06-30T16:32:13.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-07-15T01:20:40.000Z (over 4 years ago)
- Last Synced: 2025-07-03T08:37:51.031Z (7 months ago)
- Topics: artificial-intelligence, deep-learning, tokenization, transformer
- Language: Python
- Homepage:
- Size: 77.1 KB
- Stars: 117
- Watchers: 5
- Forks: 11
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

## Charformer - Pytorch
Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.
AI Coffee Break with Letitia video
## Install
```bash
$ pip install charformer-pytorch
```
## Usage
```python
import torch
from charformer_pytorch import GBST
tokenizer = GBST(
num_tokens = 257, # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
dim = 512, # dimension of token and intra-block positional embedding
max_block_size = 4, # maximum block size
downsample_factor = 4, # the final downsample factor by which the sequence length will decrease by
score_consensus_attn = True # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)
tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask = torch.ones(1, 1023).bool()
# both tokens and mask will be appropriately downsampled
tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)
# now pass this on to your transformer
```
Deviating from the paper, you can also specify block size(s) with different offsets. This is to cover a potential use-case for genomics pre-training, where the tokenizer should be able to learn the correct frame. Simply omit the `max_block_size`, and pass in `blocks` as a list of tuples of tuples, each tuple with the format `(block size, offset)`. Offsets must be less than the block size
```python
import torch
from charformer_pytorch import GBST
tokenizer = GBST(
num_tokens = 4 + 1,
dim = 512,
blocks = ((3, 0), (3, 1), (3, 2)), # block size of 3, with offsets of 0, 1, 2
downsample_factor = 3,
score_consensus_attn = True
).cuda()
basepairs = torch.randint(0, 4, (1, 1023)).cuda()
mask = torch.ones(1, 1023).bool().cuda()
# both basepairs and mask will be appropriately downsampled
basepairs, mask = tokenizer(basepairs, mask = mask)
```
## Citations
```bibtex
@misc{tay2021charformer,
title = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization},
author = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
year = {2021},
eprint = {2106.12672},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
```