An open API service indexing awesome lists of open source software.

https://github.com/lucidrains/charformer-pytorch

Implementation of the GBST block from the Charformer paper, in Pytorch
https://github.com/lucidrains/charformer-pytorch

artificial-intelligence deep-learning tokenization transformer

Last synced: 6 months ago
JSON representation

Implementation of the GBST block from the Charformer paper, in Pytorch

Awesome Lists containing this project

README

          

## Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

## Install

```bash
$ pip install charformer-pytorch
```

## Usage

```python
import torch
from charformer_pytorch import GBST

tokenizer = GBST(
num_tokens = 257, # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
dim = 512, # dimension of token and intra-block positional embedding
max_block_size = 4, # maximum block size
downsample_factor = 4, # the final downsample factor by which the sequence length will decrease by
score_consensus_attn = True # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer
```

Deviating from the paper, you can also specify block size(s) with different offsets. This is to cover a potential use-case for genomics pre-training, where the tokenizer should be able to learn the correct frame. Simply omit the `max_block_size`, and pass in `blocks` as a list of tuples of tuples, each tuple with the format `(block size, offset)`. Offsets must be less than the block size

```python
import torch
from charformer_pytorch import GBST

tokenizer = GBST(
num_tokens = 4 + 1,
dim = 512,
blocks = ((3, 0), (3, 1), (3, 2)), # block size of 3, with offsets of 0, 1, 2
downsample_factor = 3,
score_consensus_attn = True
).cuda()

basepairs = torch.randint(0, 4, (1, 1023)).cuda()
mask = torch.ones(1, 1023).bool().cuda()

# both basepairs and mask will be appropriately downsampled

basepairs, mask = tokenizer(basepairs, mask = mask)
```

## Citations

```bibtex
@misc{tay2021charformer,
title = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization},
author = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
year = {2021},
eprint = {2106.12672},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
```