https://github.com/lucidrains/charformer-pytorch

Implementation of the GBST block from the Charformer paper, in Pytorch
https://github.com/lucidrains/charformer-pytorch

artificial-intelligence deep-learning tokenization transformer

Last synced: 6 months ago
JSON representation

Implementation of the GBST block from the Charformer paper, in Pytorch

Host: GitHub
URL: https://github.com/lucidrains/charformer-pytorch
Owner: lucidrains
License: mit
Created: 2021-06-30T16:32:13.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-07-15T01:20:40.000Z (over 4 years ago)
Last Synced: 2025-07-03T08:37:51.031Z (7 months ago)
Topics: artificial-intelligence, deep-learning, tokenization, transformer
Language: Python
Homepage:
Size: 77.1 KB
Stars: 117
Watchers: 5
Forks: 11
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          

## Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

## Install

```bash

$ pip install charformer-pytorch

```

## Usage

```python

import torch

from charformer_pytorch import GBST

tokenizer = GBST(

    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)

    dim = 512,                    # dimension of token and intra-block positional embedding

    max_block_size = 4,           # maximum block size

    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by

    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper

)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)

mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

```

Deviating from the paper, you can also specify block size(s) with different offsets. This is to cover a potential use-case for genomics pre-training, where the tokenizer should be able to learn the correct frame. Simply omit the `max_block_size`, and pass in `blocks` as a list of tuples of tuples, each tuple with the format `(block size, offset)`. Offsets must be less than the block size

```python

import torch

from charformer_pytorch import GBST

tokenizer = GBST(

    num_tokens = 4 + 1,

    dim = 512,

    blocks = ((3, 0), (3, 1), (3, 2)),  # block size of 3, with offsets of 0, 1, 2

    downsample_factor = 3,

    score_consensus_attn = True

).cuda()

basepairs = torch.randint(0, 4, (1, 1023)).cuda()

mask      = torch.ones(1, 1023).bool().cuda()

# both basepairs and mask will be appropriately downsampled

basepairs, mask = tokenizer(basepairs, mask = mask)

```

## Citations

```bibtex

@misc{tay2021charformer,

    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 

    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},

    year    = {2021},

    eprint  = {2106.12672},

    archivePrefix = {arXiv},

    primaryClass = {cs.CL}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucidrains/charformer-pytorch

Awesome Lists containing this project

README