Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shjwudp/megabyte
A PyTorch implementation of MEGABYTE. This multi-scale transformer architecture has the excellent features of tokenization-free and sub-quadratic attention. The paper link: https://arxiv.org/abs/2305.07185
https://github.com/shjwudp/megabyte
deep-learning language-model sub-quadratic-attention tokenization-free
Last synced: about 1 month ago
JSON representation
A PyTorch implementation of MEGABYTE. This multi-scale transformer architecture has the excellent features of tokenization-free and sub-quadratic attention. The paper link: https://arxiv.org/abs/2305.07185
- Host: GitHub
- URL: https://github.com/shjwudp/megabyte
- Owner: shjwudp
- License: mit
- Created: 2023-06-01T14:08:51.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-06T06:36:30.000Z (11 months ago)
- Last Synced: 2024-11-19T01:54:04.001Z (about 2 months ago)
- Topics: deep-learning, language-model, sub-quadratic-attention, tokenization-free
- Language: Python
- Homepage:
- Size: 61.5 KB
- Stars: 4
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Megabyte
This repository implements [MEGABYTE](https://arxiv.org/abs/2305.07185) with pytorch, and tries to explore the best practice of Megabyte architecture. The original architecture described in the paper is implemented in [megabyte.py](./model/megabyte.py), and the best practices are implemented in [megabyte_in_action.py](./model/megabyte_in_action.py).
Megabyte is a new architecture that overcomes the performance defects of bytes end-to-end training and makes tokenization-free autoregressive sequence modeling possible.
## Megabyte in autoregressive training
```python
import torch
import torch.nn.functional as F
from einops import rearrange
from model import MegabyteConfig, MegabyteV = 512 # vocabulary size, input bytes have 256 characters, and the extra 256 are reserved for special tokens.
P = 4 # patch size
D_G = 512 # global model dimension
D_L = 128 # local model dimension
T = 1024 # sequence length
B = 2 # batch size
K = T//P # number of patches
PAD_ID = 257 # padding token id
EOS_ID = 258 # end of sequence token idconfig = MegabyteConfig(
V=V,
P=P,
D_G=D_G,
D_L=D_L,
T_MAX=T,
initializer_range=0.02, # Parameter initialization value range
g_nlayers=4, # number of global model layers
g_nheads=32, # number of global model attention heads
l_nlayers=2, # number of local model attention layers
l_nheads=2, # number of local model attention heads
pad_id=PAD_ID,
eos_id=EOS_ID,
)
megabyte = Megabyte(config)
input_ids = torch.randint(0, 255, (B, T))
# Autoregressive learning, megabyte will learn from the inputs input[:, :-1], labels input[:, :], and learn to predict the next token.
loss = megabyte(input_ids, return_loss=True).loss
loss.backward()print(loss.norm())
```## Megabyte in generation
```python
...
from model.megabyte_transformers import MegabyteLMHeadModel, MegabyteTokenizer
lm_head_megabyte = MegabyteLMHeadModel.from_native_megabyte(megabyte)
tokenizer = MegabyteTokenizer(
eos_token_id=lm_head_megabyte.config.eos_token_id,
)inputs = tokenizer("Today is", return_tensors="pt")
outputs = lm_head_megabyte.generate(
**inputs,
max_new_tokens=5,
return_dict_in_generate=True,
output_scores=True,
)texts = tokenizer.decode(outputs.sequences)
print(texts)
```## Benchmark
You can use the [benchmark.py](https://github.com/shjwudp/megabyte/blob/main/benchmark.py) script for Megabyte's performance measurement. The following table compares the training of Megabyte and GPT2 on wikitext-103-v1 with the same parameter scale.
| model | # of parameters (M) | training speed (KB/s) | GPU Memory Allocated % | eval loss ↓ | eval loss bpc ↓ |
| :---------------------- | :-------------- | :-------------------- | :--------------------- | :-------- | :------------ |
| gpt2 | 119 | 143.68 | 42.97 | 5.06 | 1.10 |
| megabyte(P=8) | 126 | 189.13 | 17.62 | 1.13 | 1.13 |
| megabyte_in_action(P=8) | 126 | 197.47 | 18.69 | 1.09 | 1.09 |## Citation
```text
@misc{yu2023megabyte,
title={MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers},
author={Lili Yu and Dániel Simig and Colin Flaherty and Armen Aghajanyan and Luke Zettlemoyer and Mike Lewis},
year={2023},
eprint={2305.07185},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```