Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shjwudp/megabyte

A PyTorch implementation of MEGABYTE. This multi-scale transformer architecture has the excellent features of tokenization-free and sub-quadratic attention. The paper link: https://arxiv.org/abs/2305.07185
https://github.com/shjwudp/megabyte

deep-learning language-model sub-quadratic-attention tokenization-free

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/shjwudp/megabyte
Owner: shjwudp
License: mit
Created: 2023-06-01T14:08:51.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-02-06T06:36:30.000Z (11 months ago)
Last Synced: 2024-11-19T01:54:04.001Z (about 2 months ago)
Topics: deep-learning, language-model, sub-quadratic-attention, tokenization-free
Language: Python
Homepage:
Size: 61.5 KB
Stars: 4
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Megabyte

This repository implements [MEGABYTE](https://arxiv.org/abs/2305.07185) with pytorch, and tries to explore the best practice of Megabyte architecture. The original architecture described in the paper is implemented in [megabyte.py](./model/megabyte.py), and the best practices are implemented in [megabyte_in_action.py](./model/megabyte_in_action.py).

Megabyte is a new architecture that overcomes the performance defects of bytes end-to-end training and makes tokenization-free autoregressive sequence modeling possible.

## Megabyte in autoregressive training

```python

import torch

import torch.nn.functional as F

from einops import rearrange

from model import MegabyteConfig, Megabyte

V = 512         # vocabulary size, input bytes have 256 characters, and the extra 256 are reserved for special tokens.

P = 4           # patch size

D_G = 512       # global model dimension

D_L = 128       # local model dimension

T = 1024        # sequence length

B = 2           # batch size

K = T//P        # number of patches

PAD_ID = 257    # padding token id

EOS_ID = 258    # end of sequence token id

config = MegabyteConfig(

    V=V,

    P=P,

    D_G=D_G,

    D_L=D_L,

    T_MAX=T,

    initializer_range=0.02, # Parameter initialization value range

    g_nlayers=4,            # number of global model layers

    g_nheads=32,            # number of global model attention heads

    l_nlayers=2,            # number of local model attention layers

    l_nheads=2,             # number of local model attention heads

    pad_id=PAD_ID,

    eos_id=EOS_ID,

)

megabyte = Megabyte(config)

input_ids = torch.randint(0, 255, (B, T))

# Autoregressive learning, megabyte will learn from the inputs input[:, :-1], labels input[:, :], and learn to predict the next token.

loss = megabyte(input_ids, return_loss=True).loss

loss.backward()

print(loss.norm())

```

## Megabyte in generation

```python

...

from model.megabyte_transformers import MegabyteLMHeadModel, MegabyteTokenizer

lm_head_megabyte = MegabyteLMHeadModel.from_native_megabyte(megabyte)

tokenizer = MegabyteTokenizer(

    eos_token_id=lm_head_megabyte.config.eos_token_id,

)

inputs = tokenizer("Today is", return_tensors="pt")

outputs = lm_head_megabyte.generate(

    **inputs,

    max_new_tokens=5,

    return_dict_in_generate=True,

    output_scores=True,

)

texts = tokenizer.decode(outputs.sequences)

print(texts)

```

## Benchmark

You can use the [benchmark.py](https://github.com/shjwudp/megabyte/blob/main/benchmark.py) script for Megabyte's performance measurement. The following table compares the training of Megabyte and GPT2 on wikitext-103-v1 with the same parameter scale.

| model                   | # of parameters (M) | training speed (KB/s) | GPU Memory Allocated % | eval loss ↓ | eval loss bpc ↓ |

| :---------------------- | :-------------- | :-------------------- | :--------------------- | :-------- | :------------ |

| gpt2                    | 119       | 143.68                | 42.97                  | 5.06      | 1.10          |

| megabyte(P=8)           | 126       | 189.13                | 17.62                  | 1.13      | 1.13          |

| megabyte_in_action(P=8) | 126       | 197.47                | 18.69                  | 1.09      | 1.09          |

## Citation

```text

@misc{yu2023megabyte,

      title={MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers}, 

      author={Lili Yu and Dániel Simig and Colin Flaherty and Armen Aghajanyan and Luke Zettlemoyer and Mike Lewis},

      year={2023},

      eprint={2305.07185},

      archivePrefix={arXiv},

      primaryClass={cs.LG}

}

```