https://github.com/lucidrains/mega-pytorch

Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena
https://github.com/lucidrains/mega-pytorch

artificial-intelligence attention-mechanisms deep-learning exponential-moving-average long-range-arena

Last synced: over 1 year ago
JSON representation

Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena

Host: GitHub
URL: https://github.com/lucidrains/mega-pytorch
Owner: lucidrains
License: mit
Created: 2022-09-23T20:40:57.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2023-08-26T17:59:00.000Z (almost 3 years ago)
Last Synced: 2025-03-31T04:06:50.982Z (over 1 year ago)
Topics: artificial-intelligence, attention-mechanisms, deep-learning, exponential-moving-average, long-range-arena
Language: Python
Homepage:
Size: 34.2 MB
Stars: 204
Watchers: 8
Forks: 11
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          

## Mega - Moving Average Equipped Gated Attention - Pytorch

Implementation of the Mega layer, the Single-head Attention with Multi-headed EMA layer that exists in the architecture that currently holds SOTA on Long Range Arena, beating S4 on Pathfinder-X and all the other tasks save for audio.

## Install

```bash

$ pip install mega-pytorch

```

## Usage

The Mega Layer with combination of attention and learned EMA

```python

import torch

from mega_pytorch import MegaLayer

layer = MegaLayer(

    dim = 128,                   # model dimensions

    ema_heads = 16,              # number of EMA heads

    attn_dim_qk = 64,            # dimension of queries / keys in attention

    attn_dim_value = 256,        # dimension of values in attention

    laplacian_attn_fn = False,   # whether to use softmax (false) or laplacian attention activation fn (true)

)

x = torch.randn(1, 1024, 128)     # (batch, seq, dim)

out = layer(x) # (1, 1024, 128)

```

Full Mega (with layernorm for now)

```python

import torch

from mega_pytorch import Mega

mega = Mega(

    num_tokens = 256,            # number of tokens

    dim = 128,                   # model dimensions

    depth = 6,                   # depth

    ema_heads = 16,              # number of EMA heads

    attn_dim_qk = 64,            # dimension of queries / keys in attention

    attn_dim_value = 256,        # dimensino of values in attention

    laplacian_attn_fn = True,    # whether to use softmax (false) or laplacian attention activation fn (true)

)

x = torch.randint(0, 256, (1, 1024))

logits = mega(x) # (1, 1024, 256)

```

## Todo

- [ ] add dynamic positional bias for best length extrapolation arch

## Citations

```bibtex

@inproceedings{Ma2022MegaMA,

    title   = {Mega: Moving Average Equipped Gated Attention},

    author  = {Xuezhe Ma and Chunting Zhou and Xiang Kong and Junxian He and Liangke Gui and Graham Neubig and Jonathan May and Luke Zettlemoyer},

    year    = {2022}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucidrains/mega-pytorch

Awesome Lists containing this project

README