Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Separius/awesome-fast-attention

list of efficient attention modules
https://github.com/Separius/awesome-fast-attention

List: awesome-fast-attention

attention attention-is-all-you-need awesome linformer longformer multihead-attention reformer self-attention transformer transformer-network

Last synced: 2 months ago
JSON representation

list of efficient attention modules

Awesome Lists containing this project

README

        

# awesome-fast-attention [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)

## Table of Contents

* **[Efficient Attention](#efficient-attention)**
* **[Articles/Surveys/Benchmarks](#articlessurveysbenchmarks)**

## Efficient Attention

|Paper (citations)|Implementation|Computational Complexity|AutoRegressive|Main Idea|
|:---:|:---:|:---:|:---:|:---:|
|[Generating Wikipedia by Summarizing Long Sequences](https://arxiv.org/abs/1801.10198v1 ) (282)|[memory-compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) ![](https://img.shields.io/github/stars/lucidrains/memory-compressed-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D}))|:heavy_check_mark:|EXPAND

compresses key and value + blocked attention

|
|[CBAM: Convolutional Block Attention Module](https://arxiv.org/abs/1807.06521v2 ) (999+)|[attention-module](https://github.com/Jongchan/attention-module ) ![](https://img.shields.io/github/stars/Jongchan/attention-module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{D}%2b\frac{{D}^2}{r})%2b({N}\cdot{D}\cdot{k}^2)))|:x:|EXPAND

combines the SE attention with a per pixel(local) weight

|
|[Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks](https://arxiv.org/abs/1810.00825v3 ) (16)|[set_transformer](https://github.com/juho-lee/set_transformer ) ![](https://img.shields.io/github/stars/juho-lee/set_transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{K}\cdot{D}))|:x:|EXPAND

uses K relay nodes

|
|[CCNet: Criss-Cross Attention for Semantic Segmentation](https://arxiv.org/abs/1811.11721v2 ) (296)|[CCNet](https://github.com/speedinghzl/CCNet ) ![](https://img.shields.io/github/stars/speedinghzl/CCNet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:x:|EXPAND

each pixel attends to its row and column simultaneously

|
|[Efficient Attention: Attention with Linear Complexities](https://arxiv.org/abs/1812.01243v9 ) (16)|[efficient-attention](https://github.com/cmsflash/efficient-attention ) ![](https://img.shields.io/github/stars/cmsflash/efficient-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPAND

Softmax(Q)*(Softmax(K^T)*V)

|
|[Star-Transformer](https://arxiv.org/abs/1902.09113v2 ) (40)|[fastNLP](https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py ) ![](https://img.shields.io/github/stars/fastnlp/fastNLP.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:x:|EXPAND

uses a relay(global) node and attends to/from that node

|
|[GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond](https://arxiv.org/abs/1904.11492v1 ) (199)|[GCNet](https://github.com/xvjiarui/GCNet ) ![](https://img.shields.io/github/stars/xvjiarui/GCNet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPAND

squeeze and excitation with an attention pooling (instead of a GAP)

|
|[Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509v1 ) (257)|[DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py ) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPAND

sparse block based attention

|
|[SCRAM: Spatially Coherent Randomized Attention Maps](https://arxiv.org/abs/1905.10308v1 ) (1)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|EXPAND

uses PatchMatch to find close keys

|
|[Interlaced Sparse Self-Attention for Semantic Segmentation](https://arxiv.org/abs/1907.12273v2 ) (24)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2%2b{N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPAND

combination of a short length and then long range(dilated) attention

|
|[Permutohedral Attention Module for Efficient Non-Local Neural Networks](https://arxiv.org/abs/1907.00641v2 ) (3)|[Permutohedral_attention_module](https://github.com/SamuelJoutard/Permutohedral_attention_module ) ![](https://img.shields.io/github/stars/SamuelJoutard/Permutohedral_attention_module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPAND

uses permutohedral lattice approximation algorithm to approximate the attention output

|
|[Large Memory Layers with Product Keys](https://arxiv.org/abs/1907.05242v2 ) (43)|[XLM](https://github.com/facebookresearch/XLM ) ![](https://img.shields.io/github/stars/facebookresearch/XLM.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({Q}\cdot({K}%2b{k}^2)\cdot{D}))|:heavy_check_mark:|EXPAND

search for nearest neighbor keys

|
|[Expectation-Maximization Attention Networks for Semantic Segmentation](https://arxiv.org/abs/1907.13426v2 ) (79)|[EMANet](https://github.com/XiaLiPKU/EMANet ) ![](https://img.shields.io/github/stars/XiaLiPKU/EMANet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPAND

applys expectation maximization to cluster keys into k clusters

|
|[BP-Transformer: Modelling Long-Range Context via Binary Partitioning](https://arxiv.org/abs/1911.04070v1 ) (15)|[BPT](https://github.com/yzh119/BPT ) ![](https://img.shields.io/github/stars/yzh119/BPT.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D}))|:heavy_check_mark:|EXPAND

attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner

|
|[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507v1 ) (48)|[compressive-transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/compressive-transformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPAND

compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL

|
|[Axial Attention in Multidimensional Transformers](https://arxiv.org/abs/1912.12180v1 ) (36)|[axial-attention](https://github.com/lucidrains/axial-attention ) ![](https://img.shields.io/github/stars/lucidrains/axial-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:heavy_check_mark:|EXPAND

apply attention on each axis separately

|
|[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451v2 ) (216)|[trax](https://github.com/google/trax/tree/master/trax/models/reformer ) ![](https://img.shields.io/github/stars/google/trax.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}^2))|:heavy_check_mark:|EXPAND

uses LSH to find close keys

|
|[Sparse Sinkhorn Attention](https://arxiv.org/abs/2002.11296v1 ) (16)|[sinkhorn-transformer](https://github.com/lucidrains/sinkhorn-transformer ) ![](https://img.shields.io/github/stars/lucidrains/sinkhorn-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(\frac{{N}^2}{n_b}%2b{n_b}^2))|:heavy_check_mark:|EXPAND

uses a cost matrix to limit attention between buckets

|
|[Transformer on a Diet](https://arxiv.org/abs/2002.06170v1 ) (2)|[transformer-on-diet](https://github.com/cgraywang/transformer-on-diet ) ![](https://img.shields.io/github/stars/cgraywang/transformer-on-diet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|EXPAND

dilated transformer like wavenet

|
|[Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184v2 ) (9)|[TaLKConvolutions](https://github.com/lioutasb/TaLKConvolutions ) ![](https://img.shields.io/github/stars/lioutasb/TaLKConvolutions.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPAND

calculate mean over a dynamic subsequence around each token with the help of summed-area table

|
|[SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection](https://arxiv.org/abs/2003.09833v3 ) (2)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|EXPAND

learns the q, k connections == dynamically creates a sparse attention matrix

|
|[Efficient Content-Based Sparse Attention with Routing Transformers](https://arxiv.org/abs/2003.05997v5 ) (38)|[routing-transformer](https://github.com/lucidrains/routing-transformer ) ![](https://img.shields.io/github/stars/lucidrains/routing-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPAND

computes attention with same-cluster tokens (computed by online k-means)

|
|[Neural Architecture Search for Lightweight Non-Local Networks](https://arxiv.org/abs/2004.01961v1 ) (11)|[AutoNL](https://github.com/LiYingwei/AutoNL ) ![](https://img.shields.io/github/stars/LiYingwei/AutoNL.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2))|:x:|EXPAND

computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions

|
|[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150v2 ) (159)|[longformer](https://github.com/allenai/longformer ) ![](https://img.shields.io/github/stars/allenai/longformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({k}%2b{g})\cdot{D}))|:heavy_check_mark:|EXPAND

global + blocked attention

|
|[ETC: Encoding Long and Structured Inputs in Transformers](https://arxiv.org/abs/2004.08483v5 ) (16)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{g}%2b{g}^2%2b{N}\cdot{k})\cdot{D}))|:x:|EXPAND

combines global attention (star transformer with multiple global tokens) with local attention

|
|[Multi-scale Transformer Language Models](https://arxiv.org/abs/2005.00581v1 ) (2)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPAND

UNet like + retina attetion is something close to BP-Transformer

|
|[Synthesizer: Rethinking Self-Attention in Transformer Models](https://arxiv.org/abs/2005.00743v2 ) (26)|[Synthesizer-Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) ![](https://img.shields.io/github/stars/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPAND

does not compute pairwise interactions

|
|[Jukebox: A Generative Model for Music](https://arxiv.org/abs/2005.00341v1 ) (45)|[jukebox](https://github.com/openai/jukebox ) ![](https://img.shields.io/github/stars/openai/jukebox.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPAND

better attention patterns from Sparse Transformer

|
|[Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers](https://arxiv.org/abs/2006.05174v2 ) (0)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPAND

does not compute pairwise interactions and uses fixed mask patters

|
|[GMAT: Global Memory Augmentation for Transformers](https://arxiv.org/abs/2006.03274v1 ) (2)|[gmat](https://github.com/ag1988/gmat ) ![](https://img.shields.io/github/stars/ag1988/gmat.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({m}\cdot({N}%2b{m})\cdot{D}))|:x:|EXPAND

adds global tokens

|
|[Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236v3 ) (45)|[fast-transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:heavy_check_mark:|EXPAND

uses phi(q)(phi(k)v) and also improves the sequential sampling step

|
|[Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768v3 ) (47)|[linformer-pytorch](https://github.com/tatp22/linformer-pytorch ) ![](https://img.shields.io/github/stars/tatp22/linformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPAND

project key and value from n*d to k*d

|
|[Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers](https://arxiv.org/abs/2006.03555v3 ) (8)|[google-research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2\cdot\log({D})))|:heavy_check_mark:|EXPAND

calculate an unbiased stochastic approximation of the attention matrix

|
|[Kronecker Attention Networks](https://arxiv.org/abs/2007.08442v1 ) (1)|[kronecker-attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/kronecker-attention-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({H}%2b{W})^2\cdot{D}))|:x:|EXPAND

uses horizontal and lateral average matrices

|
|[Real-time Semantic Segmentation with Fast Attention](https://arxiv.org/abs/2007.03815v2 ) (5)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPAND

l2_norm(q)*(l2_norm(k)*v)

|
|[Fast Transformers with Clustered Attention](https://arxiv.org/abs/2007.04825v2 ) (6)|[fast-transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPAND

groups queries together with LSH

|
|[Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062v2 ) (60)|[DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py ) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({g}^2%2b{N}\cdot({k}%2b{g}%2b{r}))\cdot{D}))|:x:|EXPAND

ETC with random connections

|
|[Tensor Low-Rank Reconstruction for Semantic Segmentation](https://arxiv.org/abs/2008.00490v1 ) (3)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({D}\cdot{H}\cdot{W}%2b{D}^2%2b{H}^2%2b{W}^2)\cdot{r}))|:x:|EXPAND

decompose the full attention tensor into rank one tensors (CP decomposition)

|
|[Looking for change? Roll the Dice and demand Attention](https://arxiv.org/abs/2009.02062v1 ) (0)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({H}\cdot{W}\cdot{D}))|:x:|EXPAND

uses the fractal tanimoto similarity to compare queries with keys inside the attention module

|
|[Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794v3 ) (30)|[google-research](https://github.com/google-research/google-research/tree/master/performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPAND

unbiased approximation of the attention matrix with softmax kernel

|
|[Memformer: The Memory-Augmented Transformer](https://arxiv.org/abs/2010.06891v1 ) (0)|[memformer](https://github.com/lucidrains/memformer ) ![](https://img.shields.io/github/stars/lucidrains/memformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPAND

attend to memory slots + Memory-Replay BackPropagation

|
|[SMYRF: Efficient Attention using Asymmetric Clustering](https://arxiv.org/abs/2010.05315v1 ) (1)|[smyrf](https://github.com/giannisdaras/smyrf ) ![](https://img.shields.io/github/stars/giannisdaras/smyrf.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:x:|EXPAND

LSH with balanced clusters

|
|[Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436v2 ) (0)|[Informer2020](https://github.com/zhouhaoyi/Informer2020 ) ![](https://img.shields.io/github/stars/zhouhaoyi/Informer2020.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|EXPAND

sparse attention + funnel like encoder

|
|[Sub-Linear Memory: How to Make Performers SLiM](https://arxiv.org/abs/2012.11346v1 ) (0)|[google-research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPAND

Performer but with sublinear Memory usage

|
|[Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902v2 ) (0)|[Nystromformer](https://github.com/mlpen/Nystromformer ) ![](https://img.shields.io/github/stars/mlpen/Nystromformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:x:|EXPAND

uses Nystrom method to approximate the attention matrix

|
|[Linear Transformers Are Secretly Fast Weight Memory Systems](https://arxiv.org/abs/2102.11174v2 ) (0)|[fast-weight-transformers](https://github.com/ischlag/fast-weight-transformers ) ![](https://img.shields.io/github/stars/ischlag/fast-weight-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPAND

show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness

|
|[LambdaNetworks: Modeling Long-Range Interactions Without Attention](https://arxiv.org/abs/2102.08602v1 ) (6)|[lambda-networks](https://github.com/lucidrains/lambda-networks ) ![](https://img.shields.io/github/stars/lucidrains/lambda-networks.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h}))|:heavy_check_mark:|EXPAND

generates a linear layer based on context + decouple pos/context

|
|[Random Feature Attention](https://arxiv.org/abs/2103.02143v1 ) (2)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPAND

kernel approximation and also transformers are rnn

|

## Articles/Surveys/Benchmarks

* [A Survey of Long-Term Context in Transformers](https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/)
* [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)

* [Long Range Arena: A Benchmark for Efficient
Transformers](https://arxiv.org/abs/2011.04006)