https://github.com/Separius/awesome-fast-attention

list of efficient attention modules
https://github.com/Separius/awesome-fast-attention
attention attention-is-all-you-need awesome linformer longformer multihead-attention reformer self-attention transformer transformer-network
Last synced: about 2 months ago
JSON representation
list of efficient attention modules
Host: GitHub
URL: https://github.com/Separius/awesome-fast-attention
Owner: Separius
License: gpl-3.0
Archived: true
Created: 2020-07-31T08:08:37.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2021-08-23T09:58:15.000Z (almost 4 years ago)
Last Synced: 2025-04-24T06:02:06.317Z (2 months ago)
Topics: attention, attention-is-all-you-need, awesome, linformer, longformer, multihead-attention, reformer, self-attention, transformer, transformer-network
Language: Python
Homepage:
Size: 156 KB
Stars: 1,000
Watchers: 31
Forks: 108
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

awesome-ai-list-guide - awesome-fast-attention
README

        # awesome-fast-attention [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)

## Table of Contents

* **[Efficient Attention](#efficient-attention)**

* **[Articles/Surveys/Benchmarks](#articlessurveysbenchmarks)**

## Efficient Attention

|Paper (citations)|Implementation|Computational Complexity|AutoRegressive|Main Idea|

|:---:|:---:|:---:|:---:|:---:|

|[Generating Wikipedia by Summarizing Long Sequences](https://arxiv.org/abs/1801.10198v1 ) (282)|[memory-compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) ![](https://img.shields.io/github/stars/lucidrains/memory-compressed-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D}))|:heavy_check_mark:|EXPAND
compresses key and value + blocked attention|

|[CBAM: Convolutional Block Attention Module](https://arxiv.org/abs/1807.06521v2 ) (999+)|[attention-module](https://github.com/Jongchan/attention-module ) ![](https://img.shields.io/github/stars/Jongchan/attention-module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{D}%2b\frac{{D}^2}{r})%2b({N}\cdot{D}\cdot{k}^2)))|:x:|EXPANDcombines the SE attention with a per pixel(local) weight|

|[Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks](https://arxiv.org/abs/1810.00825v3 ) (16)|[set_transformer](https://github.com/juho-lee/set_transformer ) ![](https://img.shields.io/github/stars/juho-lee/set_transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{K}\cdot{D}))|:x:|EXPANDuses K relay nodes|

|[CCNet: Criss-Cross Attention for Semantic Segmentation](https://arxiv.org/abs/1811.11721v2 ) (296)|[CCNet](https://github.com/speedinghzl/CCNet ) ![](https://img.shields.io/github/stars/speedinghzl/CCNet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:x:|EXPANDeach pixel attends to its row and column simultaneously|

|[Efficient Attention: Attention with Linear Complexities](https://arxiv.org/abs/1812.01243v9 ) (16)|[efficient-attention](https://github.com/cmsflash/efficient-attention ) ![](https://img.shields.io/github/stars/cmsflash/efficient-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDSoftmax(Q)*(Softmax(K^T)*V)|

|[Star-Transformer](https://arxiv.org/abs/1902.09113v2 ) (40)|[fastNLP](https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py ) ![](https://img.shields.io/github/stars/fastnlp/fastNLP.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:x:|EXPANDuses a relay(global) node and attends to/from that node|

|[GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond](https://arxiv.org/abs/1904.11492v1 ) (199)|[GCNet](https://github.com/xvjiarui/GCNet ) ![](https://img.shields.io/github/stars/xvjiarui/GCNet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDsqueeze and excitation with an attention pooling (instead of a GAP)|

|[Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509v1 ) (257)|[DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py ) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDsparse block based attention|

|[SCRAM: Spatially Coherent Randomized Attention Maps](https://arxiv.org/abs/1905.10308v1 ) (1)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|EXPANDuses PatchMatch to find close keys|

|[Interlaced Sparse Self-Attention for Semantic Segmentation](https://arxiv.org/abs/1907.12273v2 ) (24)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2%2b{N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDcombination of a short length and then long range(dilated) attention|

|[Permutohedral Attention Module for Efficient Non-Local Neural Networks](https://arxiv.org/abs/1907.00641v2 ) (3)|[Permutohedral_attention_module](https://github.com/SamuelJoutard/Permutohedral_attention_module ) ![](https://img.shields.io/github/stars/SamuelJoutard/Permutohedral_attention_module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDuses permutohedral lattice approximation algorithm to approximate the attention output|

|[Large Memory Layers with Product Keys](https://arxiv.org/abs/1907.05242v2 ) (43)|[XLM](https://github.com/facebookresearch/XLM ) ![](https://img.shields.io/github/stars/facebookresearch/XLM.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({Q}\cdot({K}%2b{k}^2)\cdot{D}))|:heavy_check_mark:|EXPANDsearch for nearest neighbor keys|

|[Expectation-Maximization Attention Networks for Semantic Segmentation](https://arxiv.org/abs/1907.13426v2 ) (79)|[EMANet](https://github.com/XiaLiPKU/EMANet ) ![](https://img.shields.io/github/stars/XiaLiPKU/EMANet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPANDapplys expectation maximization to cluster keys into k clusters|

|[BP-Transformer: Modelling Long-Range Context via Binary Partitioning](https://arxiv.org/abs/1911.04070v1 ) (15)|[BPT](https://github.com/yzh119/BPT ) ![](https://img.shields.io/github/stars/yzh119/BPT.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D}))|:heavy_check_mark:|EXPANDattends to distant tokens coarsely and attends to close tokens in a more fine-grained manner|

|[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507v1 ) (48)|[compressive-transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/compressive-transformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDcompresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL|

|[Axial Attention in Multidimensional Transformers](https://arxiv.org/abs/1912.12180v1 ) (36)|[axial-attention](https://github.com/lucidrains/axial-attention ) ![](https://img.shields.io/github/stars/lucidrains/axial-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:heavy_check_mark:|EXPANDapply attention on each axis separately|

|[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451v2 ) (216)|[trax](https://github.com/google/trax/tree/master/trax/models/reformer ) ![](https://img.shields.io/github/stars/google/trax.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}^2))|:heavy_check_mark:|EXPANDuses LSH to find close keys|

|[Sparse Sinkhorn Attention](https://arxiv.org/abs/2002.11296v1 ) (16)|[sinkhorn-transformer](https://github.com/lucidrains/sinkhorn-transformer ) ![](https://img.shields.io/github/stars/lucidrains/sinkhorn-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(\frac{{N}^2}{n_b}%2b{n_b}^2))|:heavy_check_mark:|EXPANDuses a cost matrix to limit attention between buckets|

|[Transformer on a Diet](https://arxiv.org/abs/2002.06170v1 ) (2)|[transformer-on-diet](https://github.com/cgraywang/transformer-on-diet ) ![](https://img.shields.io/github/stars/cgraywang/transformer-on-diet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|EXPANDdilated transformer like wavenet|

|[Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184v2 ) (9)|[TaLKConvolutions](https://github.com/lioutasb/TaLKConvolutions ) ![](https://img.shields.io/github/stars/lioutasb/TaLKConvolutions.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPANDcalculate mean over a dynamic subsequence around each token with the help of summed-area table|

|[SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection](https://arxiv.org/abs/2003.09833v3 ) (2)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|EXPANDlearns the q, k connections == dynamically creates a sparse attention matrix|

|[Efficient Content-Based Sparse Attention with Routing Transformers](https://arxiv.org/abs/2003.05997v5 ) (38)|[routing-transformer](https://github.com/lucidrains/routing-transformer ) ![](https://img.shields.io/github/stars/lucidrains/routing-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDcomputes attention with same-cluster tokens (computed by online k-means)|

|[Neural Architecture Search for Lightweight Non-Local Networks](https://arxiv.org/abs/2004.01961v1 ) (11)|[AutoNL](https://github.com/LiYingwei/AutoNL ) ![](https://img.shields.io/github/stars/LiYingwei/AutoNL.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2))|:x:|EXPANDcomputes Q(KV) and also down samples q, k, v both in spatial and channel dimensions|

|[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150v2 ) (159)|[longformer](https://github.com/allenai/longformer ) ![](https://img.shields.io/github/stars/allenai/longformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({k}%2b{g})\cdot{D}))|:heavy_check_mark:|EXPANDglobal + blocked attention|

|[ETC: Encoding Long and Structured Inputs in Transformers](https://arxiv.org/abs/2004.08483v5 ) (16)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{g}%2b{g}^2%2b{N}\cdot{k})\cdot{D}))|:x:|EXPANDcombines global attention (star transformer with multiple global tokens) with local attention|

|[Multi-scale Transformer Language Models](https://arxiv.org/abs/2005.00581v1 ) (2)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDUNet like + retina attetion is something close to BP-Transformer|

|[Synthesizer: Rethinking Self-Attention in Transformer Models](https://arxiv.org/abs/2005.00743v2 ) (26)|[Synthesizer-Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) ![](https://img.shields.io/github/stars/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDdoes not compute pairwise interactions|

|[Jukebox: A Generative Model for Music](https://arxiv.org/abs/2005.00341v1 ) (45)|[jukebox](https://github.com/openai/jukebox ) ![](https://img.shields.io/github/stars/openai/jukebox.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDbetter attention patterns from Sparse Transformer|

|[Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers](https://arxiv.org/abs/2006.05174v2 ) (0)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDdoes not compute pairwise interactions and uses fixed mask patters|

|[GMAT: Global Memory Augmentation for Transformers](https://arxiv.org/abs/2006.03274v1 ) (2)|[gmat](https://github.com/ag1988/gmat ) ![](https://img.shields.io/github/stars/ag1988/gmat.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({m}\cdot({N}%2b{m})\cdot{D}))|:x:|EXPANDadds global tokens|

|[Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236v3 ) (45)|[fast-transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:heavy_check_mark:|EXPANDuses phi(q)(phi(k)v) and also improves the sequential sampling step|

|[Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768v3 ) (47)|[linformer-pytorch](https://github.com/tatp22/linformer-pytorch ) ![](https://img.shields.io/github/stars/tatp22/linformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPANDproject key and value from n*d to k*d|

|[Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers](https://arxiv.org/abs/2006.03555v3 ) (8)|[google-research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2\cdot\log({D})))|:heavy_check_mark:|EXPANDcalculate an unbiased stochastic approximation of the attention matrix|

|[Kronecker Attention Networks](https://arxiv.org/abs/2007.08442v1 ) (1)|[kronecker-attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/kronecker-attention-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({H}%2b{W})^2\cdot{D}))|:x:|EXPANDuses horizontal and lateral average matrices|

|[Real-time Semantic Segmentation with Fast Attention](https://arxiv.org/abs/2007.03815v2 ) (5)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDl2_norm(q)*(l2_norm(k)*v)|

|[Fast Transformers with Clustered Attention](https://arxiv.org/abs/2007.04825v2 ) (6)|[fast-transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPANDgroups queries together with LSH|

|[Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062v2 ) (60)|[DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py ) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({g}^2%2b{N}\cdot({k}%2b{g}%2b{r}))\cdot{D}))|:x:|EXPANDETC with random connections|

|[Tensor Low-Rank Reconstruction for Semantic Segmentation](https://arxiv.org/abs/2008.00490v1 ) (3)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({D}\cdot{H}\cdot{W}%2b{D}^2%2b{H}^2%2b{W}^2)\cdot{r}))|:x:|EXPANDdecompose the full attention tensor into rank one tensors (CP decomposition)|

|[Looking for change? Roll the Dice and demand Attention](https://arxiv.org/abs/2009.02062v1 ) (0)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({H}\cdot{W}\cdot{D}))|:x:|EXPANDuses the fractal tanimoto similarity to compare queries with keys inside the attention module|

|[Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794v3 ) (30)|[google-research](https://github.com/google-research/google-research/tree/master/performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPANDunbiased approximation of the attention matrix with softmax kernel|

|[Memformer: The Memory-Augmented Transformer](https://arxiv.org/abs/2010.06891v1 ) (0)|[memformer](https://github.com/lucidrains/memformer ) ![](https://img.shields.io/github/stars/lucidrains/memformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPANDattend to memory slots + Memory-Replay BackPropagation|

|[SMYRF: Efficient Attention using Asymmetric Clustering](https://arxiv.org/abs/2010.05315v1 ) (1)|[smyrf](https://github.com/giannisdaras/smyrf ) ![](https://img.shields.io/github/stars/giannisdaras/smyrf.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:x:|EXPANDLSH with balanced clusters|

|[Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436v2 ) (0)|[Informer2020](https://github.com/zhouhaoyi/Informer2020 ) ![](https://img.shields.io/github/stars/zhouhaoyi/Informer2020.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|EXPANDsparse attention + funnel like encoder|

|[Sub-Linear Memory: How to Make Performers SLiM](https://arxiv.org/abs/2012.11346v1 ) (0)|[google-research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPANDPerformer but with sublinear Memory usage|

|[Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902v2 ) (0)|[Nystromformer](https://github.com/mlpen/Nystromformer ) ![](https://img.shields.io/github/stars/mlpen/Nystromformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:x:|EXPANDuses Nystrom method to approximate the attention matrix|

|[Linear Transformers Are Secretly Fast Weight Memory Systems](https://arxiv.org/abs/2102.11174v2 ) (0)|[fast-weight-transformers](https://github.com/ischlag/fast-weight-transformers ) ![](https://img.shields.io/github/stars/ischlag/fast-weight-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPANDshow that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness|

|[LambdaNetworks: Modeling Long-Range Interactions Without Attention](https://arxiv.org/abs/2102.08602v1 ) (6)|[lambda-networks](https://github.com/lucidrains/lambda-networks ) ![](https://img.shields.io/github/stars/lucidrains/lambda-networks.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h}))|:heavy_check_mark:|EXPANDgenerates a linear layer based on context + decouple pos/context|

|[Random Feature Attention](https://arxiv.org/abs/2103.02143v1 ) (2)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPANDkernel approximation and also transformers are rnn
|

## Articles/Surveys/Benchmarks

* [A Survey of Long-Term Context in Transformers](https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/)

* [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)

* [Long Range Arena: A Benchmark for Efficient

    Transformers](https://arxiv.org/abs/2011.04006)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Separius/awesome-fast-attention

Awesome Lists containing this project

README