Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Separius/awesome-fast-attention
list of efficient attention modules
https://github.com/Separius/awesome-fast-attention
List: awesome-fast-attention
attention attention-is-all-you-need awesome linformer longformer multihead-attention reformer self-attention transformer transformer-network
Last synced: 3 months ago
JSON representation
list of efficient attention modules
- Host: GitHub
- URL: https://github.com/Separius/awesome-fast-attention
- Owner: Separius
- License: gpl-3.0
- Archived: true
- Created: 2020-07-31T08:08:37.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-08-23T09:58:15.000Z (about 3 years ago)
- Last Synced: 2024-05-23T06:07:14.373Z (6 months ago)
- Topics: attention, attention-is-all-you-need, awesome, linformer, longformer, multihead-attention, reformer, self-attention, transformer, transformer-network
- Language: Python
- Homepage:
- Size: 156 KB
- Stars: 978
- Watchers: 32
- Forks: 108
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-ai-list-guide - awesome-fast-attention
README
# awesome-fast-attention [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)
## Table of Contents
* **[Efficient Attention](#efficient-attention)**
* **[Articles/Surveys/Benchmarks](#articlessurveysbenchmarks)**## Efficient Attention
|Paper (citations)|Implementation|Computational Complexity|AutoRegressive|Main Idea|
|:---:|:---:|:---:|:---:|:---:|
|[Generating Wikipedia by Summarizing Long Sequences](https://arxiv.org/abs/1801.10198v1 ) (282)|[memory-compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) ![](https://img.shields.io/github/stars/lucidrains/memory-compressed-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D}))|:heavy_check_mark:|EXPANDcompresses key and value + blocked attention
|
|[CBAM: Convolutional Block Attention Module](https://arxiv.org/abs/1807.06521v2 ) (999+)|[attention-module](https://github.com/Jongchan/attention-module ) ![](https://img.shields.io/github/stars/Jongchan/attention-module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{D}%2b\frac{{D}^2}{r})%2b({N}\cdot{D}\cdot{k}^2)))|:x:|EXPANDcombines the SE attention with a per pixel(local) weight
|
|[Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks](https://arxiv.org/abs/1810.00825v3 ) (16)|[set_transformer](https://github.com/juho-lee/set_transformer ) ![](https://img.shields.io/github/stars/juho-lee/set_transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{K}\cdot{D}))|:x:|EXPANDuses K relay nodes
|
|[CCNet: Criss-Cross Attention for Semantic Segmentation](https://arxiv.org/abs/1811.11721v2 ) (296)|[CCNet](https://github.com/speedinghzl/CCNet ) ![](https://img.shields.io/github/stars/speedinghzl/CCNet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:x:|EXPANDeach pixel attends to its row and column simultaneously
|
|[Efficient Attention: Attention with Linear Complexities](https://arxiv.org/abs/1812.01243v9 ) (16)|[efficient-attention](https://github.com/cmsflash/efficient-attention ) ![](https://img.shields.io/github/stars/cmsflash/efficient-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDSoftmax(Q)*(Softmax(K^T)*V)
|
|[Star-Transformer](https://arxiv.org/abs/1902.09113v2 ) (40)|[fastNLP](https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py ) ![](https://img.shields.io/github/stars/fastnlp/fastNLP.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:x:|EXPANDuses a relay(global) node and attends to/from that node
|
|[GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond](https://arxiv.org/abs/1904.11492v1 ) (199)|[GCNet](https://github.com/xvjiarui/GCNet ) ![](https://img.shields.io/github/stars/xvjiarui/GCNet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDsqueeze and excitation with an attention pooling (instead of a GAP)
|
|[Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509v1 ) (257)|[DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py ) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDsparse block based attention
|
|[SCRAM: Spatially Coherent Randomized Attention Maps](https://arxiv.org/abs/1905.10308v1 ) (1)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|EXPANDuses PatchMatch to find close keys
|
|[Interlaced Sparse Self-Attention for Semantic Segmentation](https://arxiv.org/abs/1907.12273v2 ) (24)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2%2b{N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDcombination of a short length and then long range(dilated) attention
|
|[Permutohedral Attention Module for Efficient Non-Local Neural Networks](https://arxiv.org/abs/1907.00641v2 ) (3)|[Permutohedral_attention_module](https://github.com/SamuelJoutard/Permutohedral_attention_module ) ![](https://img.shields.io/github/stars/SamuelJoutard/Permutohedral_attention_module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDuses permutohedral lattice approximation algorithm to approximate the attention output
|
|[Large Memory Layers with Product Keys](https://arxiv.org/abs/1907.05242v2 ) (43)|[XLM](https://github.com/facebookresearch/XLM ) ![](https://img.shields.io/github/stars/facebookresearch/XLM.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({Q}\cdot({K}%2b{k}^2)\cdot{D}))|:heavy_check_mark:|EXPANDsearch for nearest neighbor keys
|
|[Expectation-Maximization Attention Networks for Semantic Segmentation](https://arxiv.org/abs/1907.13426v2 ) (79)|[EMANet](https://github.com/XiaLiPKU/EMANet ) ![](https://img.shields.io/github/stars/XiaLiPKU/EMANet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPANDapplys expectation maximization to cluster keys into k clusters
|
|[BP-Transformer: Modelling Long-Range Context via Binary Partitioning](https://arxiv.org/abs/1911.04070v1 ) (15)|[BPT](https://github.com/yzh119/BPT ) ![](https://img.shields.io/github/stars/yzh119/BPT.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot\log(\frac{N}{k})\cdot{D}))|:heavy_check_mark:|EXPANDattends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
|
|[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507v1 ) (48)|[compressive-transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/compressive-transformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDcompresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
|
|[Axial Attention in Multidimensional Transformers](https://arxiv.org/abs/1912.12180v1 ) (36)|[axial-attention](https://github.com/lucidrains/axial-attention ) ![](https://img.shields.io/github/stars/lucidrains/axial-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:heavy_check_mark:|EXPANDapply attention on each axis separately
|
|[Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451v2 ) (216)|[trax](https://github.com/google/trax/tree/master/trax/models/reformer ) ![](https://img.shields.io/github/stars/google/trax.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}^2))|:heavy_check_mark:|EXPANDuses LSH to find close keys
|
|[Sparse Sinkhorn Attention](https://arxiv.org/abs/2002.11296v1 ) (16)|[sinkhorn-transformer](https://github.com/lucidrains/sinkhorn-transformer ) ![](https://img.shields.io/github/stars/lucidrains/sinkhorn-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(\frac{{N}^2}{n_b}%2b{n_b}^2))|:heavy_check_mark:|EXPANDuses a cost matrix to limit attention between buckets
|
|[Transformer on a Diet](https://arxiv.org/abs/2002.06170v1 ) (2)|[transformer-on-diet](https://github.com/cgraywang/transformer-on-diet ) ![](https://img.shields.io/github/stars/cgraywang/transformer-on-diet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|EXPANDdilated transformer like wavenet
|
|[Time-aware Large Kernel Convolutions](https://arxiv.org/abs/2002.03184v2 ) (9)|[TaLKConvolutions](https://github.com/lioutasb/TaLKConvolutions ) ![](https://img.shields.io/github/stars/lioutasb/TaLKConvolutions.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPANDcalculate mean over a dynamic subsequence around each token with the help of summed-area table
|
|[SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection](https://arxiv.org/abs/2003.09833v3 ) (2)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|EXPANDlearns the q, k connections == dynamically creates a sparse attention matrix
|
|[Efficient Content-Based Sparse Attention with Routing Transformers](https://arxiv.org/abs/2003.05997v5 ) (38)|[routing-transformer](https://github.com/lucidrains/routing-transformer ) ![](https://img.shields.io/github/stars/lucidrains/routing-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDcomputes attention with same-cluster tokens (computed by online k-means)
|
|[Neural Architecture Search for Lightweight Non-Local Networks](https://arxiv.org/abs/2004.01961v1 ) (11)|[AutoNL](https://github.com/LiYingwei/AutoNL ) ![](https://img.shields.io/github/stars/LiYingwei/AutoNL.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}((\frac{H}{h}\cdot\frac{W}{w})\cdot(\frac{D}{k})^2))|:x:|EXPANDcomputes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
|
|[Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150v2 ) (159)|[longformer](https://github.com/allenai/longformer ) ![](https://img.shields.io/github/stars/allenai/longformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({k}%2b{g})\cdot{D}))|:heavy_check_mark:|EXPANDglobal + blocked attention
|
|[ETC: Encoding Long and Structured Inputs in Transformers](https://arxiv.org/abs/2004.08483v5 ) (16)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{g}%2b{g}^2%2b{N}\cdot{k})\cdot{D}))|:x:|EXPANDcombines global attention (star transformer with multiple global tokens) with local attention
|
|[Multi-scale Transformer Language Models](https://arxiv.org/abs/2005.00581v1 ) (2)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDUNet like + retina attetion is something close to BP-Transformer
|
|[Synthesizer: Rethinking Self-Attention in Transformer Models](https://arxiv.org/abs/2005.00743v2 ) (26)|[Synthesizer-Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) ![](https://img.shields.io/github/stars/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDdoes not compute pairwise interactions
|
|[Jukebox: A Generative Model for Music](https://arxiv.org/abs/2005.00341v1 ) (45)|[jukebox](https://github.com/openai/jukebox ) ![](https://img.shields.io/github/stars/openai/jukebox.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|EXPANDbetter attention patterns from Sparse Transformer
|
|[Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers](https://arxiv.org/abs/2006.05174v2 ) (0)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|EXPANDdoes not compute pairwise interactions and uses fixed mask patters
|
|[GMAT: Global Memory Augmentation for Transformers](https://arxiv.org/abs/2006.03274v1 ) (2)|[gmat](https://github.com/ag1988/gmat ) ![](https://img.shields.io/github/stars/ag1988/gmat.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({m}\cdot({N}%2b{m})\cdot{D}))|:x:|EXPANDadds global tokens
|
|[Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](https://arxiv.org/abs/2006.16236v3 ) (45)|[fast-transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:heavy_check_mark:|EXPANDuses phi(q)(phi(k)v) and also improves the sequential sampling step
|
|[Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768v3 ) (47)|[linformer-pytorch](https://github.com/tatp22/linformer-pytorch ) ![](https://img.shields.io/github/stars/tatp22/linformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPANDproject key and value from n*d to k*d
|
|[Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers](https://arxiv.org/abs/2006.03555v3 ) (8)|[google-research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2\cdot\log({D})))|:heavy_check_mark:|EXPANDcalculate an unbiased stochastic approximation of the attention matrix
|
|[Kronecker Attention Networks](https://arxiv.org/abs/2007.08442v1 ) (1)|[kronecker-attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/kronecker-attention-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({H}%2b{W})^2\cdot{D}))|:x:|EXPANDuses horizontal and lateral average matrices
|
|[Real-time Semantic Segmentation with Fast Attention](https://arxiv.org/abs/2007.03815v2 ) (5)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|EXPANDl2_norm(q)*(l2_norm(k)*v)
|
|[Fast Transformers with Clustered Attention](https://arxiv.org/abs/2007.04825v2 ) (6)|[fast-transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|EXPANDgroups queries together with LSH
|
|[Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062v2 ) (60)|[DeepSpeed](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/ops/sparse_attention/sparse_self_attention.py ) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({g}^2%2b{N}\cdot({k}%2b{g}%2b{r}))\cdot{D}))|:x:|EXPANDETC with random connections
|
|[Tensor Low-Rank Reconstruction for Semantic Segmentation](https://arxiv.org/abs/2008.00490v1 ) (3)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({D}\cdot{H}\cdot{W}%2b{D}^2%2b{H}^2%2b{W}^2)\cdot{r}))|:x:|EXPANDdecompose the full attention tensor into rank one tensors (CP decomposition)
|
|[Looking for change? Roll the Dice and demand Attention](https://arxiv.org/abs/2009.02062v1 ) (0)|IN_PAPER|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({H}\cdot{W}\cdot{D}))|:x:|EXPANDuses the fractal tanimoto similarity to compare queries with keys inside the attention module
|
|[Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794v3 ) (30)|[google-research](https://github.com/google-research/google-research/tree/master/performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPANDunbiased approximation of the attention matrix with softmax kernel
|
|[Memformer: The Memory-Augmented Transformer](https://arxiv.org/abs/2010.06891v1 ) (0)|[memformer](https://github.com/lucidrains/memformer ) ![](https://img.shields.io/github/stars/lucidrains/memformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPANDattend to memory slots + Memory-Replay BackPropagation
|
|[SMYRF: Efficient Attention using Asymmetric Clustering](https://arxiv.org/abs/2010.05315v1 ) (1)|[smyrf](https://github.com/giannisdaras/smyrf ) ![](https://img.shields.io/github/stars/giannisdaras/smyrf.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:x:|EXPANDLSH with balanced clusters
|
|[Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436v2 ) (0)|[Informer2020](https://github.com/zhouhaoyi/Informer2020 ) ![](https://img.shields.io/github/stars/zhouhaoyi/Informer2020.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|EXPANDsparse attention + funnel like encoder
|
|[Sub-Linear Memory: How to Make Performers SLiM](https://arxiv.org/abs/2012.11346v1 ) (0)|[google-research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPANDPerformer but with sublinear Memory usage
|
|[Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902v2 ) (0)|[Nystromformer](https://github.com/mlpen/Nystromformer ) ![](https://img.shields.io/github/stars/mlpen/Nystromformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:x:|EXPANDuses Nystrom method to approximate the attention matrix
|
|[Linear Transformers Are Secretly Fast Weight Memory Systems](https://arxiv.org/abs/2102.11174v2 ) (0)|[fast-weight-transformers](https://github.com/ischlag/fast-weight-transformers ) ![](https://img.shields.io/github/stars/ischlag/fast-weight-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|EXPANDshow that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness
|
|[LambdaNetworks: Modeling Long-Range Interactions Without Attention](https://arxiv.org/abs/2102.08602v1 ) (6)|[lambda-networks](https://github.com/lucidrains/lambda-networks ) ![](https://img.shields.io/github/stars/lucidrains/lambda-networks.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h}))|:heavy_check_mark:|EXPANDgenerates a linear layer based on context + decouple pos/context
|
|[Random Feature Attention](https://arxiv.org/abs/2103.02143v1 ) (2)|-|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|EXPANDkernel approximation and also transformers are rnn
|## Articles/Surveys/Benchmarks
* [A Survey of Long-Term Context in Transformers](https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/)
* [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)* [Long Range Arena: A Benchmark for Efficient
Transformers](https://arxiv.org/abs/2011.04006)