awesome-fast-attention

list of efficient attention modules
https://github.com/Separius/awesome-fast-attention

Last synced: 22 days ago
JSON representation

Efficient Attention
- Interlaced Sparse Self-Attention for Semantic Segmentation
- Permutohedral Attention Module for Efficient Non-Local Neural Networks
- CBAM: Convolutional Block Attention Module - module](https://github.com/Jongchan/attention-module ) ![](https://img.shields.io/github/stars/Jongchan/attention-module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{D}%2b\frac{{D}^2}{r})%2b({N}\cdot{D}\cdot{k}^2)))|:x:|<details><summary>EXPAND</summary>combines the SE attention with a per pixel(local) weight</details>|
- Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks - lee/set_transformer ) ![](https://img.shields.io/github/stars/juho-lee/set_transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{K}\cdot{D}))|:x:|<details><summary>EXPAND</summary>uses K relay nodes</details>|
- CCNet: Criss-Cross Attention for Semantic Segmentation
- Efficient Attention: Attention with Linear Complexities - attention](https://github.com/cmsflash/efficient-attention ) ![](https://img.shields.io/github/stars/cmsflash/efficient-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|<details><summary>EXPAND</summary>Softmax(Q)*(Softmax(K^T)*V)</details>|
- Star-Transformer
- GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond
- Generating Long Sequences with Sparse Transformers
- SCRAM: Spatially Coherent Randomized Attention Maps - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>uses PatchMatch to find close keys</details>|
- Large Memory Layers with Product Keys
- Expectation-Maximization Attention Networks for Semantic Segmentation
- BP-Transformer: Modelling Long-Range Context via Binary Partitioning - grained manner</details>|
- Compressive Transformers for Long-Range Sequence Modelling - transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/compressive-transformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL</details>|
- Axial Attention in Multidimensional Transformers - attention](https://github.com/lucidrains/axial-attention ) ![](https://img.shields.io/github/stars/lucidrains/axial-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>apply attention on each axis separately</details>|
- Reformer: The Efficient Transformer
- Sparse Sinkhorn Attention - transformer](https://github.com/lucidrains/sinkhorn-transformer ) ![](https://img.shields.io/github/stars/lucidrains/sinkhorn-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(\frac{{N}^2}{n_b}%2b{n_b}^2))|:heavy_check_mark:|<details><summary>EXPAND</summary>uses a cost matrix to limit attention between buckets</details>|
- Transformer on a Diet - on-diet](https://github.com/cgraywang/transformer-on-diet ) ![](https://img.shields.io/github/stars/cgraywang/transformer-on-diet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>dilated transformer like wavenet</details>|
- Time-aware Large Kernel Convolutions - area table</details>|
- SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>learns the q, k connections == dynamically creates a sparse attention matrix</details>|
- Jukebox: A Generative Model for Music
- Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>does not compute pairwise interactions and uses fixed mask patters</details>|
- GMAT: Global Memory Augmentation for Transformers
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention - transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:heavy_check_mark:|<details><summary>EXPAND</summary>uses phi(q)(phi(k)v) and also improves the sequential sampling step</details>|
- Linformer: Self-Attention with Linear Complexity - pytorch](https://github.com/tatp22/linformer-pytorch ) ![](https://img.shields.io/github/stars/tatp22/linformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|<details><summary>EXPAND</summary>project key and value from n*d to k*d</details>|
- Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers - research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2\cdot\log({D})))|:heavy_check_mark:|<details><summary>EXPAND</summary>calculate an unbiased stochastic approximation of the attention matrix</details>|
- Kronecker Attention Networks - attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/kronecker-attention-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({H}%2b{W})^2\cdot{D}))|:x:|<details><summary>EXPAND</summary>uses horizontal and lateral average matrices</details>|
- Real-time Semantic Segmentation with Fast Attention - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|<details><summary>EXPAND</summary>l2_norm(q)*(l2_norm(k)*v)</details>|
- Fast Transformers with Clustered Attention - transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|<details><summary>EXPAND</summary>groups queries together with LSH</details>|
- Big Bird: Transformers for Longer Sequences
- Efficient Content-Based Sparse Attention with Routing Transformers - transformer](https://github.com/lucidrains/routing-transformer ) ![](https://img.shields.io/github/stars/lucidrains/routing-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>computes attention with same-cluster tokens (computed by online k-means)</details>|
- Neural Architecture Search for Lightweight Non-Local Networks
- Longformer: The Long-Document Transformer
- ETC: Encoding Long and Structured Inputs in Transformers - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{g}%2b{g}^2%2b{N}\cdot{k})\cdot{D}))|:x:|<details><summary>EXPAND</summary>combines global attention (star transformer with multiple global tokens) with local attention</details>|
- Multi-scale Transformer Language Models - Transformer</details>|
- Synthesizer: Rethinking Self-Attention in Transformer Models - Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) ![](https://img.shields.io/github/stars/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>does not compute pairwise interactions</details>|
- Rethinking Attention with Performers - research](https://github.com/google-research/google-research/tree/master/performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>unbiased approximation of the attention matrix with softmax kernel</details>|
- Memformer: The Memory-Augmented Transformer - Replay BackPropagation</details>|
- SMYRF: Efficient Attention using Asymmetric Clustering
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- Sub-Linear Memory: How to Make Performers SLiM - research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>Performer but with sublinear Memory usage</details>|
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
- Linear Transformers Are Secretly Fast Weight Memory Systems - weight-transformers](https://github.com/ischlag/fast-weight-transformers ) ![](https://img.shields.io/github/stars/ischlag/fast-weight-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness</details>|
- LambdaNetworks: Modeling Long-Range Interactions Without Attention - networks](https://github.com/lucidrains/lambda-networks ) ![](https://img.shields.io/github/stars/lucidrains/lambda-networks.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h}))|:heavy_check_mark:|<details><summary>EXPAND</summary>generates a linear layer based on context + decouple pos/context</details>|
- Random Feature Attention - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>kernel approximation and also transformers are rnn</details>|
- Tensor Low-Rank Reconstruction for Semantic Segmentation - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({D}\cdot{H}\cdot{W}%2b{D}^2%2b{H}^2%2b{W}^2)\cdot{r}))|:x:|<details><summary>EXPAND</summary>decompose the full attention tensor into rank one tensors (CP decomposition)</details>|
- Looking for change? Roll the Dice and demand Attention
- Generating Wikipedia by Summarizing Long Sequences - compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) ![](https://img.shields.io/github/stars/lucidrains/memory-compressed-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>compresses key and value + blocked attention</details>|
- Generating Wikipedia by Summarizing Long Sequences - compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) ![](https://img.shields.io/github/stars/lucidrains/memory-compressed-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({b}\cdot\frac{N}{b}\cdot\frac{N}{{b}\cdot{k}}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>compresses key and value + blocked attention</details>|
- CBAM: Convolutional Block Attention Module - module](https://github.com/Jongchan/attention-module ) ![](https://img.shields.io/github/stars/Jongchan/attention-module.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{D}%2b\frac{{D}^2}{r})%2b({N}\cdot{D}\cdot{k}^2)))|:x:|<details><summary>EXPAND</summary>combines the SE attention with a per pixel(local) weight</details>|
- Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks - lee/set_transformer ) ![](https://img.shields.io/github/stars/juho-lee/set_transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{K}\cdot{D}))|:x:|<details><summary>EXPAND</summary>uses K relay nodes</details>|
- CCNet: Criss-Cross Attention for Semantic Segmentation
- Efficient Attention: Attention with Linear Complexities - attention](https://github.com/cmsflash/efficient-attention ) ![](https://img.shields.io/github/stars/cmsflash/efficient-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|<details><summary>EXPAND</summary>Softmax(Q)*(Softmax(K^T)*V)</details>|
- Star-Transformer
- GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond
- Generating Long Sequences with Sparse Transformers
- SCRAM: Spatially Coherent Randomized Attention Maps - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\log({N})\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>uses PatchMatch to find close keys</details>|
- Interlaced Sparse Self-Attention for Semantic Segmentation
- Permutohedral Attention Module for Efficient Non-Local Neural Networks
- Large Memory Layers with Product Keys
- Expectation-Maximization Attention Networks for Semantic Segmentation
- BP-Transformer: Modelling Long-Range Context via Binary Partitioning - grained manner</details>|
- Compressive Transformers for Long-Range Sequence Modelling - transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/compressive-transformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL</details>|
- Axial Attention in Multidimensional Transformers - attention](https://github.com/lucidrains/axial-attention ) ![](https://img.shields.io/github/stars/lucidrains/axial-attention.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot({H}%2b{W})\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>apply attention on each axis separately</details>|
- Reformer: The Efficient Transformer
- Sparse Sinkhorn Attention - transformer](https://github.com/lucidrains/sinkhorn-transformer ) ![](https://img.shields.io/github/stars/lucidrains/sinkhorn-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(\frac{{N}^2}{n_b}%2b{n_b}^2))|:heavy_check_mark:|<details><summary>EXPAND</summary>uses a cost matrix to limit attention between buckets</details>|
- Transformer on a Diet - on-diet](https://github.com/cgraywang/transformer-on-diet ) ![](https://img.shields.io/github/stars/cgraywang/transformer-on-diet.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>dilated transformer like wavenet</details>|
- Time-aware Large Kernel Convolutions - area table</details>|
- SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>learns the q, k connections == dynamically creates a sparse attention matrix</details>|
- Efficient Content-Based Sparse Attention with Routing Transformers - transformer](https://github.com/lucidrains/routing-transformer ) ![](https://img.shields.io/github/stars/lucidrains/routing-transformer.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot\sqrt{N}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>computes attention with same-cluster tokens (computed by online k-means)</details>|
- Neural Architecture Search for Lightweight Non-Local Networks
- Longformer: The Long-Document Transformer
- ETC: Encoding Long and Structured Inputs in Transformers - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({N}\cdot{g}%2b{g}^2%2b{N}\cdot{k})\cdot{D}))|:x:|<details><summary>EXPAND</summary>combines global attention (star transformer with multiple global tokens) with local attention</details>|
- Multi-scale Transformer Language Models - Transformer</details>|
- Synthesizer: Rethinking Self-Attention in Transformer Models - Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) ![](https://img.shields.io/github/stars/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>does not compute pairwise interactions</details>|
- Jukebox: A Generative Model for Music
- Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>does not compute pairwise interactions and uses fixed mask patters</details>|
- GMAT: Global Memory Augmentation for Transformers
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention - transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:heavy_check_mark:|<details><summary>EXPAND</summary>uses phi(q)(phi(k)v) and also improves the sequential sampling step</details>|
- Linformer: Self-Attention with Linear Complexity - pytorch](https://github.com/tatp22/linformer-pytorch ) ![](https://img.shields.io/github/stars/tatp22/linformer-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|<details><summary>EXPAND</summary>project key and value from n*d to k*d</details>|
- Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers - research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2\cdot\log({D})))|:heavy_check_mark:|<details><summary>EXPAND</summary>calculate an unbiased stochastic approximation of the attention matrix</details>|
- Kronecker Attention Networks - attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) ![](https://img.shields.io/github/stars/lucidrains/kronecker-attention-pytorch.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({H}%2b{W})^2\cdot{D}))|:x:|<details><summary>EXPAND</summary>uses horizontal and lateral average matrices</details>|
- Real-time Semantic Segmentation with Fast Attention - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}^2))|:x:|<details><summary>EXPAND</summary>l2_norm(q)*(l2_norm(k)*v)</details>|
- Fast Transformers with Clustered Attention - transformers](https://github.com/idiap/fast-transformers ) ![](https://img.shields.io/github/stars/idiap/fast-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{k}\cdot{D}))|:x:|<details><summary>EXPAND</summary>groups queries together with LSH</details>|
- Big Bird: Transformers for Longer Sequences
- Tensor Low-Rank Reconstruction for Semantic Segmentation - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}(({D}\cdot{H}\cdot{W}%2b{D}^2%2b{H}^2%2b{W}^2)\cdot{r}))|:x:|<details><summary>EXPAND</summary>decompose the full attention tensor into rank one tensors (CP decomposition)</details>|
- Looking for change? Roll the Dice and demand Attention
- Rethinking Attention with Performers - research](https://github.com/google-research/google-research/tree/master/performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>unbiased approximation of the attention matrix with softmax kernel</details>|
- Memformer: The Memory-Augmented Transformer - Replay BackPropagation</details>|
- SMYRF: Efficient Attention using Asymmetric Clustering
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- Sub-Linear Memory: How to Make Performers SLiM - research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) ![](https://img.shields.io/github/stars/google-research/google-research.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>Performer but with sublinear Memory usage</details>|
- Linear Transformers Are Secretly Fast Weight Memory Systems - weight-transformers](https://github.com/ischlag/fast-weight-transformers ) ![](https://img.shields.io/github/stars/ischlag/fast-weight-transformers.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{m}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness</details>|
- LambdaNetworks: Modeling Long-Range Interactions Without Attention - networks](https://github.com/lucidrains/lambda-networks ) ![](https://img.shields.io/github/stars/lucidrains/lambda-networks.svg?style=social )|![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}^2\cdot{k}\cdot\frac{v}{h}))|:heavy_check_mark:|<details><summary>EXPAND</summary>generates a linear layer based on context + decouple pos/context</details>|
- Random Feature Attention - |![formula](https://render.githubusercontent.com/render/math?math=\mathcal{O}({N}\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary>kernel approximation and also transformers are rnn</details>|
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
Articles/Surveys/Benchmarks
- A Survey of Long-Term Context in Transformers
- Efficient Transformers: A Survey

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-fast-attention

Efficient Attention

Articles/Surveys/Benchmarks