awesome-fast-attention
list of efficient attention modules
https://github.com/Separius/awesome-fast-attention
Last synced: 22 days ago
JSON representation
-
Efficient Attention
- Interlaced Sparse Self-Attention for Semantic Segmentation
- Permutohedral Attention Module for Efficient Non-Local Neural Networks
- CBAM: Convolutional Block Attention Module - module](https://github.com/Jongchan/attention-module ) |%2b({N}\cdot{D}\cdot{k}^2)))|:x:|<details><summary>EXPAND</summary><p>combines the SE attention with a per pixel(local) weight</p></details>|
- Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks - lee/set_transformer ) |)|:x:|<details><summary>EXPAND</summary><p>uses K relay nodes</p></details>|
- CCNet: Criss-Cross Attention for Semantic Segmentation
- Efficient Attention: Attention with Linear Complexities - attention](https://github.com/cmsflash/efficient-attention ) |)|:x:|<details><summary>EXPAND</summary><p>Softmax(Q)*(Softmax(K^T)*V)</p></details>|
- Star-Transformer
- GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond
- Generating Long Sequences with Sparse Transformers
- SCRAM: Spatially Coherent Randomized Attention Maps - |\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary><p>uses PatchMatch to find close keys</p></details>|
- Large Memory Layers with Product Keys
- Expectation-Maximization Attention Networks for Semantic Segmentation
- BP-Transformer: Modelling Long-Range Context via Binary Partitioning - grained manner</p></details>|
- Compressive Transformers for Long-Range Sequence Modelling - transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL</p></details>|
- Axial Attention in Multidimensional Transformers - attention](https://github.com/lucidrains/axial-attention ) |\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary><p>apply attention on each axis separately</p></details>|
- Reformer: The Efficient Transformer
- Sparse Sinkhorn Attention - transformer](https://github.com/lucidrains/sinkhorn-transformer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>uses a cost matrix to limit attention between buckets</p></details>|
- Transformer on a Diet - on-diet](https://github.com/cgraywang/transformer-on-diet ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>dilated transformer like wavenet</p></details>|
- Time-aware Large Kernel Convolutions - area table</p></details>|
- SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection - |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>learns the q, k connections == dynamically creates a sparse attention matrix</p></details>|
- Jukebox: A Generative Model for Music
- Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers - |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>does not compute pairwise interactions and uses fixed mask patters</p></details>|
- GMAT: Global Memory Augmentation for Transformers
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention - transformers](https://github.com/idiap/fast-transformers ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>uses phi(q)(phi(k)v) and also improves the sequential sampling step</p></details>|
- Linformer: Self-Attention with Linear Complexity - pytorch](https://github.com/tatp22/linformer-pytorch ) |)|:x:|<details><summary>EXPAND</summary><p>project key and value from n*d to k*d</p></details>|
- Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers - research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) |))|:heavy_check_mark:|<details><summary>EXPAND</summary><p>calculate an unbiased stochastic approximation of the attention matrix</p></details>|
- Kronecker Attention Networks - attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) |^2\cdot{D}))|:x:|<details><summary>EXPAND</summary><p>uses horizontal and lateral average matrices</p></details>|
- Real-time Semantic Segmentation with Fast Attention - |)|:x:|<details><summary>EXPAND</summary><p>l2_norm(q)*(l2_norm(k)*v)</p></details>|
- Fast Transformers with Clustered Attention - transformers](https://github.com/idiap/fast-transformers ) |)|:x:|<details><summary>EXPAND</summary><p>groups queries together with LSH</p></details>|
- Big Bird: Transformers for Longer Sequences
- Efficient Content-Based Sparse Attention with Routing Transformers - transformer](https://github.com/lucidrains/routing-transformer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>computes attention with same-cluster tokens (computed by online k-means)</p></details>|
- Neural Architecture Search for Lightweight Non-Local Networks
- Longformer: The Long-Document Transformer
- ETC: Encoding Long and Structured Inputs in Transformers - |\cdot{D}))|:x:|<details><summary>EXPAND</summary><p>combines global attention (star transformer with multiple global tokens) with local attention</p></details>|
- Multi-scale Transformer Language Models - Transformer</p></details>|
- Synthesizer: Rethinking Self-Attention in Transformer Models - Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>does not compute pairwise interactions</p></details>|
- Rethinking Attention with Performers - research](https://github.com/google-research/google-research/tree/master/performer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>unbiased approximation of the attention matrix with softmax kernel</p></details>|
- Memformer: The Memory-Augmented Transformer - Replay BackPropagation</p></details>|
- SMYRF: Efficient Attention using Asymmetric Clustering
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- Sub-Linear Memory: How to Make Performers SLiM - research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>Performer but with sublinear Memory usage</p></details>|
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
- Linear Transformers Are Secretly Fast Weight Memory Systems - weight-transformers](https://github.com/ischlag/fast-weight-transformers ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness</p></details>|
- LambdaNetworks: Modeling Long-Range Interactions Without Attention - networks](https://github.com/lucidrains/lambda-networks ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>generates a linear layer based on context + decouple pos/context</p></details>|
- Random Feature Attention - |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>kernel approximation and also transformers are rnn</p></details>|
- Tensor Low-Rank Reconstruction for Semantic Segmentation - |\cdot{r}))|:x:|<details><summary>EXPAND</summary><p>decompose the full attention tensor into rank one tensors (CP decomposition)</p></details>|
- Looking for change? Roll the Dice and demand Attention
- Generating Wikipedia by Summarizing Long Sequences - compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>compresses key and value + blocked attention</p></details>|
- Generating Wikipedia by Summarizing Long Sequences - compressed-attention](https://github.com/lucidrains/memory-compressed-attention ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>compresses key and value + blocked attention</p></details>|
- CBAM: Convolutional Block Attention Module - module](https://github.com/Jongchan/attention-module ) |%2b({N}\cdot{D}\cdot{k}^2)))|:x:|<details><summary>EXPAND</summary><p>combines the SE attention with a per pixel(local) weight</p></details>|
- Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks - lee/set_transformer ) |)|:x:|<details><summary>EXPAND</summary><p>uses K relay nodes</p></details>|
- CCNet: Criss-Cross Attention for Semantic Segmentation
- Efficient Attention: Attention with Linear Complexities - attention](https://github.com/cmsflash/efficient-attention ) |)|:x:|<details><summary>EXPAND</summary><p>Softmax(Q)*(Softmax(K^T)*V)</p></details>|
- Star-Transformer
- GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond
- Generating Long Sequences with Sparse Transformers
- SCRAM: Spatially Coherent Randomized Attention Maps - |\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary><p>uses PatchMatch to find close keys</p></details>|
- Interlaced Sparse Self-Attention for Semantic Segmentation
- Permutohedral Attention Module for Efficient Non-Local Neural Networks
- Large Memory Layers with Product Keys
- Expectation-Maximization Attention Networks for Semantic Segmentation
- BP-Transformer: Modelling Long-Range Context via Binary Partitioning - grained manner</p></details>|
- Compressive Transformers for Long-Range Sequence Modelling - transformer-pytorch](https://github.com/lucidrains/compressive-transformer-pytorch ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL</p></details>|
- Axial Attention in Multidimensional Transformers - attention](https://github.com/lucidrains/axial-attention ) |\cdot{D}))|:heavy_check_mark:|<details><summary>EXPAND</summary><p>apply attention on each axis separately</p></details>|
- Reformer: The Efficient Transformer
- Sparse Sinkhorn Attention - transformer](https://github.com/lucidrains/sinkhorn-transformer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>uses a cost matrix to limit attention between buckets</p></details>|
- Transformer on a Diet - on-diet](https://github.com/cgraywang/transformer-on-diet ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>dilated transformer like wavenet</p></details>|
- Time-aware Large Kernel Convolutions - area table</p></details>|
- SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection - |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>learns the q, k connections == dynamically creates a sparse attention matrix</p></details>|
- Efficient Content-Based Sparse Attention with Routing Transformers - transformer](https://github.com/lucidrains/routing-transformer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>computes attention with same-cluster tokens (computed by online k-means)</p></details>|
- Neural Architecture Search for Lightweight Non-Local Networks
- Longformer: The Long-Document Transformer
- ETC: Encoding Long and Structured Inputs in Transformers - |\cdot{D}))|:x:|<details><summary>EXPAND</summary><p>combines global attention (star transformer with multiple global tokens) with local attention</p></details>|
- Multi-scale Transformer Language Models - Transformer</p></details>|
- Synthesizer: Rethinking Self-Attention in Transformer Models - Rethinking-Self-Attention-Transformer-Models](https://github.com/leaderj1001/Synthesizer-Rethinking-Self-Attention-Transformer-Models ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>does not compute pairwise interactions</p></details>|
- Jukebox: A Generative Model for Music
- Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers - |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>does not compute pairwise interactions and uses fixed mask patters</p></details>|
- GMAT: Global Memory Augmentation for Transformers
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention - transformers](https://github.com/idiap/fast-transformers ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>uses phi(q)(phi(k)v) and also improves the sequential sampling step</p></details>|
- Linformer: Self-Attention with Linear Complexity - pytorch](https://github.com/tatp22/linformer-pytorch ) |)|:x:|<details><summary>EXPAND</summary><p>project key and value from n*d to k*d</p></details>|
- Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers - research](https://github.com/google-research/google-research/tree/master/performer/fast_attention ) |))|:heavy_check_mark:|<details><summary>EXPAND</summary><p>calculate an unbiased stochastic approximation of the attention matrix</p></details>|
- Kronecker Attention Networks - attention-pytorch](https://github.com/lucidrains/kronecker-attention-pytorch ) |^2\cdot{D}))|:x:|<details><summary>EXPAND</summary><p>uses horizontal and lateral average matrices</p></details>|
- Real-time Semantic Segmentation with Fast Attention - |)|:x:|<details><summary>EXPAND</summary><p>l2_norm(q)*(l2_norm(k)*v)</p></details>|
- Fast Transformers with Clustered Attention - transformers](https://github.com/idiap/fast-transformers ) |)|:x:|<details><summary>EXPAND</summary><p>groups queries together with LSH</p></details>|
- Big Bird: Transformers for Longer Sequences
- Tensor Low-Rank Reconstruction for Semantic Segmentation - |\cdot{r}))|:x:|<details><summary>EXPAND</summary><p>decompose the full attention tensor into rank one tensors (CP decomposition)</p></details>|
- Looking for change? Roll the Dice and demand Attention
- Rethinking Attention with Performers - research](https://github.com/google-research/google-research/tree/master/performer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>unbiased approximation of the attention matrix with softmax kernel</p></details>|
- Memformer: The Memory-Augmented Transformer - Replay BackPropagation</p></details>|
- SMYRF: Efficient Attention using Asymmetric Clustering
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- Sub-Linear Memory: How to Make Performers SLiM - research](https://github.com/google-research/google-research/tree/master/performer/models/slim_performer ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>Performer but with sublinear Memory usage</p></details>|
- Linear Transformers Are Secretly Fast Weight Memory Systems - weight-transformers](https://github.com/ischlag/fast-weight-transformers ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>show that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness</p></details>|
- LambdaNetworks: Modeling Long-Range Interactions Without Attention - networks](https://github.com/lucidrains/lambda-networks ) |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>generates a linear layer based on context + decouple pos/context</p></details>|
- Random Feature Attention - |)|:heavy_check_mark:|<details><summary>EXPAND</summary><p>kernel approximation and also transformers are rnn</p></details>|
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
-
Articles/Surveys/Benchmarks
Sub Categories