https://github.com/zh-plus/awesome-vlp-and-efficient-transformer

Vision-Language Pretraining & Efficient Transformer Papers.
https://github.com/zh-plus/awesome-vlp-and-efficient-transformer
List: awesome-vlp-and-efficient-transformer
Last synced: 2 months ago
JSON representation
Vision-Language Pretraining & Efficient Transformer Papers.
Host: GitHub
URL: https://github.com/zh-plus/awesome-vlp-and-efficient-transformer
Owner: zh-plus
Created: 2020-12-02T13:47:48.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-11-30T09:40:34.000Z (over 3 years ago)
Last Synced: 2025-03-22T15:02:05.354Z (3 months ago)
Homepage:
Size: 10.6 MB
Stars: 14
Watchers: 2
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

ultimate-awesome - awesome-vlp-and-efficient-transformer - Vision-Language Pretraining & Efficient Transformer Papers. (Other Lists / Julia Lists)
README

        - [Vision Language Models](#vision-language-models)

  - [Vision-Language Pretraining](#vision-language-pretraining)

    - [ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [NeurIPS 2019]](#vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks-neurips-2019)

    - [LXMERT: Learning Cross-Modality Encoder Representations from Transformers [EMNLP 2019]](#lxmert-learning-cross-modality-encoder-representations-from-transformers-emnlp-2019)

    - [VisualBERT: A Simple and Performant Baseline for Vision and Language [arXiv 2019/08, ACL 2020]](#visualbert-a-simple-and-performant-baseline-for-vision-and-language-arxiv-201908-acl-2020)

    - [VL-BERT: Pre-training of Generic Visual-Linguistic Representations [ICLR 2020]](#vl-bert-pre-training-of-generic-visual-linguistic-representations-iclr-2020)

    - [Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training [AAAI 2020]](#unicoder-vl-a-universal-encoder-for-vision-and-language-by-cross-modal-pre-training-aaai-2020)

    - [Unified Vision-Language Pre-Training for Image Captioning and VQA [AAAI 2020]](#unified-vision-language-pre-training-for-image-captioning-and-vqa-aaai-2020)

    - [UNITER: Learning Universal Image-text Representations [ECCV 2020]](#uniter-learning-universal-image-text-representations-eccv-2020)

    - [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [arXiv 2020/04, ECCV 2020]](#oscar-object-semantics-aligned-pre-training-for-vision-language-tasks-arxiv-202004-eccv-2020)

    - [Learning Transferable Visual Models From Natural Language Supervision [OpenAI papers 2021/01]](#learning-transferable-visual-models-from-natural-language-supervision-openai-papers-202101)

  - [Video-Language Pretraining](#video-language-pretraining)

    - [VideoBERT: A Joint Model for Video and Language Representation Learning [ICCV 2019]](#videobert-a-joint-model-for-video-and-language-representation-learning-iccv-2019)

    - [Multi-modal Transformer for Video Retrieval [ECCV 2020]](#multi-modal-transformer-for-video-retrieval-eccv-2020)

    - [HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [EMNLP 2020]](#hero-hierarchical-encoder-for-videolanguage-omni-representation-pre-training-emnlp-2020)

    - [UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation](#univl-a-unified-video-and-language-pre-training-model-for-multimodal-understanding-and-generation)

  - [Image-Text Retrieval & Matching](#image-text-retrieval--matching)

    - [ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data](#imagebert-cross-modal-pre-training-with-large-scale-weak-supervised-image-text-data)

    - [Cross-Probe BERT for Efficient AND effective Cross-Modal Search](#cross-probe-bert-for-efficient-and-effective-cross-modal-search)

    - [Multi-Modality Cross Attention Network for Image and Sentence Matching [ICCV 2020]](#multi-modality-cross-attention-network-for-image-and-sentence-matching-iccv-2020)

  - [Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [CVPR 2021]](#thinking-fast-and-slow-efficient-text-to-visual-retrieval-with-transformers-cvpr-2021)

  - [Analysis](#analysis)

    - [12-in-1: Multi-Task Vision and Language Representation Learning [CVPR 2020]](#12-in-1-multi-task-vision-and-language-representation-learning-cvpr-2020)

    - [Are we pretraining it right? Digging deeper into visio-linguistic pretraining](#are-we-pretraining-it-right-digging-deeper-into-visio-linguistic-pretraining)

    - [Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models](#behind-the-scene-revealing-the-secrets-of-pre-trained-vision-and-language-models)

    - [Adaptive Transformers for Learning Multimodal Representations [ACL 2020]](#adaptive-transformers-for-learning-multimodal-representations-acl-2020)

    - [Data, Architecture, or Losses: What Contributes Most to Multimodal Transformer Success? [TACL 2021]](#data-architecture-or-losses-what-contributes-most-to-multimodal-transformer-success-tacl-2021)

  - [Survey](#survey)

    - [Pre-trained Models for Natural Language Processing: A Survey [arXiv 2020/03]](#pre-trained-models-for-natural-language-processing-a-survey-arxiv-202003)

    - [A Survey on Contextual Embeddings [arXiv 2020/03]](#a-survey-on-contextual-embeddings-arxiv-202003)

    - [Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [arXiv 2019]](#trends-in-integration-of-vision-and-language-research-a-survey-of-tasks-datasets-and-methods-arxiv-2019)

    - [Deep Multimodal Representation Learning: A Survey [arXiv 2019]](#deep-multimodal-representation-learning-a-survey-arxiv-2019)

    - [Pre-trained models for natural language processing: A survey [arXiv 2020]](#pre-trained-models-for-natural-language-processing-a-survey-arxiv-2020)

    - [A Survey on Visual Transformer [arXiv 2020/12]](#a-survey-on-visual-transformer-arxiv-202012)

  - [Platforms](#platforms)

    - [facebook MMF](#facebook-mmf)

- [Transformer](#transformer)

  - [Efficient Transformers](#efficient-transformers)

    - [Performer: Rethinking Attention with Performers [arXiv 2020/09, Under review of ICLR 2021]](#performer-rethinking-attention-with-performers-arxiv-202009-under-review-of-iclr-2021)

    - [Linformer: Self-Attention with Linear Complexity [arXiv 2020/06]](#linformer-self-attention-with-linear-complexity-arxiv-202006)

    - [Linear Transformer: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [ICML 2020]](#linear-transformer-transformers-are-rnns-fast-autoregressive-transformers-with-linear-attention-icml-2020)

    - [Synthesizer: Neural Speech Synthesis with Transformer Network [AAAI 2019]](#synthesizer-neural-speech-synthesis-with-transformer-network-aaai-2019)

    - [Sinkhorn Transformer: Sparse Sinkhorn Attention [ICML 2020]](#sinkhorn-transformer-sparse-sinkhorn-attention-icml-2020)

    - [Reformer: The Efficient Transformer [ICLR 2020]](#reformer-the-efficient-transformer-iclr-2020)

    - [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [arXiv 2019/06]](#transformer-xl-attentive-language-models-beyond-a-fixed-length-context-arxiv-201906)

    - [Compressive Transformers for Long-Range Sequence Modelling [ICLR 2020]](#compressive-transformers-for-long-range-sequence-modelling-iclr-2020)

    - [Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks [ICML 2019]](#set-transformer-a-framework-for-attention-based-permutation-invariant-neural-networks-icml-2019)

    - [Longformer: The Long-Document Transformer [arXiv 2020/04]](#longformer-the-long-document-transformer-arxiv-202004)

    - [Routing Transformer: Efficient Content-Based Sparse Attention with Routing Transformers [arXiv 2020/10]](#routing-transformer-efficient-content-based-sparse-attention-with-routing-transformers-arxiv-202010)

    - [Big Bird: Transformers for Longer Sequences [NIPS 2020]](#big-bird-transformers-for-longer-sequences-nips-2020)

    - [Etc: Encoding long and structured data in transformers [EMNLP 2020]](#etc-encoding-long-and-structured-data-in-transformers-emnlp-2020)

    - [Memory Compressed: Generating Wikipedia by Summarizing Long Sequences [ICLR 2018]](#memory-compressed-generating-wikipedia-by-summarizing-long-sequences-iclr-2018)

    - [Blockwise Transformer: Blockwise Self-Attention for Long Document Understanding [arXiv 2020/10]](#blockwise-transformer-blockwise-self-attention-for-long-document-understanding-arxiv-202010)

    - [Image Transformer [ICML 2018]](#image-transformer-icml-2018)

    - [Sparse Transformer: Generating Long Sequences with Sparse Transformers [arXiv 2019/04]](#sparse-transformer-generating-long-sequences-with-sparse-transformers-arxiv-201904)

    - [Axial Transformer: Axial Attention in Multidimensional Transformers [arXiv 2019/12]](#axial-transformer-axial-attention-in-multidimensional-transformers-arxiv-201912)

    - [Fastformer: Additive Attention Can Be All You Need [arXiv 2021/08]](#fastformer-additive-attention-can-be-all-you-need-arxiv-202108)

  - [Image Transformers](#image-transformers)

    - [ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Under review of ICLR 2021]](#vit-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-under-review-of-iclr-2021)

    - [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [Arxiv 2021/03]](#swin-transformer-hierarchical-vision-transformer-using-shifted-windows-arxiv-202103)

  - [Unsupervised Pretraining of Vision Transformer](#unsupervised-pretraining-of-vision-transformer)

    - [SiT: Self-supervised vIsion Transformer [arXiv 2021/04]](#sit-self-supervised-vision-transformer-arxiv-202104)

    - [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [CVPR 2021]](#up-detr-unsupervised-pre-training-for-object-detection-with-transformers-cvpr-2021)

    - [BEIT: BERT Pre-Training of Image Transformers [arXiv 2021/06]](#beit-bert-pre-training-of-image-transformers-arxiv-202106)

  - [Transformer GAN](#transformer-gan)

    - [TransGAN: Two Transformers Can Make One Strong GAN [Arxiv 2021/02]](#transgan-two-transformers-can-make-one-strong-gan-arxiv-202102)

    - [GANsformer: Generative Adversarial Transformers [Arxiv 2021/03]](#gansformer-generative-adversarial-transformers-arxiv-202103)

    - [TFill: Image Completion via a Transformer-Based Architecture [Arxiv 2021/04]](#tfill-image-completion-via-a-transformer-based-architecture-arxiv-202104)

  - [Transformer Visualizations](#transformer-visualizations)

    - [Transformer Interpretability Beyond Attention Visualization [Arxiv 2020/12]](#transformer-interpretability-beyond-attention-visualization-arxiv-202012)

  - [Transformer Internal Essence](#transformer-internal-essence)

    - [Pretrained Transformers as Universal Computation Engines [Arxiv 2021/03]](#pretrained-transformers-as-universal-computation-engines-arxiv-202103)

  - [Survey](#survey-1)

    - [Pre-trained Models for Natural Language Processing: A Survey](#pre-trained-models-for-natural-language-processing-a-survey)

    - [A Survey on Contextual Embeddings](#a-survey-on-contextual-embeddings)

    - [Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods](#trends-in-integration-of-vision-and-language-research-a-survey-of-tasks-datasets-and-methods)

    - [Deep Multimodal Representation Learning: A Survey](#deep-multimodal-representation-learning-a-survey)

    - [Multimodal Machine Learning: A Survey and Taxonomy](#multimodal-machine-learning-a-survey-and-taxonomy)

## Vision Language Models

### Vision-Language Pretraining

####  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [NeurIPS 2019]

[[paper]](https://arxiv.org/abs/1908.02265) [[code]](https://github.com/jiasenlu/vilbert_beta) *Facebook AI Research*

   - Architecture: Two stream :arrows_clockwise: **co-attentional transformer layers**

     

   - Pretrain dataset: Conceptual Captions (~3.3M)

   - Pretrain Tasks

     

     - predicting the semantics of masked words and image regions given the unmasked inputs (Masked Multi-modal Modelling)

     

     **image**: Predict the semantic classes distribution using image  input/output with detection model, then minimize KL divergence between these two distributions.

     

     **text**: Same as BERT.

     - predicting whether an image and text segment correspond (Multi-modal Alignment) **with [IMG] and [CLS] output**

   - Image feature (Fast R-CNN)

     - \ from pretrained object detection network

     - projected to match the visual feature

   - Text feature

    Google's WordPiece tokenizer

#### LXMERT: Learning Cross-Modality Encoder Representations from Transformers [EMNLP 2019]

[[paper]](https://arxiv.org/abs/1908.07490) [[code]](https://github.com/airsplay/lxmert) *The University of North Carolina*

   - Architecture: Two stream --- Object relationship encoder (**Image**), language encoder (**Text**), cross-modality encoder.

     

   - Pretrain dataset: COCO + Visual Genome (9.18 M)

   - Pretrain Tasks

     

     - MLM, Masked Object Prediction (MOP) [**feature regression** and **label classification**], Cross-modality Matching **with only [CLS] output**, Image Question Answering

   - Image feature (Fast R-CNN)

     - \

     - projection

     

   - Text feature

     

####  VisualBERT: A Simple and Performant Baseline for Vision and Language [arXiv 2019/08, ACL 2020]

[[paper]](https://arxiv.org/abs/1908.03557)  [[code]](https://github.com/uclanlp/visualbert)

   - Architecture: Single stream BERT

     

   - Pretrain dataset: COCO (100k)

   - Pretrain tasks:

     - Task-Agnostic Pretraining

     - MLM with only text masked

     - Sentence-image matching (Cross-modality Matching) **with only [CLS] output**

     - Task-Specific Pretraining

     using MLM with task-specific dataset, which help adapting to the new target domain.

   - Features

     - Image feature (Fast R-CNN)

     visual feature representation: bounding region feature + segment embedding + position embedding

     - Text feature: same as BERT

#### VL-BERT: Pre-training of Generic Visual-Linguistic Representations [ICLR 2020]

[[paper]](https://arxiv.org/abs/1908.08530) [[code]](https://github.com/jackroos/VL-BERT) *USTC & MSRA*

   - Architecture: Single stream BERT

     

   - Pretrain dataset: Conceptual Captions (3.3M) for visual-linguistic & BooksCorpus, English Wikipedia for pure text corpus

   - Pretrain Tasks

     - MLM, Masked RoI Classification with Linguistic Clues

     - They claim that **Cross-modality Matching** is of no use.

   - Features

     - Visual Feature Embedding (Fast R-CNN)

     - visual appearance embedding: 2048-d feature

     For **Non-visual elements**, they're obtained by RoI covering the whole input image.

     - visual geometry embedding:   to 2048-d representation by computing sine and cosine of different wavelengths according to "Relation networks for object detection"

     - Token Embedding

     - WordPiece Embedding

     For **Visual elements**, a special [IMG] is assigned.

     - Segment Embedding: Learnable

     - Sequence Position Embedding: Learnable

#### Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training [AAAI 2020]

[[paper]](https://arxiv.org/abs/1908.06066)

   - Architecture: Single stream BERT

     

   - Pretrain dataset: Conceptual Captions (3M) + SUB Captions (0.8M)

   - Pretrain tasks

     MLM + Masked Object Classification+ Visual-linguistic Matching (Cross-modality Matching) **with only [CLS] output**

   - Features

     - Image feature (Fast R-CNN)

     - [IMG] token + segment embedding + position embedding + next term

     -  , visual feature --separately--> embedding space using FC, then added up

     - Text feature: same as BERT

#### Unified Vision-Language Pre-Training for Image Captioning and VQA [AAAI 2020]

[[code]](https://github.com/LuoweiZhou/VLP), (**VLP**)

#### UNITER: Learning Universal Image-text Representations [ECCV 2020]

[[paper]](https://arxiv.org/abs/1909.11740) [[code]](https://github.com/ChenRocks/UNITER)

#### Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [arXiv 2020/04, ECCV 2020]

[[paper]](https://arxiv.org/pdf/2004.06165.pdf) [[code]](https://github.com/microsoft/Oscar)

#### Learning Transferable Visual Models From Natural Language Supervision [OpenAI papers 2021/01]

[[paper]](https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf) [[blog]](https://openai.com/blog/clip/) [[code]](https://github.com/openai/CLIP)

---

### Video-Language Pretraining

#### VideoBERT: A Joint Model for Video and Language Representation Learning [ICCV 2019]

[[paper]](https://openaccess.thecvf.com/content_ICCV_2019/papers/Sun_VideoBERT_A_Joint_Model_for_Video_and_Language_Representation_Learning_ICCV_2019_paper.pdf)

#### Multi-modal Transformer for Video Retrieval [ECCV 2020]

[[paper]](https://arxiv.org/abs/2007.10639) [[code]](https://github.com/gabeur/mmt)

#### HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [EMNLP 2020]

[[paper]](https://www.aclweb.org/anthology/2020.emnlp-main.161.pdf) [[code]](https://github.com/linjieli222/HERO)

#### UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation 

[[paper]](https://arxiv.org/abs/2002.06353)

---

### Image-Text Retrieval & Matching

#### ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data

arXiv 2020/01 [[paper]](https://arxiv.org/abs/2001.07966)

#### Cross-Probe BERT for Efficient AND effective Cross-Modal Search

ICLR 2021 submission. [[paper]](https://openreview.net/forum?id=bW9SYKHcZiz)

#### Multi-Modality Cross Attention Network for Image and Sentence Matching [ICCV 2020]

[[paper]](http://openaccess.thecvf.com/content_CVPR_2020/html/Wei_Multi-Modality_Cross_Attention_Network_for_Image_and_Sentence_Matching_CVPR_2020_paper.html)

### Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [CVPR 2021]

[[paper]](https://arxiv.org/abs/2103.16553) 

---

### Analysis

#### 12-in-1: Multi-Task Vision and Language Representation Learning [CVPR 2020]

[[paper]](https://arxiv.org/abs/1912.02315) [[code]](https://github.com/facebookresearch/vilbert-multi-task) Multi-task Learning

#### Are we pretraining it right? Digging deeper into visio-linguistic pretraining

arXiv 2020/04 [[paper]](https://arxiv.org/abs/2004.08744) In-depth Analysis

#### Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models 

arXiv 2020/05, ECCV 2020 Spotlight [[paper]](https://arxiv.org/abs/2005.07310) In-depth Analysis

#### Adaptive Transformers for Learning Multimodal Representations [ACL 2020]

[[paper]](https://arxiv.org/abs/2005.07486) Adaptive Transformer Analysis

#### Data, Architecture, or Losses: What Contributes Most to Multimodal Transformer Success? [TACL 2021]

[[paper]](https://128.84.4.15/pdf/2102.00529.pdf)

---

### Survey

#### Pre-trained Models for Natural Language Processing: A Survey [arXiv 2020/03]

[[paper]](https://arxiv.org/abs/2003.08271)

#### A Survey on Contextual Embeddings [arXiv 2020/03]

[[paper]](https://arxiv.org/abs/2003.07278)

#### Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods [arXiv 2019]

[[paper]](https://arxiv.org/abs/1907.09358)

#### Deep Multimodal Representation Learning: A Survey [arXiv 2019]

[[paper]](https://ieeexplore.ieee.org/abstract/document/8715409)

#### Pre-trained models for natural language processing: A survey [arXiv 2020]

[[paper]](https://arxiv.org/abs/2003.08271)

#### A Survey on Visual Transformer [arXiv 2020/12]

[[paper]](https://arxiv.org/abs/2012.12556)

---

### Platforms

#### facebook MMF 

https://github.com/facebookresearch/mmf

## Transformer

### Efficient Transformers

1. Fixed Patterns

   

   [Image Transformer [ICML 2018]](#image-transformer-icml-2018)

   - Blockwise Patterns

   - Strided Patterns

   - Compressed Patterns

2. Combination of Patterns

    Combining two or more distinct access patterns. 

3. Learnable Patterns

    [Reformer: The Efficient Transformer [ICLR 2020]](#reformer-the-efficient-transformer-iclr-2020)

    Opposite to the Fixed Patterns, learnable patterns aim to learn the access pattern in a data-driven fashion.

4. Memory

    [Longformer: The Long-Document Transformer [arXiv 2020/04]](#longformer-the-long-document-transformer-arxiv-202004)

    Leverage a side memory module to access multiple tokens at once.

5. Low-Rank Methods 

    [Linformer: Self-Attention with Linear Complexity [arXiv 2020/06]](#linformer-self-attention-with-linear-complexity-arxiv-202006)

    Leverage low-rank approximations of the self-attention matrix.

6. Kernels

    

   

    View the attention mechanism through kernelization, which enable clever mathematical re-writing of self-attention mechanism to avoid explicitly computing the N*N matrix. Can be view as low-rank method.

7. Recurrence

   A natural extension to the blockwise method is to connect these blocks via recurrence.

   

#### Performer: Rethinking Attention with Performers [arXiv 2020/09, Under review of ICLR 2021]

[[paper]](https://arxiv.org/abs/2009.14794)  [[code]](https://github.com/lucidrains/performer-pytorch) *Google & University of Cambridge & DeepMind & Alan Turing Institute*

#### Linformer: Self-Attention with Linear Complexity [arXiv 2020/06]

[[paper]](https://arxiv.org/pdf/2006.04768.pdf)  [[code]](https://github.com/tatp22/linformer-pytorch) *FAIR* 

![](images/Linformer.png)

- Tasks: Natural language understanding and downstream tasks.

- Contribution: Projecting (N, d) Key and Value to (k, d).

- Complexity: O(n)

- Restrictions:

  - Cause mixing of sequence information, which would make it non-trivial to maintain causal masking or prevent past-future information mixing.

#### Linear Transformer: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [ICML 2020]

[[paper]](https://arxiv.org/abs/2006.16236)  [[code]](https://github.com/lucidrains/linear-attention-transformer) *Idiap Research Institute*

#### Synthesizer: Neural Speech Synthesis with Transformer Network [AAAI 2019]

[[paper]](https://arxiv.org/abs/1809.08895)  [[code]](https://github.com/soobinseo/Transformer-TTS) *UESTC, MSRA*

#### Sinkhorn Transformer: Sparse Sinkhorn Attention [ICML 2020]

[[paper]](https://arxiv.org/abs/2002.11296)  [[code]](https://github.com/lucidrains/sinkhorn-transformer) *Google AI*

#### Reformer: The Efficient Transformer [ICLR 2020]

[[paper]](https://openreview.net/pdf?id=rkgNKkHtvB)  [[code]](https://github.com/lucidrains/reformer-pytorch) *UCB & Google Research*

![](images/Reformer.png)

- Tasks: Machine translation

- Contribution:

  - Locality Sensitive Hashing Attention (LSHA)

    - Weight approximation

    For each query  , the attention is computed as:  . 

    

    As softmax:  , the several largest term can roughly approximate the value.

    \[10, 7, 1, 0 ,2\] ---softmax---> [95%, 4.7%, 0.012%, 0.0043%, 0.032%]

    - Shared-QK Transformer

    For each k, q, let  

    > Does not affect transformer performance.

    - LSH bucket

    Split queries and keys into different buckets. Each query only attend to keys in the same bucket.

- Complexity: ref to the Table 3 of paper.

#### Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [arXiv 2019/06]

[[paper]](http://arxiv.org/abs/1901.02860) [[code]](https://github.com/kimiyoung/transformer-xl) *CMU, Google Brain*

#### Compressive Transformers for Long-Range Sequence Modelling [ICLR 2020]

[[paper]](https://openreview.net/forum?id=SylKikSYDH)  [[code]](https://github.com/lucidrains/compressive-transformer-pytorch) *Deep Mind, UCL*

#### Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks [ICML 2019]

[[paper]](http://proceedings.mlr.press/v97/lee19d.html)  *University of Oxford*

#### Longformer: The Long-Document Transformer [arXiv 2020/04]

[[paper]](https://arxiv.org/abs/2004.05150) [[code]](https://github.com/allenai/longformer) *Allen Institute for Artificial Intelligence*

![](images/Longformer.png)

- Tasks: Language Model, such as summarization, question answering...

- Contribution:

  - Sliding windows attention

    Applying different w for different layer. They increase the receptive field as the model goes deeper.

  - Dilated sliding window

    In multi-head attention, they use mixed sliding windows. The dilated sliding is used to focus on the longer context, while un-dilated sliding is used to focus on local context.

  - Global+sliding window

    They add global attention in some specific points for different tasks. \[CLS\] for classification task, "Whole Question Sentence" for QA task.

- Complexity: O(kn) for Local Attention, where k is the window size.

#### Routing Transformer: Efficient Content-Based Sparse Attention with Routing Transformers [arXiv 2020/10]

[[paper]](https://arxiv.org/pdf/2003.05997.pdf)  [[code]](https://github.com/lucidrains/routing-transformer) *Google Research*

#### Big Bird: Transformers for Longer Sequences [NIPS 2020]

[[paper]](https://proceedings.neurips.cc//paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf)  *Google Research*

#### Etc: Encoding long and structured data in transformers [EMNLP 2020]

[[paper]](https://arxiv.org/pdf/2004.08483.pdf)  [[code]](https://github.com/google-research/google-research/tree/master/etcmodel) Google Research

#### Memory Compressed: Generating Wikipedia by Summarizing Long Sequences [ICLR 2018]

[[paper]](https://arxiv.org/abs/1801.10198) [[code]](https://github.com/lucidrains/memory-compressed-attention) ** *Google Brain*

![](images/MemoryCompressed.png)

- Tasks: Text generation with WIKI as input.

- Contribution:

  - Local Attention

  - Memory-compressed Attention

- Complexity: O(bn) for Local Attention, where b is the block number. O(n*n/k) for Memory-compressed Attention, where k is the `nn.Conv1d` kernel size and strides.

#### Blockwise Transformer: Blockwise Self-Attention for Long Document Understanding [arXiv 2020/10]

[[paper]](https://arxiv.org/abs/1911.02972)  [[code]](https://github.com/xptree/BlockBERT) *Tsinghua University, FAIR*

#### Image Transformer [ICML 2018]

[[paper]](https://arxiv.org/abs/1802.05751)  [[code1]](https://github.com/sahajgarg/image_transformer) [[code2]](https://github.com/tensorflow/tensor2tensor/) *Google Brain, UCB, Google AI*

![](images/ImageTransformer.png)

- Tasks: Image Generation and Super Resolution

- Contribution:

  - Query Block split & 2 Local Attention

- Complexity: O(nm), where n is the length of flatted image, m is the memory length.

- Restrictions

  - Only focus on local neighborhood, which can be a issue where global information is required to solve a task.

  - The constant term:  ,   is introduced to be a extra hyper-parameter.

#### Sparse Transformer: Generating Long Sequences with Sparse Transformers [arXiv 2019/04]

[[paper]](https://arxiv.org/abs/1904.10509)  [[code]](https://github.com/openai/sparse_attention) *OpenAI*

#### Axial Transformer: Axial Attention in Multidimensional Transformers [arXiv 2019/12]

[[paper]](https://arxiv.org/abs/1912.12180) [[code]](https://github.com/lucidrains/axial-attention) *UCB, Google Brain*

#### Fastformer: Additive Attention Can Be All You Need [arXiv 2021/08]

[[paper]](https://arxiv.org/abs/2108.09084) [[code]](https://github.com/wilile26811249/Fastformer-PyTorch) *Tsinghua University, MSRA*

---

### Image Transformers

#### ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Under review of ICLR 2021]

[[paper]](https://arxiv.org/abs/2010.11929)  [[code1]](https://github.com/google-research/vision_transformer) [[code2]](https://github.com/lucidrains/vit-pytorch) *Google*

#### Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [Arxiv 2021/03]

[[paper]](https://arxiv.org/abs/2103.14030)  [[code]](https://github.com/microsoft/Swin-Transformer) *MSRA*

### Unsupervised Pretraining of Vision Transformer

#### SiT: Self-supervised vIsion Transformer [arXiv 2021/04]

[[paper]](https://arxiv.org/abs/2104.03602) [[code]](https://github.com/Sara-Ahmed/SiT)  *University of Surrey*

#### UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [CVPR 2021]

[[paper]](https://arxiv.org/abs/2011.09094) [[code] ](https://github.com/dddzg/up-detr) *SCUT & Tencent Wechat AI*

#### BEIT: BERT Pre-Training of Image Transformers [arXiv 2021/06]

[[paper]](https://arxiv.org/pdf/2106.08254.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/beit) *Microsoft*

---

### Transformer GAN

#### TransGAN: Two Transformers Can Make One Strong GAN [Arxiv 2021/02]

[[paper]](https://arxiv.org/abs/2102.07074)  [[code]](https://github.com/VITA-Group/TransGAN)

1. Architecture: Transformer-only

    

    

    - Up-sampling in Generator: **pixelshuffle** module from "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (CVPR 2016)"

    

    

2. Tricks

    - Data augmentation

    - Super-resolution co-training

        

    - Locality-Aware Initialization for Self-Attention

        

    - Ablations

        

3. Results

    - Scaling up model

    

        

    

    - Comparison with other model

        

#### GANsformer: Generative Adversarial Transformers [Arxiv 2021/03]

[[paper]](https://arxiv.org/abs/2103.01209) [[code]](https://github.com/dorarad/gansformer) *Stanford* & *Facebook*

#### TFill: Image Completion via a Transformer-Based Architecture [Arxiv 2021/04]

[[paper]](https://arxiv.org/abs/2104.00845) [[code]](https://github.com/lyndonzheng/TFill) **Code will be released in July** *MIT*

---

### Transformer Visualizations

#### Transformer Interpretability Beyond Attention Visualization [Arxiv 2020/12]

[[paper]](https://arxiv.org/abs/2012.09838)  [[code]](https://github.com/hila-chefer/Transformer-Explainability) *FAIR*

---

### Transformer Internal Essence

#### Pretrained Transformers as Universal Computation Engines [Arxiv 2021/03]

[[paper]](https://arxiv.org/abs/2103.05247) [[code]](https://github.com/kzl/universal-computation) *Google Brain*

---

### Survey

#### Pre-trained Models for Natural Language Processing: A Survey

arXiv 2020/03 [[paper]](https://arxiv.org/abs/2003.08271)

#### A Survey on Contextual Embeddings

arXiv 2020/03 [[paper]](https://arxiv.org/abs/2003.07278)

#### Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

arXiv 2019 [[paper]](https://arxiv.org/abs/1907.09358)

#### Deep Multimodal Representation Learning: A Survey

arXiv 2019 [[paper]](https://ieeexplore.ieee.org/abstract/document/8715409)

#### Multimodal Machine Learning: A Survey and Taxonomy

TPAMI 2018 [[paper]](https://arxiv.org/abs/1705.09406)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zh-plus/awesome-vlp-and-efficient-transformer

Awesome Lists containing this project

README