Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dk-liang/Awesome-Visual-Transformer

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)
https://github.com/dk-liang/Awesome-Visual-Transformer

List: Awesome-Visual-Transformer

detr transformer transformer-awesome transformer-cv transformer-with-cv visual-transformer

Last synced: about 2 months ago
JSON representation

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

Awesome Lists containing this project

README

        

# Awesome Visual-Transformer [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

Collect some Transformer with Computer-Vision (CV) papers.

If you find some overlooked papers, please open issues or pull requests (recommended).

## Papers

### Transformer original paper

- [Attention is All You Need](https://arxiv.org/abs/1706.03762) (NIPS 2017)

### Technical blog

- [English Blog] Transformers in Vision [[Link](https://davide-coccomini.medium.com/)]
- [Chinese Blog] 3W字长文带你轻松入门视觉transformer [[Link](https://zhuanlan.zhihu.com/p/308301901)]
- [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [[Link](https://zhuanlan.zhihu.com/p/348593638)]

### Survey
- Multimodal learning with transformers: A survey (IEEE TPAMI) [[paper](https://arxiv.org/abs/2206.06488)] - 2023.05.11
- A Survey of Visual Transformers [[paper](https://arxiv.org/abs/2111.06091)] - 2021.11.30
- Transformers in Vision: A Survey [[paper](https://arxiv.org/abs/2101.01169)] - 2021.02.22
- A Survey on Visual Transformer [[paper](https://arxiv.org/abs/2012.12556)] - 2021.1.30
- A Survey of Transformers [[paper](https://arxiv.org/abs/2106.04554)] - 2020.6.09

### arXiv papers
- Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [[paper](https://arxiv.org/abs/2305.04722)]
- **[FocusedDecoder]** Focused Decoding Enables 3D Anatomical Detection by Transformers [[paper](https://arxiv.org/abs/2207.10774v4)] [[code](https://github.com/bwittmann/transoar)]
- **[TAG]** TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [[paper](https://arxiv.org/abs/2208.01813)] [[code](https://github.com/HenryJunW/TAG)]
- **[FastMETRO]** Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [[paper](https://arxiv.org/abs/2207.13820)] [[code](https://github.com/postech-ami/FastMETRO)]
- BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [[paper](https://arxiv.org/abs/2203.01522)] [[code](https://github.com/zhihou7/BatchFormer)]
- **[RelViT]** RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [[paper]](https://arxiv.org/pdf/2204.11167.pdf) [[code]](https://github.com/NVlabs/RelViT)
- **[MViTv2]** Improved Multiscale Vision Transformers for Classification and Detection [[paper](https://arxiv.org/pdf/2112.01526.pdf)] [[code](https://github.com/facebookresearch/mvit)]
- DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [[paper](https://arxiv.org/pdf/2203.03605.pdf)] [[code](https://github.com/IDEACVR/DINO)]
- Three things everyone should know about Vision Transformers [[paper](https://arxiv.org/pdf/2203.09795.pdf)]
- **[DeiT III]** DeiT III: Revenge of the ViT [[paper](https://arxiv.org/pdf/2204.07118.pdf)]
- **[DaViT]** DaViT: Dual Attention Vision Transformers
[[paper](https://arxiv.org/pdf/2204.03645.pdf)] [[code](https://github.com/dingmyu/davit)]
- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition
[[paper](https://arxiv.org/abs/2203.16518)] [[code](https://github.com/jhcho99/CoFormer)]
- **[GSRTR]** Grounded Situation Recognition with Transformers
[[paper](https://arxiv.org/abs/2111.10135)] [[code](https://github.com/jhcho99/gsrtr)]
- **[MaxViT]** MaxViT: Multi-Axis Vision Transformer [[paper]](https://arxiv.org/abs/2204.01697)
- **[V2X-ViT]** V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [[paper]](https://arxiv.org/abs/2203.10638)
- **[MemMC-MAE]** Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [[paper](https://arxiv.org/abs/2203.11725)] [[code](https://github.com/tianyu0207/MemMC-MAE)]
- Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [[paper](https://arxiv.org/abs/2203.12121)] [[code](https://github.com/tianyu0207/weakly-polyp)]
- **[VideoMAE]** VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [[paper](https://arxiv.org/abs/2203.12602)] [[code](https://github.com/MCG-NJU/VideoMAE)]
- PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [[paper](https://arxiv.org/pdf/2111.12710.pdf)]
- ResViT: Residual vision transformers for multi-modal medical image synthesis [[paper](https://arxiv.org/abs/2106.16031)]
- **[CrossEfficientViT]** Combining EfficientNet and Vision Transformers for Video Deepfake Detection [[paper](https://arxiv.org/abs/2107.02612)] [[code](https://github.com/davide-coccomini/Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection)]
- **[Discrete ViT]** Discrete Representations Strengthen Vision Transformer Robustness [[paper](https://arxiv.org/abs/2111.10493)]
- **[StyleSwin]** StyleSwin: Transformer-based GAN for High-resolution Image Generation [[paper](https://arxiv.org/abs/2112.10762)] [[code](https://github.com/microsoft/StyleSwin)]
- **[SReT]** Sliced Recursive Transformer [[paper](https://arxiv.org/abs/2111.05297)] [[code](https://github.com/szq0214/SReT)]
- Dynamic Token Normalization Improves Vision Transformer [[paper](https://arxiv.org/abs/2112.02624)]
- TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [[paper](https://arxiv.org/abs/2106.11297)] [[code](https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner)]
- Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [[paper](https://arxiv.org/abs/2111.08413)]
- **[ORViT]** Object-Region Video Transformers [[paper](https://arxiv.org/abs/2110.06915)] [[code](https://roeiherz.github.io/ORViT/)]
- Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [[paper](https://arxiv.org/abs/2110.05092)] [[code](https://github.com/lelexx/MTF-Transformer)]
- **[NViT]** NViT: Vision Transformer Compression and Parameter Redistribution [[paper](https://arxiv.org/abs/2110.04869)]
- 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [[paper](https://arxiv.org/abs/2110.04792)]
- Adversarial Token Attacks on Vision Transformers [[paper](https://arxiv.org/abs/2110.04337)]
- Contextual Transformer Networks for Visual Recognition [[paper](https://arxiv.org/pdf/2107.12292.pdf)] [[code](https://github.com/JDAI-CV/CoTNet)]
- **[TranSalNet]** TranSalNet: Visual saliency prediction using transformers [[paper](https://arxiv.org/abs/2110.03593)]
- **[MobileViT]** MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [[paper](https://arxiv.org/abs/2110.02178)]
- A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [[paper](https://arxiv.org/abs/2110.01240)]
- **[3D-Transformer]** 3D-Transformer: Molecular Representation with Transformer in 3D Space [[paper](https://arxiv.org/abs/2110.01191)]
- **[CCTrans]** CCTrans: Simplifying and Improving Crowd Counting with Transformer [[paper](https://arxiv.org/abs/2109.14483)]
- **[UFO-ViT]** UFO-ViT: High Performance Linear Vision Transformer without Softmax [[paper](https://arxiv.org/abs/2109.14382)]
- Sparse Spatial Transformers for Few-Shot Learning [[paper](https://arxiv.org/abs/2109.12932)]
- Vision Transformer Hashing for Image Retrieval [[paper](https://arxiv.org/abs/2109.12564)]
- **[OH-Former]** OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [[paper](https://arxiv.org/abs/2109.11159)]
- **[Pix2seq]** Pix2seq: A Language Modeling Framework for Object Detection [[paper](https://arxiv.org/abs/2109.10852)]
- **[CoAtNet]** CoAtNet: Marrying Convolution and Attention for All Data Sizes [[paper](https://arxiv.org/pdf/2106.04803.pdf)]
- **[LOTR]** LOTR: Face Landmark Localization Using Localization Transformer [[paper](https://arxiv.org/abs/2109.10057)]
- Transformer-Unet: Raw Image Processing with Unet [[paper](https://arxiv.org/abs/2109.08417)]
- **[GraFormer]** GraFormer: Graph Convolution Transformer for 3D Pose Estimation [[paper](https://arxiv.org/abs/2109.08364)]
- **[CDTrans]** CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [[paper](https://arxiv.org/abs/2109.06165)]
- PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [[paper](https://arxiv.org/abs/2109.05566)] [[code](https://github.com/OPEN-AIR-SUN/PQ-Transformer)]
- Anchor DETR: Query Design for Transformer-Based Detector [[paper](https://arxiv.org/abs/2109.07107)] [[code](https://github.com/megvii-model/AnchorDETR)]
- **[DAB-DETR]** DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [[paper](https://arxiv.org/abs/2201.12329)] [[code](https://github.com/IDEA-opensource/DAB-DETR)]
- **[ESRT]** Efficient Transformer for Single Image Super-Resolution [[paper](https://arxiv.org/abs/2108.11084)]
- **[MaskFormer]** MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [[paper](http://arxiv.org/abs/2107.06278)] [[code](https://github.com/facebookresearch/MaskFormer)]
- **[SwinIR]** SwinIR: Image Restoration Using Swin Transformer [[paper](https://arxiv.org/abs/2108.10257)] [[code](https://github.com/JingyunLiang/SwinIR)]
- **[Trans4Trans]** Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [[paper](https://arxiv.org/abs/2108.09174)]
- Do Vision Transformers See Like Convolutional Neural Networks? [[paper](https://arxiv.org/abs/2108.08810)]
- Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [[paper](https://arxiv.org/abs/2108.07851)]
- Light Field Image Super-Resolution with Transformers [[paper](https://arxiv.org/abs/2108.07597)] [[code](https://github.com/ZhengyuLiang24/LFT)]
- Focal Self-attention for Local-Global Interactions in Vision Transformers [[paper](https://arxiv.org/abs/2107.00641)] [[code](https://github.com/microsoft/Focal-Transformer)]
- Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [[paper](https://arxiv.org/abs/2108.06932)] [[code](https://github.com/DengPingFan/Polyp-PVT)]
- Mobile-Former: Bridging MobileNet and Transformer [[paper](https://arxiv.org/abs/2108.05895)]
- **[TriTransNet]** TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [[paper](https://arxiv.org/abs/2108.03798)]
- **[PSViT]** PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [[paper](https://arxiv.org/abs/2108.03428)]
- Boosting Few-shot Semantic Segmentation with Transformers [[paper](https://arxiv.org/abs/2108.02266)] [[code](https://github.com/GuoleiSun/TRFS)]
- Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [[paper](https://arxiv.org/abs/2108.00584)]
- Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [[paper](https://arxiv.org/abs/2108.01390)]
- **[Styleformer]** Styleformer: Transformer based Generative Adversarial Networks with Style Vector [[paper](https://arxiv.org/abs/2106.07023)] [[code](https://github.com/Jeeseung-Park/Styleformer)]
- **[CMT]** CMT: Convolutional Neural Networks Meet Vision Transformers [[paper](https://arxiv.org/abs/2107.06263)]
- **[TransAttUnet]** TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2107.05274)]
- TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [[paper](https://arxiv.org/abs/2107.05188)]
- **[ViTGAN]** ViTGAN: Training GANs with Vision Transformers [[paper](https://arxiv.org/abs/2107.04589)]
- What Makes for Hierarchical Vision Transformer? [[paper](https://arxiv.org/abs/2107.02174)]
- **[Trans4Trans]** Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [[paper](https://arxiv.org/abs/2107.03172)]
- **[FFVT]** Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [[paper](https://arxiv.org/abs/2107.02341)]
- **[TransformerFusion]** TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [[paper](https://arxiv.org/abs/2107.02191)]
- Escaping the Big Data Paradigm with Compact Transformers [[paper](https://arxiv.org/pdf/2104.05704.pdf)]
- How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [[paper](https://arxiv.org/pdf/2106.10270.pdf)]
- Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [[paper](https://arxiv.org/pdf/2105.02358.pdf)]
- **[XCiT]** XCiT: Cross-Covariance Image Transformers [[paper](https://arxiv.org/pdf/2106.09681.pdf)] [[code](https://github.com/facebookresearch/xcit)]
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [[paper](https://arxiv.org/abs/2106.03650)] [[code](https://github.com/mulinmeng/Shuffle-Transformer)]
- Video Swin Transformer [[paper](https://arxiv.org/abs/2106.13230)] [[code](https://github.com/SwinTransformer/Video-Swin-Transformer)]
- **[VOLO]** VOLO: Vision Outlooker for Visual Recognition [[paper](https://arxiv.org/abs/2106.13112)] [[code](https://github.com/sail-sg/volo)]
- Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [[paper](https://arxiv.org/abs/2106.12413)]
- End-to-end Temporal Action Detection with Transformer [[paper](https://arxiv.org/abs/2106.10271)] [[code](https://github.com/xlliu7/TadTR)]
- How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [[paper](https://arxiv.org/abs/2106.10270)]
- Efficient Self-supervised Vision Transformers for Representation Learning [[paper](https://arxiv.org/abs/2106.09785)]
- Space-time Mixing Attention for Video Transformer [[paper](https://arxiv.org/abs/2106.05968)]
- Transformed CNNs: recasting pre-trained convolutional layers with self-attention [[paper](https://arxiv.org/abs/2106.05795)]
- **[CAT]** CAT: Cross Attention in Vision Transformer [[paper](https://arxiv.org/abs/2106.05786)]
- Scaling Vision Transformers [[paper](https://arxiv.org/abs/2106.04560)]
- **[DETReg]** DETReg: Unsupervised Pretraining with Region Priors for Object Detection [[paper](https://arxiv.org/abs/2106.04550)] [[code](https://amirbar.net/detreg)]
- Chasing Sparsity in Vision Transformers:An End-to-End Exploration [[paper](https://arxiv.org/abs/2106.04533)]
- **[MViT]** MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [[paper](https://arxiv.org/abs/2106.04520)]
- Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [[paper](https://arxiv.org/abs/2106.04263)]
- On Improving Adversarial Transferability of Vision Transformers [[paper](https://arxiv.org/abs/2106.04169)]
- Fully Transformer Networks for Semantic ImageSegmentation [[paper](https://arxiv.org/abs/2106.04108)]
- Visual Transformer for Task-aware Active Learning [[paper](https://arxiv.org/abs/2106.03801)] [[code](https://github.com/razvancaramalau/Visual-Transformer-for-Task-aware-Active-Learning)]
- Efficient Training of Visual Transformers with Small-Size Datasets [[paper](https://arxiv.org/abs/2106.03746)]
- Reveal of Vision Transformers Robustness against Adversarial Attacks [[paper](https://arxiv.org/abs/2106.03734)]
- Person Re-Identification with a Locally Aware Transformer [[paper](https://arxiv.org/abs/2106.03720)]
- **[Refiner]** Refiner: Refining Self-attention for Vision Transformers [[paper](https://arxiv.org/abs/2106.03714)]
- **[ViTAE]** ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [[paper](https://arxiv.org/abs/2106.03348)]
- Video Instance Segmentation using Inter-Frame Communication Transformers [[paper](https://arxiv.org/abs/2106.03299)]
- Transformer in Convolutional Neural Networks [[paper](https://arxiv.org/abs/2106.03180)] [[code](https://github.com/yun-liu/TransCNN)]
- **[Uformer]** Uformer: A General U-Shaped Transformer for Image Restoration [[paper](https://arxiv.org/abs/2106.03106)] [[code](https://github.com/ZhendongWang6/Uformer)]
- Patch Slimming for Efficient Vision Transformers [[paper](https://arxiv.org/abs/2106.02852)]
- **[RegionViT]** RegionViT: Regional-to-Local Attention for Vision Transformers [[paper](https://arxiv.org/abs/2106.02689)]
- Associating Objects with Transformers for Video Object Segmentation [[paper](https://arxiv.org/abs/2106.02638)] [[code](https://github.com/z-x-yang/AOT)]
- Few-Shot Segmentation via Cycle-Consistent Transformer [[paper](https://arxiv.org/abs/2106.02320)]
- Glance-and-Gaze Vision Transformer [[paper](https://arxiv.org/abs/2106.02277)] [[code]( https://github.com/yucornetto/GG-Transformer)]
- Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [[paper](https://arxiv.org/pdf/2105.08059.pdf)]
- **[DynamicViT]** DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [[paper](https://arxiv.org/abs/2106.02034)] [[code](https://dynamicvit.ivg-research.xyz/)]
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [[paper](https://arxiv.org/abs/2106.01548)] [[code]()]
- Unsupervised Out-of-Domain Detection via Pre-trained Transformers [[paper](https://arxiv.org/abs/2106.00948)]
- **[TransMIL]** TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [[paper](https://arxiv.org/abs/2106.00908)]
- **[TransVOS]** TransVOS: Video Object Segmentation with Transformers [[paper](https://arxiv.org/abs/2106.00588)]
- **[KVT]** KVT: k-NN Attention for Boosting Vision Transformers [[paper](https://arxiv.org/abs/2106.00515)]
- **[MSG-Transformer]** MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [[paper](https://arxiv.org/abs/2105.15168)] [[code](https://github.com/hustvl/MSG-Transformer)]
- **[SegFormer]** SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [[paper](https://arxiv.org/abs/2105.15203)] [[code](https://github.com/NVlabs/SegFormer)]
- **[SDNet]** SDNet: mutil-branch for single image deraining using swin [[paper](https://arxiv.org/abs/2105.15077)] [[code](https://github.com/H-tfx/SDNet)]
- **[DVT]** Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [[paper](https://arxiv.org/abs/2105.15075)]
- **[GazeTR]** Gaze Estimation using Transformer [[paper](https://arxiv.org/abs/2105.14424)] [[code](https://github.com/yihuacheng/GazeTR)]
- Transformer-Based Deep Image Matching for Generalizable Person Re-identification [[paper](https://arxiv.org/abs/2105.14432)]
- Less is More: Pay Less Attention in Vision Transformers [[paper](https://arxiv.org/abs/2105.14217)]
- **[FoveaTer]** FoveaTer: Foveated Transformer for Image Classification [[paper](https://arxiv.org/abs/2105.14173)]
- **[TransDA]** Transformer-Based Source-Free Domain Adaptation [[paper](https://arxiv.org/abs/2105.14138)] [[code](https://github.com/ygjwd12345/TransDA)]
- An Attention Free Transformer [[paper](https://arxiv.org/abs/2105.14103)]
- **[PTNet]** PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [[paper](https://arxiv.org/abs/2105.13993)]
- **[ResT]** ResT: An Efficient Transformer for Visual Recognition [[paper](https://arxiv.org/abs/2105.13677)] [[code](https://github.com/wofmanaf/ResT)]
- **[CogView]** CogView: Mastering Text-to-Image Generation via Transformers [[paper](https://arxiv.org/abs/2105.13290)]
- **[NesT]** Aggregating Nested Transformers [[paper](https://arxiv.org/abs/2105.12723)]
- **[TAPG]** Temporal Action Proposal Generation with Transformers [[paper](https://arxiv.org/abs/2105.12043)]
- Boosting Crowd Counting with Transformers [[paper](https://arxiv.org/abs/2105.10926)]
- **[COTR]** COTR: Convolution in Transformer Network for End to End Polyp Detection [[paper](https://arxiv.org/abs/2105.10925)]
- **[TransVOD]** End-to-End Video Object Detection with Spatial-Temporal Transformers [[paper](https://arxiv.org/abs/2105.10920)] [[code](https://github.com/SJTU-LuHe/TransVOD)]
- Intriguing Properties of Vision Transformers [[paper](https://arxiv.org/abs/2105.10497)] [[code](https://git.io/Js15X)]
- Combining Transformer Generators with Convolutional Discriminators [[paper](https://arxiv.org/abs/2105.10189)]
- Rethinking the Design Principles of Robust Vision Transformer [[paper](https://arxiv.org/abs/2105.07926)]
- Vision Transformers are Robust Learners [[paper](https://arxiv.org/abs/2105.07581)] [[code](https://git.io/J3VO0)]
- Manipulation Detection in Satellite Images Using Vision Transformer [[paper](https://arxiv.org/abs/2105.06373)]
- **[Swin-Unet]** Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2105.05537)] [[code](https://github.com/HuCaoFighting/Swin-Unet)]
- Self-Supervised Learning with Swin Transformers [[paper](https://arxiv.org/abs/2105.04553)] [[code](https://github.com/SwinTransformer/Transformer-SSL)]
- **[SCTN]** SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [[paper](https://arxiv.org/abs/2105.04447)]
- **[RelationTrack]** RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [[paper](https://arxiv.org/abs/2105.04322)]
- **[VGTR]** Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2105.04281)]
- **[PST]** Visual Composite Set Detection Using Part-and-Sum Transformers [[paper](https://arxiv.org/abs/2105.02170)]
- **[TrTr]** TrTr: Visual Tracking with Transformer [[paper](https://arxiv.org/abs/2105.03817)] [[code](https://github.com/tongtybj/TrTr)]
- **[MOTR]** MOTR: End-to-End Multiple-Object Tracking with TRansformer [[paper](https://arxiv.org/abs/2105.03247)] [[code](https://github.com/megvii-model/MOTR)]
- Attention for Image Registration (AiR): an unsupervised Transformer approach [[paper](https://arxiv.org/abs/2105.02282)]
- **[TransHash]** TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [[paper](https://arxiv.org/abs/2105.01823)]
- **[ISTR]** ISTR: End-to-End Instance Segmentation with Transformers [[paper](https://arxiv.org/abs/2105.00637)] [[code](https://github.com/hujiecpp/ISTR)]
- **[CAT]** CAT: Cross-Attention Transformer for One-Shot Object Detection [[paper](https://arxiv.org/abs/2104.14984)]
- **[CoSformer]** CoSformer: Detecting Co-Salient Object with Transformers [[paper](https://arxiv.org/abs/2104.14729)]
- End-to-End Attention-based Image Captioning [[paper](https://arxiv.org/abs/2104.14721)]
- **[PMTrans]** Pyramid Medical Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2104.14702)]
- **[HandsFormer]** HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [[paper](https://arxiv.org/abs/2104.14639)]
- **[GasHis-Transformer]** GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [[paper](https://arxiv.org/abs/2104.14528)]
- Emerging Properties in Self-Supervised Vision Transformers [[paper](https://arxiv.org/abs/2104.14294)]
- **[InTra]** Inpainting Transformer for Anomaly Detection [[paper](https://arxiv.org/abs/2104.13897)]
- **[Twins]** Twins: Revisiting Spatial Attention Design in Vision Transformers [[paper](https://arxiv.org/abs/2104.13840)] [[code](https://github.com/Meituan-AutoML/Twins)]
- **[MLMSPT]** Point Cloud Learning with Transformer [[paper](https://arxiv.org/abs/2104.13636)]
- Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [[paper](https://arxiv.org/abs/2104.13633)]
- **[ConTNet]** ConTNet: Why not use convolution and transformer at the same time? [[paper](https://arxiv.org/abs/2104.13497)] [[code](https://github.com/yan-hao-tian/ConTNet)]
- **[DTNet]** Dual Transformer for Point Cloud Analysis [[paper](https://arxiv.org/abs/2104.13044)]
- Improve Vision Transformers Training by Suppressing Over-smoothing [[paper](https://arxiv.org/abs/2104.12753)] [[code](https://github.com/ChengyueGongR/PatchVisionTransformer)]
- Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [[paper](https://arxiv.org/abs/2104.12137)]
- **[M3DeTR]** M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [[paper](https://arxiv.org/abs/2104.11896)] [[code](https://github.com/rayguan97/M3DeTR)]
- **[Skeletor]** Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [[paper](https://arxiv.org/abs/2104.11712)]
- **[FaceT]** Learning to Cluster Faces via Transformer [[paper](https://arxiv.org/abs/2104.11502)]
- **[MViT]** Multiscale Vision Transformers [[paper](https://arxiv.org/abs/2104.11227)] [[code](https://github.com/facebookresearch/SlowFast)]
- **[VATT]** VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [[paper](https://arxiv.org/abs/2104.11178)]
- **[So-ViT]** So-ViT: Mind Visual Tokens for Vision Transformer [[paper](https://arxiv.org/abs/2104.10935)] [[code](https://github.com/jiangtaoxie/So-ViT)]
- Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [[paper](https://arxiv.org/abs/2104.10858)] [[code](https://github.com/zihangJiang/TokenLabeling)]
- **[TransRPPG]** TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [[paper](https://arxiv.org/abs/2104.07419)]
- **[VideoGPT]** VideoGPT: Video Generation using VQ-VAE and Transformers [[paper](https://arxiv.org/abs/2104.10157)]
- **[M2TR]** M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [[paper](https://arxiv.org/abs/2104.09770)]
- Transformer Transforms Salient Object Detection and Camouflaged Object Detection [[paper](https://arxiv.org/abs/2104.10127)]
- **[TransCrowd]** TransCrowd: Weakly-Supervised Crowd Counting with Transformer [[paper](https://arxiv.org/abs/2104.09116)] [[code](https://github.com/dk-liang/TransCrowd)]
- Visual Transformer Pruning [[paper](https://arxiv.org/abs/2104.08500)]
- Self-supervised Video Retrieval Transformer Network [[paper](https://arxiv.org/abs/2104.07993)]
- Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [[paper](https://arxiv.org/abs/2104.07235)]
- **[TransGAN]** TransGAN: Two Transformers Can Make One Strong GAN [[paper](https://arxiv.org/abs/2102.07074)] [[code](https://github.com/VITA-Group/TransGAN)]
- Geometry-Free View Synthesis: Transformers and no 3D Priors [[paper](https://arxiv.org/abs/2104.07652)] [[code](https://git.io/JOnwn)]
- **[CoaT]** Co-Scale Conv-Attentional Image Transformers [[paper](https://arxiv.org/abs/2104.06399)] [[code](https://github.com/mlpc-ucsd/CoaT)]
- **[LocalViT]** LocalViT: Bringing Locality to Vision Transformers [[paper](https://arxiv.org/abs/2104.05707)] [[code](https://github.com/ofsoundof/LocalViT)]
- **[CIT]** Cloth Interactive Transformer for Virtual Try-On [[paper](https://arxiv.org/abs/2104.05519)] [[code](https://arxiv.org/abs/2104.05519)]
- Handwriting Transformers [[paper](https://arxiv.org/abs/2104.03964)]
- **[SiT]** SiT: Self-supervised vIsion Transformer [[paper](https://arxiv.org/abs/2104.03602)] [[code](https://github.com/Sara-Ahmed/SiT)]
- On the Robustness of Vision Transformers to Adversarial Examples [[paper](https://arxiv.org/abs/2104.02610)]
- An Empirical Study of Training Self-Supervised Visual Transformers [[paper](https://arxiv.org/abs/2104.02057)]
- A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [[paper](https://arxiv.org/abs/2104.01745)]
- **[AOT-GAN]** Aggregated Contextual Transformations for High-Resolution Image Inpainting [[paper](https://arxiv.org/abs/2104.01431)] [[code](https://github.com/researchmm/AOT-GAN-for-Inpainting)]
- Deepfake Detection Scheme Based on Vision Transformer and Distillation [[paper](https://arxiv.org/abs/2104.01353)]
- **[ATAG]** Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [[paper](https://arxiv.org/pdf/2103.16024)]
- **[TubeR]** TubeR: Tube-Transformer for Action Detection [[paper](https://arxiv.org/abs/2104.00969)]
- **[AAformer]** AAformer: Auto-Aligned Transformer for Person Re-Identification [[paper](https://arxiv.org/abs/2104.00921)]
- **[TFill]** TFill: Image Completion via a Transformer-Based Architecture [[paper](https://arxiv.org/abs/2104.00845)]
- Group-Free 3D Object Detection via Transformers [[paper](https://arxiv.org/abs/2104.00678)] [[code](https://github.com/zeliu98/Group-Free-3D)]
- **[STGT]** Spatial-Temporal Graph Transformer for Multiple Object Tracking [[paper](https://arxiv.org/abs/2104.00194)]
- Going deeper with Image Transformers[[paper](https://arxiv.org/abs/2103.17239)]
- **[Meta-DETR]** Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [[paper](https://arxiv.org/abs/2103.11731) [[code](https://github.com/ZhangGongjie/Meta-DETR)]
- **[DA-DETR]** DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [[paper](https://arxiv.org/abs/2103.17084)]
- Robust Facial Expression Recognition with Convolutional Visual Transformers [[paper](https://arxiv.org/abs/2103.16854)]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](https://arxiv.org/abs/2103.16553)]
- Spatiotemporal Transformer for Video-based Person Re-identification[[paper](https://arxiv.org/abs/2103.16469)]
- **[TransUNet]** TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.04306)] [[code](https://github.com/Beckschen/TransUNet)]
- **[CvT]** CvT: Introducing Convolutions to Vision Transformers [[paper](https://arxiv.org/abs/2103.15808)] [[code](https://github.com/leoxiaobin/CvT)]
- **[TFPose]** TFPose: Direct Human Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.15320)]
- **[TransCenter]** TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [[paper](https://arxiv.org/abs/2103.15145)]
- Face Transformer for Recognition [[paper](https://arxiv.org/abs/2103.14803)]
- On the Adversarial Robustness of Visual Transformers [[paper](https://arxiv.org/abs/2103.15670)]
- Understanding Robustness of Transformers for Image Classification [[paper](https://arxiv.org/abs/2103.14586)]
- Lifting Transformer for 3D Human Pose Estimation in Video [[paper](https://arxiv.org/abs/2103.14304)]
- **[GSA-Net]** Global Self-Attention Networks for Image Recognition[[paper](https://arxiv.org/abs/2010.03019)]
- High-Fidelity Pluralistic Image Completion with Transformers [[paper](https://arxiv.org/abs/2103.14031)] [[code](http://raywzy.com/ICT)]
- **[DPT]** Vision Transformers for Dense Prediction [[paper](https://arxiv.org/abs/2103.13413)] [[code](https://github.com/intel-isl/DPT)]
- **[TransFG]** TransFG: A Transformer Architecture for Fine-grained Recognition? [[paper](https://arxiv.org/abs/2103.07976)]
- **[TimeSformer]** Is Space-Time Attention All You Need for Video Understanding? [[paper](https://arxiv.org/abs/2102.05095)]
- Multi-view 3D Reconstruction with Transformer [[paper](https://arxiv.org/abs/2103.12957)]
- Can Vision Transformers Learn without Natural Images? [[paper](https://arxiv.org/abs/2103.13023)] [[code](https://hirokatsukataoka16.github.io/Vision-Transformers-without-Natural-Images/)]
- End-to-End Trainable Multi-Instance Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.12115)]
- Instance-level Image Retrieval using Reranking Transformers [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://arxiv.org/abs/2103.12236)]
- **[BossNAS]** BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://github.com/changlin31/BossNAS)]
- **[CeiT]** Incorporating Convolution Designs into Visual Transformers [[paper](https://arxiv.org/abs/2103.11816)]
- **[DeepViT]** DeepViT: Towards Deeper Vision Transformer [[paper](https://arxiv.org/abs/2103.11886)]
- Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [[paper](https://arxiv.org/abs/2103.10043)]
- 3D Human Pose Estimation with Spatial and Temporal Transformers [[paper](https://arxiv.org/abs/2103.10455)] [[code](https://github.com/zczcwh/PoseFormer)]
- **[SUNETR]** SUNETR: Transformers for 3D Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.10504)]
- Scalable Visual Transformers with Hierarchical Pooling [[paper](https://arxiv.org/abs/2103.10619)]
- **[ConViT]** ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [[paper](https://arxiv.org/abs/2103.10697)]
- **[TransMed]** TransMed: Transformers Advance Multi-modal Medical Image Classification [[paper](https://arxiv.org/abs/2103.05940)]
- **[U-Transformer]** U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.06104)]
- **[SpecTr]** SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [[paper](https://arxiv.org/abs/2103.03604)] [[code](https://github.com/hfut-xc-yun/SpecTr)]
- **[TransBTS]** TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [[paper](https://arxiv.org/abs/2103.04430)] [[code](https://github.com/Wenxuan-1119/TransBTS)]
- **[SSTN]** SSTN: Self-Supervised Domain Adaptation Thermal
Object Detection for Autonomous Driving [[paper](https://arxiv.org/abs/2103.03150)]
- Transformer is All You Need:
Multimodal Multitask Learning with a Unified Transformer [[paper](https://arxiv.org/abs/2102.10772)] [[code](https://mmf.sh/)]
- **[CPVT]** Do We Really Need Explicit Position Encodings for Vision Transformers? [[paper](https://arxiv.org/abs/2102.10882)] [[code](https://github.com/Meituan-AutoML/CPVT)]
- Deepfake Video Detection Using Convolutional Vision Transformer[[paper](https://arxiv.org/abs/2102.11126)]
- Training Vision Transformers for Image Retrieval[[paper](https://arxiv.org/abs/2102.05644)]
- **[VTN]** Video Transformer Network[[paper](https://arxiv.org/abs/2102.00719)]
- **[BoTNet]** Bottleneck Transformers for Visual Recognition [[paper](https://arxiv.org/abs/2101.11605)]
- **[CPTR]** CPTR: Full Transformer Network for Image Captioning [[paper](https://arxiv.org/abs/2101.10804)]
- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [[paper](https://arxiv.org/abs/2101.08779)] [[code](https://google.github.io/aichoreographer/)]
- **[Trans2Seg]** Segmenting Transparent Object in the Wild with Transformer [[paper](https://arxiv.org/abs/2101.08461)] [[code](https://github.com/xieenze/Trans2Seg)]
- Investigating the Vision Transformer Model for Image Retrieval Tasks [[paper](https://arxiv.org/abs/2101.03771)]
- **[Trear]** Trear: Transformer-based RGB-D Egocentric Action Recognition [[paper](https://arxiv.org/abs/2101.03904)]
- **[VisualSparta]** VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [[paper](https://arxiv.org/abs/2101.00265)]
- **[TrackFormer]** TrackFormer: Multi-Object Tracking with Transformers [[paper](https://arxiv.org/abs/2101.02702)]
- **[TAPE]** Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [[paper](https://arxiv.org/abs/2101.02143)]
- **[TRIQ]** Transformer for Image Quality Assessment [[paper](https://arxiv.org/abs/2101.01097)] [[code](https://github.com/junyongyou/triq)]
- **[TransTrack]** TransTrack: Multiple-Object Tracking with Transformer [[paper](https://arxiv.org/abs/2012.15460)] [[code](https://github.com/PeizeSun/TransTrack)]
- **[DeiT]** Training data-efficient image transformers & distillation through attention [[paper](https://arxiv.org/abs/2012.12877)] [[code](https://github.com/facebookresearch/deit)]
- **[Pointformer]** 3D Object Detection with Pointformer [[paper](https://arxiv.org/abs/2012.11409)]
- **[ViT-FRCNN]** Toward Transformer-Based Object Detection [[paper](https://arxiv.org/abs/2012.09958)]
- **[Taming-transformers]** Taming Transformers for High-Resolution Image Synthesis [[paper](https://arxiv.org/abs/2012.09841)] [[code](https://compvis.github.io/taming-transformers/)]
- **[SceneFormer]** SceneFormer: Indoor Scene Generation with Transformers [[paper](https://arxiv.org/abs/2012.09793)]
- **[PCT]** PCT: Point Cloud Transformer [[paper](https://arxiv.org/abs/2012.09688)]
- **[PED]** DETR for Pedestrian Detection[[paper](https://arxiv.org/abs/2012.06785)]
- Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[[paper](https://arxiv.org/abs/2101.02143)]
- **[C-Tran]** General Multi-label Image Classification with Transformers [[paper](https://arxiv.org/abs/2011.14027)]

### 2022

**TPAMI**

- **[P2T]** P2T: Pyramid Pooling Transformer for Scene Understanding [[paper](https://ieeexplore.ieee.org/document/9870559)]

**ECCV**

- **[X-CLIP]** Expanding Language-Image Pretrained Models for General Video Recognition [[paper](https://arxiv.org/abs/2208.02816)] [[code](https://aka.ms/X-CLIP)]
- **[TinyViT]** TinyViT: Fast Pretraining Distillation for Small Vision Transformers [[paper](https://arxiv.org/abs/2207.10666)] [[code](https://github.com/microsoft/Cream/tree/main/TinyViT)]
- **[FastMETRO]** Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [[paper](https://arxiv.org/abs/2207.13820)] [[code](https://github.com/postech-ami/FastMETRO)]
- **[AiATrack]** AiATrack: Attention in Attention for Transformer Visual Tracking [[paper](https://arxiv.org/abs/2207.09603)] [[code](https://github.com/Little-Podi/AiATrack)]
- **[OSTrack]** Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [[paper](https://arxiv.org/abs/2203.11991)] [[code](https://github.com/botaoye/OSTrack)]
- **[Unicorn]** Towards Grand Unification of Object Tracking [[paper](https://arxiv.org/abs/2207.07078)] [[code](https://github.com/MasterBin-IIAU/Unicorn)]
- **[P3AFormer]** Tracking Objects as Pixel-wise Distributions [[paper](https://arxiv.org/abs/2207.05518)] [[code](https://github.com/dvlab-research/ECCV22-P3AFormer-Tracking-Objects-as-Pixel-wise-Distributions)]

**CVPR**
- **[MAE]** Masked Autoencoders Are Scalable Vision Learners [[paper](https://arxiv.org/abs/2111.06377)] [[code]](https://github.com/facebookresearch/mae)
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [[paper](https://arxiv.org/abs/2107.00652)] [[code](https://github.com/microsoft/CSWin-Transformer)]
- Fast Point Transformer [[paper](https://arxiv.org/abs/2112.04702)]
- EDTER: Edge Detection With Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Pu_EDTER_Edge_Detection_With_Transformer_CVPR_2022_paper.html)] [[code](https://github.com/MengyangPu/EDTER)]
- Bridged Transformer for Vision and Point Cloud 3D Object Detection [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Bridged_Transformer_for_Vision_and_Point_Cloud_3D_Object_Detection_CVPR_2022_paper.html)]
- MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Xie_MNSRNet_Multimodal_Transformer_Network_for_3D_Surface_Super-Resolution_CVPR_2022_paper.html)]
- HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Bandara_HyperTransformer_A_Textural_and_Spectral_Feature_Fusion_Transformer_for_Pansharpening_CVPR_2022_paper.html)] [[code](https://github.com/wgcban/HyperTransformer)]
- Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Hampali_Keypoint_Transformer_Solving_Joint_Identification_in_Challenging_Hands_and_Object_CVPR_2022_paper.html)]
- MPViT: Multi-Path Vision Transformer for Dense Prediction [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Lee_MPViT_Multi-Path_Vision_Transformer_for_Dense_Prediction_CVPR_2022_paper.html)] [[code]](https://github.com/youngwanLEE/MPViT)
- A-ViT: Adaptive Tokens for Efficient Vision Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Yin_A-ViT_Adaptive_Tokens_for_Efficient_Vision_Transformer_CVPR_2022_paper.html)]
- TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zhang_TopFormer_Token_Pyramid_Transformer_for_Mobile_Semantic_Segmentation_CVPR_2022_paper.html)] [[code](https://github.com/hustvl/TopFormer)]
- Continual Learning With Lifelong Vision Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Continual_Learning_With_Lifelong_Vision_Transformer_CVPR_2022_paper.html)]
- Swin Transformer V2: Scaling Up Capacity and Resolution [[paper](https://arxiv.org/abs/2111.09883)] [[code]](https://github.com/microsoft/Swin-Transformer)
- Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [[paper](https://arxiv.org/abs/2203.10314)] [[code](https://github.com/skyhehe123/VoxSeT)]
- Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [[paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Xu_Multi-Class_Token_Transformer_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2022_paper.pdf)]
- Human-Object Interaction Detection via Disentangled Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zhou_Human-Object_Interaction_Detection_via_Disentangled_Transformer_CVPR_2022_paper.html)]
- LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Jiang_LGT-Net_Indoor_Panoramic_Room_Layout_Estimation_With_Geometry-Aware_Transformer_Network_CVPR_2022_paper.html)]
- Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Xia_Sparse_Local_Patch_Transformer_for_Robust_Face_Alignment_and_Landmarks_CVPR_2022_paper.html)]
- Vision Transformer With Deformable Attention [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Xia_Vision_Transformer_With_Deformable_Attention_CVPR_2022_paper.html)]
- DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [[paper](https://arxiv.org/abs/2204.12997)]
- **[Restormer]** Restormer: Efficient Transformer for High-Resolution Image Restoration [[paper](https://arxiv.org/abs/2111.09881)] [[code](https://github.com/swz30/Restormer)]
- **[SAM-DETR]** Accelerating DETR Convergence via Semantic-Aligned Matching [[paper](https://arxiv.org/abs/2203.06883)] [[code](https://github.com/ZhangGongjie/SAM-DETR)]
- **[BEVT]** BEVT: BERT Pretraining of Video Transformers [[paper](https://arxiv.org/pdf/2112.01529.pdf)] [[code](https://github.com/xyzforever/BEVT)]
- **[MobileFormer]** Mobile-Former: Bridging MobileNet and Transformer [[paper](https://arxiv.org/pdf/2108.05895.pdf)]
- **[STRM]** Spatio-temporal Relation Modeling for Few-shot Action Recognition [[paper](https://arxiv.org/pdf/2112.05132.pdf)] [[code](https://github.com/Anirudh257/strm)]
- **[MiniViT]** MiniViT: Compressing Vision Transformers with Weight Multiplexing [[paper](https://arxiv.org/abs/2204.07154)] [[code](https://github.com/microsoft/Cream/tree/main/MiniViT)]
- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition
[[paper](https://arxiv.org/abs/2203.16518)] [[code](https://github.com/jhcho99/CoFormer)]
- **[DW-ViT]** Beyond Fixation: Dynamic Window Visual Transformer [[paper](https://arxiv.org/pdf/2203.12856.pdf)] [[code](https://github.com/pzhren/DW-ViT)]
- **[TokenFusion]** Multimodal Token Fusion for Vision Transformers [[paper](https://arxiv.org/pdf/2204.08721.pdf)]
- **[CMT]** Convolutional Neural Networks Meet Vision Transformers [[paper](https://arxiv.org/pdf/2107.06263.pdf)]
- Fine-tuning Image Transformers using Learnable Memory [[paper](https://arxiv.org/pdf/2203.15243.pdf)]
- **[TransMix]** Attend to Mix for Vision Transformers [[paper](https://arxiv.org/pdf/2111.09833.pdf)] [[code](https://github.com/Beckschen/TransMix)]
- **[NomMer]** Nominate Synergistic Context in Vision Transformer for Visual Recognition [[paper](https://arxiv.org/pdf/2111.12994.pdf)] [[code](https://github.com/TencentYoutuResearch/VisualRecognition-NomMer)]
- **[SSA]** Shunted Self-Attention via Multi-Scale Token Aggregation [[paper](https://arxiv.org/pdf/2111.15193.pdf)] [[code](https://github.com/OliverRensu/Shunted-Transformer)]
- **[RVT]** Towards Robust Vision Transformer [[paper](https://arxiv.org/pdf/2105.07926.pdf) [[code](https://github.com/vtddggg/Robust-Vision-Transformer)]
- **[LVT]** Lite Vision Transformer with Enhanced Self-Attention [[paper](https://arxiv.org/pdf/2112.10809.pdf) [[code](https://github.com/Chenglin-Yang/LVT)]
- **[StyTr2]** StyTr2: Image Style Transfer with Transformers [[paper](https://arxiv.org/pdf/2105.14576.pdf)] [[code](https://github.com/diyiiyiii/StyTR-2)]

**WACV**
- Image-Adaptive Hint Generation via Vision Transformer for Outpainting [[paper](https://openaccess.thecvf.com/content/WACV2022/papers/Kong_Image-Adaptive_Hint_Generation_via_Vision_Transformer_for_Outpainting_WACV_2022_paper.pdf)] [[code](https://github.com/kdh4672/hgonet)]

**ICLR**
- **[RelViT]** RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [[paper](https://arxiv.org/pdf/2204.11167.pdf)] [[code](https://github.com/NVlabs/RelViT)]
- **[CrossFormer]** CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention [[paper](https://arxiv.org/abs/2108.00154)] [[code](https://github.com/cheerss/CrossFormer)]

- Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [[paper](https://arxiv.org/abs/2201.04676)] [[code](https://github.com/Sense-X/UniFormer)]

- **[DAB-DETR]** DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [[paper](https://arxiv.org/abs/2201.12329)] [[code](https://github.com/IDEA-opensource/DAB-DETR)]

### 2021
**NeurIPS**

- ProTo: Program-Guided Transformer for Program-Guided Tasks [[paper](https://arxiv.org/abs/2110.00804)] [[code](https://github.com/sjtuytc/Neurips21-ProTo-Program-guided-Transformers-for-Program-guided-Tasks)]
- **[Augvit]** Augmented Shortcuts for Vision Transformers [[paper](https://arxiv.org/abs/2106.15941)] [[code](https://github.com/huawei-noah/CV-Backbones/tree/master/augvit_pytorch)]
- **[YOLOS]** You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [[paper](https://arxiv.org/abs/2106.00666)] [[code](https://github.com/hustvl/YOLOS)]
- **[CATs]** Semantic Correspondence with Transformers [[paper](https://arxiv.org/abs/2106.02520)] [[code](https://github.com/SunghwanHong/CATs)]
- **[Moment-DETR]** QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [[paper](https://arxiv.org/abs/2107.09609)] [[code](https://github.com/jayleicn/moment_detr)]
- Dual-stream Network for Visual Recognition [[paper](https://arxiv.org/abs/2105.14734)] [[code](https://github.com/gaopengcuhk/DSNet)]
- **[Container]** Container: Context Aggregation Network [[paper](https://arxiv.org/abs/2106.01401)] [[code](https://github.com/gaopengcuhk/Container)]
- **[TNT]** Transformer in Transformer [[paper](https://arxiv.org/abs/2103.00112)] [[code](https://github.com/huawei-noah/noah-research/tree/master/TNT)]
- T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [[paper](https://arxiv.org/abs/2109.10948)]
- Long Short-Term Transformer for Online Action Detection [[paper](https://papers.nips.cc/paper/2021/hash/08b255a5d42b89b0585260b6f2360bdd-Abstract.html)]
- TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [[paper](https://papers.nips.cc/paper/2021/hash/0a87257e5308197df43230edf4ad1dae-Abstract.html)]
- TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [[paper](https://papers.nips.cc/paper/2021/hash/0f49c89d1e7298bb9930789c8ed59d48-Abstract.html)]
- TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [[paper](https://papers.nips.cc/paper/2021/hash/10c272d06794d3e5785d5e7c5356e9ff-Abstract.html)]
- Associating Objects with Transformers for Video Object Segmentation [[paper](https://papers.nips.cc/paper/2021/hash/147702db07145348245dc5a2f2fe5683-Abstract.html)]
- Test-Time Personalization with a Transformer for Human Pose Estimation [[paper](https://papers.nips.cc/paper/2021/hash/1517c8664be296f0d87d9e5fc54fdd60-Abstract.html)]
- Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [[paper](https://papers.nips.cc/paper/2021/hash/21be992eb8016e541a15953eee90760e-Abstract.html)]
- Dynamic Grained Encoder for Vision Transformers [[paper](https://papers.nips.cc/paper/2021/hash/2d969e2cee8cfa07ce7ca0bb13c7a36d-Abstract.html)]
- HRFormer: High-Resolution Vision Transformer for Dense Predict [[paper](https://papers.nips.cc/paper/2021/hash/3bbfdde8842a5c44a0323518eec97cbe-Abstract.html)]
- Searching the Search Space of Vision Transformer [[paper](https://papers.nips.cc/paper/2021/hash/48e95c45c8217961bf6cd7696d80d238-Abstract.html)]
- Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [[paper](https://papers.nips.cc/paper/2021/hash/64517d8435994992e682b3e4aa0a0661-Abstract.html)]
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [[paper](https://papers.nips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html)]
- Do Vision Transformers See Like Convolutional Neural Networks? [[paper](https://papers.nips.cc/paper/2021/hash/652cf38361a209088302ba2b8b7f51e0-Abstract.html)]
- Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [[paper](https://papers.nips.cc/paper/2021/hash/67f7fb873eaf29526a11a9b7ac33bfac-Abstract.html)]
- Glance-and-Gaze Vision Transformer [[paper](https://papers.nips.cc/paper/2021/hash/6c524f9d5d7027454a783c841250ba71-Abstract.html)]
- MST: Masked Self-Supervised Transformer for Visual Representation [[paper](https://papers.nips.cc/paper/2021/hash/6dbbe6abe5f14af882ff977fc3f35501-Abstract.html)]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [[paper](https://papers.nips.cc/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html)]
- TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [[paper](https://papers.nips.cc/paper/2021/hash/7c220a2091c26a7f5e9f1cfb099511e3-Abstract.html)]
- Augmented Shortcuts for Vision Transformers [[paper](https://papers.nips.cc/paper/2021/hash/818f4654ed39a1c147d1e51a00ffb4cb-Abstract.html)]
- Improved Transformer for High-Resolution GANs [[paper](https://papers.nips.cc/paper/2021/hash/98dce83da57b0395e163467c9dae521b-Abstract.html)]
- All Tokens Matter: Token Labeling for Training Better Vision Transformers [[paper](https://papers.nips.cc/paper/2021/hash/9a49a25d845a483fae4be7e341368e36-Abstract.html)]
- XCiT: Cross-Covariance Image Transformers [[paper](https://papers.nips.cc/paper/2021/hash/a655fbe4b8d7439994aa37ddad80de56-Abstract.html)]
- Efficient Training of Visual Transformers with Small Datasets [[paper](https://papers.nips.cc/paper/2021/hash/c81e155d85dae5430a8cee6f2242e82c-Abstract.html)]

**ICCV**

- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (**Marr Prize**) [[paper](https://arxiv.org/abs/2103.14030)] [[code](https://github.com/microsoft/Swin-Transformer)]
- **[ICT]** High-Fidelity Pluralistic Image Completion with Transformers [[paper](https://arxiv.org/pdf/2103.14031.pdf)] [[code](https://github.com/raywzy/ICT)]
- **[PoinTr]** PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (**oral**) [[paper](https://arxiv.org/abs/2108.08839)] [[code](https://github.com/yuxumin/PoinTr)]
- **[STTR]** Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [[paper](https://arxiv.org/abs/2011.02910v2)] [[code](https://github.com/mli0603/stereo-transformer)]
- **[TSP-FCOS]** Rethinking Transformer-based Set Prediction for Object Detection [[paper](https://arxiv.org/abs/2011.10881)]
- Paint Transformer: Feed Forward Neural Painting with Stroke Prediction (**oral**) ) [[paper](https://arxiv.org/abs/2108.03798]) [[code](https://github.com/Huage001/PaintTransformer)]
- 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_3DVG-Transformer_Relation_Modeling_for_Visual_Grounding_on_Point_Clouds_ICCV_2021_paper.pdf)]
- **[T2T-ViT]** Training Vision Transformers from Scratch on ImageNet [[paper](https://arxiv.org/abs/2101.11986)] [[code](https://github.com/yitu-opensource/T2T-ViT)]
- **[THUNDR]** THUNDR: Transformer-Based 3D Human Reconstruction With Markers [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zanfir_THUNDR_Transformer-Based_3D_Human_Reconstruction_With_Markers_ICCV_2021_paper.html)]
- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [[paper](https://arxiv.org/abs/2103.15358)]
- **[PVT]** Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [[paper](https://arxiv.org/abs/2102.12122)] [[code](https://github.com/whai362/PVT)]
- Spatial-Temporal Transformer for Dynamic Scene Graph Generation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Cong_Spatial-Temporal_Transformer_for_Dynamic_Scene_Graph_Generation_ICCV_2021_paper.pdf)]
- **[GLiT]** GLiT: Neural Architecture Search for Global and Local Image Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_GLiT_Neural_Architecture_Search_for_Global_and_Local_Image_Transformer_ICCV_2021_paper.pdf)]
- **[TRAR]** TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_TRAR_Routing_the_Attention_Spans_in_Transformer_for_Visual_Question_ICCV_2021_paper.pdf)]
- **[UniT]** UniT: Multimodal Multitask Learning With a Unified Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Hu_UniT_Multimodal_Multitask_Learning_With_a_Unified_Transformer_ICCV_2021_paper.html)] [[code](https://mmf.sh)]
- Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Voskou_Stochastic_Transformer_Networks_With_Linear_Competing_Units_Application_To_End-to-End_ICCV_2021_paper.pdf)]
- Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Transformer-Based_Dual_Relation_Graph_for_Multi-Label_Image_Recognition_ICCV_2021_paper.pdf)]
- **[LocalTrans]** LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Shao_LocalTrans_A_Multiscale_Local_Transformer_Network_for_Cross-Resolution_Homography_Estimation_ICCV_2021_paper.pdf)]
- Improving 3D Object Detection With Channel-Wise Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sheng_Improving_3D_Object_Detection_With_Channel-Wise_Transformer_ICCV_2021_paper.html)]
- A Latent Transformer for Disentangled Face Editing in Images and Videos [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yao_A_Latent_Transformer_for_Disentangled_Face_Editing_in_Images_and_ICCV_2021_paper.pdf)] [[code](https://github.com/InterDigitalInc/latent-transformer)]
- **[GroupFormer]** GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Li_GroupFormer_Group_Activity_Recognition_With_Clustered_Spatial-Temporal_Transformer_ICCV_2021_paper.html)]
- Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Matsumori_Unified_Questioner_Transformer_for_Descriptive_Question_Generation_in_Goal-Oriented_Visual_ICCV_2021_paper.pdf)]
- **[WB-DETR]** WB-DETR: Transformer-Based Detector Without Backbone [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_WB-DETR_Transformer-Based_Detector_Without_Backbone_ICCV_2021_paper.pdf)]
- The Animation Transformer: Visual Correspondence via Segment Matching [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Casey_The_Animation_Transformer_Visual_Correspondence_via_Segment_Matching_ICCV_2021_paper.pdf)]
- The Animation Transformer: Visual Correspondence via Segment Matching [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Casey_The_Animation_Transformer_Visual_Correspondence_via_Segment_Matching_ICCV_2021_paper.pdf)]
- Relaxed Transformer Decoders for Direct Action Proposal Generation [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Tan_Relaxed_Transformer_Decoders_for_Direct_Action_Proposal_Generation_ICCV_2021_paper.html)]
- **[PPT-Net]** Pyramid Point Cloud Transformer for Large-Scale Place Recognition [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Hui_Pyramid_Point_Cloud_Transformer_for_Large-Scale_Place_Recognition_ICCV_2021_paper.pdf)] [[code](https://github.com/fpthink/PPT-Net)]
- Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Multimodal_Co-Attention_Transformer_for_Survival_Prediction_in_Gigapixel_Whole_Slide_ICCV_2021_paper.pdf)]
- Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Uncertainty-Guided_Transformer_Reasoning_for_Camouflaged_Object_Detection_ICCV_2021_paper.pdf)]
- Image Harmonization With Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Guo_Image_Harmonization_With_Transformer_ICCV_2021_paper.html)] [[cpde](https://github.com/zhenglab/HarmonyTransformer)]
- **[COTR]** COTR: Correspondence Transformer for Matching Across Images [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Jiang_COTR_Correspondence_Transformer_for_Matching_Across_Images_ICCV_2021_paper.pdf)]
- **[MUSIQ]** MUSIQ: Multi-Scale Image Quality Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Ke_MUSIQ_Multi-Scale_Image_Quality_Transformer_ICCV_2021_paper.pdf)]
- Episodic Transformer for Vision-and-Language Navigation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Pashevich_Episodic_Transformer_for_Vision-and-Language_Navigation_ICCV_2021_paper.pdf)]
- Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Petrovich_Action-Conditioned_3D_Human_Motion_Synthesis_With_Transformer_VAE_ICCV_2021_paper.html)]
- **[CrackFormer]** CrackFormer: Transformer Network for Fine-Grained Crack Detection [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_CrackFormer_Transformer_Network_for_Fine-Grained_Crack_Detection_ICCV_2021_paper.pdf)]
- **[HiT]** HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_HiT_Hierarchical_Transformer_With_Momentum_Contrast_for_Video-Text_Retrieval_ICCV_2021_paper.pdf)]
- Event-Based Video Reconstruction Using Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Weng_Event-Based_Video_Reconstruction_Using_Transformer_ICCV_2021_paper.pdf)]
- **[STVGBert]** STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Su_STVGBert_A_Visual-Linguistic_Transformer_Based_Framework_for_Spatio-Temporal_Video_Grounding_ICCV_2021_paper.pdf)]
- **[HiFT]** HiFT: Hierarchical Feature Transformer for Aerial Tracking [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Cao_HiFT_Hierarchical_Feature_Transformer_for_Aerial_Tracking_ICCV_2021_paper.pdf)] [[code](https://github.com/vision4robotics/HiFT)]
- **[DocFormer]** DocFormer: End-to-End Transformer for Document Understanding [[paper](https://arxiv.org/abs/2106.11539)]
- **[LeViT]** LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [[paper](https://arxiv.org/abs/2104.01136)] [[code](https://github.com/facebookresearch/LeViT)]
- **[SignBERT]** SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[[paper](https://arxiv.org/abs/2110.05382)]
- **[VidTr]** VidTr: Video Transformer Without Convolutions [[paper](https://arxiv.org/abs/2104.11746)]
- **[ACTOR]** Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [[paper](https://arxiv.org/abs/2104.05670)]
- **[Segmenter]** Segmenter: Transformer for Semantic Segmentation [[paper](https://arxiv.org/abs/2105.05633)] [[code](https://github.com/rstrudel/segmenter)]
- **[Visformer]** Visformer: The Vision-friendly Transformer [[paper](https://arxiv.org/abs/2104.12533)] [[code](https://github.com/danczs/Visformer)]
- **[PnP-DETR]** PnP-DETR: Towards Efficient Visual Analysis with Transformers (**ICCV**) [[paper](https://arxiv.org/abs/2109.07036)] [[code](https://github.com/twangnh/pnp-detr)]
- [**VoTr**] Voxel Transformer for 3D Object Detection [[paper](https://arxiv.org/abs/2109.02497)]
- **[TransVG]** TransVG: End-to-End Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2104.08541)]
- **[3DETR]** An End-to-End Transformer Model for 3D Object Detection [[paper](https://arxiv.org/abs/2109.08141)] [[code](https://github.com/facebookresearch/3detr)]
- **[Eformer]** Eformer: Edge Enhancement based Transformer for Medical Image Denoising [[paper](https://arxiv.org/abs/2109.08044)]
- **[TransFER]** TransFER: Learning Relation-aware Facial Expression Representations with Transformers [[paper](https://arxiv.org/abs/2108.11116)]
- **[Oriented RCNN]** Oriented Object Detection with Transformer [[paper](https://arxiv.org/abs/2106.03146)]
- **[ViViT]** ViViT: A Video Vision Transformer [[paper](https://arxiv.org/abs/2103.15691)]
- **[Stark]** Learning Spatio-Temporal Transformer for Visual Tracking [[paper](https://arxiv.org/abs/2103.17154)] [[code](https://github.com/researchmm/Stark)]
- **[CT3D]** Improving 3D Object Detection with Channel-wise Transformer [[paper](https://arxiv.org/abs/2108.10723)]
- **[VST]** Visual Saliency Transformer [[paper](https://arxiv.org/abs/2104.12099)]
- **[PiT]** Rethinking Spatial Dimensions of Vision Transformers [[paper](https://arxiv.org/abs/2103.16302)] [[code](https://github.com/naver-ai/pit)]
- **[CrossViT]** CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [[paper](https://arxiv.org/abs/2103.14899)] [[code](https://github.com/IBM/CrossViT)]
- **[PointTransformer]** Point Transformer [[paper](https://arxiv.org/abs/2012.09164)]
- **[TS-CAM]** TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [[paper](https://arxiv.org/abs/2103.14862)] [[code](https://github.com/vasgaowei/TS-CAM.git)]
- **[VTs]** Visual Transformers: Token-based Image Representation and Processing for Computer Vision [[paper](https://arxiv.org/abs/2006.03677)]
- **[TransDepth]** Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction [[paper](https://arxiv.org/pdf/2103.12091.pdf)] [[code](https://github.com/ygjwd12345/TransDepth)]
- **[Conditional DETR]** Conditional DETR for Fast Training Convergence [[paper](https://arxiv.org/abs/2108.06152)] [[code](https://github.com/Atten4Vis/ConditionalDETR)]
- **[PIT]** PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [[paper](https://arxiv.org/abs/2108.07142)] [[code](https://github.com/sheepooo/PIT-Position-Invariant-Transform)]
- **[SOTR]** SOTR: Segmenting Objects with Transformers [[paper](https://arxiv.org/abs/2108.06747)] [[code](https://github.com/easton-cau/SOTR)]
- **[SnowflakeNet]** SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [[paper](https://arxiv.org/abs/2108.04444)] [[code](https://github.com/AllenXiangX/SnowflakeNet.)]
- **[TransPose]** TransPose: Keypoint Localization via Transformer [[paper](https://arxiv.org/abs/2012.14214)] [[code](https://github.com/yangsenius/TransPose)]
- **[TransReID]** TransReID: Transformer-based Object Re-Identification [[paper](https://arxiv.org/abs/2102.04378)] [[code](https://github.com/heshuting555/TransReID)]
- **[CWT]** Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [[paper](https://arxiv.org/abs/2108.03032)] [[code](https://github.com/zhiheLu/CWT-for-FSS)]
- Anticipative Video Transformer [[paper](https://arxiv.org/abs/2106.02036)] [[code](http://facebookresearch.github.io/AVT)]
- Rethinking and Improving Relative Position Encoding for Vision Transformer [[paper](https://arxiv.org/abs/2107.14222)] [[code](https://github.com/microsoft/Cream/tree/main/iRPE)]
- Vision Transformer with Progressive Sampling [[paper](https://arxiv.org/abs/2108.01684)] [[code](https://github.com/yuexy/PS-ViT)]
- **[SMCA]** Fast Convergence of DETR with Spatially Modulated Co-Attention [[paper](https://arxiv.org/abs/2101.07448)] [[code](https://github.com/abc403/SMCA-replication)]
- **[AutoFormer]** AutoFormer: Searching Transformers for Visual Recognition [[paper](https://arxiv.org/pdf/2107.00651.pdf)] [[code](https://github.com/microsoft/AutoML)]

**CVPR**
- Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [[paper](https://arxiv.org/abs/2106.04095)]
- **[HOTR]** HOTR: End-to-End Human-Object Interaction Detection with Transformers (**oral**) [[paper](https://arxiv.org/abs/2104.13682)]
- **[METRO]** End-to-End Human Pose and Mesh Reconstruction with Transformers [[paper](https://arxiv.org/abs/2012.09760)]
- **[LETR]** Line Segment Detection Using Transformers without Edges [[paper](https://arxiv.org/abs/2101.01909)]
- **[TransFuser]** Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [[paper](https://arxiv.org/abs/2104.09224)] [[code](https://github.com/autonomousvision/transfuser)]
- Pose Recognition with Cascade Transformers [[paper](https://arxiv.org/abs/2104.06976)]
- Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [[paper](https://arxiv.org/abs/2104.03135)]
- **[LoFTR]** LoFTR: Detector-Free Local Feature Matching with Transformers [[paper](https://arxiv.org/abs/2104.00680)] [[code](https://zju3dv.github.io/loftr/)]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](https://arxiv.org/abs/2103.16553)]
- **[SETR]** Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [[paper](https://arxiv.org/abs/2012.15840)] [[code](https://fudan-zvg.github.io/SETR/)]
- **[TransT]** Transformer Tracking [[paper](https://arxiv.org/abs/2103.15436)] [[code](https://github.com/chenxin-dlut/TransT)]
- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [[paper](https://arxiv.org/abs/2103.11681)]
- **[VisTR]** End-to-End Video Instance Segmentation with Transformers [[paper](https://arxiv.org/abs/2011.14503)]
- Transformer Interpretability Beyond Attention Visualization [[paper](https://arxiv.org/abs/2012.09838)] [[code](https://github.com/hila-chefer/Transformer-Explainability)]
- **[IPT]** Pre-Trained Image Processing Transformer [[paper](https://arxiv.org/abs/2012.00364)]
- **[UP-DETR]** UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [[paper](https://arxiv.org/abs/2011.09094)]
- **[IQT]** Perceptual Image Quality Assessment with Transformers (**workshop**) [[paper](https://arxiv.org/abs/2104.14730)]
- High-Resolution Complex Scene Synthesis with Transformers (**workshop**) [[paper](https://arxiv.org/abs/2105.06458)]
- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition
[[paper](https://arxiv.org/abs/2203.16518)] [[code](https://github.com/jhcho99/CoFormer)]

**ICML**
- Generative Video Transformer: Can Objects be the Words? [[paper](https://arxiv.org/abs/2107.09240)]
- **[GANsformer]** Generative Adversarial Transformers [[paper](https://arxiv.org/abs/2103.01209)] [[code](https://github.com/dorarad/gansformer)]

**ICRA**
- **[NDT-Transformer]** NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [[paper](https://arxiv.org/abs/2103.12292)]

**ICLR**
- **[VTNet]** VTNet: Visual Transformer Network for Object Goal Navigation [[paper](https://arxiv.org/abs/2105.09447)]
- **[Vision Transformer]** An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[paper](https://arxiv.org/abs/2010.11929)] [[code](https://github.com/google-research/vision_transformer)]
- **[Deformable DETR]** Deformable DETR: Deformable Transformers for End-to-End Object Detection [[paper](https://arxiv.org/abs/2010.04159)] [[code](https://github.com/fundamentalvision/Deformable-DETR)]
- **[LAMBDANETWORKS]** MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION [[paper](https://openreview.net/pdf?id=xTJEN-ggl1b)] [[code](https://github.com/lucidrains/lambda-networks)]

**ACM MM**
- Video Transformer for Deepfake Detection with Incremental Learning[[paper](https://arxiv.org/abs/2108.05307)]
- **[HAT]** HAT: Hierarchical Aggregation Transformers for Person Re-identification [[paper](https://arxiv.org/abs/2107.05946)]
- Token Shift Transformer for Video Classification [[paper](https://arxiv.org/abs/2108.02432)] [[code](https://github.com/VideoNetworks/TokShift-Transformer)]
- **[DPT]** DPT: Deformable Patch-based Transformer for Visual Recognition [[paper](https://arxiv.org/abs/2107.14467)] [[code](https://github.com/CASIA-IVA-Lab/DPT)]

**MICCAI**
- **[UTNet]** UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [[paper](https://arxiv.org/abs/2107.00781)] [[code](https://github.com/yhygao/UTNet)]
- **[MedT]** Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.10662)] [[code](https://github.com/jeya-maria-jose/Medical-Transformer)]
- **[MCTrans]** Multi-Compound Transformer for Accurate Biomedical Image Segmentation [[paper](https://arxiv.org/abs/2106.14385)] [[code](https://github.com/JiYuanFeng/MCTrans)]
- **[PNS-Net]** Progressively Normalized Self-Attention Network for Video Polyp Segmentation [[paper](https://arxiv.org/abs/2105.08468)] [[code](https://github.com/GewelsJI/PNS-Net)]
- **[MBT-Net]** A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [[paper](https://arxiv.org/abs/2106.07557)]

**BMVC**
- **[ACT]** End-to-End Object Detection with Adaptive Clustering Transformer [[paper](https://arxiv.org/abs/2011.09315)]
- **[GSRTR]** Grounded Situation Recognition with Transformers
[[paper](https://arxiv.org/abs/2111.10135)] [[code](https://github.com/jhcho99/gsrtr)]
- **[TransFusion]** TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [[paper](https://arxiv.org/abs/2110.09554)] [[code](https://github.com/HowieMa/TransFusion-Pose)]

**ISIE**
- VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (**ISIE**) [[paper](https://arxiv.org/abs/2104.10036)]

**CORL**
- **[DETR3D]** DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [[paper](https://arxiv.org/abs/2110.06922)]

**IJCAI**
- Medical Image Segmentation using Squeeze-and-Expansion Transformers [[paper](https://arxiv.org/abs/2105.09511)]

**IROS**
- **[YOGO]** You Only Group Once: Efficient Point-Cloud Processing with Token
Representation and Relation Inference Module (**IROS**) [[paper](https://arxiv.org/abs/2103.09975)] [[code](https://github.com/chenfengxu714/YOGO.git)]
- **[PTT]** PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds [[paper](https://arxiv.org/abs/2108.06455)] [[code](https://github.com/shanjiayao/PTT)]

**WACV**
- **[LSTR]** End-to-end Lane Shape Prediction with Transformers [[paper](https://arxiv.org/abs/2011.04233)] [[code](https://github.com/liuruijin17/LSTR)]

**ICDAR**
- Vision Transformer for Fast and Efficient Scene Text Recognition [[paper](https://arxiv.org/abs/2105.08582)]
### 2020

- **[DETR]** End-to-End Object Detection with Transformers (**ECCV**) [[paper](https://arxiv.org/abs/2005.12872)] [[code](https://github.com/facebookresearch/detr)]
- [**FPT**] Feature Pyramid Transformer (**CVPR**) [[paper](https://arxiv.org/abs/2007.09451)] [[code](https://github.com/ZHANGDONG-NJUST/FPT)]

### Other resource
- [[Awesome-Transformer-Attention](https://github.com/cmhungsteve/Awesome-Transformer-Attention)]

### Acknowledgement

Thanks the template from [Awesome-Crowd-Counting](https://github.com/gjy3035/Awesome-Crowd-Counting)