
An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

List: Awesome-Visual-Transformer

detr transformer transformer-awesome transformer-cv transformer-with-cv visual-transformer

Last synced: 4 months ago
JSON representation

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

Awesome Lists containing this project



# Awesome Visual-Transformer [![Awesome](](

Collect some Transformer with Computer-Vision (CV) papers.

If you find some overlooked papers, please open issues or pull requests (recommended).

## Papers

### Transformer original paper

- [Attention is All You Need]( (NIPS 2017)

### Technical blog

- [English Blog] Transformers in Vision [[Link](]
- [Chinese Blog] 3W字长文带你轻松入门视觉transformer [[Link](]
- [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [[Link](]

### Survey
- Multimodal learning with transformers: A survey (IEEE TPAMI) [[paper](] - 2023.05.11
- A Survey of Visual Transformers [[paper](] - 2021.11.30
- Transformers in Vision: A Survey [[paper](] - 2021.02.22
- A Survey on Visual Transformer [[paper](] - 2021.1.30
- A Survey of Transformers [[paper](] - 2020.6.09

### arXiv papers
- Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [[paper](]
- **[FocusedDecoder]** Focused Decoding Enables 3D Anatomical Detection by Transformers [[paper](] [[code](]
- **[TAG]** TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [[paper](] [[code](]
- **[FastMETRO]** Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [[paper](] [[code](]
- BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [[paper](] [[code](]
- **[RelViT]** RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [[paper]]( [[code]](
- **[MViTv2]** Improved Multiscale Vision Transformers for Classification and Detection [[paper](] [[code](]
- DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [[paper](] [[code](]
- Three things everyone should know about Vision Transformers [[paper](]
- **[DeiT III]** DeiT III: Revenge of the ViT [[paper](]
- **[DaViT]** DaViT: Dual Attention Vision Transformers
[[paper](] [[code](]
- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition
[[paper](] [[code](]
- **[GSRTR]** Grounded Situation Recognition with Transformers
[[paper](] [[code](]
- **[MaxViT]** MaxViT: Multi-Axis Vision Transformer [[paper]](
- **[V2X-ViT]** V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [[paper]](
- **[MemMC-MAE]** Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [[paper](] [[code](]
- Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [[paper](] [[code](]
- **[VideoMAE]** VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [[paper](] [[code](]
- PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [[paper](]
- ResViT: Residual vision transformers for multi-modal medical image synthesis [[paper](]
- **[CrossEfficientViT]** Combining EfficientNet and Vision Transformers for Video Deepfake Detection [[paper](] [[code](]
- **[Discrete ViT]** Discrete Representations Strengthen Vision Transformer Robustness [[paper](]
- **[StyleSwin]** StyleSwin: Transformer-based GAN for High-resolution Image Generation [[paper](] [[code](]
- **[SReT]** Sliced Recursive Transformer [[paper](] [[code](]
- Dynamic Token Normalization Improves Vision Transformer [[paper](]
- TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [[paper](] [[code](]
- Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [[paper](]
- **[ORViT]** Object-Region Video Transformers [[paper](] [[code](]
- Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [[paper](] [[code](]
- **[NViT]** NViT: Vision Transformer Compression and Parameter Redistribution [[paper](]
- 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [[paper](]
- Adversarial Token Attacks on Vision Transformers [[paper](]
- Contextual Transformer Networks for Visual Recognition [[paper](] [[code](]
- **[TranSalNet]** TranSalNet: Visual saliency prediction using transformers [[paper](]
- **[MobileViT]** MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [[paper](]
- A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [[paper](]
- **[3D-Transformer]** 3D-Transformer: Molecular Representation with Transformer in 3D Space [[paper](]
- **[CCTrans]** CCTrans: Simplifying and Improving Crowd Counting with Transformer [[paper](]
- **[UFO-ViT]** UFO-ViT: High Performance Linear Vision Transformer without Softmax [[paper](]
- Sparse Spatial Transformers for Few-Shot Learning [[paper](]
- Vision Transformer Hashing for Image Retrieval [[paper](]
- **[OH-Former]** OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [[paper](]
- **[Pix2seq]** Pix2seq: A Language Modeling Framework for Object Detection [[paper](]
- **[CoAtNet]** CoAtNet: Marrying Convolution and Attention for All Data Sizes [[paper](]
- **[LOTR]** LOTR: Face Landmark Localization Using Localization Transformer [[paper](]
- Transformer-Unet: Raw Image Processing with Unet [[paper](]
- **[GraFormer]** GraFormer: Graph Convolution Transformer for 3D Pose Estimation [[paper](]
- **[CDTrans]** CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [[paper](]
- PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [[paper](] [[code](]
- Anchor DETR: Query Design for Transformer-Based Detector [[paper](] [[code](]
- **[DAB-DETR]** DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [[paper](] [[code](]
- **[ESRT]** Efficient Transformer for Single Image Super-Resolution [[paper](]
- **[MaskFormer]** MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [[paper](] [[code](]
- **[SwinIR]** SwinIR: Image Restoration Using Swin Transformer [[paper](] [[code](]
- **[Trans4Trans]** Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [[paper](]
- Do Vision Transformers See Like Convolutional Neural Networks? [[paper](]
- Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [[paper](]
- Light Field Image Super-Resolution with Transformers [[paper](] [[code](]
- Focal Self-attention for Local-Global Interactions in Vision Transformers [[paper](] [[code](]
- Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [[paper](] [[code](]
- Mobile-Former: Bridging MobileNet and Transformer [[paper](]
- **[TriTransNet]** TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [[paper](]
- **[PSViT]** PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [[paper](]
- Boosting Few-shot Semantic Segmentation with Transformers [[paper](] [[code](]
- Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [[paper](]
- Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [[paper](]
- **[Styleformer]** Styleformer: Transformer based Generative Adversarial Networks with Style Vector [[paper](] [[code](]
- **[CMT]** CMT: Convolutional Neural Networks Meet Vision Transformers [[paper](]
- **[TransAttUnet]** TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [[paper](]
- TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [[paper](]
- **[ViTGAN]** ViTGAN: Training GANs with Vision Transformers [[paper](]
- What Makes for Hierarchical Vision Transformer? [[paper](]
- **[Trans4Trans]** Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [[paper](]
- **[FFVT]** Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [[paper](]
- **[TransformerFusion]** TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [[paper](]
- Escaping the Big Data Paradigm with Compact Transformers [[paper](]
- How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [[paper](]
- Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [[paper](]
- **[XCiT]** XCiT: Cross-Covariance Image Transformers [[paper](] [[code](]
- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [[paper](] [[code](]
- Video Swin Transformer [[paper](] [[code](]
- **[VOLO]** VOLO: Vision Outlooker for Visual Recognition [[paper](] [[code](]
- Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [[paper](]
- End-to-end Temporal Action Detection with Transformer [[paper](] [[code](]
- How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [[paper](]
- Efficient Self-supervised Vision Transformers for Representation Learning [[paper](]
- Space-time Mixing Attention for Video Transformer [[paper](]
- Transformed CNNs: recasting pre-trained convolutional layers with self-attention [[paper](]
- **[CAT]** CAT: Cross Attention in Vision Transformer [[paper](]
- Scaling Vision Transformers [[paper](]
- **[DETReg]** DETReg: Unsupervised Pretraining with Region Priors for Object Detection [[paper](] [[code](]
- Chasing Sparsity in Vision Transformers:An End-to-End Exploration [[paper](]
- **[MViT]** MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [[paper](]
- Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [[paper](]
- On Improving Adversarial Transferability of Vision Transformers [[paper](]
- Fully Transformer Networks for Semantic ImageSegmentation [[paper](]
- Visual Transformer for Task-aware Active Learning [[paper](] [[code](]
- Efficient Training of Visual Transformers with Small-Size Datasets [[paper](]
- Reveal of Vision Transformers Robustness against Adversarial Attacks [[paper](]
- Person Re-Identification with a Locally Aware Transformer [[paper](]
- **[Refiner]** Refiner: Refining Self-attention for Vision Transformers [[paper](]
- **[ViTAE]** ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [[paper](]
- Video Instance Segmentation using Inter-Frame Communication Transformers [[paper](]
- Transformer in Convolutional Neural Networks [[paper](] [[code](]
- **[Uformer]** Uformer: A General U-Shaped Transformer for Image Restoration [[paper](] [[code](]
- Patch Slimming for Efficient Vision Transformers [[paper](]
- **[RegionViT]** RegionViT: Regional-to-Local Attention for Vision Transformers [[paper](]
- Associating Objects with Transformers for Video Object Segmentation [[paper](] [[code](]
- Few-Shot Segmentation via Cycle-Consistent Transformer [[paper](]
- Glance-and-Gaze Vision Transformer [[paper](] [[code](]
- Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [[paper](]
- **[DynamicViT]** DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [[paper](] [[code](]
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [[paper](] [[code]()]
- Unsupervised Out-of-Domain Detection via Pre-trained Transformers [[paper](]
- **[TransMIL]** TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [[paper](]
- **[TransVOS]** TransVOS: Video Object Segmentation with Transformers [[paper](]
- **[KVT]** KVT: k-NN Attention for Boosting Vision Transformers [[paper](]
- **[MSG-Transformer]** MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [[paper](] [[code](]
- **[SegFormer]** SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [[paper](] [[code](]
- **[SDNet]** SDNet: mutil-branch for single image deraining using swin [[paper](] [[code](]
- **[DVT]** Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [[paper](]
- **[GazeTR]** Gaze Estimation using Transformer [[paper](] [[code](]
- Transformer-Based Deep Image Matching for Generalizable Person Re-identification [[paper](]
- Less is More: Pay Less Attention in Vision Transformers [[paper](]
- **[FoveaTer]** FoveaTer: Foveated Transformer for Image Classification [[paper](]
- **[TransDA]** Transformer-Based Source-Free Domain Adaptation [[paper](] [[code](]
- An Attention Free Transformer [[paper](]
- **[PTNet]** PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [[paper](]
- **[ResT]** ResT: An Efficient Transformer for Visual Recognition [[paper](] [[code](]
- **[CogView]** CogView: Mastering Text-to-Image Generation via Transformers [[paper](]
- **[NesT]** Aggregating Nested Transformers [[paper](]
- **[TAPG]** Temporal Action Proposal Generation with Transformers [[paper](]
- Boosting Crowd Counting with Transformers [[paper](]
- **[COTR]** COTR: Convolution in Transformer Network for End to End Polyp Detection [[paper](]
- **[TransVOD]** End-to-End Video Object Detection with Spatial-Temporal Transformers [[paper](] [[code](]
- Intriguing Properties of Vision Transformers [[paper](] [[code](]
- Combining Transformer Generators with Convolutional Discriminators [[paper](]
- Rethinking the Design Principles of Robust Vision Transformer [[paper](]
- Vision Transformers are Robust Learners [[paper](] [[code](]
- Manipulation Detection in Satellite Images Using Vision Transformer [[paper](]
- **[Swin-Unet]** Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [[paper](] [[code](]
- Self-Supervised Learning with Swin Transformers [[paper](] [[code](]
- **[SCTN]** SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [[paper](]
- **[RelationTrack]** RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [[paper](]
- **[VGTR]** Visual Grounding with Transformers [[paper](]
- **[PST]** Visual Composite Set Detection Using Part-and-Sum Transformers [[paper](]
- **[TrTr]** TrTr: Visual Tracking with Transformer [[paper](] [[code](]
- **[MOTR]** MOTR: End-to-End Multiple-Object Tracking with TRansformer [[paper](] [[code](]
- Attention for Image Registration (AiR): an unsupervised Transformer approach [[paper](]
- **[TransHash]** TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [[paper](]
- **[ISTR]** ISTR: End-to-End Instance Segmentation with Transformers [[paper](] [[code](]
- **[CAT]** CAT: Cross-Attention Transformer for One-Shot Object Detection [[paper](]
- **[CoSformer]** CoSformer: Detecting Co-Salient Object with Transformers [[paper](]
- End-to-End Attention-based Image Captioning [[paper](]
- **[PMTrans]** Pyramid Medical Transformer for Medical Image Segmentation [[paper](]
- **[HandsFormer]** HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [[paper](]
- **[GasHis-Transformer]** GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [[paper](]
- Emerging Properties in Self-Supervised Vision Transformers [[paper](]
- **[InTra]** Inpainting Transformer for Anomaly Detection [[paper](]
- **[Twins]** Twins: Revisiting Spatial Attention Design in Vision Transformers [[paper](] [[code](]
- **[MLMSPT]** Point Cloud Learning with Transformer [[paper](]
- Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [[paper](]
- **[ConTNet]** ConTNet: Why not use convolution and transformer at the same time? [[paper](] [[code](]
- **[DTNet]** Dual Transformer for Point Cloud Analysis [[paper](]
- Improve Vision Transformers Training by Suppressing Over-smoothing [[paper](] [[code](]
- Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [[paper](]
- **[M3DeTR]** M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [[paper](] [[code](]
- **[Skeletor]** Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [[paper](]
- **[FaceT]** Learning to Cluster Faces via Transformer [[paper](]
- **[MViT]** Multiscale Vision Transformers [[paper](] [[code](]
- **[VATT]** VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [[paper](]
- **[So-ViT]** So-ViT: Mind Visual Tokens for Vision Transformer [[paper](] [[code](]
- Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [[paper](] [[code](]
- **[TransRPPG]** TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [[paper](]
- **[VideoGPT]** VideoGPT: Video Generation using VQ-VAE and Transformers [[paper](]
- **[M2TR]** M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [[paper](]
- Transformer Transforms Salient Object Detection and Camouflaged Object Detection [[paper](]
- **[TransCrowd]** TransCrowd: Weakly-Supervised Crowd Counting with Transformer [[paper](] [[code](]
- Visual Transformer Pruning [[paper](]
- Self-supervised Video Retrieval Transformer Network [[paper](]
- Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [[paper](]
- **[TransGAN]** TransGAN: Two Transformers Can Make One Strong GAN [[paper](] [[code](]
- Geometry-Free View Synthesis: Transformers and no 3D Priors [[paper](] [[code](]
- **[CoaT]** Co-Scale Conv-Attentional Image Transformers [[paper](] [[code](]
- **[LocalViT]** LocalViT: Bringing Locality to Vision Transformers [[paper](] [[code](]
- **[CIT]** Cloth Interactive Transformer for Virtual Try-On [[paper](] [[code](]
- Handwriting Transformers [[paper](]
- **[SiT]** SiT: Self-supervised vIsion Transformer [[paper](] [[code](]
- On the Robustness of Vision Transformers to Adversarial Examples [[paper](]
- An Empirical Study of Training Self-Supervised Visual Transformers [[paper](]
- A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [[paper](]
- **[AOT-GAN]** Aggregated Contextual Transformations for High-Resolution Image Inpainting [[paper](] [[code](]
- Deepfake Detection Scheme Based on Vision Transformer and Distillation [[paper](]
- **[ATAG]** Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [[paper](]
- **[TubeR]** TubeR: Tube-Transformer for Action Detection [[paper](]
- **[AAformer]** AAformer: Auto-Aligned Transformer for Person Re-Identification [[paper](]
- **[TFill]** TFill: Image Completion via a Transformer-Based Architecture [[paper](]
- Group-Free 3D Object Detection via Transformers [[paper](] [[code](]
- **[STGT]** Spatial-Temporal Graph Transformer for Multiple Object Tracking [[paper](]
- Going deeper with Image Transformers[[paper](]
- **[Meta-DETR]** Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [[paper]( [[code](]
- **[DA-DETR]** DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [[paper](]
- Robust Facial Expression Recognition with Convolutional Visual Transformers [[paper](]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](]
- Spatiotemporal Transformer for Video-based Person Re-identification[[paper](]
- **[TransUNet]** TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [[paper](] [[code](]
- **[CvT]** CvT: Introducing Convolutions to Vision Transformers [[paper](] [[code](]
- **[TFPose]** TFPose: Direct Human Pose Estimation with Transformers [[paper](]
- **[TransCenter]** TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [[paper](]
- Face Transformer for Recognition [[paper](]
- On the Adversarial Robustness of Visual Transformers [[paper](]
- Understanding Robustness of Transformers for Image Classification [[paper](]
- Lifting Transformer for 3D Human Pose Estimation in Video [[paper](]
- **[GSA-Net]** Global Self-Attention Networks for Image Recognition[[paper](]
- High-Fidelity Pluralistic Image Completion with Transformers [[paper](] [[code](]
- **[DPT]** Vision Transformers for Dense Prediction [[paper](] [[code](]
- **[TransFG]** TransFG: A Transformer Architecture for Fine-grained Recognition? [[paper](]
- **[TimeSformer]** Is Space-Time Attention All You Need for Video Understanding? [[paper](]
- Multi-view 3D Reconstruction with Transformer [[paper](]
- Can Vision Transformers Learn without Natural Images? [[paper](] [[code](]
- End-to-End Trainable Multi-Instance Pose Estimation with Transformers [[paper](]
- Instance-level Image Retrieval using Reranking Transformers [[paper](] [[code](]
- **[BossNAS]** BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [[paper](] [[code](]
- **[CeiT]** Incorporating Convolution Designs into Visual Transformers [[paper](]
- **[DeepViT]** DeepViT: Towards Deeper Vision Transformer [[paper](]
- Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [[paper](]
- 3D Human Pose Estimation with Spatial and Temporal Transformers [[paper](] [[code](]
- **[SUNETR]** SUNETR: Transformers for 3D Medical Image Segmentation [[paper](]
- Scalable Visual Transformers with Hierarchical Pooling [[paper](]
- **[ConViT]** ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [[paper](]
- **[TransMed]** TransMed: Transformers Advance Multi-modal Medical Image Classification [[paper](]
- **[U-Transformer]** U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [[paper](]
- **[SpecTr]** SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [[paper](] [[code](]
- **[TransBTS]** TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [[paper](] [[code](]
- **[SSTN]** SSTN: Self-Supervised Domain Adaptation Thermal
Object Detection for Autonomous Driving [[paper](]
- Transformer is All You Need:
Multimodal Multitask Learning with a Unified Transformer [[paper](] [[code](]
- **[CPVT]** Do We Really Need Explicit Position Encodings for Vision Transformers? [[paper](] [[code](]
- Deepfake Video Detection Using Convolutional Vision Transformer[[paper](]
- Training Vision Transformers for Image Retrieval[[paper](]
- **[VTN]** Video Transformer Network[[paper](]
- **[BoTNet]** Bottleneck Transformers for Visual Recognition [[paper](]
- **[CPTR]** CPTR: Full Transformer Network for Image Captioning [[paper](]
- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [[paper](] [[code](]
- **[Trans2Seg]** Segmenting Transparent Object in the Wild with Transformer [[paper](] [[code](]
- Investigating the Vision Transformer Model for Image Retrieval Tasks [[paper](]
- **[Trear]** Trear: Transformer-based RGB-D Egocentric Action Recognition [[paper](]
- **[VisualSparta]** VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [[paper](]
- **[TrackFormer]** TrackFormer: Multi-Object Tracking with Transformers [[paper](]
- **[TAPE]** Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [[paper](]
- **[TRIQ]** Transformer for Image Quality Assessment [[paper](] [[code](]
- **[TransTrack]** TransTrack: Multiple-Object Tracking with Transformer [[paper](] [[code](]
- **[DeiT]** Training data-efficient image transformers & distillation through attention [[paper](] [[code](]
- **[Pointformer]** 3D Object Detection with Pointformer [[paper](]
- **[ViT-FRCNN]** Toward Transformer-Based Object Detection [[paper](]
- **[Taming-transformers]** Taming Transformers for High-Resolution Image Synthesis [[paper](] [[code](]
- **[SceneFormer]** SceneFormer: Indoor Scene Generation with Transformers [[paper](]
- **[PCT]** PCT: Point Cloud Transformer [[paper](]
- **[PED]** DETR for Pedestrian Detection[[paper](]
- Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[[paper](]
- **[C-Tran]** General Multi-label Image Classification with Transformers [[paper](]

### 2022


- **[P2T]** P2T: Pyramid Pooling Transformer for Scene Understanding [[paper](]


- **[X-CLIP]** Expanding Language-Image Pretrained Models for General Video Recognition [[paper](] [[code](]
- **[TinyViT]** TinyViT: Fast Pretraining Distillation for Small Vision Transformers [[paper](] [[code](]
- **[FastMETRO]** Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [[paper](] [[code](]
- **[AiATrack]** AiATrack: Attention in Attention for Transformer Visual Tracking [[paper](] [[code](]
- **[OSTrack]** Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [[paper](] [[code](]
- **[Unicorn]** Towards Grand Unification of Object Tracking [[paper](] [[code](]
- **[P3AFormer]** Tracking Objects as Pixel-wise Distributions [[paper](] [[code](]

- **[MAE]** Masked Autoencoders Are Scalable Vision Learners [[paper](] [[code]](
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [[paper](] [[code](]
- Fast Point Transformer [[paper](]
- EDTER: Edge Detection With Transformer [[paper](] [[code](]
- Bridged Transformer for Vision and Point Cloud 3D Object Detection [[paper](]
- MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [[paper](]
- HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [[paper](] [[code](]
- Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [[paper](]
- MPViT: Multi-Path Vision Transformer for Dense Prediction [[paper](] [[code]](
- A-ViT: Adaptive Tokens for Efficient Vision Transformer [[paper](]
- TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [[paper](] [[code](]
- Continual Learning With Lifelong Vision Transformer [[paper](]
- Swin Transformer V2: Scaling Up Capacity and Resolution [[paper](] [[code]](
- Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [[paper](] [[code](]
- Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [[paper](]
- Human-Object Interaction Detection via Disentangled Transformer [[paper](]
- LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [[paper](]
- Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [[paper](]
- Vision Transformer With Deformable Attention [[paper](]
- DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [[paper](]
- **[Restormer]** Restormer: Efficient Transformer for High-Resolution Image Restoration [[paper](] [[code](]
- **[SAM-DETR]** Accelerating DETR Convergence via Semantic-Aligned Matching [[paper](] [[code](]
- **[BEVT]** BEVT: BERT Pretraining of Video Transformers [[paper](] [[code](]
- **[MobileFormer]** Mobile-Former: Bridging MobileNet and Transformer [[paper](]
- **[STRM]** Spatio-temporal Relation Modeling for Few-shot Action Recognition [[paper](] [[code](]
- **[MiniViT]** MiniViT: Compressing Vision Transformers with Weight Multiplexing [[paper](] [[code](]
- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition
[[paper](] [[code](]
- **[DW-ViT]** Beyond Fixation: Dynamic Window Visual Transformer [[paper](] [[code](]
- **[TokenFusion]** Multimodal Token Fusion for Vision Transformers [[paper](]
- **[CMT]** Convolutional Neural Networks Meet Vision Transformers [[paper](]
- Fine-tuning Image Transformers using Learnable Memory [[paper](]
- **[TransMix]** Attend to Mix for Vision Transformers [[paper](] [[code](]
- **[NomMer]** Nominate Synergistic Context in Vision Transformer for Visual Recognition [[paper](] [[code](]
- **[SSA]** Shunted Self-Attention via Multi-Scale Token Aggregation [[paper](] [[code](]
- **[RVT]** Towards Robust Vision Transformer [[paper]( [[code](]
- **[LVT]** Lite Vision Transformer with Enhanced Self-Attention [[paper]( [[code](]
- **[StyTr2]** StyTr2: Image Style Transfer with Transformers [[paper](] [[code](]

- Image-Adaptive Hint Generation via Vision Transformer for Outpainting [[paper](] [[code](]

- **[RelViT]** RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [[paper](] [[code](]
- **[CrossFormer]** CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention [[paper](] [[code](]

- Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [[paper](] [[code](]

- **[DAB-DETR]** DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [[paper](] [[code](]

### 2021

- ProTo: Program-Guided Transformer for Program-Guided Tasks [[paper](] [[code](]
- **[Augvit]** Augmented Shortcuts for Vision Transformers [[paper](] [[code](]
- **[YOLOS]** You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [[paper](] [[code](]
- **[CATs]** Semantic Correspondence with Transformers [[paper](] [[code](]
- **[Moment-DETR]** QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [[paper](] [[code](]
- Dual-stream Network for Visual Recognition [[paper](] [[code](]
- **[Container]** Container: Context Aggregation Network [[paper](] [[code](]
- **[TNT]** Transformer in Transformer [[paper](] [[code](]
- T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [[paper](]
- Long Short-Term Transformer for Online Action Detection [[paper](]
- TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [[paper](]
- TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [[paper](]
- TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [[paper](]
- Associating Objects with Transformers for Video Object Segmentation [[paper](]
- Test-Time Personalization with a Transformer for Human Pose Estimation [[paper](]
- Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [[paper](]
- Dynamic Grained Encoder for Vision Transformers [[paper](]
- HRFormer: High-Resolution Vision Transformer for Dense Predict [[paper](]
- Searching the Search Space of Vision Transformer [[paper](]
- Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [[paper](]
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [[paper](]
- Do Vision Transformers See Like Convolutional Neural Networks? [[paper](]
- Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [[paper](]
- Glance-and-Gaze Vision Transformer [[paper](]
- MST: Masked Self-Supervised Transformer for Visual Representation [[paper](]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [[paper](]
- TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [[paper](]
- Augmented Shortcuts for Vision Transformers [[paper](]
- Improved Transformer for High-Resolution GANs [[paper](]
- All Tokens Matter: Token Labeling for Training Better Vision Transformers [[paper](]
- XCiT: Cross-Covariance Image Transformers [[paper](]
- Efficient Training of Visual Transformers with Small Datasets [[paper](]


- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (**Marr Prize**) [[paper](] [[code](]
- **[ICT]** High-Fidelity Pluralistic Image Completion with Transformers [[paper](] [[code](]
- **[PoinTr]** PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (**oral**) [[paper](] [[code](]
- **[STTR]** Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [[paper](] [[code](]
- **[TSP-FCOS]** Rethinking Transformer-based Set Prediction for Object Detection [[paper](]
- Paint Transformer: Feed Forward Neural Painting with Stroke Prediction (**oral**) ) [[paper](]) [[code](]
- 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [[paper](]
- **[T2T-ViT]** Training Vision Transformers from Scratch on ImageNet [[paper](] [[code](]
- **[THUNDR]** THUNDR: Transformer-Based 3D Human Reconstruction With Markers [[paper](]
- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [[paper](]
- **[PVT]** Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [[paper](] [[code](]
- Spatial-Temporal Transformer for Dynamic Scene Graph Generation [[paper](]
- **[GLiT]** GLiT: Neural Architecture Search for Global and Local Image Transformer [[paper](]
- **[TRAR]** TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [[paper](]
- **[UniT]** UniT: Multimodal Multitask Learning With a Unified Transformer [[paper](] [[code](]
- Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [[paper](]
- Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [[paper](]
- **[LocalTrans]** LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [[paper](]
- Improving 3D Object Detection With Channel-Wise Transformer [[paper](]
- A Latent Transformer for Disentangled Face Editing in Images and Videos [[paper](] [[code](]
- **[GroupFormer]** GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [[paper](]
- Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [[paper](]
- **[WB-DETR]** WB-DETR: Transformer-Based Detector Without Backbone [[paper](]
- The Animation Transformer: Visual Correspondence via Segment Matching [[paper](]
- The Animation Transformer: Visual Correspondence via Segment Matching [[paper](]
- Relaxed Transformer Decoders for Direct Action Proposal Generation [[paper](]
- **[PPT-Net]** Pyramid Point Cloud Transformer for Large-Scale Place Recognition [[paper](] [[code](]
- Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [[paper](]
- Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [[paper](]
- Image Harmonization With Transformer [[paper](] [[cpde](]
- **[COTR]** COTR: Correspondence Transformer for Matching Across Images [[paper](]
- **[MUSIQ]** MUSIQ: Multi-Scale Image Quality Transformer [[paper](]
- Episodic Transformer for Vision-and-Language Navigation [[paper](]
- Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [[paper](]
- **[CrackFormer]** CrackFormer: Transformer Network for Fine-Grained Crack Detection [[paper](]
- **[HiT]** HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [[paper](]
- Event-Based Video Reconstruction Using Transformer [[paper](]
- **[STVGBert]** STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [[paper](]
- **[HiFT]** HiFT: Hierarchical Feature Transformer for Aerial Tracking [[paper](] [[code](]
- **[DocFormer]** DocFormer: End-to-End Transformer for Document Understanding [[paper](]
- **[LeViT]** LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [[paper](] [[code](]
- **[SignBERT]** SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[[paper](]
- **[VidTr]** VidTr: Video Transformer Without Convolutions [[paper](]
- **[ACTOR]** Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [[paper](]
- **[Segmenter]** Segmenter: Transformer for Semantic Segmentation [[paper](] [[code](]
- **[Visformer]** Visformer: The Vision-friendly Transformer [[paper](] [[code](]
- **[PnP-DETR]** PnP-DETR: Towards Efficient Visual Analysis with Transformers (**ICCV**) [[paper](] [[code](]
- [**VoTr**] Voxel Transformer for 3D Object Detection [[paper](]
- **[TransVG]** TransVG: End-to-End Visual Grounding with Transformers [[paper](]
- **[3DETR]** An End-to-End Transformer Model for 3D Object Detection [[paper](] [[code](]
- **[Eformer]** Eformer: Edge Enhancement based Transformer for Medical Image Denoising [[paper](]
- **[TransFER]** TransFER: Learning Relation-aware Facial Expression Representations with Transformers [[paper](]
- **[Oriented RCNN]** Oriented Object Detection with Transformer [[paper](]
- **[ViViT]** ViViT: A Video Vision Transformer [[paper](]
- **[Stark]** Learning Spatio-Temporal Transformer for Visual Tracking [[paper](] [[code](]
- **[CT3D]** Improving 3D Object Detection with Channel-wise Transformer [[paper](]
- **[VST]** Visual Saliency Transformer [[paper](]
- **[PiT]** Rethinking Spatial Dimensions of Vision Transformers [[paper](] [[code](]
- **[CrossViT]** CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [[paper](] [[code](]
- **[PointTransformer]** Point Transformer [[paper](]
- **[TS-CAM]** TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [[paper](] [[code](]
- **[VTs]** Visual Transformers: Token-based Image Representation and Processing for Computer Vision [[paper](]
- **[TransDepth]** Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction [[paper](] [[code](]
- **[Conditional DETR]** Conditional DETR for Fast Training Convergence [[paper](] [[code](]
- **[PIT]** PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [[paper](] [[code](]
- **[SOTR]** SOTR: Segmenting Objects with Transformers [[paper](] [[code](]
- **[SnowflakeNet]** SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [[paper](] [[code](]
- **[TransPose]** TransPose: Keypoint Localization via Transformer [[paper](] [[code](]
- **[TransReID]** TransReID: Transformer-based Object Re-Identification [[paper](] [[code](]
- **[CWT]** Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [[paper](] [[code](]
- Anticipative Video Transformer [[paper](] [[code](]
- Rethinking and Improving Relative Position Encoding for Vision Transformer [[paper](] [[code](]
- Vision Transformer with Progressive Sampling [[paper](] [[code](]
- **[SMCA]** Fast Convergence of DETR with Spatially Modulated Co-Attention [[paper](] [[code](]
- **[AutoFormer]** AutoFormer: Searching Transformers for Visual Recognition [[paper](] [[code](]

- Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [[paper](]
- **[HOTR]** HOTR: End-to-End Human-Object Interaction Detection with Transformers (**oral**) [[paper](]
- **[METRO]** End-to-End Human Pose and Mesh Reconstruction with Transformers [[paper](]
- **[LETR]** Line Segment Detection Using Transformers without Edges [[paper](]
- **[TransFuser]** Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [[paper](] [[code](]
- Pose Recognition with Cascade Transformers [[paper](]
- Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [[paper](]
- **[LoFTR]** LoFTR: Detector-Free Local Feature Matching with Transformers [[paper](] [[code](]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](]
- **[SETR]** Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [[paper](] [[code](]
- **[TransT]** Transformer Tracking [[paper](] [[code](]
- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [[paper](]
- **[VisTR]** End-to-End Video Instance Segmentation with Transformers [[paper](]
- Transformer Interpretability Beyond Attention Visualization [[paper](] [[code](]
- **[IPT]** Pre-Trained Image Processing Transformer [[paper](]
- **[UP-DETR]** UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [[paper](]
- **[IQT]** Perceptual Image Quality Assessment with Transformers (**workshop**) [[paper](]
- High-Resolution Complex Scene Synthesis with Transformers (**workshop**) [[paper](]
- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition
[[paper](] [[code](]

- Generative Video Transformer: Can Objects be the Words? [[paper](]
- **[GANsformer]** Generative Adversarial Transformers [[paper](] [[code](]

- **[NDT-Transformer]** NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [[paper](]

- **[VTNet]** VTNet: Visual Transformer Network for Object Goal Navigation [[paper](]
- **[Vision Transformer]** An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[paper](] [[code](]
- **[Deformable DETR]** Deformable DETR: Deformable Transformers for End-to-End Object Detection [[paper](] [[code](]

**ACM MM**
- Video Transformer for Deepfake Detection with Incremental Learning[[paper](]
- **[HAT]** HAT: Hierarchical Aggregation Transformers for Person Re-identification [[paper](]
- Token Shift Transformer for Video Classification [[paper](] [[code](]
- **[DPT]** DPT: Deformable Patch-based Transformer for Visual Recognition [[paper](] [[code](]

- **[UTNet]** UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [[paper](] [[code](]
- **[MedT]** Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [[paper](] [[code](]
- **[MCTrans]** Multi-Compound Transformer for Accurate Biomedical Image Segmentation [[paper](] [[code](]
- **[PNS-Net]** Progressively Normalized Self-Attention Network for Video Polyp Segmentation [[paper](] [[code](]
- **[MBT-Net]** A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [[paper](]

- **[ACT]** End-to-End Object Detection with Adaptive Clustering Transformer [[paper](]
- **[GSRTR]** Grounded Situation Recognition with Transformers
[[paper](] [[code](]
- **[TransFusion]** TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [[paper](] [[code](]

- VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (**ISIE**) [[paper](]

- **[DETR3D]** DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [[paper](]

- Medical Image Segmentation using Squeeze-and-Expansion Transformers [[paper](]

- **[YOGO]** You Only Group Once: Efficient Point-Cloud Processing with Token
Representation and Relation Inference Module (**IROS**) [[paper](] [[code](]
- **[PTT]** PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds [[paper](] [[code](]

- **[LSTR]** End-to-end Lane Shape Prediction with Transformers [[paper](] [[code](]

- Vision Transformer for Fast and Efficient Scene Text Recognition [[paper](]
### 2020

- **[DETR]** End-to-End Object Detection with Transformers (**ECCV**) [[paper](] [[code](]
- [**FPT**] Feature Pyramid Transformer (**CVPR**) [[paper](] [[code](]

### Other resource
- [[Awesome-Transformer-Attention](]

### Acknowledgement

Thanks the template from [Awesome-Crowd-Counting](