https://github.com/GuanRunwei/Awesome-Vision-Transformer-Collection

Variants of Vision Transformer and its downstream tasks
https://github.com/GuanRunwei/Awesome-Vision-Transformer-Collection
List: Awesome-Vision-Transformer-Collection
awesome backbone computer-vision deep-learning detection explainable-ai generative-model lidar-point-cloud pose-estimation segmentation self-attention self-supervised-learning tracking
Last synced: about 1 month ago
JSON representation
Variants of Vision Transformer and its downstream tasks
Host: GitHub
URL: https://github.com/GuanRunwei/Awesome-Vision-Transformer-Collection
Owner: GuanRunwei
Created: 2021-11-07T02:55:22.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-07-16T01:05:14.000Z (almost 3 years ago)
Last Synced: 2025-05-07T20:02:04.840Z (about 1 month ago)
Topics: awesome, backbone, computer-vision, deep-learning, detection, explainable-ai, generative-model, lidar-point-cloud, pose-estimation, segmentation, self-attention, self-supervised-learning, tracking
Homepage:
Size: 59.6 KB
Stars: 234
Watchers: 6
Forks: 28
Open Issues: 3
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

ultimate-awesome - Awesome-Vision-Transformer-Collection - Variants of Vision Transformer and its downstream tasks. (Other Lists / Julia Lists)
README

        # __Awesome Vision Transformer Collection__

__Variants of Vision Transformer and Vision Transformer for Downstream Tasks__

author: Runwei Guan

affiliation: University of Liverpool / JITRI-Institute of Deep Perception Technology

email: [email protected] / [email protected] / [email protected]

## Image Backbone

* Vision Transformer [paper](https://arxiv.org/abs/2010.11929) [code](https://github.com/google-research/vision_transformer)

* Swin Transformer [paper](https://arxiv.org/abs/2103.14030) [code](https://github.com/microsoft/Swin-Transformer)

* Swin Transformer V2: Scaling Up Capacity and Resolution [paper](https://arxiv.org/abs/2111.09883) [code](https://github.com/microsoft/Swin-Transformer)

* DVT [paper](https://arxiv.org/abs/2105.15075) [code](https://github.com/blackfeather-wang/Dynamic-Vision-Transformer)

* PVT [paper](https://arxiv.org/abs/2102.12122) [code](https://github.com/whai362/PVT)

* Lite Vision Transformer: LVT [paper](https://arxiv.org/abs/2112.10809)

* PiT [paper](https://arxiv.org/abs/2103.16302) [code](https://github.com/naver-ai/pit)

* Twins [paper](https://arxiv.org/abs/2104.13840) [code](https://github.com/Meituan-AutoML/Twins)

* TNT [paper](https://arxiv.org/abs/2103.00112) [code](https://github.com/lucidrains/transformer-in-transformer)

* Mobile-ViT [paper](https://arxiv.org/abs/2110.02178?context=cs.LG) [code](https://github.com/chinhsuanwu/mobilevit-pytorch)

* Cross-ViT [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper.html) [code](https://github.com/IBM/CrossViT)

* LeViT [paper](https://arxiv.org/pdf/2104.01136.pdf) [code](https://github.com/facebookresearch/LeViT)

* ViT-Lite [paper](https://arxiv.org/pdf/2104.05704.pdf)

* Refiner [paper](https://arxiv.org/pdf/2106.03714.pdf) [code](https://github.com/zhoudaquan/Refiner_ViT)

* DeepViT [paper](https://arxiv.org/pdf/2103.11886.pdf) [code](https://github.com/zhoudaquan/dvit_repo)

* CaiT [paper](https://arxiv.org/pdf/2103.17239.pdf) [code](https://github.com/facebookresearch/deit)

* LV-ViT [paper](https://arxiv.org/pdf/2104.10858.pdf) [code](https://github.com/zihangJiang/TokenLabeling)

* DeiT [paper](https://arxiv.org/pdf/2012.12877.pdf) [code](https://github.com/facebookresearch/deit)

* CeiT [paper](https://arxiv.org/pdf/2103.11816.pdf) [code](https://github.com/rishikksh20/CeiT-pytorch)

* BoTNet [paper](https://arxiv.org/abs/2101.11605) 

* ViTAE [paper](https://arxiv.org/abs/2106.03348)

* Visformer: The Vision-Friendly Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Chen_Visformer_The_Vision-Friendly_Transformer_ICCV_2021_paper.html) [code](https://github.com/danczs/Visformer)

* Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training [paper](https://arxiv.org/abs/2112.03552)

* AdaViT: Adaptive Tokens for Efficient Vision Transformer [paper](https://arxiv.org/abs/2112.07658)

* Improved Multiscale Vision Transformers for Classification and Detection [paper](https://arxiv.org/abs/2112.01526)

* Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zhang_Multi-Scale_Vision_Longformer_A_New_Vision_Transformer_for_High-Resolution_Image_ICCV_2021_paper.html)

* Towards End-to-End Image Compression and Analysis with Transformers [paper](https://arxiv.org/abs/2112.09300)

* MPViT: Multi-Path Vision Transformer for Dense Prediction [paper](https://arxiv.org/abs/2112.11010)

* Lite Vision Transformer with Enhanced Self-Attention [paper](https://arxiv.org/abs/2112.10809)

* PolyViT: Co-training Vision Transformers on Images, Videos and Audio [paper](https://arxiv.org/abs/2111.12993)

* MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation [paper](https://arxiv.org/abs/2112.11542)

* ELSA: Enhanced Local Self-Attention for Vision Transformer [paper](https://arxiv.org/abs/2112.12786)

* Vision Transformer for Small-Size Datasets [paper](https://arxiv.org/abs/2112.13492)

* SimViT: Exploring a Simple Vision Transformer with sliding windows [paper](https://arxiv.org/abs/2112.13492)

* SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [paper](https://arxiv.org/abs/2112.13890)

* Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space [paper](https://arxiv.org/abs/2201.00814)

* Vision Transformer with Deformable Attention [paper](https://arxiv.org/abs/2201.00520) [code](https://github.com/LeapLabTHU/DAT)

* PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture [paper](https://arxiv.org/abs/2201.00978)

* QuadTree Attention for Vision Transformers [paper](https://arxiv.org/abs/2201.02767) [code](https://github.com/Tangshitao/QuadtreeAttention)

* TerViT: An Efficient Ternary Vision Transformer [paper](https://arxiv.org/abs/2201.08050)

* BViT: Broad Attention based Vision Transformer [paper](https://arxiv.org/abs/2202.06268)

* CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction [paper](https://arxiv.org/abs/2203.04570)

* EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers [paper](https://arxiv.org/abs/2203.03952)

* Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention [paper](https://arxiv.org/abs/2203.03937)

* Coarse-to-Fine Vision Transformer [paper](https://arxiv.org/abs/2203.03821)

* ViT-P: Rethinking Data-efficient Vision Transformers from Locality [paper](https://arxiv.org/abs/2203.02358)

* MPViT: Multi-Path Vision Transformer for Dense Prediction [paper](https://arxiv.org/abs/2112.11010)

* Event Transformer [paper](https://arxiv.org/abs/2204.05172)

* DaViT: Dual Attention Vision Transformers [paper](https://arxiv.org/abs/2204.03645)

* LightViT: Towards Light-Weight Convolution-Free Vision Transformers [paper](https://arxiv.org/abs/2207.05557)

* UniNet: Unified Architecture Search with Convolution, Transformer, and MLP [paper](https://arxiv.org/abs/2207.05420)

* Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning [paper](https://arxiv.org/abs/2207.04978)

* EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [paper](https://arxiv.org/abs/2206.10589)

## Multi-label Classification

* Graph Attention Transformer Network for Multi-Label Image Classification [paper](https://arxiv.org/abs/2203.04049)

## Point Cloud Processing

* Point Cloud Transformer [paper](https://arxiv.org/pdf/2012.09688.pdf)

* Point Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zhao_Point_Transformer_ICCV_2021_paper.html)

* Fast Point Transformer [paper](https://arxiv.org/abs/2112.04702)

* Adaptive Channel Encoding Transformer for Point Cloud Analysis [paper](https://arxiv.org/abs/2112.02507)

* PTTR: Relational 3D Point Cloud Object Tracking with Transformer [paper](https://arxiv.org/abs/2112.02857)

* Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction [paper](https://arxiv.org/abs/2112.09385)

* LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling [paper](https://arxiv.org/abs/2202.06263)

* Geometric Transformer for Fast and Robust Point Cloud Registration [paper](https://arxiv.org/abs/2202.06688)

* HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud [paper](https://arxiv.org/abs/2204.05481)

## Video Processing

* Video Transformers: A Survey [paper](https://arxiv.org/abs/2201.05991)

* ViViT: A Video Vision Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Arnab_ViViT_A_Video_Vision_Transformer_ICCV_2021_paper.html)

* Vision Transformer Based Video Hashing Retrieval for Tracing the Source  of Fake Videos [paper](https://arxiv.org/abs/2112.08117)

* LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach [paper](https://arxiv.org/abs/2112.10066)

* Video Joint Modelling Based on Hierarchical Transformer for Co-summarization [paper](https://arxiv.org/abs/2112.13478)

* InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer [paper](https://arxiv.org/abs/2112.15320)

* TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers [paper](https://arxiv.org/abs/2201.05047)

* Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [paper](https://arxiv.org/abs/2201.04676)

* Multiview Transformers for Video Recognition [paper](https://arxiv.org/abs/2201.04288)

* MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition [paper](https://arxiv.org/abs/2201.08383)

* Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval [paper](https://arxiv.org/abs/2202.06014)

* A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [paper](https://arxiv.org/abs/2203.04708)

* Learning Trajectory-Aware Transformer for Video Super-Resolution [paper](https://arxiv.org/abs/2204.04216)

* Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [paper](https://arxiv.org/abs/2204.03638)

## Model Compression

* A Unified Pruning Framework for Vision Transformers [paper](https://arxiv.org/abs/2111.15127)

* Multi-Dimensional Model Compression of Vision Transformer [paper](https://arxiv.org/abs/2201.00043)

* Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression [paper](https://arxiv.org/abs/2203.02452)

## Transfer Learning & Pretraining

* Pre-Trained Image Processing Transformer [paper](https://arxiv.org/abs/2012.00364) [code](https://github.com/huawei-noah/Pretrained-IPT)

* UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper](https://arxiv.org/abs/2011.09094) [code](https://github.com/dddzg/up-detr)

* BEVT: BERT Pretraining of Video Transformers [paper](https://arxiv.org/abs/2112.01529)

* Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text [paper](https://arxiv.org/abs/2112.07074)

* On Efficient Transformer and Image Pre-training for Low-level Vision [paper](https://arxiv.org/abs/2112.10175)

* Pre-Training Transformers for Domain Adaptation [paper](https://arxiv.org/abs/2112.09965)

* RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training [paper](https://arxiv.org/abs/2201.06857)

* Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion [paper](https://arxiv.org/abs/2203.04771)

* DiT: Self-supervised Pre-training for Document Image Transformer [paper](https://arxiv.org/abs/2203.02378)

* Underwater Image Enhancement Using Pre-trained Transformer [paper](https://arxiv.org/abs/2204.04199)

## Multi-Modal

* Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [paper](https://arxiv.org/abs/2104.09224)

* Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [paper](https://arxiv.org/abs/2112.04446)

* LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [paper](https://arxiv.org/abs/2112.02244)

* MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object  Detection [paper](https://arxiv.org/abs/2112.01177)

* Visual-Semantic Transformer for Scene Text Recognition [paper](https://arxiv.org/abs/2112.00948)

* Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text [paper](https://arxiv.org/abs/2112.07074)

* LaTr: Layout-Aware Transformer for Scene-Text VQA [paper](https://arxiv.org/abs/2112.12494)

* Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding [paper](https://arxiv.org/abs/2112.12180)

* Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation [paper](https://arxiv.org/abs/2112.14088)

* Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg [paper](https://arxiv.org/abs/2112.14100)

* On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering [paper](https://arxiv.org/abs/2201.03965)

* DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers [paper](https://arxiv.org/abs/2202.04053)

* CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers [paper](https://arxiv.org/abs/2203.04838)

* VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [paper](https://arxiv.org/abs/2203.04099)

* Knowledge Amalgamation for Object Detection with Transformers [paper](https://arxiv.org/abs/2203.03187)

* Are Multimodal Transformers Robust to Missing Modality? [paper](https://arxiv.org/abs/2204.05454)

* Self-supervised Vision Transformers for Joint SAR-optical Representation Learning [paper](https://arxiv.org/abs/2204.05381)

* Video Graph Transformer for Video Question Answering [paper](https://arxiv.org/abs/2207.05342)

## Detection

* YOLOS: You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [paper](https://arxiv.org/abs/2106.00666) [code](https://github.com/dk-liang/Awesome-Visual-Transformer)

* WB-DETR: Transformer-Based Detector without Backbone [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_WB-DETR_Transformer-Based_Detector_Without_Backbone_ICCV_2021_paper.html)

* UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper](https://arxiv.org/abs/2011.09094)

* TSP: Rethinking Transformer-based Set Prediction for Object Detection [paper](https://arxiv.org/abs/2011.10881)

* DETR [paper](https://arxiv.org/abs/2005.12872) [code](https://github.com/facebookresearch/detr)

* Deformable DETR [paper](https://arxiv.org/abs/2010.04159) [code](https://github.com/fundamentalvision/Deformable-DETR)

* DN-DETR: Accelerate DETR Training by Introducing Query DeNoising [paper](https://arxiv.org/abs/2203.01305) [code](https://github.com/FengLi-ust/DN-DETR)

* Rethinking Transformer-Based Set Prediction for Object Detection [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sun_Rethinking_Transformer-Based_Set_Prediction_for_Object_Detection_ICCV_2021_paper.html)

* End-to-End Object Detection with Adaptive Clustering Transformer [paper](https://arxiv.org/abs/2011.09315)

* An End-to-End Transformer Model for 3D Object Detection [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Misra_An_End-to-End_Transformer_Model_for_3D_Object_Detection_ICCV_2021_paper.html)

* End-to-End Human Object Interaction Detection with HOI Transformer [paper](https://arxiv.org/abs/2103.04503) [code](https://github.com/bbepoch/HoiTransformer)

* Adaptive Image Transformer for One-Shot Object Detection [paper](https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Adaptive_Image_Transformer_for_One-Shot_Object_Detection_CVPR_2021_paper.html)

* Improving 3D Object Detection With Channel-Wise Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sheng_Improving_3D_Object_Detection_With_Channel-Wise_Transformer_ICCV_2021_paper.html)

* TransPose: Keypoint Localization via Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Yang_TransPose_Keypoint_Localization_via_Transformer_ICCV_2021_paper.html)

* Voxel Transformer for 3D Object Detection [paper](https://arxiv.org/abs/2109.02497)

* Embracing Single Stride 3D Object Detector with Sparse Transformer [paper](https://arxiv.org/abs/2112.06375)

* OW-DETR: Open-world Detection Transformer [paper](https://arxiv.org/abs/2112.01513)

* A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [paper](https://arxiv.org/abs/2112.09747)

* Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence [paper](https://arxiv.org/abs/2112.13310)

* Voxel Transformer for 3D Object Detection [paper](https://arxiv.org/pdf/2109.02497v2.pdf)

* Short Range Correlation Transformer for Occluded Person Re-Identification [paper](https://arxiv.org/abs/2201.01090)

* TransVPR: Transformer-based place recognition with multi-level attention aggregation [paper](https://arxiv.org/abs/2201.02001)

* Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond [paper](https://arxiv.org/abs/2201.03176)

* Arbitrary Shape Text Detection using Transformers [paper](https://arxiv.org/abs/2202.11221)

* A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial transformer network [paper](https://arxiv.org/abs/2203.04822)

* A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [paper](https://arxiv.org/abs/2203.04708)

* Knowledge Amalgamation for Object Detection with Transformers [paper](https://arxiv.org/abs/2203.03187)

* SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection [paper](https://arxiv.org/abs/2204.05585)

* POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition [paper](https://arxiv.org/abs/2204.04083)

* PSTR: End-to-End One-Step Person Search With Transformers [paper](https://arxiv.org/abs/2204.03340)

* Scaling Novel Object Detection with Weakly Supervised Detection Transformers [paper](https://arxiv.org/abs/2207.05205)

* OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers [paper](https://arxiv.org/abs/2207.02255)

* Exploring Plain Vision Transformer Backbones for Object Detection [paper](https://arxiv.org/abs/2203.16527)

## Segmentation

* Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation [paper](https://arxiv.org/pdf/1909.11065v6.pdf) [code](https://github.com/openseg-group/openseg.pytorch?v=2)

* Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention [paper](https://arxiv.org/pdf/2201.01615v1.pdf) [code](https://github.com/yan-hao-tian/lawin)

* MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers [paper](https://arxiv.org/abs/2012.00759) [code](https://github.com/google-research/deeplab2)

* Line Segment Detection Using Transformers without Edges [paper](https://arxiv.org/abs/2101.01909)

* VisTR: End-to-End Video Instance Segmentation with Transformers [paper](https://arxiv.org/abs/2011.14503) [code](https://github.com/Epiphqny/VisTR)

* SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [paper](https://arxiv.org/abs/2012.15840) [code](https://github.com/fudan-zvg/SETR)

* Segmenter: Transformer for Semantic Segmentation [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Strudel_Segmenter_Transformer_for_Semantic_Segmentation_ICCV_2021_paper.html)

* Fully Transformer Networks for Semantic Image Segmentation [paper](https://arxiv.org/abs/2106.04108)

* SOTR: Segmenting Objects with Transformers [paper](https://arxiv.org/abs/2108.06747) [code](https://github.com/easton-cau/SOTR)

* GETAM: Gradient-weighted Element-wise Transformer Attention Map for  Weakly-supervised Semantic segmentation [paper](https://arxiv.org/abs/2112.02841)

* Masked-attention Mask Transformer for Universal Image Segmentation [paper](https://arxiv.org/abs/2112.01527)

* A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [paper](https://arxiv.org/abs/2112.09747)

* iSegFormer: Interactive Image Segmentation with Transformers [paper](https://arxiv.org/abs/2112.11325)

* SOIT: Segmenting Objects with Instance-Aware Transformers [paper](https://arxiv.org/abs/2112.11037)

* SeMask: Semantically Masked Transformers for Semantic Segmentation [paper](https://arxiv.org/abs/2112.12782)

* Siamese Network with Interactive Transformer for Video Object Segmentation [paper](https://arxiv.org/abs/2112.13983)

* Pyramid Fusion Transformer for Semantic Segmentation [paper](https://arxiv.org/abs/2201.04019)

* Swin transformers make strong contextual encoders for VHR image road extraction [paper](https://arxiv.org/abs/2201.03178)

* Transformers in Action:Weakly Supervised Action Segmentation [paper](https://arxiv.org/abs/2201.05675)

* Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation [paper](https://arxiv.org/abs/2202.06498)

* Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers [paper](https://arxiv.org/abs/2203.02664)

* Contextual Attention Network: Transformer Meets U-Net [paper](https://arxiv.org/abs/2203.01932)

* TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [paper](https://arxiv.org/abs/2204.05525)

## Pose Estimation

* Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation [paper](https://cse.buffalo.edu/~jsyuan/papers/2020/4836.pdf)

* HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation [paper](https://cse.buffalo.edu/~jsyuan/papers/2020/lin_mm20.pdf) 

* End-to-End Human Pose and Mesh Reconstruction with Transformers [paper](https://arxiv.org/pdf/2012.09760.pdf) [code](https://github.com/microsoft/MeshTransformer)

* PE-former: Pose Estimation Transformer [paper](https://arxiv.org/abs/2112.04981)

* Pose Recognition with Cascade Transformers [paper](https://arxiv.org/abs/2104.06976) [code](https://github.com/mlpc-ucsd/PRTR)

* Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer [code](https://arxiv.org/abs/2112.02466)

* Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [paper](https://arxiv.org/abs/2112.07374)

* Temporal Transformer Networks with Self-Supervision for Action Recognition [paper](https://arxiv.org/abs/2112.07338)

* Co-training Transformer with Videos and Images Improves Action Recognition [paper](https://arxiv.org/abs/2112.07175)

* DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer [paper](https://arxiv.org/abs/2112.08775)

* Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition [paper](https://arxiv.org/abs/2201.02849)

* Motion-Aware Transformer For Occluded Person Re-identification [paper](https://arxiv.org/abs/2202.04243)

* HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders [paper](https://arxiv.org/abs/2202.03548)

* ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers [paper](https://arxiv.org/abs/2202.11423)

* Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding [paper](https://arxiv.org/abs/2203.05156)

* Spatial Transformer Network on Skeleton-based Gait Recognition [paper](https://arxiv.org/abs/2204.03873)

## Tracking and Trajectory Prediction

* Transformer Tracking [paper](https://arxiv.org/abs/2103.15436) [code](https://github.com/chenxin-dlut/TransT)

* Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking [paper](https://arxiv.org/abs/2103.11681) [code](https://arxiv.org/abs/2103.11681)

* MOTR: End-to-End Multiple-Object Tracking with TRansformer [paper](https://arxiv.org/abs/2105.03247) [code](https://github.com/megvii-model/MOTR)

* SwinTrack: A Simple and Strong Baseline for Transformer Tracking [paper](https://arxiv.org/abs/2112.00995)

* Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network [paper](https://arxiv.org/abs/2112.06624)

* PTTR: Relational 3D Point Cloud Object Tracking with Transformer [paper](https://arxiv.org/abs/2112.02857)

* Efficient Visual Tracking with Exemplar Transformers [paper](https://arxiv.org/abs/2112.09686)

* TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer [paper](https://arxiv.org/abs/2202.03183)

## Generative Model and Denoising

* 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zhao_3DVG-Transformer_Relation_Modeling_for_Visual_Grounding_on_Point_Clouds_ICCV_2021_paper.html)

* Spatial-Temporal Transformer for Dynamic Scene Graph Generation [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Cong_Spatial-Temporal_Transformer_for_Dynamic_Scene_Graph_Generation_ICCV_2021_paper.html)

* THUNDR: Transformer-Based 3D Human Reconstruction With Markers [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zanfir_THUNDR_Transformer-Based_3D_Human_Reconstruction_With_Markers_ICCV_2021_paper.html)

* DoodleFormer: Creative Sketch Drawing with Transformers [paper](https://arxiv.org/abs/2112.03258)

* Image Transformer [paper](https://arxiv.org/abs/1802.05751)

* Taming Transformers for High-Resolution Image Synthesis [paper](https://arxiv.org/abs/2012.09841) [code](https://github.com/CompVis/taming-transformers)

* TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [code](https://github.com/VITA-Group/TransGAN)

* U2-Former: A Nested U-shaped Transformer for Image Restoration [paper](https://arxiv.org/abs/2112.02279)

* Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers [paper](https://arxiv.org/abs/2112.09685)

* SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained  Siamese Transformers [paper](https://arxiv.org/abs/2112.09426)

* StyleSwin: Transformer-based GAN for High-resolution Image Generation [paper](https://arxiv.org/abs/2112.10762)

* Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction [paper](https://arxiv.org/abs/2112.13528)

* SGTR: End-to-end Scene Graph Generation with Transformer [paper](https://arxiv.org/abs/2112.12970)

* Flow-Guided Sparse Transformer for Video Deblurring [paper](https://arxiv.org/abs/2201.01893)

* Spherical Transformer [paper](https://arxiv.org/abs/2202.04942)

* MaskGIT: Masked Generative Image Transformer [paper](https://arxiv.org/abs/2202.04200)

* Entroformer: A Transformer-based Entropy Model for Learned Image Compression [paper](https://arxiv.org/abs/2202.05492)

* UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation [paper](https://arxiv.org/abs/2203.02557)

* Stripformer: Strip Transformer for Fast Image Deblurring [paper](https://arxiv.org/abs/2204.04627)

* Vision Transformers for Single Image Dehazing [paper](https://arxiv.org/abs/2204.03883)

* Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [paper](https://arxiv.org/abs/2204.03638)

## Self-Supervised Learning

* Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [paper](https://arxiv.org/abs/2103.13061) [code](https://github.com/amzn/image-to-recipe-transformers)

* iGPT [paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf) [code](https://github.com/openai/image-gpt)

* An Empirical Study of Training Self-Supervised Vision Transformers [paper](https://arxiv.org/abs/2104.02057) [code](https://github.com/facebookresearch/moco-v3)

* Self-supervised Video Transformer [paper](https://arxiv.org/abs/2112.01514)

* TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning [paper](https://arxiv.org/abs/2112.01030)

* TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [paper](https://arxiv.org/abs/2112.08643)

* Transformers in Action:Weakly Supervised Action Segmentation [paper](https://arxiv.org/abs/2201.05675)

* Motion-Aware Transformer For Occluded Person Re-identification [paper](https://arxiv.org/abs/2202.04243)

* Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics [paper](https://arxiv.org/abs/2202.03131)

* Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut [paper](https://arxiv.org/abs/2202.11539)

* Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers [paper](https://arxiv.org/abs/2203.03682)

* Multi-class Token Transformer for Weakly Supervised Semantic Segmentation [paper](https://arxiv.org/abs/2203.02891)

* Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers [paper](https://arxiv.org/abs/2203.02664)

* DiT: Self-supervised Pre-training for Document Image Transformer [paper](https://arxiv.org/abs/2203.02378)

* Self-supervised Vision Transformers for Joint SAR-optical Representation Learning [paper](https://arxiv.org/abs/2204.05381)

* DILEMMA: Self-Supervised Shape and Texture Learning with Transformers [paper](https://arxiv.org/abs/2204.04788)

## Depth and Height Estimation

* Disentangled Latent Transformer for Interpretable Monocular Height Estimation [paper](https://arxiv.org/abs/2201.06357)

* Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics [paper](https://arxiv.org/abs/2202.03131)

* SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification [paper](https://arxiv.org/abs/2207.04224)

## Explainable

* Development and testing of an image transformer for explainable autonomous driving systems [paper](https://arxiv.org/abs/2110.05559)

* Transformer Interpretability Beyond Attention Visualization [paper](https://arxiv.org/abs/2012.09838) [code](https://github.com/hila-chefer/Transformer-Explainability)

* How Do Vision Transformers Work? [paper](https://arxiv.org/abs/2202.06709)

* eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation [paper](https://arxiv.org/abs/2207.05358)

## Robustness

* Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [paper](https://arxiv.org/abs/2111.08413)

## Deep Reinforcement Learning

* Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels [paper](https://arxiv.org/abs/2204.04905)

## Calibration 

* CTRL-C: Camera Calibration TRansformer With Line-Classification [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Lee_CTRL-C_Camera_Calibration_TRansformer_With_Line-Classification_ICCV_2021_paper.html) [code](https://github.com/jwlee-vcl/CTRL-C)

## Radar

* Learning class prototypes from Synthetic InSAR with Vision Transformers [paper](https://arxiv.org/abs/2201.03016)

* Radar Transformer [paper](https://www.mdpi.com/1424-8220/21/11/3854)

## Traffic

* SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers [paper](https://arxiv.org/abs/2201.06390)

## AI Medicine

* Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer [paper](https://arxiv.org/abs/2112.04894)

* 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis [paper](https://arxiv.org/abs/2112.04863)

* Hformer: Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks [paper](https://arxiv.org/abs/2112.05761)

* MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification [paper](https://arxiv.org/abs/2112.01767)

* MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer [paper](https://arxiv.org/abs/2112.13513)

* Generalized Wasserstein Dice Loss, Test-time Augmentation, and Transformers for the BraTS 2021 challenge [paper](https://arxiv.org/abs/2112.13054)

* D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation [paper](https://arxiv.org/abs/2201.00462)

* RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark [paper](https://arxiv.org/abs/2201.00466)

* Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images [paper](https://arxiv.org/abs/2201.01266)

* Swin Transformer for Fast MRI [paper](https://arxiv.org/abs/2201.03230) [code](https://github.com/ayanglab/SwinMR)

* Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are? [paper](https://arxiv.org/abs/2201.06251)

* ViTBIS: Vision Transformer for Biomedical Image Segmentation [paper](https://arxiv.org/abs/2201.05920)

* SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation [paper](https://arxiv.org/abs/2201.08582)

* Improving Across-Dataset Brain Tissue Segmentation Using Transformer [paper](https://arxiv.org/abs/2201.08741)

* Brain Cancer Survival Prediction on Treatment-naive MRI using Deep Anchor Attention Learning with Vision Transformer [paper](https://arxiv.org/abs/2202.01857)

* Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers [paper](https://arxiv.org/abs/2202.06076)

* AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation [paper](https://arxiv.org/abs/2202.06431)

* Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification [paper](https://arxiv.org/abs/2203.04614)

* Characterizing Renal Structures with 3D Block Aggregate Transformers [paper](https://arxiv.org/abs/2203.02430)

* Multimodal Transformer for Nursing Activity Recognition [paper](https://arxiv.org/abs/2204.04564)

* RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment [paper](https://arxiv.org/abs/2207.06177)

* Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays [paper](https://arxiv.org/abs/2207.04394)

## Hardware

* VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer [paper](https://arxiv.org/abs/2201.06618)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/GuanRunwei/Awesome-Vision-Transformer-Collection

Awesome Lists containing this project

README