Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/GuanRunwei/Awesome-Vision-Transformer-Collection

Variants of Vision Transformer and its downstream tasks
https://github.com/GuanRunwei/Awesome-Vision-Transformer-Collection

List: Awesome-Vision-Transformer-Collection

awesome backbone computer-vision deep-learning detection explainable-ai generative-model lidar-point-cloud pose-estimation segmentation self-attention self-supervised-learning tracking

Last synced: 3 months ago
JSON representation

Variants of Vision Transformer and its downstream tasks

Awesome Lists containing this project

README

        

# __Awesome Vision Transformer Collection__
__Variants of Vision Transformer and Vision Transformer for Downstream Tasks__

author: Runwei Guan

affiliation: University of Liverpool / JITRI-Institute of Deep Perception Technology

email: [email protected] / [email protected] / [email protected]

## Image Backbone
* Vision Transformer [paper](https://arxiv.org/abs/2010.11929) [code](https://github.com/google-research/vision_transformer)
* Swin Transformer [paper](https://arxiv.org/abs/2103.14030) [code](https://github.com/microsoft/Swin-Transformer)
* Swin Transformer V2: Scaling Up Capacity and Resolution [paper](https://arxiv.org/abs/2111.09883) [code](https://github.com/microsoft/Swin-Transformer)
* DVT [paper](https://arxiv.org/abs/2105.15075) [code](https://github.com/blackfeather-wang/Dynamic-Vision-Transformer)
* PVT [paper](https://arxiv.org/abs/2102.12122) [code](https://github.com/whai362/PVT)
* Lite Vision Transformer: LVT [paper](https://arxiv.org/abs/2112.10809)
* PiT [paper](https://arxiv.org/abs/2103.16302) [code](https://github.com/naver-ai/pit)
* Twins [paper](https://arxiv.org/abs/2104.13840) [code](https://github.com/Meituan-AutoML/Twins)
* TNT [paper](https://arxiv.org/abs/2103.00112) [code](https://github.com/lucidrains/transformer-in-transformer)
* Mobile-ViT [paper](https://arxiv.org/abs/2110.02178?context=cs.LG) [code](https://github.com/chinhsuanwu/mobilevit-pytorch)
* Cross-ViT [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper.html) [code](https://github.com/IBM/CrossViT)
* LeViT [paper](https://arxiv.org/pdf/2104.01136.pdf) [code](https://github.com/facebookresearch/LeViT)
* ViT-Lite [paper](https://arxiv.org/pdf/2104.05704.pdf)
* Refiner [paper](https://arxiv.org/pdf/2106.03714.pdf) [code](https://github.com/zhoudaquan/Refiner_ViT)
* DeepViT [paper](https://arxiv.org/pdf/2103.11886.pdf) [code](https://github.com/zhoudaquan/dvit_repo)
* CaiT [paper](https://arxiv.org/pdf/2103.17239.pdf) [code](https://github.com/facebookresearch/deit)
* LV-ViT [paper](https://arxiv.org/pdf/2104.10858.pdf) [code](https://github.com/zihangJiang/TokenLabeling)
* DeiT [paper](https://arxiv.org/pdf/2012.12877.pdf) [code](https://github.com/facebookresearch/deit)
* CeiT [paper](https://arxiv.org/pdf/2103.11816.pdf) [code](https://github.com/rishikksh20/CeiT-pytorch)
* BoTNet [paper](https://arxiv.org/abs/2101.11605)
* ViTAE [paper](https://arxiv.org/abs/2106.03348)
* Visformer: The Vision-Friendly Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Chen_Visformer_The_Vision-Friendly_Transformer_ICCV_2021_paper.html) [code](https://github.com/danczs/Visformer)
* Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training [paper](https://arxiv.org/abs/2112.03552)
* AdaViT: Adaptive Tokens for Efficient Vision Transformer [paper](https://arxiv.org/abs/2112.07658)
* Improved Multiscale Vision Transformers for Classification and Detection [paper](https://arxiv.org/abs/2112.01526)
* Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zhang_Multi-Scale_Vision_Longformer_A_New_Vision_Transformer_for_High-Resolution_Image_ICCV_2021_paper.html)
* Towards End-to-End Image Compression and Analysis with Transformers [paper](https://arxiv.org/abs/2112.09300)
* MPViT: Multi-Path Vision Transformer for Dense Prediction [paper](https://arxiv.org/abs/2112.11010)
* Lite Vision Transformer with Enhanced Self-Attention [paper](https://arxiv.org/abs/2112.10809)
* PolyViT: Co-training Vision Transformers on Images, Videos and Audio [paper](https://arxiv.org/abs/2111.12993)
* MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation [paper](https://arxiv.org/abs/2112.11542)
* ELSA: Enhanced Local Self-Attention for Vision Transformer [paper](https://arxiv.org/abs/2112.12786)
* Vision Transformer for Small-Size Datasets [paper](https://arxiv.org/abs/2112.13492)
* SimViT: Exploring a Simple Vision Transformer with sliding windows [paper](https://arxiv.org/abs/2112.13492)
* SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [paper](https://arxiv.org/abs/2112.13890)
* Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space [paper](https://arxiv.org/abs/2201.00814)
* Vision Transformer with Deformable Attention [paper](https://arxiv.org/abs/2201.00520) [code](https://github.com/LeapLabTHU/DAT)
* PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture [paper](https://arxiv.org/abs/2201.00978)
* QuadTree Attention for Vision Transformers [paper](https://arxiv.org/abs/2201.02767) [code](https://github.com/Tangshitao/QuadtreeAttention)
* TerViT: An Efficient Ternary Vision Transformer [paper](https://arxiv.org/abs/2201.08050)
* BViT: Broad Attention based Vision Transformer [paper](https://arxiv.org/abs/2202.06268)
* CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction [paper](https://arxiv.org/abs/2203.04570)
* EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers [paper](https://arxiv.org/abs/2203.03952)
* Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention [paper](https://arxiv.org/abs/2203.03937)
* Coarse-to-Fine Vision Transformer [paper](https://arxiv.org/abs/2203.03821)
* ViT-P: Rethinking Data-efficient Vision Transformers from Locality [paper](https://arxiv.org/abs/2203.02358)
* MPViT: Multi-Path Vision Transformer for Dense Prediction [paper](https://arxiv.org/abs/2112.11010)
* Event Transformer [paper](https://arxiv.org/abs/2204.05172)
* DaViT: Dual Attention Vision Transformers [paper](https://arxiv.org/abs/2204.03645)
* LightViT: Towards Light-Weight Convolution-Free Vision Transformers [paper](https://arxiv.org/abs/2207.05557)
* UniNet: Unified Architecture Search with Convolution, Transformer, and MLP [paper](https://arxiv.org/abs/2207.05420)
* Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning [paper](https://arxiv.org/abs/2207.04978)
* EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [paper](https://arxiv.org/abs/2206.10589)

## Multi-label Classification
* Graph Attention Transformer Network for Multi-Label Image Classification [paper](https://arxiv.org/abs/2203.04049)

## Point Cloud Processing
* Point Cloud Transformer [paper](https://arxiv.org/pdf/2012.09688.pdf)
* Point Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zhao_Point_Transformer_ICCV_2021_paper.html)
* Fast Point Transformer [paper](https://arxiv.org/abs/2112.04702)
* Adaptive Channel Encoding Transformer for Point Cloud Analysis [paper](https://arxiv.org/abs/2112.02507)
* PTTR: Relational 3D Point Cloud Object Tracking with Transformer [paper](https://arxiv.org/abs/2112.02857)
* Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction [paper](https://arxiv.org/abs/2112.09385)
* LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling [paper](https://arxiv.org/abs/2202.06263)
* Geometric Transformer for Fast and Robust Point Cloud Registration [paper](https://arxiv.org/abs/2202.06688)
* HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud [paper](https://arxiv.org/abs/2204.05481)

## Video Processing
* Video Transformers: A Survey [paper](https://arxiv.org/abs/2201.05991)
* ViViT: A Video Vision Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Arnab_ViViT_A_Video_Vision_Transformer_ICCV_2021_paper.html)
* Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos [paper](https://arxiv.org/abs/2112.08117)
* LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach [paper](https://arxiv.org/abs/2112.10066)
* Video Joint Modelling Based on Hierarchical Transformer for Co-summarization [paper](https://arxiv.org/abs/2112.13478)
* InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer [paper](https://arxiv.org/abs/2112.15320)
* TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers [paper](https://arxiv.org/abs/2201.05047)
* Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [paper](https://arxiv.org/abs/2201.04676)
* Multiview Transformers for Video Recognition [paper](https://arxiv.org/abs/2201.04288)
* MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition [paper](https://arxiv.org/abs/2201.08383)
* Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval [paper](https://arxiv.org/abs/2202.06014)
* A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [paper](https://arxiv.org/abs/2203.04708)
* Learning Trajectory-Aware Transformer for Video Super-Resolution [paper](https://arxiv.org/abs/2204.04216)
* Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [paper](https://arxiv.org/abs/2204.03638)

## Model Compression
* A Unified Pruning Framework for Vision Transformers [paper](https://arxiv.org/abs/2111.15127)
* Multi-Dimensional Model Compression of Vision Transformer [paper](https://arxiv.org/abs/2201.00043)
* Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression [paper](https://arxiv.org/abs/2203.02452)

## Transfer Learning & Pretraining
* Pre-Trained Image Processing Transformer [paper](https://arxiv.org/abs/2012.00364) [code](https://github.com/huawei-noah/Pretrained-IPT)
* UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper](https://arxiv.org/abs/2011.09094) [code](https://github.com/dddzg/up-detr)
* BEVT: BERT Pretraining of Video Transformers [paper](https://arxiv.org/abs/2112.01529)
* Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text [paper](https://arxiv.org/abs/2112.07074)
* On Efficient Transformer and Image Pre-training for Low-level Vision [paper](https://arxiv.org/abs/2112.10175)
* Pre-Training Transformers for Domain Adaptation [paper](https://arxiv.org/abs/2112.09965)
* RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training [paper](https://arxiv.org/abs/2201.06857)
* Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion [paper](https://arxiv.org/abs/2203.04771)
* DiT: Self-supervised Pre-training for Document Image Transformer [paper](https://arxiv.org/abs/2203.02378)
* Underwater Image Enhancement Using Pre-trained Transformer [paper](https://arxiv.org/abs/2204.04199)

## Multi-Modal
* Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [paper](https://arxiv.org/abs/2104.09224)
* Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [paper](https://arxiv.org/abs/2112.04446)
* LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [paper](https://arxiv.org/abs/2112.02244)
* MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection [paper](https://arxiv.org/abs/2112.01177)
* Visual-Semantic Transformer for Scene Text Recognition [paper](https://arxiv.org/abs/2112.00948)
* Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text [paper](https://arxiv.org/abs/2112.07074)
* LaTr: Layout-Aware Transformer for Scene-Text VQA [paper](https://arxiv.org/abs/2112.12494)
* Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding [paper](https://arxiv.org/abs/2112.12180)
* Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation [paper](https://arxiv.org/abs/2112.14088)
* Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg [paper](https://arxiv.org/abs/2112.14100)
* On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering [paper](https://arxiv.org/abs/2201.03965)
* DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers [paper](https://arxiv.org/abs/2202.04053)
* CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers [paper](https://arxiv.org/abs/2203.04838)
* VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [paper](https://arxiv.org/abs/2203.04099)
* Knowledge Amalgamation for Object Detection with Transformers [paper](https://arxiv.org/abs/2203.03187)
* Are Multimodal Transformers Robust to Missing Modality? [paper](https://arxiv.org/abs/2204.05454)
* Self-supervised Vision Transformers for Joint SAR-optical Representation Learning [paper](https://arxiv.org/abs/2204.05381)
* Video Graph Transformer for Video Question Answering [paper](https://arxiv.org/abs/2207.05342)

## Detection
* YOLOS: You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [paper](https://arxiv.org/abs/2106.00666) [code](https://github.com/dk-liang/Awesome-Visual-Transformer)
* WB-DETR: Transformer-Based Detector without Backbone [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_WB-DETR_Transformer-Based_Detector_Without_Backbone_ICCV_2021_paper.html)
* UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper](https://arxiv.org/abs/2011.09094)
* TSP: Rethinking Transformer-based Set Prediction for Object Detection [paper](https://arxiv.org/abs/2011.10881)
* DETR [paper](https://arxiv.org/abs/2005.12872) [code](https://github.com/facebookresearch/detr)
* Deformable DETR [paper](https://arxiv.org/abs/2010.04159) [code](https://github.com/fundamentalvision/Deformable-DETR)
* DN-DETR: Accelerate DETR Training by Introducing Query DeNoising [paper](https://arxiv.org/abs/2203.01305) [code](https://github.com/FengLi-ust/DN-DETR)
* Rethinking Transformer-Based Set Prediction for Object Detection [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sun_Rethinking_Transformer-Based_Set_Prediction_for_Object_Detection_ICCV_2021_paper.html)
* End-to-End Object Detection with Adaptive Clustering Transformer [paper](https://arxiv.org/abs/2011.09315)
* An End-to-End Transformer Model for 3D Object Detection [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Misra_An_End-to-End_Transformer_Model_for_3D_Object_Detection_ICCV_2021_paper.html)
* End-to-End Human Object Interaction Detection with HOI Transformer [paper](https://arxiv.org/abs/2103.04503) [code](https://github.com/bbepoch/HoiTransformer)
* Adaptive Image Transformer for One-Shot Object Detection [paper](https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Adaptive_Image_Transformer_for_One-Shot_Object_Detection_CVPR_2021_paper.html)
* Improving 3D Object Detection With Channel-Wise Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sheng_Improving_3D_Object_Detection_With_Channel-Wise_Transformer_ICCV_2021_paper.html)
* TransPose: Keypoint Localization via Transformer [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Yang_TransPose_Keypoint_Localization_via_Transformer_ICCV_2021_paper.html)
* Voxel Transformer for 3D Object Detection [paper](https://arxiv.org/abs/2109.02497)
* Embracing Single Stride 3D Object Detector with Sparse Transformer [paper](https://arxiv.org/abs/2112.06375)
* OW-DETR: Open-world Detection Transformer [paper](https://arxiv.org/abs/2112.01513)
* A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [paper](https://arxiv.org/abs/2112.09747)
* Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence [paper](https://arxiv.org/abs/2112.13310)
* Voxel Transformer for 3D Object Detection [paper](https://arxiv.org/pdf/2109.02497v2.pdf)
* Short Range Correlation Transformer for Occluded Person Re-Identification [paper](https://arxiv.org/abs/2201.01090)
* TransVPR: Transformer-based place recognition with multi-level attention aggregation [paper](https://arxiv.org/abs/2201.02001)
* Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond [paper](https://arxiv.org/abs/2201.03176)
* Arbitrary Shape Text Detection using Transformers [paper](https://arxiv.org/abs/2202.11221)
* A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial transformer network [paper](https://arxiv.org/abs/2203.04822)
* A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [paper](https://arxiv.org/abs/2203.04708)
* Knowledge Amalgamation for Object Detection with Transformers [paper](https://arxiv.org/abs/2203.03187)
* SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection [paper](https://arxiv.org/abs/2204.05585)
* POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition [paper](https://arxiv.org/abs/2204.04083)
* PSTR: End-to-End One-Step Person Search With Transformers [paper](https://arxiv.org/abs/2204.03340)
* Scaling Novel Object Detection with Weakly Supervised Detection Transformers [paper](https://arxiv.org/abs/2207.05205)
* OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers [paper](https://arxiv.org/abs/2207.02255)
* Exploring Plain Vision Transformer Backbones for Object Detection [paper](https://arxiv.org/abs/2203.16527)

## Segmentation
* Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation [paper](https://arxiv.org/pdf/1909.11065v6.pdf) [code](https://github.com/openseg-group/openseg.pytorch?v=2)
* Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention [paper](https://arxiv.org/pdf/2201.01615v1.pdf) [code](https://github.com/yan-hao-tian/lawin)
* MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers [paper](https://arxiv.org/abs/2012.00759) [code](https://github.com/google-research/deeplab2)
* Line Segment Detection Using Transformers without Edges [paper](https://arxiv.org/abs/2101.01909)
* VisTR: End-to-End Video Instance Segmentation with Transformers [paper](https://arxiv.org/abs/2011.14503) [code](https://github.com/Epiphqny/VisTR)
* SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [paper](https://arxiv.org/abs/2012.15840) [code](https://github.com/fudan-zvg/SETR)
* Segmenter: Transformer for Semantic Segmentation [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Strudel_Segmenter_Transformer_for_Semantic_Segmentation_ICCV_2021_paper.html)
* Fully Transformer Networks for Semantic Image Segmentation [paper](https://arxiv.org/abs/2106.04108)
* SOTR: Segmenting Objects with Transformers [paper](https://arxiv.org/abs/2108.06747) [code](https://github.com/easton-cau/SOTR)
* GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation [paper](https://arxiv.org/abs/2112.02841)
* Masked-attention Mask Transformer for Universal Image Segmentation [paper](https://arxiv.org/abs/2112.01527)
* A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [paper](https://arxiv.org/abs/2112.09747)
* iSegFormer: Interactive Image Segmentation with Transformers [paper](https://arxiv.org/abs/2112.11325)
* SOIT: Segmenting Objects with Instance-Aware Transformers [paper](https://arxiv.org/abs/2112.11037)
* SeMask: Semantically Masked Transformers for Semantic Segmentation [paper](https://arxiv.org/abs/2112.12782)
* Siamese Network with Interactive Transformer for Video Object Segmentation [paper](https://arxiv.org/abs/2112.13983)
* Pyramid Fusion Transformer for Semantic Segmentation [paper](https://arxiv.org/abs/2201.04019)
* Swin transformers make strong contextual encoders for VHR image road extraction [paper](https://arxiv.org/abs/2201.03178)
* Transformers in Action:Weakly Supervised Action Segmentation [paper](https://arxiv.org/abs/2201.05675)
* Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation [paper](https://arxiv.org/abs/2202.06498)
* Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers [paper](https://arxiv.org/abs/2203.02664)
* Contextual Attention Network: Transformer Meets U-Net [paper](https://arxiv.org/abs/2203.01932)
* TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [paper](https://arxiv.org/abs/2204.05525)

## Pose Estimation
* Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation [paper](https://cse.buffalo.edu/~jsyuan/papers/2020/4836.pdf)
* HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation [paper](https://cse.buffalo.edu/~jsyuan/papers/2020/lin_mm20.pdf)
* End-to-End Human Pose and Mesh Reconstruction with Transformers [paper](https://arxiv.org/pdf/2012.09760.pdf) [code](https://github.com/microsoft/MeshTransformer)
* PE-former: Pose Estimation Transformer [paper](https://arxiv.org/abs/2112.04981)
* Pose Recognition with Cascade Transformers [paper](https://arxiv.org/abs/2104.06976) [code](https://github.com/mlpc-ucsd/PRTR)
* Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer [code](https://arxiv.org/abs/2112.02466)
* Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [paper](https://arxiv.org/abs/2112.07374)
* Temporal Transformer Networks with Self-Supervision for Action Recognition [paper](https://arxiv.org/abs/2112.07338)
* Co-training Transformer with Videos and Images Improves Action Recognition [paper](https://arxiv.org/abs/2112.07175)
* DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer [paper](https://arxiv.org/abs/2112.08775)
* Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition [paper](https://arxiv.org/abs/2201.02849)
* Motion-Aware Transformer For Occluded Person Re-identification [paper](https://arxiv.org/abs/2202.04243)
* HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders [paper](https://arxiv.org/abs/2202.03548)
* ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers [paper](https://arxiv.org/abs/2202.11423)
* Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding [paper](https://arxiv.org/abs/2203.05156)
* Spatial Transformer Network on Skeleton-based Gait Recognition [paper](https://arxiv.org/abs/2204.03873)

## Tracking and Trajectory Prediction
* Transformer Tracking [paper](https://arxiv.org/abs/2103.15436) [code](https://github.com/chenxin-dlut/TransT)
* Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking [paper](https://arxiv.org/abs/2103.11681) [code](https://arxiv.org/abs/2103.11681)
* MOTR: End-to-End Multiple-Object Tracking with TRansformer [paper](https://arxiv.org/abs/2105.03247) [code](https://github.com/megvii-model/MOTR)
* SwinTrack: A Simple and Strong Baseline for Transformer Tracking [paper](https://arxiv.org/abs/2112.00995)
* Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network [paper](https://arxiv.org/abs/2112.06624)
* PTTR: Relational 3D Point Cloud Object Tracking with Transformer [paper](https://arxiv.org/abs/2112.02857)
* Efficient Visual Tracking with Exemplar Transformers [paper](https://arxiv.org/abs/2112.09686)
* TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer [paper](https://arxiv.org/abs/2202.03183)

## Generative Model and Denoising
* 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zhao_3DVG-Transformer_Relation_Modeling_for_Visual_Grounding_on_Point_Clouds_ICCV_2021_paper.html)
* Spatial-Temporal Transformer for Dynamic Scene Graph Generation [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Cong_Spatial-Temporal_Transformer_for_Dynamic_Scene_Graph_Generation_ICCV_2021_paper.html)
* THUNDR: Transformer-Based 3D Human Reconstruction With Markers [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zanfir_THUNDR_Transformer-Based_3D_Human_Reconstruction_With_Markers_ICCV_2021_paper.html)
* DoodleFormer: Creative Sketch Drawing with Transformers [paper](https://arxiv.org/abs/2112.03258)
* Image Transformer [paper](https://arxiv.org/abs/1802.05751)
* Taming Transformers for High-Resolution Image Synthesis [paper](https://arxiv.org/abs/2012.09841) [code](https://github.com/CompVis/taming-transformers)
* TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [code](https://github.com/VITA-Group/TransGAN)
* U2-Former: A Nested U-shaped Transformer for Image Restoration [paper](https://arxiv.org/abs/2112.02279)
* Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers [paper](https://arxiv.org/abs/2112.09685)
* SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers [paper](https://arxiv.org/abs/2112.09426)
* StyleSwin: Transformer-based GAN for High-resolution Image Generation [paper](https://arxiv.org/abs/2112.10762)
* Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction [paper](https://arxiv.org/abs/2112.13528)
* SGTR: End-to-end Scene Graph Generation with Transformer [paper](https://arxiv.org/abs/2112.12970)
* Flow-Guided Sparse Transformer for Video Deblurring [paper](https://arxiv.org/abs/2201.01893)
* Spherical Transformer [paper](https://arxiv.org/abs/2202.04942)
* MaskGIT: Masked Generative Image Transformer [paper](https://arxiv.org/abs/2202.04200)
* Entroformer: A Transformer-based Entropy Model for Learned Image Compression [paper](https://arxiv.org/abs/2202.05492)
* UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation [paper](https://arxiv.org/abs/2203.02557)
* Stripformer: Strip Transformer for Fast Image Deblurring [paper](https://arxiv.org/abs/2204.04627)
* Vision Transformers for Single Image Dehazing [paper](https://arxiv.org/abs/2204.03883)
* Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer [paper](https://arxiv.org/abs/2204.03638)

## Self-Supervised Learning
* Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [paper](https://arxiv.org/abs/2103.13061) [code](https://github.com/amzn/image-to-recipe-transformers)
* iGPT [paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf) [code](https://github.com/openai/image-gpt)
* An Empirical Study of Training Self-Supervised Vision Transformers [paper](https://arxiv.org/abs/2104.02057) [code](https://github.com/facebookresearch/moco-v3)
* Self-supervised Video Transformer [paper](https://arxiv.org/abs/2112.01514)
* TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning [paper](https://arxiv.org/abs/2112.01030)
* TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [paper](https://arxiv.org/abs/2112.08643)
* Transformers in Action:Weakly Supervised Action Segmentation [paper](https://arxiv.org/abs/2201.05675)
* Motion-Aware Transformer For Occluded Person Re-identification [paper](https://arxiv.org/abs/2202.04243)
* Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics [paper](https://arxiv.org/abs/2202.03131)
* Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut [paper](https://arxiv.org/abs/2202.11539)
* Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers [paper](https://arxiv.org/abs/2203.03682)
* Multi-class Token Transformer for Weakly Supervised Semantic Segmentation [paper](https://arxiv.org/abs/2203.02891)
* Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers [paper](https://arxiv.org/abs/2203.02664)
* DiT: Self-supervised Pre-training for Document Image Transformer [paper](https://arxiv.org/abs/2203.02378)
* Self-supervised Vision Transformers for Joint SAR-optical Representation Learning [paper](https://arxiv.org/abs/2204.05381)
* DILEMMA: Self-Supervised Shape and Texture Learning with Transformers [paper](https://arxiv.org/abs/2204.04788)

## Depth and Height Estimation
* Disentangled Latent Transformer for Interpretable Monocular Height Estimation [paper](https://arxiv.org/abs/2201.06357)
* Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics [paper](https://arxiv.org/abs/2202.03131)
* SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification [paper](https://arxiv.org/abs/2207.04224)

## Explainable
* Development and testing of an image transformer for explainable autonomous driving systems [paper](https://arxiv.org/abs/2110.05559)
* Transformer Interpretability Beyond Attention Visualization [paper](https://arxiv.org/abs/2012.09838) [code](https://github.com/hila-chefer/Transformer-Explainability)
* How Do Vision Transformers Work? [paper](https://arxiv.org/abs/2202.06709)
* eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation [paper](https://arxiv.org/abs/2207.05358)

## Robustness
* Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [paper](https://arxiv.org/abs/2111.08413)

## Deep Reinforcement Learning
* Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels [paper](https://arxiv.org/abs/2204.04905)

## Calibration
* CTRL-C: Camera Calibration TRansformer With Line-Classification [paper](https://openaccess.thecvf.com/content/ICCV2021/html/Lee_CTRL-C_Camera_Calibration_TRansformer_With_Line-Classification_ICCV_2021_paper.html) [code](https://github.com/jwlee-vcl/CTRL-C)

## Radar
* Learning class prototypes from Synthetic InSAR with Vision Transformers [paper](https://arxiv.org/abs/2201.03016)
* Radar Transformer [paper](https://www.mdpi.com/1424-8220/21/11/3854)

## Traffic
* SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers [paper](https://arxiv.org/abs/2201.06390)

## AI Medicine
* Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer [paper](https://arxiv.org/abs/2112.04894)
* 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis [paper](https://arxiv.org/abs/2112.04863)
* Hformer: Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks [paper](https://arxiv.org/abs/2112.05761)
* MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification [paper](https://arxiv.org/abs/2112.01767)
* MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer [paper](https://arxiv.org/abs/2112.13513)
* Generalized Wasserstein Dice Loss, Test-time Augmentation, and Transformers for the BraTS 2021 challenge [paper](https://arxiv.org/abs/2112.13054)
* D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation [paper](https://arxiv.org/abs/2201.00462)
* RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark [paper](https://arxiv.org/abs/2201.00466)
* Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images [paper](https://arxiv.org/abs/2201.01266)
* Swin Transformer for Fast MRI [paper](https://arxiv.org/abs/2201.03230) [code](https://github.com/ayanglab/SwinMR)
* Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are? [paper](https://arxiv.org/abs/2201.06251)
* ViTBIS: Vision Transformer for Biomedical Image Segmentation [paper](https://arxiv.org/abs/2201.05920)
* SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation [paper](https://arxiv.org/abs/2201.08582)
* Improving Across-Dataset Brain Tissue Segmentation Using Transformer [paper](https://arxiv.org/abs/2201.08741)
* Brain Cancer Survival Prediction on Treatment-naive MRI using Deep Anchor Attention Learning with Vision Transformer [paper](https://arxiv.org/abs/2202.01857)
* Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers [paper](https://arxiv.org/abs/2202.06076)
* AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation [paper](https://arxiv.org/abs/2202.06431)
* Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification [paper](https://arxiv.org/abs/2203.04614)
* Characterizing Renal Structures with 3D Block Aggregate Transformers [paper](https://arxiv.org/abs/2203.02430)
* Multimodal Transformer for Nursing Activity Recognition [paper](https://arxiv.org/abs/2204.04564)
* RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment [paper](https://arxiv.org/abs/2207.06177)
* Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays [paper](https://arxiv.org/abs/2207.04394)

## Hardware
* VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer [paper](https://arxiv.org/abs/2201.06618)