{"id":13409088,"url":"https://github.com/dk-liang/Awesome-Visual-Transformer","last_synced_at":"2025-03-14T14:30:57.546Z","repository":{"id":37159666,"uuid":"318087408","full_name":"dk-liang/Awesome-Visual-Transformer","owner":"dk-liang","description":"Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)","archived":false,"fork":false,"pushed_at":"2023-05-24T07:34:10.000Z","size":171,"stargazers_count":3267,"open_issues_count":1,"forks_count":390,"subscribers_count":106,"default_branch":"main","last_synced_at":"2024-05-18T20:52:49.082Z","etag":null,"topics":["detr","transformer","transformer-awesome","transformer-cv","transformer-with-cv","visual-transformer"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dk-liang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-12-03T05:40:09.000Z","updated_at":"2024-05-18T15:21:14.000Z","dependencies_parsed_at":"2022-07-12T16:14:06.689Z","dependency_job_id":"e89e3552-0f6b-4c82-8097-ee7cd13451e2","html_url":"https://github.com/dk-liang/Awesome-Visual-Transformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dk-liang%2FAwesome-Visual-Transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dk-liang%2FAwesome-Visual-Transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dk-liang%2FAwesome-Visual-Transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dk-liang%2FAwesome-Visual-Transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dk-liang","download_url":"https://codeload.github.com/dk-liang/Awesome-Visual-Transformer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243593283,"owners_count":20316159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["detr","transformer","transformer-awesome","transformer-cv","transformer-with-cv","visual-transformer"],"created_at":"2024-07-30T20:00:57.924Z","updated_at":"2025-03-14T14:30:57.525Z","avatar_url":"https://github.com/dk-liang.png","language":null,"funding_links":[],"categories":["Acknowledgement","Uncategorized","References","Awesome Lists","Table of Contents","Learning Resources\u003ca title=\"Suggest an addition to the list!\" href=\"https://forms.gle/aPA41GT5AmbxrTwq5\"\u003e\u003cimg alt=\"Click button to suggest an addition\" align=\"right\" src=\"https://raw.githubusercontent.com/AI4LAM/awesome-ai4lam/main/.graphics/suggest-addition-small.svg\"\u003e\u003c/a\u003e","Others","Acknowledgment","其他_机器视觉","Computer Vision ##","Other Lists","Transformer","Reference","Multimodal, Vision-Language, and Generative AI","Foundation Models","Awesome Computer Vision"],"sub_categories":["Guidelines","Uncategorized","arXiv papers (From oldest to latest)","Attention for Others","Other \"awesome\" lists in AI and ML","Others","Other Video Tasks","网络服务_其他","TeX Lists","2020","Computer Vision","Misc resources"],"readme":"# Awesome Visual-Transformer [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\n\nCollect some Transformer with Computer-Vision (CV) papers. \n\nIf you find some overlooked papers, please open issues or pull requests (recommended).\n\n## Papers\n\n### Transformer original paper\n\n- [Attention is All You Need](https://arxiv.org/abs/1706.03762) (NIPS 2017)\n\n### Technical blog\n\n- [English Blog] Transformers in Vision [[Link](https://davide-coccomini.medium.com/)]\n- [Chinese Blog] 3W字长文带你轻松入门视觉transformer [[Link](https://zhuanlan.zhihu.com/p/308301901)]\n- [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [[Link](https://zhuanlan.zhihu.com/p/348593638)]\n\n### Survey\n  - Multimodal learning with transformers: A survey (IEEE TPAMI) [[paper](https://arxiv.org/abs/2206.06488)]  - 2023.05.11\n  - A Survey of Visual Transformers [[paper](https://arxiv.org/abs/2111.06091)]  - 2021.11.30\n  - Transformers in Vision: A Survey [[paper](https://arxiv.org/abs/2101.01169)]   - 2021.02.22\n  - A Survey on Visual Transformer [[paper](https://arxiv.org/abs/2012.12556)]   - 2021.1.30\n  - A Survey of Transformers  [[paper](https://arxiv.org/abs/2106.04554)]   - 2020.6.09\n\n### arXiv papers\n- **[Superpoint Transformer]** Efficient 3D Semantic Segmentation with Superpoint Transformer [[paper](https://arxiv.org/abs/2306.08045)] [[code](https://github.com/drprojects/superpoint_transformer)]\n- Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [[paper](https://arxiv.org/abs/2305.04722)]\n- **[FocusedDecoder]** Focused Decoding Enables 3D Anatomical Detection by Transformers [[paper](https://arxiv.org/abs/2207.10774v4)] [[code](https://github.com/bwittmann/transoar)]\n- **[TAG]** TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [[paper](https://arxiv.org/abs/2208.01813)] [[code](https://github.com/HenryJunW/TAG)]\n- **[FastMETRO]** Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [[paper](https://arxiv.org/abs/2207.13820)] [[code](https://github.com/postech-ami/FastMETRO)]\n- BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [[paper](https://arxiv.org/abs/2203.01522)] [[code](https://github.com/zhihou7/BatchFormer)]\n- **[RelViT]** RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [[paper]](https://arxiv.org/pdf/2204.11167.pdf) [[code]](https://github.com/NVlabs/RelViT)\n- **[MViTv2]** Improved Multiscale Vision Transformers for Classification and Detection [[paper](https://arxiv.org/pdf/2112.01526.pdf)] [[code](https://github.com/facebookresearch/mvit)]\n- DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [[paper](https://arxiv.org/pdf/2203.03605.pdf)] [[code](https://github.com/IDEACVR/DINO)]\n- Three things everyone should know about Vision Transformers [[paper](https://arxiv.org/pdf/2203.09795.pdf)] \n- **[DeiT III]** DeiT III: Revenge of the ViT [[paper](https://arxiv.org/pdf/2204.07118.pdf)] \n- **[DaViT]** DaViT: Dual Attention Vision Transformers\n[[paper](https://arxiv.org/pdf/2204.03645.pdf)] [[code](https://github.com/dingmyu/davit)]\n- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition\n[[paper](https://arxiv.org/abs/2203.16518)] [[code](https://github.com/jhcho99/CoFormer)]\n- **[GSRTR]** Grounded Situation Recognition with Transformers\n[[paper](https://arxiv.org/abs/2111.10135)] [[code](https://github.com/jhcho99/gsrtr)]\n- **[MaxViT]** MaxViT: Multi-Axis Vision Transformer [[paper]](https://arxiv.org/abs/2204.01697)\n- **[V2X-ViT]** V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [[paper]](https://arxiv.org/abs/2203.10638)\n- **[MemMC-MAE]** Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [[paper](https://arxiv.org/abs/2203.11725)] [[code](https://github.com/tianyu0207/MemMC-MAE)]\n- Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [[paper](https://arxiv.org/abs/2203.12121)] [[code](https://github.com/tianyu0207/weakly-polyp)]\n- **[VideoMAE]** VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [[paper](https://arxiv.org/abs/2203.12602)] [[code](https://github.com/MCG-NJU/VideoMAE)]\n- PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [[paper](https://arxiv.org/pdf/2111.12710.pdf)]\n- ResViT: Residual vision transformers for multi-modal medical image synthesis [[paper](https://arxiv.org/abs/2106.16031)]\n- **[CrossEfficientViT]** Combining EfficientNet and Vision Transformers for Video Deepfake Detection [[paper](https://arxiv.org/abs/2107.02612)] [[code](https://github.com/davide-coccomini/Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection)]\n- **[Discrete ViT]** Discrete Representations Strengthen Vision Transformer Robustness [[paper](https://arxiv.org/abs/2111.10493)]\n- **[StyleSwin]** StyleSwin: Transformer-based GAN for High-resolution Image Generation [[paper](https://arxiv.org/abs/2112.10762)] [[code](https://github.com/microsoft/StyleSwin)]\n- **[SReT]** Sliced Recursive Transformer [[paper](https://arxiv.org/abs/2111.05297)] [[code](https://github.com/szq0214/SReT)]\n- Dynamic Token Normalization Improves Vision Transformer [[paper](https://arxiv.org/abs/2112.02624)]\n- TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [[paper](https://arxiv.org/abs/2106.11297)] [[code](https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner)]\n- Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [[paper](https://arxiv.org/abs/2111.08413)]\n- **[ORViT]** Object-Region Video Transformers [[paper](https://arxiv.org/abs/2110.06915)] [[code](https://roeiherz.github.io/ORViT/)]\n- Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [[paper](https://arxiv.org/abs/2110.05092)] [[code](https://github.com/lelexx/MTF-Transformer)]\n- **[NViT]** NViT: Vision Transformer Compression and Parameter Redistribution [[paper](https://arxiv.org/abs/2110.04869)]\n- 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [[paper](https://arxiv.org/abs/2110.04792)]\n- Adversarial Token Attacks on Vision Transformers [[paper](https://arxiv.org/abs/2110.04337)]\n- Contextual Transformer Networks for Visual Recognition [[paper](https://arxiv.org/pdf/2107.12292.pdf)] [[code](https://github.com/JDAI-CV/CoTNet)]\n- **[TranSalNet]** TranSalNet: Visual saliency prediction using transformers [[paper](https://arxiv.org/abs/2110.03593)]\n- **[MobileViT]** MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [[paper](https://arxiv.org/abs/2110.02178)]\n- A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [[paper](https://arxiv.org/abs/2110.01240)]\n- **[3D-Transformer]** 3D-Transformer: Molecular Representation with Transformer in 3D Space [[paper](https://arxiv.org/abs/2110.01191)]\n- **[CCTrans]** CCTrans: Simplifying and Improving Crowd Counting with Transformer [[paper](https://arxiv.org/abs/2109.14483)]\n- **[UFO-ViT]** UFO-ViT: High Performance Linear Vision Transformer without Softmax [[paper](https://arxiv.org/abs/2109.14382)]\n- Sparse Spatial Transformers for Few-Shot Learning [[paper](https://arxiv.org/abs/2109.12932)]\n- Vision Transformer Hashing for Image Retrieval [[paper](https://arxiv.org/abs/2109.12564)]\n- **[OH-Former]** OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [[paper](https://arxiv.org/abs/2109.11159)]\n- **[Pix2seq]** Pix2seq: A Language Modeling Framework for Object Detection [[paper](https://arxiv.org/abs/2109.10852)]\n- **[CoAtNet]** CoAtNet: Marrying Convolution and Attention for All Data Sizes [[paper](https://arxiv.org/pdf/2106.04803.pdf)]\n- **[LOTR]** LOTR: Face Landmark Localization Using Localization Transformer [[paper](https://arxiv.org/abs/2109.10057)]\n- Transformer-Unet: Raw Image Processing with Unet [[paper](https://arxiv.org/abs/2109.08417)]\n- **[GraFormer]** GraFormer: Graph Convolution Transformer for 3D Pose Estimation [[paper](https://arxiv.org/abs/2109.08364)]\n- **[CDTrans]** CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [[paper](https://arxiv.org/abs/2109.06165)]\n- PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [[paper](https://arxiv.org/abs/2109.05566)] [[code](https://github.com/OPEN-AIR-SUN/PQ-Transformer)]\n- Anchor DETR: Query Design for Transformer-Based Detector [[paper](https://arxiv.org/abs/2109.07107)] [[code](https://github.com/megvii-model/AnchorDETR)]\n- **[DAB-DETR]** DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [[paper](https://arxiv.org/abs/2201.12329)] [[code](https://github.com/IDEA-opensource/DAB-DETR)]\n- **[ESRT]** Efficient Transformer for Single Image Super-Resolution [[paper](https://arxiv.org/abs/2108.11084)]\n- **[MaskFormer]** MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [[paper](http://arxiv.org/abs/2107.06278)] [[code](https://github.com/facebookresearch/MaskFormer)]\n- **[SwinIR]** SwinIR: Image Restoration Using Swin Transformer [[paper](https://arxiv.org/abs/2108.10257)] [[code](https://github.com/JingyunLiang/SwinIR)]\n- **[Trans4Trans]** Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [[paper](https://arxiv.org/abs/2108.09174)]\n- Do Vision Transformers See Like Convolutional Neural Networks? [[paper](https://arxiv.org/abs/2108.08810)]\n- Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [[paper](https://arxiv.org/abs/2108.07851)]\n- Light Field Image Super-Resolution with Transformers [[paper](https://arxiv.org/abs/2108.07597)] [[code](https://github.com/ZhengyuLiang24/LFT)]\n- Focal Self-attention for Local-Global Interactions in Vision Transformers [[paper](https://arxiv.org/abs/2107.00641)] [[code](https://github.com/microsoft/Focal-Transformer)]\n- Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [[paper](https://arxiv.org/abs/2108.06932)] [[code](https://github.com/DengPingFan/Polyp-PVT)]\n- Mobile-Former: Bridging MobileNet and Transformer [[paper](https://arxiv.org/abs/2108.05895)]\n- **[TriTransNet]** TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [[paper](https://arxiv.org/abs/2108.03798)]\n- **[PSViT]** PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [[paper](https://arxiv.org/abs/2108.03428)]\n- Boosting Few-shot Semantic Segmentation with Transformers [[paper](https://arxiv.org/abs/2108.02266)] [[code](https://github.com/GuoleiSun/TRFS)]\n- Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [[paper](https://arxiv.org/abs/2108.00584)]\n- Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [[paper](https://arxiv.org/abs/2108.01390)]\n- **[Styleformer]** Styleformer: Transformer based Generative Adversarial Networks with Style Vector [[paper](https://arxiv.org/abs/2106.07023)] [[code](https://github.com/Jeeseung-Park/Styleformer)]\n- **[CMT]** CMT: Convolutional Neural Networks Meet Vision Transformers [[paper](https://arxiv.org/abs/2107.06263)]\n- **[TransAttUnet]** TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2107.05274)]\n- TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [[paper](https://arxiv.org/abs/2107.05188)]\n- **[ViTGAN]** ViTGAN: Training GANs with Vision Transformers [[paper](https://arxiv.org/abs/2107.04589)]\n- What Makes for Hierarchical Vision Transformer? [[paper](https://arxiv.org/abs/2107.02174)]\n- **[Trans4Trans]** Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [[paper](https://arxiv.org/abs/2107.03172)] \n- **[FFVT]** Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [[paper](https://arxiv.org/abs/2107.02341)] \n- **[TransformerFusion]** TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [[paper](https://arxiv.org/abs/2107.02191)]\n- Escaping the Big Data Paradigm with Compact Transformers [[paper](https://arxiv.org/pdf/2104.05704.pdf)]\n- How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [[paper](https://arxiv.org/pdf/2106.10270.pdf)]\n- Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [[paper](https://arxiv.org/pdf/2105.02358.pdf)]\n- **[XCiT]** XCiT: Cross-Covariance Image Transformers [[paper](https://arxiv.org/pdf/2106.09681.pdf)] [[code](https://github.com/facebookresearch/xcit)]\n- Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [[paper](https://arxiv.org/abs/2106.03650)] [[code](https://github.com/mulinmeng/Shuffle-Transformer)]\n- Video Swin Transformer [[paper](https://arxiv.org/abs/2106.13230)] [[code](https://github.com/SwinTransformer/Video-Swin-Transformer)]\n- **[VOLO]** VOLO: Vision Outlooker for Visual Recognition [[paper](https://arxiv.org/abs/2106.13112)] [[code](https://github.com/sail-sg/volo)]\n- Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [[paper](https://arxiv.org/abs/2106.12413)] \n- End-to-end Temporal Action Detection with Transformer [[paper](https://arxiv.org/abs/2106.10271)] [[code](https://github.com/xlliu7/TadTR)]\n- How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [[paper](https://arxiv.org/abs/2106.10270)]\n- Efficient Self-supervised Vision Transformers for Representation Learning [[paper](https://arxiv.org/abs/2106.09785)]\n- Space-time Mixing Attention for Video Transformer [[paper](https://arxiv.org/abs/2106.05968)]\n- Transformed CNNs: recasting pre-trained convolutional layers with self-attention [[paper](https://arxiv.org/abs/2106.05795)]\n- **[CAT]** CAT: Cross Attention in Vision Transformer [[paper](https://arxiv.org/abs/2106.05786)]\n- Scaling Vision Transformers [[paper](https://arxiv.org/abs/2106.04560)]\n- **[DETReg]** DETReg: Unsupervised Pretraining with Region Priors for Object Detection [[paper](https://arxiv.org/abs/2106.04550)] [[code](https://amirbar.net/detreg)]\n- Chasing Sparsity in Vision Transformers:An End-to-End Exploration [[paper](https://arxiv.org/abs/2106.04533)]\n- **[MViT]** MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [[paper](https://arxiv.org/abs/2106.04520)]\n- Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [[paper](https://arxiv.org/abs/2106.04263)]\n- On Improving Adversarial Transferability of Vision Transformers [[paper](https://arxiv.org/abs/2106.04169)]\n- Fully Transformer Networks for Semantic ImageSegmentation [[paper](https://arxiv.org/abs/2106.04108)]\n- Visual Transformer for Task-aware Active Learning [[paper](https://arxiv.org/abs/2106.03801)] [[code](https://github.com/razvancaramalau/Visual-Transformer-for-Task-aware-Active-Learning)]\n- Efficient Training of Visual Transformers with Small-Size Datasets [[paper](https://arxiv.org/abs/2106.03746)] \n- Reveal of Vision Transformers Robustness against Adversarial Attacks [[paper](https://arxiv.org/abs/2106.03734)]\n- Person Re-Identification with a Locally Aware Transformer [[paper](https://arxiv.org/abs/2106.03720)]\n- **[Refiner]** Refiner: Refining Self-attention for Vision Transformers [[paper](https://arxiv.org/abs/2106.03714)]\n- **[ViTAE]** ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [[paper](https://arxiv.org/abs/2106.03348)]\n- Video Instance Segmentation using Inter-Frame Communication Transformers [[paper](https://arxiv.org/abs/2106.03299)]\n- Transformer in Convolutional Neural Networks [[paper](https://arxiv.org/abs/2106.03180)] [[code](https://github.com/yun-liu/TransCNN)]\n- **[Uformer]** Uformer: A General U-Shaped Transformer for Image Restoration [[paper](https://arxiv.org/abs/2106.03106)] [[code](https://github.com/ZhendongWang6/Uformer)]\n- Patch Slimming for Efficient Vision Transformers [[paper](https://arxiv.org/abs/2106.02852)]\n- **[RegionViT]** RegionViT: Regional-to-Local Attention for Vision Transformers [[paper](https://arxiv.org/abs/2106.02689)]\n- Associating Objects with Transformers for Video Object Segmentation [[paper](https://arxiv.org/abs/2106.02638)] [[code](https://github.com/z-x-yang/AOT)]\n- Few-Shot Segmentation via Cycle-Consistent Transformer [[paper](https://arxiv.org/abs/2106.02320)]\n- Glance-and-Gaze Vision Transformer [[paper](https://arxiv.org/abs/2106.02277)] [[code]( https://github.com/yucornetto/GG-Transformer)]\n- Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [[paper](https://arxiv.org/pdf/2105.08059.pdf)]\n- **[DynamicViT]** DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [[paper](https://arxiv.org/abs/2106.02034)] [[code](https://dynamicvit.ivg-research.xyz/)]\n- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [[paper](https://arxiv.org/abs/2106.01548)] [[code]()]\n- Unsupervised Out-of-Domain Detection via Pre-trained Transformers [[paper](https://arxiv.org/abs/2106.00948)]\n- **[TransMIL]** TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [[paper](https://arxiv.org/abs/2106.00908)]\n- **[TransVOS]**  TransVOS: Video Object Segmentation with Transformers [[paper](https://arxiv.org/abs/2106.00588)]\n- **[KVT]** KVT: k-NN Attention for Boosting Vision Transformers [[paper](https://arxiv.org/abs/2106.00515)] \n- **[MSG-Transformer]** MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [[paper](https://arxiv.org/abs/2105.15168)] [[code](https://github.com/hustvl/MSG-Transformer)]\n- **[SegFormer]** SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [[paper](https://arxiv.org/abs/2105.15203)] [[code](https://github.com/NVlabs/SegFormer)]\n- **[SDNet]** SDNet: mutil-branch for single image deraining using swin [[paper](https://arxiv.org/abs/2105.15077)] [[code](https://github.com/H-tfx/SDNet)]\n- **[DVT]** Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [[paper](https://arxiv.org/abs/2105.15075)]\n- **[GazeTR]** Gaze Estimation using Transformer [[paper](https://arxiv.org/abs/2105.14424)] [[code](https://github.com/yihuacheng/GazeTR)]\n- Transformer-Based Deep Image Matching for Generalizable Person Re-identification [[paper](https://arxiv.org/abs/2105.14432)]\n- Less is More: Pay Less Attention in Vision Transformers [[paper](https://arxiv.org/abs/2105.14217)] \n- **[FoveaTer]** FoveaTer: Foveated Transformer for Image Classification [[paper](https://arxiv.org/abs/2105.14173)]\n- **[TransDA]** Transformer-Based Source-Free Domain Adaptation [[paper](https://arxiv.org/abs/2105.14138)] [[code](https://github.com/ygjwd12345/TransDA)]\n- An Attention Free Transformer [[paper](https://arxiv.org/abs/2105.14103)]\n- **[PTNet]** PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [[paper](https://arxiv.org/abs/2105.13993)]\n- **[ResT]** ResT: An Efficient Transformer for Visual Recognition [[paper](https://arxiv.org/abs/2105.13677)] [[code](https://github.com/wofmanaf/ResT)]\n- **[CogView]** CogView: Mastering Text-to-Image Generation via Transformers [[paper](https://arxiv.org/abs/2105.13290)]\n- **[NesT]** Aggregating Nested Transformers [[paper](https://arxiv.org/abs/2105.12723)] \n- **[TAPG]** Temporal Action Proposal Generation with Transformers [[paper](https://arxiv.org/abs/2105.12043)] \n- Boosting Crowd Counting with Transformers [[paper](https://arxiv.org/abs/2105.10926)] \n- **[COTR]** COTR: Convolution in Transformer Network for End to End Polyp Detection [[paper](https://arxiv.org/abs/2105.10925)]\n- **[TransVOD]** End-to-End Video Object Detection with Spatial-Temporal Transformers [[paper](https://arxiv.org/abs/2105.10920)] [[code](https://github.com/SJTU-LuHe/TransVOD)]\n- Intriguing Properties of Vision Transformers [[paper](https://arxiv.org/abs/2105.10497)] [[code](https://git.io/Js15X)] \n- Combining Transformer Generators with Convolutional Discriminators [[paper](https://arxiv.org/abs/2105.10189)]\n- Rethinking the Design Principles of Robust Vision Transformer [[paper](https://arxiv.org/abs/2105.07926)]\n- Vision Transformers are Robust Learners [[paper](https://arxiv.org/abs/2105.07581)] [[code](https://git.io/J3VO0)]\n- Manipulation Detection in Satellite Images Using Vision Transformer [[paper](https://arxiv.org/abs/2105.06373)]\n- **[Swin-Unet]** Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2105.05537)] [[code](https://github.com/HuCaoFighting/Swin-Unet)]\n- Self-Supervised Learning with Swin Transformers [[paper](https://arxiv.org/abs/2105.04553)] [[code](https://github.com/SwinTransformer/Transformer-SSL)]\n- **[SCTN]** SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [[paper](https://arxiv.org/abs/2105.04447)] \n- **[RelationTrack]** RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [[paper](https://arxiv.org/abs/2105.04322)]\n- **[VGTR]** Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2105.04281)]\n- **[PST]** Visual Composite Set Detection Using Part-and-Sum Transformers [[paper](https://arxiv.org/abs/2105.02170)] \n- **[TrTr]** TrTr: Visual Tracking with Transformer [[paper](https://arxiv.org/abs/2105.03817)] [[code](https://github.com/tongtybj/TrTr)]\n- **[MOTR]** MOTR: End-to-End Multiple-Object Tracking with TRansformer [[paper](https://arxiv.org/abs/2105.03247)] [[code](https://github.com/megvii-model/MOTR)]\n- Attention for Image Registration (AiR): an unsupervised Transformer approach [[paper](https://arxiv.org/abs/2105.02282)] \n- **[TransHash]** TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [[paper](https://arxiv.org/abs/2105.01823)]\n- **[ISTR]** ISTR: End-to-End Instance Segmentation with Transformers [[paper](https://arxiv.org/abs/2105.00637)] [[code](https://github.com/hujiecpp/ISTR)]\n- **[CAT]** CAT: Cross-Attention Transformer for One-Shot Object Detection [[paper](https://arxiv.org/abs/2104.14984)] \n- **[CoSformer]** CoSformer: Detecting Co-Salient Object with Transformers [[paper](https://arxiv.org/abs/2104.14729)]\n- End-to-End Attention-based Image Captioning [[paper](https://arxiv.org/abs/2104.14721)]\n- **[PMTrans]** Pyramid Medical Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2104.14702)]\n- **[HandsFormer]** HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [[paper](https://arxiv.org/abs/2104.14639)]\n- **[GasHis-Transformer]** GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [[paper](https://arxiv.org/abs/2104.14528)] \n- Emerging Properties in Self-Supervised Vision Transformers [[paper](https://arxiv.org/abs/2104.14294)]\n- **[InTra]** Inpainting Transformer for Anomaly Detection [[paper](https://arxiv.org/abs/2104.13897)] \n- **[Twins]** Twins: Revisiting Spatial Attention Design in Vision Transformers [[paper](https://arxiv.org/abs/2104.13840)] [[code](https://github.com/Meituan-AutoML/Twins)]\n- **[MLMSPT]** Point Cloud Learning with Transformer [[paper](https://arxiv.org/abs/2104.13636)]\n- Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [[paper](https://arxiv.org/abs/2104.13633)]\n- **[ConTNet]** ConTNet: Why not use convolution and transformer at the same time? [[paper](https://arxiv.org/abs/2104.13497)] [[code](https://github.com/yan-hao-tian/ConTNet)]\n- **[DTNet]** Dual Transformer for Point Cloud Analysis [[paper](https://arxiv.org/abs/2104.13044)] \n- Improve Vision Transformers Training by Suppressing Over-smoothing [[paper](https://arxiv.org/abs/2104.12753)] [[code](https://github.com/ChengyueGongR/PatchVisionTransformer)]\n- Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [[paper](https://arxiv.org/abs/2104.12137)]\n- **[M3DeTR]** M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [[paper](https://arxiv.org/abs/2104.11896)] [[code](https://github.com/rayguan97/M3DeTR)]\n- **[Skeletor]** Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [[paper](https://arxiv.org/abs/2104.11712)] \n- **[FaceT]** Learning to Cluster Faces via Transformer [[paper](https://arxiv.org/abs/2104.11502)]\n- **[MViT]** Multiscale Vision Transformers [[paper](https://arxiv.org/abs/2104.11227)] [[code](https://github.com/facebookresearch/SlowFast)]\n- **[VATT]** VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [[paper](https://arxiv.org/abs/2104.11178)]\n- **[So-ViT]** So-ViT: Mind Visual Tokens for Vision Transformer [[paper](https://arxiv.org/abs/2104.10935)] [[code](https://github.com/jiangtaoxie/So-ViT)]\n- Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [[paper](https://arxiv.org/abs/2104.10858)] [[code](https://github.com/zihangJiang/TokenLabeling)]\n- **[TransRPPG]** TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [[paper](https://arxiv.org/abs/2104.07419)]\n- **[VideoGPT]** VideoGPT: Video Generation using VQ-VAE and Transformers [[paper](https://arxiv.org/abs/2104.10157)]\n- **[M2TR]** M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [[paper](https://arxiv.org/abs/2104.09770)]\n- Transformer Transforms Salient Object Detection and Camouflaged Object Detection [[paper](https://arxiv.org/abs/2104.10127)]\n- **[TransCrowd]** TransCrowd: Weakly-Supervised Crowd Counting with Transformer [[paper](https://arxiv.org/abs/2104.09116)] [[code](https://github.com/dk-liang/TransCrowd)]\n- Visual Transformer Pruning [[paper](https://arxiv.org/abs/2104.08500)]\n- Self-supervised Video Retrieval Transformer Network [[paper](https://arxiv.org/abs/2104.07993)]\n- Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [[paper](https://arxiv.org/abs/2104.07235)]\n- **[TransGAN]** TransGAN: Two Transformers Can Make One Strong GAN [[paper](https://arxiv.org/abs/2102.07074)] [[code](https://github.com/VITA-Group/TransGAN)]\n- Geometry-Free View Synthesis: Transformers and no 3D Priors [[paper](https://arxiv.org/abs/2104.07652)] [[code](https://git.io/JOnwn)]\n- **[CoaT]** Co-Scale Conv-Attentional Image Transformers [[paper](https://arxiv.org/abs/2104.06399)] [[code](https://github.com/mlpc-ucsd/CoaT)]\n- **[LocalViT]** LocalViT: Bringing Locality to Vision Transformers [[paper](https://arxiv.org/abs/2104.05707)] [[code](https://github.com/ofsoundof/LocalViT)]\n- **[CIT]** Cloth Interactive Transformer for Virtual Try-On [[paper](https://arxiv.org/abs/2104.05519)] [[code](https://arxiv.org/abs/2104.05519)]\n- Handwriting Transformers [[paper](https://arxiv.org/abs/2104.03964)]\n- **[SiT]** SiT: Self-supervised vIsion Transformer [[paper](https://arxiv.org/abs/2104.03602)] [[code](https://github.com/Sara-Ahmed/SiT)]\n- On the Robustness of Vision Transformers to Adversarial Examples [[paper](https://arxiv.org/abs/2104.02610)]\n- An Empirical Study of Training Self-Supervised Visual Transformers [[paper](https://arxiv.org/abs/2104.02057)]\n- A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [[paper](https://arxiv.org/abs/2104.01745)]\n- **[AOT-GAN]** Aggregated Contextual Transformations for High-Resolution Image Inpainting [[paper](https://arxiv.org/abs/2104.01431)] [[code](https://github.com/researchmm/AOT-GAN-for-Inpainting)]\n- Deepfake Detection Scheme Based on Vision Transformer and Distillation [[paper](https://arxiv.org/abs/2104.01353)]\n- **[ATAG]** Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [[paper](https://arxiv.org/pdf/2103.16024)] \n- **[TubeR]** TubeR: Tube-Transformer for Action Detection [[paper](https://arxiv.org/abs/2104.00969)]\n- **[AAformer]** AAformer: Auto-Aligned Transformer for Person Re-Identification [[paper](https://arxiv.org/abs/2104.00921)]\n- **[TFill]** TFill: Image Completion via a Transformer-Based Architecture [[paper](https://arxiv.org/abs/2104.00845)]\n- Group-Free 3D Object Detection via Transformers [[paper](https://arxiv.org/abs/2104.00678)] [[code](https://github.com/zeliu98/Group-Free-3D)]\n- **[STGT]** Spatial-Temporal Graph Transformer for Multiple Object Tracking [[paper](https://arxiv.org/abs/2104.00194)] \n- Going deeper with Image Transformers[[paper](https://arxiv.org/abs/2103.17239)] \n- **[Meta-DETR]** Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [[paper](https://arxiv.org/abs/2103.11731) [[code](https://github.com/ZhangGongjie/Meta-DETR)]\n- **[DA-DETR]** DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [[paper](https://arxiv.org/abs/2103.17084)]\n- Robust Facial Expression Recognition with Convolutional Visual Transformers [[paper](https://arxiv.org/abs/2103.16854)]\n- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](https://arxiv.org/abs/2103.16553)]\n- Spatiotemporal Transformer for Video-based Person Re-identification[[paper](https://arxiv.org/abs/2103.16469)] \n- **[TransUNet]** TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.04306)] [[code](https://github.com/Beckschen/TransUNet)]\n- **[CvT]** CvT: Introducing Convolutions to Vision Transformers [[paper](https://arxiv.org/abs/2103.15808)] [[code](https://github.com/leoxiaobin/CvT)]\n- **[TFPose]** TFPose: Direct Human Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.15320)]\n- **[TransCenter]** TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [[paper](https://arxiv.org/abs/2103.15145)]\n- Face Transformer for Recognition [[paper](https://arxiv.org/abs/2103.14803)]\n- On the Adversarial Robustness of Visual Transformers [[paper](https://arxiv.org/abs/2103.15670)]\n- Understanding Robustness of Transformers for Image Classification [[paper](https://arxiv.org/abs/2103.14586)]\n- Lifting Transformer for 3D Human Pose Estimation in Video [[paper](https://arxiv.org/abs/2103.14304)]\n- **[GSA-Net]** Global Self-Attention Networks for Image Recognition[[paper](https://arxiv.org/abs/2010.03019)]\n- High-Fidelity Pluralistic Image Completion with Transformers [[paper](https://arxiv.org/abs/2103.14031)] [[code](http://raywzy.com/ICT)]\n- **[DPT]** Vision Transformers for Dense Prediction [[paper](https://arxiv.org/abs/2103.13413)] [[code](https://github.com/intel-isl/DPT)]\n- **[TransFG]** TransFG: A Transformer Architecture for Fine-grained Recognition? [[paper](https://arxiv.org/abs/2103.07976)]\n- **[TimeSformer]** Is Space-Time Attention All You Need for Video Understanding? [[paper](https://arxiv.org/abs/2102.05095)]\n- Multi-view 3D Reconstruction with Transformer [[paper](https://arxiv.org/abs/2103.12957)] \n- Can Vision Transformers Learn without Natural Images? [[paper](https://arxiv.org/abs/2103.13023)] [[code](https://hirokatsukataoka16.github.io/Vision-Transformers-without-Natural-Images/)]\n- End-to-End Trainable Multi-Instance Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.12115)] \n- Instance-level Image Retrieval using Reranking Transformers [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://arxiv.org/abs/2103.12236)]\n- **[BossNAS]** BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://github.com/changlin31/BossNAS)]\n- **[CeiT]** Incorporating Convolution Designs into Visual Transformers [[paper](https://arxiv.org/abs/2103.11816)] \n- **[DeepViT]** DeepViT: Towards Deeper Vision Transformer [[paper](https://arxiv.org/abs/2103.11886)] \n- Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [[paper](https://arxiv.org/abs/2103.10043)] \n- 3D Human Pose Estimation with Spatial and Temporal Transformers [[paper](https://arxiv.org/abs/2103.10455)] [[code](https://github.com/zczcwh/PoseFormer)]\n- **[SUNETR]** SUNETR: Transformers for 3D Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.10504)] \n- Scalable Visual Transformers with Hierarchical Pooling [[paper](https://arxiv.org/abs/2103.10619)] \n- **[ConViT]** ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [[paper](https://arxiv.org/abs/2103.10697)] \n- **[TransMed]** TransMed: Transformers Advance Multi-modal Medical Image Classification [[paper](https://arxiv.org/abs/2103.05940)] \n- **[U-Transformer]** U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.06104)] \n- **[SpecTr]** SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [[paper](https://arxiv.org/abs/2103.03604)] [[code](https://github.com/hfut-xc-yun/SpecTr)]\n- **[TransBTS]** TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [[paper](https://arxiv.org/abs/2103.04430)] [[code](https://github.com/Wenxuan-1119/TransBTS)]\n- **[SSTN]** SSTN: Self-Supervised Domain Adaptation Thermal\nObject Detection for Autonomous Driving [[paper](https://arxiv.org/abs/2103.03150)] \n- Transformer is All You Need:\nMultimodal Multitask Learning with a Unified Transformer [[paper](https://arxiv.org/abs/2102.10772)] [[code](https://mmf.sh/)]\n- **[CPVT]** Do We Really Need Explicit Position Encodings for Vision Transformers? [[paper](https://arxiv.org/abs/2102.10882)] [[code](https://github.com/Meituan-AutoML/CPVT)]\n- Deepfake Video Detection Using Convolutional Vision Transformer[[paper](https://arxiv.org/abs/2102.11126)]\n- Training Vision Transformers for Image Retrieval[[paper](https://arxiv.org/abs/2102.05644)]\n- **[VTN]** Video Transformer Network[[paper](https://arxiv.org/abs/2102.00719)]\n- **[BoTNet]** Bottleneck Transformers for Visual Recognition [[paper](https://arxiv.org/abs/2101.11605)]\n- **[CPTR]** CPTR: Full Transformer Network for Image Captioning [[paper](https://arxiv.org/abs/2101.10804)]\n- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [[paper](https://arxiv.org/abs/2101.08779)] [[code](https://google.github.io/aichoreographer/)]\n- **[Trans2Seg]**  Segmenting Transparent Object in the Wild with Transformer [[paper](https://arxiv.org/abs/2101.08461)] [[code](https://github.com/xieenze/Trans2Seg)]\n- Investigating the Vision Transformer Model for Image Retrieval Tasks [[paper](https://arxiv.org/abs/2101.03771)]\n- **[Trear]** Trear: Transformer-based RGB-D Egocentric Action Recognition [[paper](https://arxiv.org/abs/2101.03904)]\n- **[VisualSparta]** VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [[paper](https://arxiv.org/abs/2101.00265)]\n- **[TrackFormer]** TrackFormer: Multi-Object Tracking with Transformers [[paper](https://arxiv.org/abs/2101.02702)]\n- **[TAPE]** Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [[paper](https://arxiv.org/abs/2101.02143)]\n- **[TRIQ]** Transformer for Image Quality Assessment [[paper](https://arxiv.org/abs/2101.01097)] [[code](https://github.com/junyongyou/triq)]\n- **[TransTrack]** TransTrack: Multiple-Object Tracking with Transformer [[paper](https://arxiv.org/abs/2012.15460)] [[code](https://github.com/PeizeSun/TransTrack)]\n- **[DeiT]** Training data-efficient image transformers \u0026 distillation through attention [[paper](https://arxiv.org/abs/2012.12877)] [[code](https://github.com/facebookresearch/deit)]\n- **[Pointformer]** 3D Object Detection with Pointformer [[paper](https://arxiv.org/abs/2012.11409)] \n- **[ViT-FRCNN]** Toward Transformer-Based Object Detection [[paper](https://arxiv.org/abs/2012.09958)] \n- **[Taming-transformers]** Taming Transformers for High-Resolution Image Synthesis [[paper](https://arxiv.org/abs/2012.09841)] [[code](https://compvis.github.io/taming-transformers/)]\n- **[SceneFormer]** SceneFormer: Indoor Scene Generation with Transformers [[paper](https://arxiv.org/abs/2012.09793)] \n- **[PCT]** PCT: Point Cloud Transformer [[paper](https://arxiv.org/abs/2012.09688)] \n- **[PED]** DETR for Pedestrian Detection[[paper](https://arxiv.org/abs/2012.06785)]\n- Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[[paper](https://arxiv.org/abs/2101.02143)]\n- **[C-Tran]** General Multi-label Image Classification with Transformers [[paper](https://arxiv.org/abs/2011.14027)]\n\n### 2022\n\n**TPAMI**\n\n- **[P2T]** P2T: Pyramid Pooling Transformer for Scene Understanding [[paper](https://ieeexplore.ieee.org/document/9870559)]\n\n**ECCV**\n\n- **[X-CLIP]** Expanding Language-Image Pretrained Models for General Video Recognition [[paper](https://arxiv.org/abs/2208.02816)] [[code](https://aka.ms/X-CLIP)]\n- **[TinyViT]** TinyViT: Fast Pretraining Distillation for Small Vision Transformers [[paper](https://arxiv.org/abs/2207.10666)] [[code](https://github.com/microsoft/Cream/tree/main/TinyViT)]\n- **[FastMETRO]** Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [[paper](https://arxiv.org/abs/2207.13820)] [[code](https://github.com/postech-ami/FastMETRO)]\n- **[AiATrack]** AiATrack: Attention in Attention for Transformer Visual Tracking [[paper](https://arxiv.org/abs/2207.09603)] [[code](https://github.com/Little-Podi/AiATrack)]\n- **[OSTrack]** Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [[paper](https://arxiv.org/abs/2203.11991)] [[code](https://github.com/botaoye/OSTrack)]\n- **[Unicorn]** Towards Grand Unification of Object Tracking [[paper](https://arxiv.org/abs/2207.07078)] [[code](https://github.com/MasterBin-IIAU/Unicorn)]\n- **[P3AFormer]** Tracking Objects as Pixel-wise Distributions [[paper](https://arxiv.org/abs/2207.05518)] [[code](https://github.com/dvlab-research/ECCV22-P3AFormer-Tracking-Objects-as-Pixel-wise-Distributions)]\n\n**CVPR**\n- **[MAE]** Masked Autoencoders Are Scalable Vision Learners [[paper](https://arxiv.org/abs/2111.06377)] [[code]](https://github.com/facebookresearch/mae)\n- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [[paper](https://arxiv.org/abs/2107.00652)] [[code](https://github.com/microsoft/CSWin-Transformer)]\n- Fast Point Transformer [[paper](https://arxiv.org/abs/2112.04702)]\n- EDTER: Edge Detection With Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Pu_EDTER_Edge_Detection_With_Transformer_CVPR_2022_paper.html)] [[code](https://github.com/MengyangPu/EDTER)]\n- Bridged Transformer for Vision and Point Cloud 3D Object Detection [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Bridged_Transformer_for_Vision_and_Point_Cloud_3D_Object_Detection_CVPR_2022_paper.html)]\n- MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Xie_MNSRNet_Multimodal_Transformer_Network_for_3D_Surface_Super-Resolution_CVPR_2022_paper.html)] \n- HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Bandara_HyperTransformer_A_Textural_and_Spectral_Feature_Fusion_Transformer_for_Pansharpening_CVPR_2022_paper.html)] [[code](https://github.com/wgcban/HyperTransformer)]\n- Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Hampali_Keypoint_Transformer_Solving_Joint_Identification_in_Challenging_Hands_and_Object_CVPR_2022_paper.html)]\n- MPViT: Multi-Path Vision Transformer for Dense Prediction [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Lee_MPViT_Multi-Path_Vision_Transformer_for_Dense_Prediction_CVPR_2022_paper.html)] [[code]](https://github.com/youngwanLEE/MPViT)\n- A-ViT: Adaptive Tokens for Efficient Vision Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Yin_A-ViT_Adaptive_Tokens_for_Efficient_Vision_Transformer_CVPR_2022_paper.html)]\n- TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zhang_TopFormer_Token_Pyramid_Transformer_for_Mobile_Semantic_Segmentation_CVPR_2022_paper.html)] [[code](https://github.com/hustvl/TopFormer)]\n- Continual Learning With Lifelong Vision Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Continual_Learning_With_Lifelong_Vision_Transformer_CVPR_2022_paper.html)]\n- Swin Transformer V2: Scaling Up Capacity and Resolution [[paper](https://arxiv.org/abs/2111.09883)] [[code]](https://github.com/microsoft/Swin-Transformer)\n- Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [[paper](https://arxiv.org/abs/2203.10314)] [[code](https://github.com/skyhehe123/VoxSeT)]\n- Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [[paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Xu_Multi-Class_Token_Transformer_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2022_paper.pdf)]\n- Human-Object Interaction Detection via Disentangled Transformer [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zhou_Human-Object_Interaction_Detection_via_Disentangled_Transformer_CVPR_2022_paper.html)]\n- LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Jiang_LGT-Net_Indoor_Panoramic_Room_Layout_Estimation_With_Geometry-Aware_Transformer_Network_CVPR_2022_paper.html)]\n- Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Xia_Sparse_Local_Patch_Transformer_for_Robust_Face_Alignment_and_Landmarks_CVPR_2022_paper.html)]\n- Vision Transformer With Deformable Attention [[paper](https://openaccess.thecvf.com/content/CVPR2022/html/Xia_Vision_Transformer_With_Deformable_Attention_CVPR_2022_paper.html)]\n- DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [[paper](https://arxiv.org/abs/2204.12997)]\n- **[Restormer]** Restormer: Efficient Transformer for High-Resolution Image Restoration [[paper](https://arxiv.org/abs/2111.09881)] [[code](https://github.com/swz30/Restormer)]\n- **[SAM-DETR]** Accelerating DETR Convergence via Semantic-Aligned Matching [[paper](https://arxiv.org/abs/2203.06883)] [[code](https://github.com/ZhangGongjie/SAM-DETR)]\n- **[BEVT]** BEVT: BERT Pretraining of Video Transformers [[paper](https://arxiv.org/pdf/2112.01529.pdf)] [[code](https://github.com/xyzforever/BEVT)]\n- **[MobileFormer]** Mobile-Former: Bridging MobileNet and Transformer [[paper](https://arxiv.org/pdf/2108.05895.pdf)]\n- **[STRM]** Spatio-temporal Relation Modeling for Few-shot Action Recognition [[paper](https://arxiv.org/pdf/2112.05132.pdf)] [[code](https://github.com/Anirudh257/strm)]\n- **[MiniViT]** MiniViT: Compressing Vision Transformers with Weight Multiplexing [[paper](https://arxiv.org/abs/2204.07154)] [[code](https://github.com/microsoft/Cream/tree/main/MiniViT)]\n- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition\n[[paper](https://arxiv.org/abs/2203.16518)] [[code](https://github.com/jhcho99/CoFormer)]\n- **[DW-ViT]** Beyond Fixation: Dynamic Window Visual Transformer [[paper](https://arxiv.org/pdf/2203.12856.pdf)] [[code](https://github.com/pzhren/DW-ViT)]\n- **[TokenFusion]** Multimodal Token Fusion for Vision Transformers [[paper](https://arxiv.org/pdf/2204.08721.pdf)]\n- **[CMT]** Convolutional Neural Networks Meet Vision Transformers [[paper](https://arxiv.org/pdf/2107.06263.pdf)]\n- Fine-tuning Image Transformers using Learnable Memory [[paper](https://arxiv.org/pdf/2203.15243.pdf)]\n- **[TransMix]** Attend to Mix for Vision Transformers [[paper](https://arxiv.org/pdf/2111.09833.pdf)] [[code](https://github.com/Beckschen/TransMix)]\n- **[NomMer]** Nominate Synergistic Context in Vision Transformer for Visual Recognition [[paper](https://arxiv.org/pdf/2111.12994.pdf)] [[code](https://github.com/TencentYoutuResearch/VisualRecognition-NomMer)]\n- **[SSA]** Shunted Self-Attention via Multi-Scale Token Aggregation [[paper](https://arxiv.org/pdf/2111.15193.pdf)] [[code](https://github.com/OliverRensu/Shunted-Transformer)]\n- **[RVT]** Towards Robust Vision Transformer [[paper](https://arxiv.org/pdf/2105.07926.pdf) [[code](https://github.com/vtddggg/Robust-Vision-Transformer)]\n- **[LVT]** Lite Vision Transformer with Enhanced Self-Attention [[paper](https://arxiv.org/pdf/2112.10809.pdf) [[code](https://github.com/Chenglin-Yang/LVT)]\n- **[StyTr2]** StyTr2: Image Style Transfer with Transformers [[paper](https://arxiv.org/pdf/2105.14576.pdf)] [[code](https://github.com/diyiiyiii/StyTR-2)]\n\n**WACV** \n- Image-Adaptive Hint Generation via Vision Transformer for Outpainting [[paper](https://openaccess.thecvf.com/content/WACV2022/papers/Kong_Image-Adaptive_Hint_Generation_via_Vision_Transformer_for_Outpainting_WACV_2022_paper.pdf)] [[code](https://github.com/kdh4672/hgonet)]\n\n**ICLR**\n- **[RelViT]** RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [[paper](https://arxiv.org/pdf/2204.11167.pdf)] [[code](https://github.com/NVlabs/RelViT)]\n- **[CrossFormer]** CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention [[paper](https://arxiv.org/abs/2108.00154)] [[code](https://github.com/cheerss/CrossFormer)]\n\n- Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [[paper](https://arxiv.org/abs/2201.04676)] [[code](https://github.com/Sense-X/UniFormer)]\n\n- **[DAB-DETR]** DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [[paper](https://arxiv.org/abs/2201.12329)] [[code](https://github.com/IDEA-opensource/DAB-DETR)]\n\n### 2021\n**NeurIPS**  \n\n- ProTo: Program-Guided Transformer for Program-Guided Tasks [[paper](https://arxiv.org/abs/2110.00804)] [[code](https://github.com/sjtuytc/Neurips21-ProTo-Program-guided-Transformers-for-Program-guided-Tasks)]\n- **[Augvit]** Augmented Shortcuts for Vision Transformers [[paper](https://arxiv.org/abs/2106.15941)] [[code](https://github.com/huawei-noah/CV-Backbones/tree/master/augvit_pytorch)]\n- **[YOLOS]** You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [[paper](https://arxiv.org/abs/2106.00666)] [[code](https://github.com/hustvl/YOLOS)]\n- **[CATs]** Semantic Correspondence with Transformers [[paper](https://arxiv.org/abs/2106.02520)] [[code](https://github.com/SunghwanHong/CATs)] \n- **[Moment-DETR]** QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [[paper](https://arxiv.org/abs/2107.09609)] [[code](https://github.com/jayleicn/moment_detr)]\n- Dual-stream Network for Visual Recognition [[paper](https://arxiv.org/abs/2105.14734)] [[code](https://github.com/gaopengcuhk/DSNet)]\n- **[Container]** Container: Context Aggregation Network [[paper](https://arxiv.org/abs/2106.01401)] [[code](https://github.com/gaopengcuhk/Container)]\n- **[TNT]** Transformer in Transformer [[paper](https://arxiv.org/abs/2103.00112)] [[code](https://github.com/huawei-noah/noah-research/tree/master/TNT)]\n- T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [[paper](https://arxiv.org/abs/2109.10948)]\n- Long Short-Term Transformer for Online Action Detection [[paper](https://papers.nips.cc/paper/2021/hash/08b255a5d42b89b0585260b6f2360bdd-Abstract.html)]\n- TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [[paper](https://papers.nips.cc/paper/2021/hash/0a87257e5308197df43230edf4ad1dae-Abstract.html)]\n- TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [[paper](https://papers.nips.cc/paper/2021/hash/0f49c89d1e7298bb9930789c8ed59d48-Abstract.html)]\n- TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [[paper](https://papers.nips.cc/paper/2021/hash/10c272d06794d3e5785d5e7c5356e9ff-Abstract.html)]\n- Associating Objects with Transformers for Video Object Segmentation [[paper](https://papers.nips.cc/paper/2021/hash/147702db07145348245dc5a2f2fe5683-Abstract.html)]\n- Test-Time Personalization with a Transformer for Human Pose Estimation [[paper](https://papers.nips.cc/paper/2021/hash/1517c8664be296f0d87d9e5fc54fdd60-Abstract.html)]\n- Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [[paper](https://papers.nips.cc/paper/2021/hash/21be992eb8016e541a15953eee90760e-Abstract.html)]\n- Dynamic Grained Encoder for Vision Transformers [[paper](https://papers.nips.cc/paper/2021/hash/2d969e2cee8cfa07ce7ca0bb13c7a36d-Abstract.html)]\n- HRFormer: High-Resolution Vision Transformer for Dense Predict [[paper](https://papers.nips.cc/paper/2021/hash/3bbfdde8842a5c44a0323518eec97cbe-Abstract.html)]\n- Searching the Search Space of Vision Transformer [[paper](https://papers.nips.cc/paper/2021/hash/48e95c45c8217961bf6cd7696d80d238-Abstract.html)]\n- Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [[paper](https://papers.nips.cc/paper/2021/hash/64517d8435994992e682b3e4aa0a0661-Abstract.html)]\n- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [[paper](https://papers.nips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html)]\n- Do Vision Transformers See Like Convolutional Neural Networks? [[paper](https://papers.nips.cc/paper/2021/hash/652cf38361a209088302ba2b8b7f51e0-Abstract.html)]\n- Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [[paper](https://papers.nips.cc/paper/2021/hash/67f7fb873eaf29526a11a9b7ac33bfac-Abstract.html)]\n- Glance-and-Gaze Vision Transformer [[paper](https://papers.nips.cc/paper/2021/hash/6c524f9d5d7027454a783c841250ba71-Abstract.html)]\n- MST: Masked Self-Supervised Transformer for Visual Representation [[paper](https://papers.nips.cc/paper/2021/hash/6dbbe6abe5f14af882ff977fc3f35501-Abstract.html)]\n- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [[paper](https://papers.nips.cc/paper/2021/hash/747d3443e319a22747fbb873e8b2f9f2-Abstract.html)]\n- TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [[paper](https://papers.nips.cc/paper/2021/hash/7c220a2091c26a7f5e9f1cfb099511e3-Abstract.html)]\n- Augmented Shortcuts for Vision Transformers [[paper](https://papers.nips.cc/paper/2021/hash/818f4654ed39a1c147d1e51a00ffb4cb-Abstract.html)]\n- Improved Transformer for High-Resolution GANs [[paper](https://papers.nips.cc/paper/2021/hash/98dce83da57b0395e163467c9dae521b-Abstract.html)]\n- All Tokens Matter: Token Labeling for Training Better Vision Transformers [[paper](https://papers.nips.cc/paper/2021/hash/9a49a25d845a483fae4be7e341368e36-Abstract.html)]\n- XCiT: Cross-Covariance Image Transformers [[paper](https://papers.nips.cc/paper/2021/hash/a655fbe4b8d7439994aa37ddad80de56-Abstract.html)]\n- Efficient Training of Visual Transformers with Small Datasets [[paper](https://papers.nips.cc/paper/2021/hash/c81e155d85dae5430a8cee6f2242e82c-Abstract.html)]\n\n**ICCV**\n\n- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows  (**Marr Prize**)  [[paper](https://arxiv.org/abs/2103.14030)] [[code](https://github.com/microsoft/Swin-Transformer)]\n- **[ICT]** High-Fidelity Pluralistic Image Completion with Transformers [[paper](https://arxiv.org/pdf/2103.14031.pdf)] [[code](https://github.com/raywzy/ICT)]\n- **[PoinTr]** PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers  (**oral**) [[paper](https://arxiv.org/abs/2108.08839)] [[code](https://github.com/yuxumin/PoinTr)]\n- **[STTR]** Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [[paper](https://arxiv.org/abs/2011.02910v2)] [[code](https://github.com/mli0603/stereo-transformer)]\n- **[TSP-FCOS]** Rethinking Transformer-based Set Prediction for Object Detection [[paper](https://arxiv.org/abs/2011.10881)]\n-  Paint Transformer: Feed Forward Neural Painting with Stroke Prediction  (**oral**) ) [[paper](https://arxiv.org/abs/2108.03798]) [[code](https://github.com/Huage001/PaintTransformer)]\n- 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_3DVG-Transformer_Relation_Modeling_for_Visual_Grounding_on_Point_Clouds_ICCV_2021_paper.pdf)]\n- **[T2T-ViT]** Training Vision Transformers from Scratch on ImageNet [[paper](https://arxiv.org/abs/2101.11986)] [[code](https://github.com/yitu-opensource/T2T-ViT)]\n- **[THUNDR]** THUNDR: Transformer-Based 3D Human Reconstruction With Markers [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Zanfir_THUNDR_Transformer-Based_3D_Human_Reconstruction_With_Markers_ICCV_2021_paper.html)]\n- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [[paper](https://arxiv.org/abs/2103.15358)]\n- **[PVT]** Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [[paper](https://arxiv.org/abs/2102.12122)] [[code](https://github.com/whai362/PVT)]\n- Spatial-Temporal Transformer for Dynamic Scene Graph Generation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Cong_Spatial-Temporal_Transformer_for_Dynamic_Scene_Graph_Generation_ICCV_2021_paper.pdf)]\n- **[GLiT]** GLiT: Neural Architecture Search for Global and Local Image Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_GLiT_Neural_Architecture_Search_for_Global_and_Local_Image_Transformer_ICCV_2021_paper.pdf)]\n- **[TRAR]** TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhou_TRAR_Routing_the_Attention_Spans_in_Transformer_for_Visual_Question_ICCV_2021_paper.pdf)]\n- **[UniT]** UniT: Multimodal Multitask Learning With a Unified Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Hu_UniT_Multimodal_Multitask_Learning_With_a_Unified_Transformer_ICCV_2021_paper.html)] [[code](https://mmf.sh)]\n- Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Voskou_Stochastic_Transformer_Networks_With_Linear_Competing_Units_Application_To_End-to-End_ICCV_2021_paper.pdf)]\n- Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Zhao_Transformer-Based_Dual_Relation_Graph_for_Multi-Label_Image_Recognition_ICCV_2021_paper.pdf)]\n- **[LocalTrans]** LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Shao_LocalTrans_A_Multiscale_Local_Transformer_Network_for_Cross-Resolution_Homography_Estimation_ICCV_2021_paper.pdf)]\n- Improving 3D Object Detection With Channel-Wise Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sheng_Improving_3D_Object_Detection_With_Channel-Wise_Transformer_ICCV_2021_paper.html)]\n- A Latent Transformer for Disentangled Face Editing in Images and Videos [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yao_A_Latent_Transformer_for_Disentangled_Face_Editing_in_Images_and_ICCV_2021_paper.pdf)] [[code](https://github.com/InterDigitalInc/latent-transformer)]\n- **[GroupFormer]** GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Li_GroupFormer_Group_Activity_Recognition_With_Clustered_Spatial-Temporal_Transformer_ICCV_2021_paper.html)]\n- Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Matsumori_Unified_Questioner_Transformer_for_Descriptive_Question_Generation_in_Goal-Oriented_Visual_ICCV_2021_paper.pdf)]\n- **[WB-DETR]** WB-DETR: Transformer-Based Detector Without Backbone [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_WB-DETR_Transformer-Based_Detector_Without_Backbone_ICCV_2021_paper.pdf)]\n- The Animation Transformer: Visual Correspondence via Segment Matching [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Casey_The_Animation_Transformer_Visual_Correspondence_via_Segment_Matching_ICCV_2021_paper.pdf)]\n- The Animation Transformer: Visual Correspondence via Segment Matching [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Casey_The_Animation_Transformer_Visual_Correspondence_via_Segment_Matching_ICCV_2021_paper.pdf)]\n- Relaxed Transformer Decoders for Direct Action Proposal Generation [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Tan_Relaxed_Transformer_Decoders_for_Direct_Action_Proposal_Generation_ICCV_2021_paper.html)]\n- **[PPT-Net]** Pyramid Point Cloud Transformer for Large-Scale Place Recognition [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Hui_Pyramid_Point_Cloud_Transformer_for_Large-Scale_Place_Recognition_ICCV_2021_paper.pdf)] [[code](https://github.com/fpthink/PPT-Net)]\n- Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Multimodal_Co-Attention_Transformer_for_Survival_Prediction_in_Gigapixel_Whole_Slide_ICCV_2021_paper.pdf)]\n- Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Uncertainty-Guided_Transformer_Reasoning_for_Camouflaged_Object_Detection_ICCV_2021_paper.pdf)]\n- Image Harmonization With Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Guo_Image_Harmonization_With_Transformer_ICCV_2021_paper.html)] [[cpde](https://github.com/zhenglab/HarmonyTransformer)]\n- **[COTR]** COTR: Correspondence Transformer for Matching Across Images [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Jiang_COTR_Correspondence_Transformer_for_Matching_Across_Images_ICCV_2021_paper.pdf)]\n- **[MUSIQ]** MUSIQ: Multi-Scale Image Quality Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Ke_MUSIQ_Multi-Scale_Image_Quality_Transformer_ICCV_2021_paper.pdf)]\n- Episodic Transformer for Vision-and-Language Navigation [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Pashevich_Episodic_Transformer_for_Vision-and-Language_Navigation_ICCV_2021_paper.pdf)]\n- Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [[paper](https://openaccess.thecvf.com/content/ICCV2021/html/Petrovich_Action-Conditioned_3D_Human_Motion_Synthesis_With_Transformer_VAE_ICCV_2021_paper.html)]\n- **[CrackFormer]** CrackFormer: Transformer Network for Fine-Grained Crack Detection [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_CrackFormer_Transformer_Network_for_Fine-Grained_Crack_Detection_ICCV_2021_paper.pdf)]\n- **[HiT]** HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_HiT_Hierarchical_Transformer_With_Momentum_Contrast_for_Video-Text_Retrieval_ICCV_2021_paper.pdf)]\n- Event-Based Video Reconstruction Using Transformer [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Weng_Event-Based_Video_Reconstruction_Using_Transformer_ICCV_2021_paper.pdf)]\n- **[STVGBert]** STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Su_STVGBert_A_Visual-Linguistic_Transformer_Based_Framework_for_Spatio-Temporal_Video_Grounding_ICCV_2021_paper.pdf)]\n- **[HiFT]** HiFT: Hierarchical Feature Transformer for Aerial Tracking [[paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Cao_HiFT_Hierarchical_Feature_Transformer_for_Aerial_Tracking_ICCV_2021_paper.pdf)] [[code](https://github.com/vision4robotics/HiFT)]\n- **[DocFormer]** DocFormer: End-to-End Transformer for Document Understanding [[paper](https://arxiv.org/abs/2106.11539)]\n- **[LeViT]** LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [[paper](https://arxiv.org/abs/2104.01136)] [[code](https://github.com/facebookresearch/LeViT)]\n- **[SignBERT]** SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[[paper](https://arxiv.org/abs/2110.05382)]\n- **[VidTr]** VidTr: Video Transformer Without Convolutions [[paper](https://arxiv.org/abs/2104.11746)] \n- **[ACTOR]** Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [[paper](https://arxiv.org/abs/2104.05670)]\n- **[Segmenter]** Segmenter: Transformer for Semantic Segmentation [[paper](https://arxiv.org/abs/2105.05633)] [[code](https://github.com/rstrudel/segmenter)]\n- **[Visformer]** Visformer: The Vision-friendly Transformer [[paper](https://arxiv.org/abs/2104.12533)] [[code](https://github.com/danczs/Visformer)]\n- **[PnP-DETR]** PnP-DETR: Towards Efficient Visual Analysis with Transformers (**ICCV**)  [[paper](https://arxiv.org/abs/2109.07036)] [[code](https://github.com/twangnh/pnp-detr)]\n- [**VoTr**] Voxel Transformer for 3D Object Detection  [[paper](https://arxiv.org/abs/2109.02497)]\n- **[TransVG]** TransVG: End-to-End Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2104.08541)]\n- **[3DETR]** An End-to-End Transformer Model for 3D Object Detection [[paper](https://arxiv.org/abs/2109.08141)] [[code](https://github.com/facebookresearch/3detr)]\n- **[Eformer]** Eformer: Edge Enhancement based Transformer for Medical Image Denoising [[paper](https://arxiv.org/abs/2109.08044)]\n- **[TransFER]** TransFER: Learning Relation-aware Facial Expression Representations with Transformers [[paper](https://arxiv.org/abs/2108.11116)] \n- **[Oriented RCNN]** Oriented Object Detection with Transformer  [[paper](https://arxiv.org/abs/2106.03146)]\n- **[ViViT]** ViViT: A Video Vision Transformer [[paper](https://arxiv.org/abs/2103.15691)]\n- **[Stark]** Learning Spatio-Temporal Transformer for Visual Tracking  [[paper](https://arxiv.org/abs/2103.17154)] [[code](https://github.com/researchmm/Stark)]\n- **[CT3D]** Improving 3D Object Detection with Channel-wise Transformer  [[paper](https://arxiv.org/abs/2108.10723)]\n-  **[VST]** Visual Saliency Transformer [[paper](https://arxiv.org/abs/2104.12099)] \n- **[PiT]** Rethinking Spatial Dimensions of Vision Transformers  [[paper](https://arxiv.org/abs/2103.16302)] [[code](https://github.com/naver-ai/pit)]\n- **[CrossViT]** CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [[paper](https://arxiv.org/abs/2103.14899)] [[code](https://github.com/IBM/CrossViT)]\n- **[PointTransformer]** Point Transformer [[paper](https://arxiv.org/abs/2012.09164)]\n- **[TS-CAM]** TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization  [[paper](https://arxiv.org/abs/2103.14862)] [[code](https://github.com/vasgaowei/TS-CAM.git)]\n- **[VTs]** Visual Transformers: Token-based Image Representation and Processing for Computer Vision [[paper](https://arxiv.org/abs/2006.03677)]\n- **[TransDepth]** Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction  [[paper](https://arxiv.org/pdf/2103.12091.pdf)] [[code](https://github.com/ygjwd12345/TransDepth)]\n- **[Conditional DETR]** Conditional DETR for Fast Training Convergence [[paper](https://arxiv.org/abs/2108.06152)] [[code](https://github.com/Atten4Vis/ConditionalDETR)]\n- **[PIT]** PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [[paper](https://arxiv.org/abs/2108.07142)] [[code](https://github.com/sheepooo/PIT-Position-Invariant-Transform)]\n- **[SOTR]** SOTR: Segmenting Objects with Transformers  [[paper](https://arxiv.org/abs/2108.06747)] [[code](https://github.com/easton-cau/SOTR)]\n- **[SnowflakeNet]** SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [[paper](https://arxiv.org/abs/2108.04444)] [[code](https://github.com/AllenXiangX/SnowflakeNet.)]\n- **[TransPose]** TransPose: Keypoint Localization via Transformer [[paper](https://arxiv.org/abs/2012.14214)] [[code](https://github.com/yangsenius/TransPose)]\n- **[TransReID]** TransReID: Transformer-based Object Re-Identification  [[paper](https://arxiv.org/abs/2102.04378)] [[code](https://github.com/heshuting555/TransReID)]\n- **[CWT]** Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [[paper](https://arxiv.org/abs/2108.03032)] [[code](https://github.com/zhiheLu/CWT-for-FSS)]\n-  Anticipative Video Transformer [[paper](https://arxiv.org/abs/2106.02036)] [[code](http://facebookresearch.github.io/AVT)]\n- Rethinking and Improving Relative Position Encoding for Vision Transformer [[paper](https://arxiv.org/abs/2107.14222)] [[code](https://github.com/microsoft/Cream/tree/main/iRPE)]\n- Vision Transformer with Progressive Sampling  [[paper](https://arxiv.org/abs/2108.01684)] [[code](https://github.com/yuexy/PS-ViT)]\n- **[SMCA]**  Fast Convergence of DETR with Spatially Modulated Co-Attention [[paper](https://arxiv.org/abs/2101.07448)] [[code](https://github.com/abc403/SMCA-replication)]\n- **[AutoFormer]** AutoFormer: Searching Transformers for Visual Recognition [[paper](https://arxiv.org/pdf/2107.00651.pdf)] [[code](https://github.com/microsoft/AutoML)]\n\n**CVPR**\n- Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [[paper](https://arxiv.org/abs/2106.04095)]\n- **[HOTR]** HOTR: End-to-End Human-Object Interaction Detection with Transformers (**oral**) [[paper](https://arxiv.org/abs/2104.13682)] \n- **[METRO]** End-to-End Human Pose and Mesh Reconstruction with Transformers [[paper](https://arxiv.org/abs/2012.09760)]\n- **[LETR]** Line Segment Detection Using Transformers without Edges [[paper](https://arxiv.org/abs/2101.01909)]\n- **[TransFuser]** Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [[paper](https://arxiv.org/abs/2104.09224)] [[code](https://github.com/autonomousvision/transfuser)]\n- Pose Recognition with Cascade Transformers  [[paper](https://arxiv.org/abs/2104.06976)]\n- Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning  [[paper](https://arxiv.org/abs/2104.03135)]\n- **[LoFTR]** LoFTR: Detector-Free Local Feature Matching with Transformers [[paper](https://arxiv.org/abs/2104.00680)] [[code](https://zju3dv.github.io/loftr/)]\n- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](https://arxiv.org/abs/2103.16553)] \n- **[SETR]** Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [[paper](https://arxiv.org/abs/2012.15840)] [[code](https://fudan-zvg.github.io/SETR/)]\n- **[TransT]** Transformer Tracking  [[paper](https://arxiv.org/abs/2103.15436)] [[code](https://github.com/chenxin-dlut/TransT)]\n- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [[paper](https://arxiv.org/abs/2103.11681)]\n- **[VisTR]** End-to-End Video Instance Segmentation with Transformers [[paper](https://arxiv.org/abs/2011.14503)]\n- Transformer Interpretability Beyond Attention Visualization [[paper](https://arxiv.org/abs/2012.09838)] [[code](https://github.com/hila-chefer/Transformer-Explainability)]\n- **[IPT]** Pre-Trained Image Processing Transformer [[paper](https://arxiv.org/abs/2012.00364)]\n- **[UP-DETR]** UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [[paper](https://arxiv.org/abs/2011.09094)]\n- **[IQT]** Perceptual Image Quality Assessment with Transformers (**workshop**) [[paper](https://arxiv.org/abs/2104.14730)]\n- High-Resolution Complex Scene Synthesis with Transformers (**workshop**) [[paper](https://arxiv.org/abs/2105.06458)]\n- **[CoFormer]** Collaborative Transformers for Grounded Situation Recognition\n[[paper](https://arxiv.org/abs/2203.16518)] [[code](https://github.com/jhcho99/CoFormer)]\n\n**ICML**\n- Generative Video Transformer: Can Objects be the Words?  [[paper](https://arxiv.org/abs/2107.09240)]\n- **[GANsformer]** Generative Adversarial Transformers [[paper](https://arxiv.org/abs/2103.01209)] [[code](https://github.com/dorarad/gansformer)]\n\n**ICRA**\n- **[NDT-Transformer]** NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [[paper](https://arxiv.org/abs/2103.12292)] \n\n**ICLR**\n- **[VTNet]** VTNet: Visual Transformer Network for Object Goal Navigation [[paper](https://arxiv.org/abs/2105.09447)]\n- **[Vision Transformer]** An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [[paper](https://arxiv.org/abs/2010.11929)] [[code](https://github.com/google-research/vision_transformer)]\n- **[Deformable DETR]** Deformable DETR: Deformable Transformers for End-to-End Object Detection [[paper](https://arxiv.org/abs/2010.04159)] [[code](https://github.com/fundamentalvision/Deformable-DETR)]\n- **[LAMBDANETWORKS]** MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION  [[paper](https://openreview.net/pdf?id=xTJEN-ggl1b)] [[code](https://github.com/lucidrains/lambda-networks)]\n\n**ACM MM**\n- Video Transformer for Deepfake Detection with Incremental Learning[[paper](https://arxiv.org/abs/2108.05307)] \n- **[HAT]** HAT: Hierarchical Aggregation Transformers for Person Re-identification [[paper](https://arxiv.org/abs/2107.05946)]\n- Token Shift Transformer for Video Classification  [[paper](https://arxiv.org/abs/2108.02432)] [[code](https://github.com/VideoNetworks/TokShift-Transformer)]\n- **[DPT]** DPT: Deformable Patch-based Transformer for Visual Recognition [[paper](https://arxiv.org/abs/2107.14467)] [[code](https://github.com/CASIA-IVA-Lab/DPT)]\n\n**MICCAI**  \n- **[UTNet]** UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation  [[paper](https://arxiv.org/abs/2107.00781)] [[code](https://github.com/yhygao/UTNet)]\n- **[MedT]** Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.10662)] [[code](https://github.com/jeya-maria-jose/Medical-Transformer)]\n- **[MCTrans]** Multi-Compound Transformer for Accurate Biomedical Image Segmentation  [[paper](https://arxiv.org/abs/2106.14385)] [[code](https://github.com/JiYuanFeng/MCTrans)]\n- **[PNS-Net]** Progressively Normalized Self-Attention Network for Video Polyp Segmentation  [[paper](https://arxiv.org/abs/2105.08468)] [[code](https://github.com/GewelsJI/PNS-Net)]\n- **[MBT-Net]** A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [[paper](https://arxiv.org/abs/2106.07557)]\n\n**BMVC**\n- **[ACT]** End-to-End Object Detection with Adaptive Clustering Transformer [[paper](https://arxiv.org/abs/2011.09315)]\n- **[GSRTR]** Grounded Situation Recognition with Transformers\n[[paper](https://arxiv.org/abs/2111.10135)] [[code](https://github.com/jhcho99/gsrtr)]\n- **[TransFusion]** TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [[paper](https://arxiv.org/abs/2110.09554)] [[code](https://github.com/HowieMa/TransFusion-Pose)]\n\n**ISIE**  \n- VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (**ISIE**) [[paper](https://arxiv.org/abs/2104.10036)]\n\n**CORL**\n- **[DETR3D]** DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [[paper](https://arxiv.org/abs/2110.06922)] \n\n**IJCAI**\n- Medical Image Segmentation using Squeeze-and-Expansion Transformers  [[paper](https://arxiv.org/abs/2105.09511)]\n\n**IROS**   \n- **[YOGO]** You Only Group Once: Efficient Point-Cloud Processing with Token \nRepresentation and Relation Inference Module (**IROS**)  [[paper](https://arxiv.org/abs/2103.09975)] [[code](https://github.com/chenfengxu714/YOGO.git)]\n- **[PTT]** PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds  [[paper](https://arxiv.org/abs/2108.06455)] [[code](https://github.com/shanjiayao/PTT)]\n\n**WACV**  \n- **[LSTR]** End-to-end Lane Shape Prediction with Transformers [[paper](https://arxiv.org/abs/2011.04233)] [[code](https://github.com/liuruijin17/LSTR)]\n\n**ICDAR**  \n- Vision Transformer for Fast and Efficient Scene Text Recognition [[paper](https://arxiv.org/abs/2105.08582)]\n### 2020\n\n- **[DETR]** End-to-End Object Detection with Transformers (**ECCV**) [[paper](https://arxiv.org/abs/2005.12872)] [[code](https://github.com/facebookresearch/detr)]\n- [**FPT**] Feature Pyramid Transformer (**CVPR**) [[paper](https://arxiv.org/abs/2007.09451)] [[code](https://github.com/ZHANGDONG-NJUST/FPT)]\n\n### Other resource\n- [[Awesome-Transformer-Attention](https://github.com/cmhungsteve/Awesome-Transformer-Attention)]\n\n### Acknowledgement\n\nThanks the template from [Awesome-Crowd-Counting](https://github.com/gjy3035/Awesome-Crowd-Counting)\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdk-liang%2FAwesome-Visual-Transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdk-liang%2FAwesome-Visual-Transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdk-liang%2FAwesome-Visual-Transformer/lists"}