{"id":14110224,"url":"https://github.com/AIprogrammer/Visual-Transformer-Paper-Summary","last_synced_at":"2025-08-01T09:33:21.427Z","repository":{"id":133428218,"uuid":"339915678","full_name":"AIprogrammer/Visual-Transformer-Paper-Summary","owner":"AIprogrammer","description":"Summary of Transformer applications for computer vision tasks.","archived":false,"fork":false,"pushed_at":"2021-08-07T03:49:24.000Z","size":175,"stargazers_count":58,"open_issues_count":1,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-22T14:27:31.487Z","etag":null,"topics":["attention","attention-visualization","awesome","computer-vision","detr","papers","segmentation","survey","transformer","transformer-networks","visual-transformer","vit"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AIprogrammer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-18T02:39:35.000Z","updated_at":"2024-10-12T14:37:06.000Z","dependencies_parsed_at":"2023-04-19T09:47:36.643Z","dependency_job_id":null,"html_url":"https://github.com/AIprogrammer/Visual-Transformer-Paper-Summary","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIprogrammer%2FVisual-Transformer-Paper-Summary","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIprogrammer%2FVisual-Transformer-Paper-Summary/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIprogrammer%2FVisual-Transformer-Paper-Summary/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AIprogrammer%2FVisual-Transformer-Paper-Summary/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AIprogrammer","download_url":"https://codeload.github.com/AIprogrammer/Visual-Transformer-Paper-Summary/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228360668,"owners_count":17907952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","attention-visualization","awesome","computer-vision","detr","papers","segmentation","survey","transformer","transformer-networks","visual-transformer","vit"],"created_at":"2024-08-14T10:02:43.725Z","updated_at":"2024-12-05T19:31:38.409Z","avatar_url":"https://github.com/AIprogrammer.png","language":null,"readme":"# Awesome-Transformer-CV\n\nIf you have any problems, suggestions or improvements, please submit the issue or PR.\n\n## Contents\n* [Attention](#attention)\n* [OverallSurvey](#OverallSurvey)\n* [NLP](#nlp)\n    * [Language](#language)\n    * [Speech](#Speech)\n* [CV](#cv)\n    * [Backbone_Classification](#Backbone_Classification)\n    * [Self-Supervised](#Self-Supervised)\n    * [Interpretability and Robustness](#Interpretability-and-Robustness)\n    * [Detection](#Detection)\n    * [HOI](#HOI)\n    * [Tracking](#Tracking)\n    * [Segmentation](#Segmentation)\n    * [Reid](#Reid)\n    * [Localization](#Localization)\n    * [Generation](#Generation)\n    * [Inpainting](#Inpainting)\n    * [Image enhancement](#Image-enhancement)\n    * [Pose Estimation](#Pose-Estimation)\n    * [Face](#Face)\n    * [Video Understanding](#Video-Understanding)\n    * [Depth Estimation](#Depth-Estimation)\n    * [Prediction](#Prediction)\n    * [NAS](#NAS)\n    * [PointCloud](#PointCloud)\n    * [Fashion](#Fashion)\n    * [Medical](#Medical)\n* [Cross-Modal](#Cross-Modal)\n* [Reference](#Reference)\n* [Acknowledgement](#Acknowledgement)\n\n\n## Attention\n- Recurrent Models of Visual Attention [2014 deepmind NIPS]\n- Neural Machine Translation by Jointly Learning to Align and Translate [ICLR 2015]\n\n## OverallSurvey\n- Efficient Transformers: A Survey [[paper](https://arxiv.org/abs/2009.06732)]\n- A Survey on Visual Transformer [[paper](https://arxiv.org/abs/2012.12556)]\n- Transformers in Vision: A Survey [[paper](https://arxiv.org/abs/2101.01169)]\n \n## NLP\n\n\u003ca name=\"language\"\u003e\u003c/a\u003e\n### Language\n- Sequence to Sequence Learning with Neural Networks [NIPS 2014] [[paper](https://arxiv.org/abs/1409.3215)] [[code](https://github.com/bentrevett/pytorch-seq2seq)]\n- End-To-End Memory Networks [NIPS 2015] [[paper](https://arxiv.org/abs/1503.08895)] [[code](https://github.com/nmhkahn/MemN2N-pytorch)]\n- Attention is all you need [NIPS 2017] [[paper](https://arxiv.org/abs/1706.03762)] [[code]()]\n- **B**idirectional **E**ncoder **R**epresentations from **T**ransformers: BERT [[paper]()] [[code](https://huggingface.co/transformers/)] [[pretrained-models](https://huggingface.co/transformers/pretrained_models.html)]\n- Reformer: The Efficient Transformer [ICLR2020] [[paper](https://arxiv.org/abs/2001.04451)] [[code](https://github.com/lucidrains/reformer-pytorch)]\n- Linformer: Self-Attention with Linear Complexity [AAAI2020] [[paper](https://arxiv.org/abs/2006.04768)] [[code](https://github.com/lucidrains/linformer)]\n- GPT-3: Language Models are Few-Shot Learners [NIPS 2020] [[paper](https://arxiv.org/abs/2005.14165)] [[code](https://github.com/openai/gpt-3)]\n\n\u003ca name=\"Speech\"\u003e\u003c/a\u003e\n### Speech\n- Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [INTERSPEECH 2020] [[paper](https://arxiv.org/abs/2007.13975)] [[code](https://github.com/ujscjj/DPTNet)]\n\n## CV\n\u003ca name=\"Backbone_Classification\"\u003e\u003c/a\u003e\n### Backbone_Classification\n#### Papers and Codes\n- CoaT: Co-Scale Conv-Attentional Image Transformers [arxiv 2021] [[paper](http://arxiv.org/abs/2104.06399)] [[code](https://github.com/mlpc-ucsd/CoaT)]\n- SiT: Self-supervised vIsion Transformer [arxiv 2021] [[paper](https://arxiv.org/abs/2104.03602)] [[code](https://github.com/Sara-Ahmed/SiT)]\n- VIT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [VIT] [ICLR 2021] [[paper](https://arxiv.org/abs/2010.11929)] [[code](https://github.com/lucidrains/vit-pytorch)]\n    - Trained with extra private data: do not generalized well when trained on insufficient amounts of data\n- DeiT: Data-efficient Image Transformers [arxiv2021] [[paper](https://arxiv.org/abs/2012.12877)] [[code](https://github.com/facebookresearch/deit)]\n    - Token-based strategy and build upon VIT and convolutional models\n- Transformer in Transformer [arxiv 2021] [[paper](https://arxiv.org/abs/2103.00112)] [[code1](https://github.com/lucidrains/transformer-in-transformer)] [[code-official](https://github.com/huawei-noah/noah-research/tree/master/TNT)]\n- OmniNet: Omnidirectional Representations from Transformers [arxiv2021] [[paper](https://arxiv.org/abs/2103.01075)]\n- Gaussian Context Transformer [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Ruan_Gaussian_Context_Transformer_CVPR_2021_paper.pdf)]\n- General Multi-Label Image Classification With Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2011.14027)] [[code](https://github.com/QData/C-Tran)]\n- Scaling Local Self-Attention for Parameter Efficient Visual Backbones [CVPR 2021] [[paper](https://arxiv.org/abs/2103.12731)]\n- T2T-ViT: Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [ICCV 2021] [[paper](https://arxiv.org/abs/2101.11986)] [[code](https://github.com/yitu-opensource/T2T-ViT)]\n- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [ICCV 2021] [[paper](https://arxiv.org/abs/2103.14030)] [[code](https://github.com/microsoft/Swin-Transformer)]\n- Bias Loss for Mobile Neural Networks [ICCV 2021] [[paper](https://arxiv.org/abs/2107.11170)] [[code()]]\n- Vision Transformer with Progressive Sampling [ICCV 2021] [[paper](https://arxiv.org/abs/2108.01684)] [[code(https://github.com/yuexy/PS-ViT)]]\n- Rethinking Spatial Dimensions of Vision Transformers [ICCV 2021] [[paper](https://arxiv.org/abs/2103.16302)] [[code](https://github.com/naver-ai/pit)]\n- Rethinking and Improving Relative Position Encoding for Vision Transformer [ICCV 2021] [[paper](https://arxiv.org/abs/2107.14222)] [[code](https://github.com/microsoft/AutoML/tree/main/iRPE)]\n\n#### Interesting Repos\n- [Convolutional Cifar10](https://github.com/kuangliu/pytorch-cifar/blob/master/main.py)\n- [vision-transformers-cifar10](https://github.com/kentaroy47/vision-transformers-cifar10)\n    - Found that performance was worse than simple resnet18\n    - The influence of hyper-parameters: dim of vit, etc.\n- [ViT-pytorch](https://github.com/jeonsworld/ViT-pytorch)\n    - Using pretrained weights can get better results\n\n\u003ca name=\"Self-Supervised\"\u003e\u003c/a\u003e\n### Self-Supervised\n- Emerging Properties in Self-Supervised Vision Transformers [ICCV 2021] [[paper](https://arxiv.org/abs/2104.14294)] [[code](https://github.com/facebookresearch/dino)]\n- An Empirical Study of Training Self-Supervised Vision Transformers [ICCV 2021] [[paper](https://arxiv.org/abs/2104.02057)] [[code](https://github.com/searobbersduck/MoCo_v3_pytorch)]\n\n\u003ca name=\"Interpretability-and-Robustness\"\u003e\u003c/a\u003e\n### Interpretability and Robustness\n- Transformer Interpretability Beyond Attention Visualization [CVPR 2021] [[paper](https://arxiv.org/abs/2012.09838)] [[code](https://github.com/hila-chefer/Transformer-Explainability)]\n- On the Adversarial Robustness of Visual Transformers [arxiv 2021] [[paper](https://arxiv.org/abs/2103.15670)] \n- Robustness Verification for Transformers [ICLR 2020] [[paper](https://arxiv.org/abs/2002.06622)] [[code](https://github.com/shizhouxing/Robustness-Verification-for-Transformers)]\n- Pretrained Transformers Improve Out-of-Distribution Robustness [ACL 2020] [[paper](https://arxiv.org/abs/2004.06100)] [[code](https://github.com/camelop/NLP-Robustness)]\n\n\u003ca name=\"Detection\"\u003e\u003c/a\u003e\n### Detection\n- DETR: End-to-End Object Detection with Transformers [ECCV2020] [[paper](https://arxiv.org/abs/2005.12872)] [[code](https://github.com/facebookresearch/detr)]\n- Deformable DETR: Deformable Transformers for End-to-End Object Detection [ICLR2021] [[paper](https://openreview.net/forum?id=gZ9hCDWe6ke)] [[code](https://github.com/fundamentalvision/Deformable-DETR)]\n- End-to-End Object Detection with Adaptive Clustering Transformer [arxiv2020] [[paper](https://arxiv.org/abs/2011.09315)]\n- UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [[arxiv2020] [[paper](https://arxiv.org/abs/2011.09094)]\n- Rethinking Transformer-based Set Prediction for Object Detection [arxiv2020] [[paper](https://arxiv.org/pdf/2011.10881.pdf)] [[zhihu](https://zhuanlan.zhihu.com/p/326647798)]\n- End-to-end Lane Shape Prediction with Transformers [WACV 2021] [[paper](https://arxiv.org/pdf/2011.04233.pdf)] [[code](https://github.com/liuruijin17/LSTR)]\n- ViT-FRCNN: Toward Transformer-Based Object Detection [arxiv2020] [[paper](https://arxiv.org/abs/2012.09958)]\n- Line Segment Detection Using Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2101.01909)] [[code](https://github.com/mlpc-ucsd/LETR)]\n- Facial Action Unit Detection With Transformers [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Jacob_Facial_Action_Unit_Detection_With_Transformers_CVPR_2021_paper.pdf)] [[code]()]\n- Adaptive Image Transformer for One-Shot Object Detection [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Adaptive_Image_Transformer_for_One-Shot_Object_Detection_CVPR_2021_paper.pdf)] [[code]()]\n- Self-attention based Text Knowledge Mining for Text Detection [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Wan_Self-Attention_Based_Text_Knowledge_Mining_for_Text_Detection_CVPR_2021_paper.pdf)] [[code](https://github.com/CVI-SZU/STKM)]\n- Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [ICCV 2021] [[paper](https://arxiv.org/abs/2102.12122)] [[code](https://github.com/whai362/PVT)]\n- Group-Free 3D Object Detection via Transformers [ICCV 2021] [[paper](https://arxiv.org/abs/2104.00678)] [[code](https://github.com/zeliu98/Group-Free-3D)]\n- Fast Convergence of DETR with Spatially Modulated Co-Attention [ICCV 2021] [[paper](https://arxiv.org/abs/2101.07448)] [[code](https://github.com/abc403/SMCA-replication)]\n\n\n\u003ca name=\"HOI\"\u003e\u003c/a\u003e\n### HOI\n- End-to-End Human Object Interaction Detection with HOI Transformer [CVPR 2021] [[paper](https://arxiv.org/abs/2103.04503)] [[code](https://github.com/bbepoch/HoiTransformer)]\n- HOTR: End-to-End Human-Object Interaction Detection with Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2104.13682)] [[code](https://github.com/kakaobrain/HOTR)]\n\n\u003ca name=\"Tracking\"\u003e\u003c/a\u003e\n### Tracking\n- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking [CVPR 2021] [[paper](https://arxiv.org/abs/2103.11681)] [[code](https://github.com/594422814/TransformerTrack)]\n- TransTrack: Multiple-Object Tracking with Transformer [CVPR 2021] [[paper](https://arxiv.org/abs/2012.15460)] [[code](https://github.com/PeizeSun/TransTrack)]\n- Transformer Tracking [CVPR 2021] [[paper](https://arxiv.org/abs/2103.15436)] [[code](https://github.com/chenxin-dlut/TransT)]\n- Learning Spatio-Temporal Transformer for Visual Tracking [ICCV 2021] [[paper](https://arxiv.org/abs/2103.17154)] [[code](https://github.com/researchmm/Stark)]\n\n\u003ca name=\"Segmentation\"\u003e\u003c/a\u003e\n### Segmentation\n- SETR : Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2012.15840)] [[code](https://github.com/fudan-zvg/SETR)]\n- Trans2Seg: Transparent Object Segmentation with Transformer [arxiv2021] [[paper](https://arxiv.org/abs/2101.08461)] [[code](https://github.com/xieenze/Trans2Seg)]\n- End-to-End Video Instance Segmentation with Transformers [arxiv2020] [[paper](https://arxiv.org/abs/2011.14503)] [[zhihu](https://zhuanlan.zhihu.com/p/343286325)]\n- MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers [CVPR 2021] [[paper](https://arxiv.org/pdf/2012.00759.pdf)] [[official-code](https://github.com/google-research/deeplab2/blob/main/g3doc/projects/max_deeplab.md)] [[unofficial-code](https://github.com/conradry/max-deeplab)]\n- Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [arxiv 2020] [[paper](https://arxiv.org/pdf/2102.10662.pdf)] [[code](https://github.com/jeya-maria-jose/Medical-Transformer)]\n- SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation [CVPR 2021] [[paper](https://arxiv.org/abs/2101.08833)] [[code](https://github.com/dukebw/SSTVOS)]\n\n\u003ca name=\"Reid\"\u003e\u003c/a\u003e\n### Reid\n- Diverse Part Discovery: Occluded Person Re-Identification With Part-Aware Transformer [CVPR 2021] [[paper](https://arxiv.org/abs/2106.04095)] [[code]()]\n\n\u003ca name=\"Localization\"\u003e\u003c/a\u003e\n### Localization\n- LoFTR: Detector-Free Local Feature Matching with Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2104.00680)] [[code](https://github.com/zju3dv/LoFTR)]\n- MIST: Multiple Instance Spatial Transformer [CVPR 2021] [[paper](https://arxiv.org/pdf/1811.10725)] [[code](https://github.com/ubc-vision/mist)]\n\n\u003ca name=\"Generation\"\u003e\u003c/a\u003e\n### Generation\n- Variational Transformer Networks for Layout Generation [CVPR 2021] [[paper](https://arxiv.org/abs/2104.02416)] [[code](https://github.com/zlinao/Variational-Transformer)]\n- TransGAN: Two Transformers Can Make One Strong GAN [[paper](https://arxiv.org/pdf/2102.07074.pdf)] [[code](https://github.com/VITA-Group/TransGAN)]\n- Taming Transformers for High-Resolution Image Synthesis [CVPR 2021] [[paper](https://arxiv.org/abs/2012.09841)] [[code](https://github.com/CompVis/taming-transformers)]\n- iGPT: Generative Pretraining from Pixels [ICML 2020] [[paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf)] [[code](https://github.com/openai/image-gpt)]\n- Generative Adversarial Transformers [arxiv 2021] [[paper](https://arxiv.org/abs/2103.01209)] [[code](https://github.com/dorarad/gansformer)]\n- LayoutTransformer: Scene Layout Generation With Conceptual and Spatial Diversity [CVPR2021] [paper[https://openaccess.thecvf.com/content/CVPR2021/html/Yang_LayoutTransformer_Scene_Layout_Generation_With_Conceptual_and_Spatial_Diversity_CVPR_2021_paper.html]] [[code](https://github.com/davidhalladay/LayoutTransformer)]\n- Spatial-Temporal Transformer for Dynamic Scene Graph Generation [ICCV 2021] [[paper](https://arxiv.org/abs/2107.12309)]\n\n\n\u003ca name=\"Inpainting\"\u003e\u003c/a\u003e\n### Inpainting\n- STTN: Learning Joint Spatial-Temporal Transformations for Video Inpainting [ECCV 2020] [[paper](https://arxiv.org/abs/2007.10247)] [[code](https://github.com/researchmm/STTN)]\n\n\u003ca name=\"Image-enhancement\"\u003e\u003c/a\u003e\n### Image enhancement\n- Pre-Trained Image Processing Transformer [CVPR 2021] [[paper](https://arxiv.org/abs/2012.00364)]\n- TTSR: Learning Texture Transformer Network for Image Super-Resolution [CVPR2020] [[paper](https://arxiv.org/abs/2006.04139)] [[code](https://github.com/researchmm/TTSR)]\n\n\u003ca name=\"Pose-Estimation\"\u003e\u003c/a\u003e\n### Pose Estimation\n- Pose Recognition with Cascade Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2104.06976)] [[code](https://github.com/mlpc-ucsd/PRTR)]\n- TransPose: Towards Explainable Human Pose Estimation by Transformer [arxiv 2020] [[paper](https://arxiv.org/abs/2012.14214)] [[code](https://github.com/yangsenius/TransPose)]\n- Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation [ECCV 2020] [[paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123700018.pdf)]\n- HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation [ACMMM 2020] [[paper](https://cse.buffalo.edu/~jmeng2/publications/hotnet_mm20)]\n- End-to-End Human Pose and Mesh Reconstruction with Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2012.09760)] [[code](https://github.com/microsoft/MeshTransformer)]\n- 3D Human Pose Estimation with Spatial and Temporal Transformers [arxiv 2020] [[paper](https://arxiv.org/pdf/2103.10455.pdf)] [[code](https://github.com/zczcwh/PoseFormer)]\n- End-to-End Trainable Multi-Instance Pose Estimation with Transformers [arxiv 2020] [[paper](https://arxiv.org/abs/2103.12115)]\n\n\n\u003ca name=\"Face\"\u003e\u003c/a\u003e\n### Face\n- Robust Facial Expression Recognition with Convolutional Visual Transformers [arxiv 2020] [[paper](https://arxiv.org/abs/2103.16854)]\n- Clusformer: A Transformer Based Clustering Approach to Unsupervised Large-Scale Face and Visual Landmark Recognition [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/html/Nguyen_Clusformer_A_Transformer_Based_Clustering_Approach_to_Unsupervised_Large-Scale_Face_CVPR_2021_paper.html)] [[code](https://github.com/lucidrains/TimeSformer-pytorch)]\n\n\n\u003ca name=\"Video-Understanding\"\u003e\u003c/a\u003e\n### Video Understanding\n- Is Space-Time Attention All You Need for Video Understanding? [arxiv 2020] [[paper](https://arxiv.org/abs/2102.05095)] [[code](https://github.com/lucidrains/TimeSformer-pytorch)]\n- Temporal-Relational CrossTransformers for Few-Shot Action Recognition [CVPR 2021] [[paper](https://arxiv.org/abs/2101.06184)] [[code](https://github.com/tobyperrett/trx)]\n- Self-Supervised Video Hashing via Bidirectional Transformers [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/html/Li_Self-Supervised_Video_Hashing_via_Bidirectional_Transformers_CVPR_2021_paper.html)]\n- SSAN: Separable Self-Attention Network for Video Representation Learning [CVPR 2021] [[paper](https://arxiv.org/abs/2105.13033)]\n\n\u003ca name=\"Depth-Estimation\"\u003e\u003c/a\u003e\n### Depth-Estimation\n- Adabins：Depth Estimation using Adaptive Bins [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Bhat_AdaBins_Depth_Estimation_Using_Adaptive_Bins_CVPR_2021_paper.pdf)] [[code](https://github.com/shariqfarooq123)]\n\n\n\u003ca name=\"Prediction\"\u003e\u003c/a\u003e\n### Prediction\n- Multimodal Motion Prediction with Stacked Transformers [CVPR 2021] [[paper](https://arxiv.org/pdf/2103.11624.pdf)] [[code](https://github.com/decisionforce/mmTransformer)]\n- Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case [[paper](https://arxiv.org/pdf/2001.08317.pdf)]\n- Transformer networks for trajectory forecasting [ICPR 2020] [[paper](https://arxiv.org/abs/2003.08111)] [[code](https://github.com/FGiuliari/Trajectory-Transformer)]\n- Spatial-Channel Transformer Network for Trajectory Prediction on the Traffic Scenes [arxiv 2021] [[paper](https://arxiv.org/abs/2101.11472)] [[code]()]\n- Pedestrian Trajectory Prediction using Context-Augmented Transformer Networks [ICRA 2020] [[paper](https://arxiv.org/abs/2012.01757)] [[code]()]\n- Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction [ECCV 2020] [[paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123570494.pdf)] [[code](https://github.com/Majiker/STAR)]\n- Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction [[paper](https://www.ijcai.org/Proceedings/2020/0640.pdf)]\n- Single-Shot Motion Completion with Transformer [arxiv2021] [[paper](https://arxiv.org/abs/2103.00776)] [[code](https://github.com/FuxiCV/SSMCT)]\n\n\u003ca name=\"NAS\"\u003e\u003c/a\u003e\n### NAS\n- HR-NAS: Searching Efficient High-Resolution Neural Architectures with Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2106.06560)] [[code](https://github.com/dingmyu/HR-NAS)]\n- AutoFormer: Searching Transformers for Visual Recognition [ICCV 2021] [[paper](https://arxiv.org/abs/2107.00651)] [[code(https://github.com/microsoft/AutoML)]]\n\n\u003ca name=\"PointCloud\"\u003e\u003c/a\u003e\n### PointCloud\n- Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [CVPR 2021] [[paper](https://arxiv.org/abs/2104.09224)] [[code](https://github.com/autonomousvision/transfuser)]\n- Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/html/Fan_Point_4D_Transformer_Networks_for_Spatio-Temporal_Modeling_in_Point_Cloud_CVPR_2021_paper.html)]\n\n\u003ca name=\"Fashion\"\u003e\u003c/a\u003e\n### Fashion\n- Kaleido-BERT：Vision-Language Pre-training on Fashion Domain [CVPR 2021] [[paper](https://arxiv.org/abs/2103.16110)] [[code](https://github.com/mczhuge/Kaleido-BERT)]\n\n\u003ca name=\"Medical\"\u003e\u003c/a\u003e\n### Medical\n- Lesion-Aware Transformers for Diabetic Retinopathy Grading [CVPR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/html/Sun_Lesion-Aware_Transformers_for_Diabetic_Retinopathy_Grading_CVPR_2021_paper.html)]\n\n\u003ca name=\"Cross-Modal\"\u003e\u003c/a\u003e\n## Cross-Modal\n- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [CVPR 2021] [[paper](https://arxiv.org/abs/2103.16553)]\n- Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [CVPR2021] [[paper](https://www.amazon.science/publications/revamping-cross-modal-recipe-retrieval-with-hierarchical-transformers-and-self-supervised-learning)] [[code](https://github.com/amzn/image-to-recipe-transformers)]\n- Topological Planning With Transformers for Vision-and-Language Navigation [CVPR 2021] [[paper](https://arxiv.org/abs/2012.05292)]\n- Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos [CVPRR 2021] [[paper](https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_Multi-Stage_Aggregated_Transformer_Network_for_Temporal_Language_Localization_in_Videos_CVPR_2021_paper.html)]\n- VLN BERT: A Recurrent Vision-and-Language BERT for Navigation [CVPR 2021] [[paper](https://arxiv.org/abs/2011.13922)] [[code](https://github.com/YicongHong/Recurrent-VLN-BERT)]\n- Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [CVPR 2021] [[paper](https://arxiv.org/abs/2102.06183)] [[code](https://github.com/jayleicn/ClipBERT)]\n\n# Reference\n- Attention 机制详解1，2 [zhihu1](https://zhuanlan.zhihu.com/p/47063917) [zhihu2](https://zhuanlan.zhihu.com/p/47282410)\n- [自然语言处理中的自注意力机制（Self-attention Mechanism)](https://www.cnblogs.com/robert-dlut/p/8638283.html)\n- Transformer模型原理详解 [[zhihu](https://zhuanlan.zhihu.com/p/44121378)] [[csdn](https://blog.csdn.net/longxinchen_ml/article/details/86533005)]\n- [完全解析RNN, Seq2Seq, Attention注意力机制](https://zhuanlan.zhihu.com/p/51383402)\n- [Seq2Seq and transformer implementation](https://github.com/bentrevett/pytorch-seq2seq)\n- End-To-End Memory Networks [[zhihu](https://zhuanlan.zhihu.com/p/29679742)]\n- [Illustrating the key,query,value in attention](https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190)\n- [Transformer in CV](https://towardsdatascience.com/transformer-in-cv-bbdb58bf335e)\n- [CVPR2021-Papers-with-Code](https://github.com/amusi/CVPR2021-Papers-with-Code)\n- [ICCV2021-Papers-with-Code](https://github.com/amusi/ICCV2021-Papers-with-Code)\n# Acknowledgement\nThanks for the awesome survey papers of Transformer.","funding_links":[],"categories":["Other Lists"],"sub_categories":["TeX Lists"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAIprogrammer%2FVisual-Transformer-Paper-Summary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAIprogrammer%2FVisual-Transformer-Paper-Summary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAIprogrammer%2FVisual-Transformer-Paper-Summary/lists"}