https://github.com/cmhungsteve/Awesome-Transformer-Attention

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
https://github.com/cmhungsteve/Awesome-Transformer-Attention
List: Awesome-Transformer-Attention
attention-mechanism attention-mechanisms awesome-list computer-vision deep-learning detr papers self-attention transformer transformer-architecture transformer-awesome transformer-cv transformer-models transformer-with-cv transformers vision-transformer visual-transformer vit
Last synced: 3 months ago
JSON representation
An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
Host: GitHub
URL: https://github.com/cmhungsteve/Awesome-Transformer-Attention
Owner: cmhungsteve
Created: 2021-09-15T07:16:24.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2024-07-30T06:57:18.000Z (11 months ago)
Last Synced: 2024-10-29T15:09:38.360Z (8 months ago)
Topics: attention-mechanism, attention-mechanisms, awesome-list, computer-vision, deep-learning, detr, papers, self-attention, transformer, transformer-architecture, transformer-awesome, transformer-cv, transformer-models, transformer-with-cv, transformers, vision-transformer, visual-transformer, vit
Homepage:
Size: 5.65 MB
Stars: 4,621
Watchers: 128
Forks: 489
Open Issues: 22
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

Awesome-Segmentation-With-Transformer - Repo - Hung (Steve) Chen. (Related Repo For Segmentation and Detection / Related Domains and Beyond)
stars - cmhungsteve/Awesome-Transformer-Attention - An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites (Others)
awesome-ai-list-guide - Awesome-Transformer-Attention
StarryDivineSky - cmhungsteve/Awesome-Transformer-Attention
Awesome-Visual-Transformer - [Awesome-Transformer-Attention
awesome-transformer-search - Vision Transformer & Attention Awesome List
ultimate-awesome - Awesome-Transformer-Attention - An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites. (Other Lists / Julia Lists)
awesome-awesome-artificial-intelligence - Awesome Transformer Attention - Transformer-Attention?style=social) | (Transformer)
awesome-awesome-artificial-intelligence - Awesome Transformer Attention - Transformer-Attention?style=social) | (Transformer)
Awesome-Transformer-based-SLAM - Awesome-Transformer-Attention
README

        # Ultimate-Awesome-Transformer-Attention [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

This repo contains a comprehensive paper list of **Vision Transformer & Attention**, including papers, codes, and related websites. 


This list is maintained by [Min-Hung Chen](https://minhungchen.netlify.app/). (*Actively* keep updating)

If you find some ignored papers, **feel free to [*create pull requests*](https://github.com/cmhungsteve/Awesome-Transformer-Attention/blob/main/How-to-PR.md), [*open issues*](https://github.com/cmhungsteve/Awesome-Transformer-Attention/issues/new), or [*email* me](mailto:[email protected])**. 
 

Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider **[citing](#citation)** and **★STARing** this list. 


Feel free to share this list with others! 

**[Update: January, 2024]** Added all the related papers from *NeurIPS 2023*! 


**[Update: December, 2023]** Added all the related papers from *ICCV 2023*! 


**[Update: September, 2023]** Split the multi-modal paper list to [README_multimodal.md](README_multimodal.md) 


**[Update: June, 2023]** Added all the related papers from *ICML 2023*! 


**[Update: June, 2023]** Added all the related papers from *CVPR 2023*! 


**[Update: February, 2023]** Added all the related papers from *ICLR 2023*! 


**[Update: December, 2022]** Added attention-free papers from [Networks Beyond Attention (GitHub)](https://github.com/FocalNet/Networks-Beyond-Attention) made by [Jianwei Yang](https://github.com/jwyang) 


**[Update: November, 2022]** Added all the related papers from *NeurIPS 2022*! 


**[Update: October, 2022]** Split the 2nd half of the paper list to [README_2.md](README_2.md) 


**[Update: October, 2022]** Added all the related papers from *ECCV 2022*! 


**[Update: September, 2022]** Added the [Transformer tutorial slides](http://lucasb.eyer.be/transformer) made by [Lucas Beyer](https://twitter.com/giffmana)! 


**[Update: June, 2022]** Added all the related papers from *CVPR 2022*!

---

## Overview

- [Citation](#citation)

- [Survey](#survey)

- [Image Classification / Backbone](#image-classification--backbone)

    - [Replace Conv w/ Attention](#replace-conv-w-attention)

        - [Pure Attention](#pure-attention)

        - [Conv-stem + Attention](#conv-stem--attention)

        - [Conv + Attention](#conv--attention)

    - [Vision Transformer](#vision-transformer)

        - [General Vision Transformer](#general-vision-transformer)

        - [Efficient Vision Transformer](#efficient-vision-transformer)

        - [Conv + Transformer](#conv--transformer)

        - [Training + Transformer](#training--transformer)

        - [Robustness + Transformer](#robustness--transformer)

        - [Model Compression + Transformer](#model-compression--transformer)

    - [Attention-Free](#attention-free)

        - [MLP-Series](#mlp-series)

        - [Other Attention-Free](#other-attention-free)

    - [Analysis for Transformer](#analysis-for-transformer)

- [Detection](#detection)

    - [Object Detection](#object-detection)

    - [3D Object Detection](#3d-object-detection)

    - [Multi-Modal Detection](#multi-modal-detection)

    - [HOI Detection](#hoi-detection)

    - [Salient Object Detection](#salient-object-detection)

    - [Other Detection Tasks](#other-detection-tasks)

- [Segmentation](#segmentation)

    - [Semantic Segmentation](#semantic-segmentation)

    - [Depth Estimation](#depth-estimation)

    - [Object Segmentation](#object-segmentation)

    - [Other Segmentation Tasks](#other-segmentation-tasks)

- [Video (High-level)](#video-high-level)

    - [Action Recognition](#action-recognition)

    - [Action Detection/Localization](#action-detectionlocalization)

    - [Action Prediction/Anticipation](#action-predictionanticipation)

    - [Video Object Segmentation](#video-object-segmentation)

    - [Video Instance Segmentation](#video-instance-segmentation)

    - [Other Video Tasks](#other-video-tasks)

- [References](#references)

 

------ (The following papers are moved to [README_multimodal.md](README_multimodal.md)) ------

- [Multi-Modality](README_multimodal.md#multi-modality)

    - [Visual Captioning](README_multimodal.md#visual-captioning)

    - [Visual Question Answering](README_multimodal.md#visual-question-answering)

    - [Visual Grounding](README_multimodal.md#visual-grounding)

    - [Multi-Modal Representation Learning](README_multimodal.md#multi-modal-representation-learning)

    - [Multi-Modal Retrieval](README_multimodal.md#multi-modal-retrieval)

    - [Multi-Modal Generation](README_multimodal.md#multi-modal-generation)

    - [Prompt Learning/Tuning](README_multimodal.md#prompt-learningtuning)

    - [Visual Document Understanding](README_multimodal.md#visual-document-understanding)

    - [Other Multi-Modal Tasks](README_multimodal.md#other-multi-modal-tasks)

------ (The following papers are moved to [README_2.md](README_2.md)) ------

- [Other High-level Vision Tasks](README_2.md#other-high-level-vision-tasks)

    - [Point Cloud / 3D](README_2.md#point-cloud--3d)

    - [Pose Estimation](README_2.md#pose-estimation)

    - [Tracking](README_2.md#tracking)

    - [Re-ID](README_2.md#re-id)

    - [Face](README_2.md#face)

    - [Scene Graph](README_2.md#scene-graph)

    - [Neural Architecture Search](README_2.md#neural-architecture-search)

- [Transfer / X-Supervised / X-Shot / Continual Learning](README_2.md#transfer--x-supervised--x-shot--continual-learning)

- [Low-level Vision Tasks](README_2.md#low-level-vision-tasks)

    - [Image Restoration](README_2.md#image-restoration)

    - [Video Restoration](README_2.md#video-restoration)

    - [Inpainting / Completion / Outpainting](README_2.md#inpainting--completion--outpainting)

    - [Image Generation](README_2.md#image-generation)

    - [Video Generation](README_2.md#video-generation)

    - [Transfer / Translation / Manipulation](README_2.md#transfer--translation--manipulation)

    - [Other Low-Level Tasks](README_2.md#other-low-level-tasks)

- [Reinforcement Learning](README_2.md#reinforcement-learning)

    - [Navigation](README_2.md#navigation)

    - [Other RL Tasks](README_2.md#other-rl-tasks)

- [Medical](README_2.md#medical)

    - [Medical Segmentation](README_2.md#medical-segmentation)

    - [Medical Classification](README_2.md#medical-classification)

    - [Medical Detection](README_2.md#medical-detection)

    - [Medical Reconstruction](README_2.md#medical-detection)

    - [Medical Low-Level Vision](README_2.md#medical-low-level-vision)

    - [Medical Vision-Language](README_2.md#medical-vision-language)

    - [Medical Others](README_2.md#medical-others)

- [Other Tasks](README_2.md#other-tasks)

- [Attention Mechanisms in Vision/NLP](README_2.md#attention-mechanisms-in-visionnlp)

    - [Attention for Vision](README_2.md#attention-for-vision)

    - [NLP](README_2.md#attention-for-nlp)

    - [Both](README_2.md#attention-for-both)

    - [Others](README_2.md#attention-for-others)

---

## Citation

If you find this repository useful, please consider citing this list:

```

@misc{chen2022transformerpaperlist,

    title = {Ultimate awesome paper list: transformer and attention},

    author = {Chen, Min-Hung},

    journal = {GitHub repository},

    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},

    year = {2022},

}

```

---

## Survey

* "A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 (*Purdue*). [[Paper](https://arxiv.org/abs/2311.12320)][[GitHub](https://github.com/IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving)]

* "Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 (*Tencent*). [[Paper](https://arxiv.org/abs/2405.10739)][[GitHub](https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey)]

* "From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 (*Newcastle University, UK*). [[Paper](https://arxiv.org/abs/2405.10674)][[GitHub](https://github.com/soraw-ai/Awesome-Text-to-Video-Generation)]

* "When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 (*Oxford*). [[Paper](https://arxiv.org/abs/2405.10255)][[GitHub](https://github.com/ActiveVisionLab/Awesome-LLM-3D)]

* "Foundation Models for Video Understanding: A Survey", arXiv, 2024 (*Aalborg University, Denmark*). [[Paper](https://arxiv.org/abs/2405.03770)][[GitHub](https://github.com/NeeluMadan/ViFM_Survey)]

* "Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 (*Chongqing University*). [[Paper](https://arxiv.org/abs/2405.04404)][[GitHub](https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy)]

* "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 (*GigaAI, China*). [[Paper](https://arxiv.org/abs/2405.03520)][[GitHub](https://github.com/GigaAI-research/General-World-Models-Survey)]

* "Video Diffusion Models: A Survey", arXiv, 2024 (*Bielefeld University, Germany*). [[Paper](https://arxiv.org/abs/2405.03150)][[GitHub](https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models)]

* "Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 (*Lehigh + UPenn*). [[Paper](https://arxiv.org/abs/2404.18961)]

* "Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2404.18930)][[GitHub](https://github.com/showlab/Awesome-MLLM-Hallucination)]

* "A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 (*HKUST*). [[Paper](https://arxiv.org/abs/2404.18861)][[GitHub](https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models)]

* "State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 (*Anhui University*). [[Paper](https://arxiv.org/abs/2404.09516)][[GitHub](https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List)]

* "Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 (*IIT Patna*). [[Paper](https://arxiv.org/abs/2404.07214)]

* "From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2403.12027)][[GitHub](https://github.com/khuangaf/Awesome-Chart-Understanding)]

* "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 (*Northeastern*). [[Paper](https://arxiv.org/abs/2403.14608)]

* "Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2403.05131)]

* "Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2403.04279)][[GitHub](https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models)]

* "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 (*Lehigh University, Pennsylvania*). [[Paper](https://arxiv.org/abs/2402.17177)][[GitHub](https://github.com/lichao-sun/SoraReview)]

* "Large Multimodal Agents: A Survey", arXiv, 2024 (*CUHK*). [[Paper](https://arxiv.org/abs/2402.15116)][[GitHub](https://github.com/jun0wanan/awesome-large-multimodal-agents)]

* "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 (*BIGAI*). [[Paper](https://arxiv.org/abs/2402.02242)][[GitHub](https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning)]

* "Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 (*Qufu Normal University, China*). [[Paper](https://arxiv.org/abs/2402.14304)]

* "The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 (*University of Modena and Reggio Emilia (UniMoRE), Italy*). [[Paper](https://arxiv.org/abs/2402.12451)]

* "Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 (*Westlake University, China*). [[Paper](https://arxiv.org/abs/2401.00897)][[GitHub](https://github.com/Lupin1998/Awesome-MIM)]

* "Transformer for Object Re-Identification: A Survey", arXiv, 2024 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2401.06960)]

* "Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2401.08045)][[GtiHub](https://github.com/zhanghm1995/Forge_VFM4AD)]

* "MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 (*Tencent*). [[Paper](https://arxiv.org/abs/2401.13601)]

* "From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2401.15071)]

* "A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2402.00253)]

* "A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 (*Motional, Massachusetts*). [[Paper](https://arxiv.org/abs/2402.01105)]

* "A Survey on Transformer Compression", arXiv, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2402.05964)]

* "Vision + Language Applications: A Survey", CVPRW, 2023 (*Ritsumeikan University, Japan*). [[Paper](https://arxiv.org/abs/2305.14598)][[GitHub](https://github.com/Yutong-Zhou-cv/Awesome-Text-to-Image)]

* "Multimodal Learning With Transformers: A Survey", TPAMI, 2023 (*Tsinghua & Oxford*). [[Paper](https://arxiv.org/abs/2206.06488)]

* "A Survey of Visual Transformers", TNNLS, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2111.06091)][[GitHub](https://github.com/arekavandi/Transformer-SOD)]

* "Video Understanding with Large Language Models: A Survey", arXiv, 2023 (*University of Rochester*). [[Paper](https://arxiv.org/abs/2312.17432)][[GitHub](https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding)]

* "Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2312.16602)]

* "A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2312.11562)][[GitHub](https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models)]

* "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2312.12436)][GitHub](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)]

* "Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 (*JHU*). [[Paper](https://arxiv.org/abs/2312.10163)]

* "Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 (*Institute for Research in Fundamental Sciences (IPM), Iran*). [[Paper](https://arxiv.org/abs/2311.06786)]

* "Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2311.08172)][[GitHub (in construction)](https://github.com/palchenli/VL-Instruction-Tuning)]

* "Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 (*York University*). [[Paper](https://arxiv.org/abs/2310.12296)]

* "Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 (*valeo.ai, France*). [[Paper](https://arxiv.org/abs/2310.12904)][[GitHub](https://github.com/valeoai/Awesome-Unsupervised-Object-Localization)]

* "A Survey on Video Diffusion Models", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2310.10647)][[GitHub](https://github.com/ChenHsing/Awesome-Video-Diffusion-Models)]

* "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2309.17421)]

* "Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2309.10020)]

* "Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 (*University of Western Australia*). [[Paper](https://arxiv.org/abs/2309.04902)]

* "RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 (*University of Sydney*). [[Paper](https://arxiv.org/abs/2309.00810)]

* "A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2309.02031)]

* "From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 (*UESTC*). [[Paper](https://arxiv.org/abs/2308.05305)]

* "Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2307.13721)][[GitHub](https://github.com/awaisrauf/Awesome-CV-Foundational-Models)]

* "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2307.12980)]

* "Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 (*Xi'an Jiaotong University*). [[Paper](https://arxiv.org/abs/2307.11471)]

* "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2307.09220)]

* "Transformers in Reinforcement Learning: A Survey", arXiv, 2023 (*Mila*). [[Paper](https://arxiv.org/abs/2307.05979)]

* "Vision Language Transformers: A Survey", arXiv, 2023 (*Boise State University, Idaho*). [[Paper](https://arxiv.org/abs/2307.03254)]

* "Towards Open Vocabulary Learning: A Survey", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2306.15880)][[GitHub](https://github.com/jianzongwu/Awesome-Open-Vocabulary)]

* "Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2306.14895)]

* "A Survey on Multimodal Large Language Models", arXiv, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2306.13549)][[GitHub](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)]

* "2D Object Detection with Transformers: A Review", arXiv, 2023 (*German Research Center for Artificial Intelligence, Germany*). [[Paper](https://arxiv.org/abs/2306.04670)]

* "Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 (*Eldorado’s Institute of Technology, Brazil*). [[Paper](https://arxiv.org/abs/2305.11033)]

* "Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 (*NYU*). [[Paper](https://arxiv.org/abs/2305.05726)]

* "Visual Tuning", arXiv, 2023 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2305.06061)]

* "Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 (*Fudan University*). [[Paper](https://arxiv.org/abs/2305.04691)]

* "Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 (*University of Peradeniya, Sri Lanka*). [[Paper](https://arxiv.org/abs/2305.03273)]

* "A Review of Deep Learning for Video Captioning", arXiv, 2023 (*Deakin University, Australia*). [[Paper](https://arxiv.org/abs/2304.11431)]

* "Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2304.09854)][[GitHub](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer)]

* "Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 (*?*). [[Paper](https://arxiv.org/abs/2304.00685)][[GitHub (in construction)](https://github.com/jingyi0000/VLM_survey)]

* "Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2303.07909)]

* "Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 (*Berkeley + Google*). [[Paper](https://arxiv.org/abs/2303.04129)]

* "Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2301.03505)][[GitHub](https://github.com/mindflow-institue/Awesome-Transformer)]

* "Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (*IBM*). [[Paper](https://arxiv.org/abs/2302.08374)][[GitHub](https://github.com/badripatro/efficient360)]

* "Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (*Indian Institute of Information Technology*). [[Paper](https://arxiv.org/abs/2302.08641)]

* "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 (*Pengcheng Laboratory*). [[Paper](https://arxiv.org/abs/2302.10035)][[GitHub](https://github.com/wangxiao5791509/MultiModal_BigModels_Survey)]

* "A Survey on Visual Transformer", TPAMI, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2012.12556)]

* "Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 (*Tsinghua University, China*). [[Paper](https://arxiv.org/abs/2111.07624)][[Springer](https://link.springer.com/article/10.1007/s41095-022-0271-y)][[Github](https://github.com/MenghaoGuo/Awesome-Vision-Attentions)]

* "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (*NavInfo Europe, Netherlands*). [[Paper](https://arxiv.org/abs/2201.08683)]

* "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2204.07356)]

* "Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (*Covenant University, Nigeria*). [[Paper](https://arxiv.org/abs/2211.10043)]

* "A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (*Sejong University*). [[Paper](https://arxiv.org/abs/2211.06004)]

* "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.09263)]

* "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (*Illinois Institute of Technology, Chicago*). [[Paper](https://arxiv.org/abs/2210.02884)]

* "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (*Charles Sturt University, Australia*). [[Paper](https://arxiv.org/abs/2209.05700)]

* "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2202.09061)]

* "Transformers in Remote Sensing: A Survey", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2209.01206)][[Github](https://github.com/VIROBO-15/Transformer-in-Remote-Sensing)]

* "Medical image analysis based on transformer: A Review", arXiv, 2022 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2208.06643)]

* "3D Vision with Transformers: A Survey", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2208.04309)][[GitHub](https://github.com/lahoud/3d-vision-transformers)]

* "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (*NYCU*). [[Paper](https://arxiv.org/abs/2207.03041)]

* "Transformers in Medical Imaging: A Survey", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2201.09873)][[GitHub](https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging)]

* "Multimodal Learning with Transformers: A Survey", arXiv, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2206.06488)]

* "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2206.01136)]

* "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (*University of Waterloo*). [[Paper](https://arxiv.org/abs/2205.07417)]

* "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (*INESC TEC and University of Porto, Portugal*). [[Paper](https://arxiv.org/abs/2204.12406)]

* "Efficient Transformers: A Survey", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2009.06732)]

* "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2111.04060)]

* "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (*National University of Sciences and Technology (NUST), Pakistan*). [[Paper](https://arxiv.org/abs/2203.15269)]

* "Video Transformers: A Survey", arXiv, 2022 (*Universitat de Barcelona, Spain*). [[Paper](https://arxiv.org/abs/2201.05991)] 

* "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2202.12165)]

* "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (*?*). [[Paper](https://arxiv.org/abs/2203.01536)]

* "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (*Xidian University*). [[Paper](https://arxiv.org/abs/2203.12944)]

* "Image Captioning In the Transformer Age", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2204.07374)][[GitHub](https://github.com/SjokerLily/awesome-image-captioning)]

* "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (*Fayoum University, Egypt*). [[Paper](https://arxiv.org/abs/2204.07756)]

* "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2101.01169)]

* "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (*Renmin University of China*). [[Paper](https://arxiv.org/abs/2109.09920)]

* "A Survey of Transformers", arXiv, 2021 (*Fudan*). [[Paper](https://arxiv.org/abs/2106.04554)]

* "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (*University of Kashmir, India*). [[Paper](https://arxiv.org/abs/2106.07550)]

[[Back to Overview](#overview)]

## Image Classification / Backbone

### Replace Conv w/ Attention

#### Pure Attention

* **LR-Net**: "Local Relation Networks for Image Recognition", ICCV, 2019 (*Microsoft*). [[Paper](https://arxiv.org/abs/1904.11491)][[PyTorch (gan3sh500)](https://github.com/gan3sh500/local-relational-nets)]

* **SASA**: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 (*Google*). [[Paper](https://arxiv.org/abs/1906.05909)][[PyTorch-1 (leaderj1001)](https://github.com/leaderj1001/Stand-Alone-Self-Attention)][[PyTorch-2 (MerHS)](https://github.com/MerHS/SASA-pytorch)]

* **Axial-Transformer**: "Axial Attention in Multidimensional Transformers", arXiv, 2019 (*Google*). [[Paper](https://openreview.net/forum?id=H1e5GJBtDr)][[PyTorch (lucidrains)](https://github.com/lucidrains/axial-attention)]

* **SAN**: "Exploring Self-attention for Image Recognition", CVPR, 2020 (*CUHK + Intel*). [[Paper](https://arxiv.org/abs/2004.13621)][[PyTorch](https://github.com/hszhao/SAN)]

* **Axial-DeepLab**: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 (*Google*). [[Paper](https://arxiv.org/abs/2003.07853)][[PyTorch](https://github.com/csrhddlam/axial-deeplab)]

#### Conv-stem + Attention

* **GSA-Net**: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (*Google*). [[Paper](https://arxiv.org/abs/2010.03019)][[PyTorch (lucidrains)](https://github.com/lucidrains/global-self-attention-network)]

* **HaloNet**: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2103.12731)][[PyTorch (lucidrains)](https://github.com/lucidrains/halonet-pytorch)]

* **CoTNet**: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (*JD*). [[Paper](https://arxiv.org/abs/2107.12292)][[PyTorch](https://github.com/JDAI-CV/CoTNet)]

* **HAT-Net**: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2106.03180)][[PyTorch (in construction)](https://github.com/yun-liu/HAT-Net)]

#### Conv + Attention

* **AA**: "Attention Augmented Convolutional Networks", ICCV, 2019 (*Google*). [[Paper](https://arxiv.org/abs/1904.09925)][[PyTorch (leaderj1001)](https://github.com/leaderj1001/Attention-Augmented-Conv2d)][[Tensorflow (titu1994)](https://github.com/titu1994/keras-attention-augmented-convs)]

* **GCNet**: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) (*Microsoft*). [[Paper](https://arxiv.org/abs/2012.13375)][[PyTorch](https://github.com/xvjiarui/GCNet)]

* **LambdaNetworks**: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 (*Google*). [[Paper](https://openreview.net/forum?id=xTJEN-ggl1b)][[PyTorch-1 (lucidrains)](https://github.com/lucidrains/lambda-networks)][[PyTorch-2 (leaderj1001)](https://github.com/leaderj1001/LambdaNetworks)]

* **BoTNet**: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2101.11605)][[PyTorch-1 (lucidrains)](https://github.com/lucidrains/bottleneck-transformer-pytorch)][[PyTorch-2 (leaderj1001)](https://github.com/leaderj1001/BottleneckTransformers)]

* **GCT**: "Gaussian Context Transformer", CVPR, 2021 (*Zhejiang University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2021/html/Ruan_Gaussian_Context_Transformer_CVPR_2021_paper.html)]

* **CoAtNet**: "CoAtNet: Marrying Convolution and Attention for All Data Sizes", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.04803)]

* **ACmix**: "On the Integration of Self-Attention and Convolution", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2111.14556)][[PyTorch](https://github.com/LeapLabTHU/ACmix)]

[[Back to Overview](#overview)]

### Vision Transformer

#### General Vision Transformer

* **ViT**: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (*Google*). [[Paper](https://openreview.net/forum?id=YicbFdNTTy)][[Tensorflow](https://github.com/google-research/vision_transformer)][[PyTorch (lucidrains)](https://github.com/lucidrains/vit-pytorch)][[JAX (conceptofmind)](https://github.com/conceptofmind/vit-flax)]

* **Perceiver**: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (*DeepMind*). [[Paper](https://arxiv.org/abs/2103.03206)][[PyTorch (lucidrains)](https://github.com/lucidrains/perceiver-pytorch)]

* **PiT**: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (*NAVER*). [[Paper](https://arxiv.org/abs/2103.16302)][[PyTorch](https://github.com/naver-ai/pit)]

* **VT**: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (*Facebook*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Wu_Visual_Transformers_Where_Do_Transformers_Really_Belong_in_Vision_Models_ICCV_2021_paper.html)][[PyTorch (tahmid0007)](https://github.com/tahmid0007/VisualTransformers)] 

* **PVT**: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2102.12122)][[PyTorch](https://github.com/whai362/PVT)] 

* **iRPE**: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2107.14222)][[PyTorch](https://github.com/microsoft/Cream/tree/main/iRPE)]

* **CaiT**: "Going deeper with Image Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2103.17239)][[PyTorch](https://github.com/facebookresearch/deit)]

* **Swin-Transformer**: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2103.14030)][[PyTorch](https://github.com/microsoft/Swin-Transformer)][[PyTorch (berniwal)](https://github.com/berniwal/swin-transformer-pytorch)]

* **T2T-ViT**: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (*Yitu*). [[Paper](https://arxiv.org/abs/2101.11986)][[PyTorch](https://github.com/yitu-opensource/T2T-ViT)]

* **FFNBN**: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (*Microsoft*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021W/NeurArch/html/Yao_Leveraging_Batch_Normalization_for_Vision_Transformers_ICCVW_2021_paper.html)]

* **DPT**: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2107.14467)][[PyTorch](https://github.com/CASIA-IVA-Lab/DPT)]

* **Focal**: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2107.00641)][[PyTorch](https://github.com/microsoft/Focal-Transformer)]

* **XCiT**: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.09681)]

* **Twins**: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (*Meituan*). [[Paper](https://arxiv.org/abs/2104.13840)][[PyTorch)](https://github.com/Meituan-AutoML/Twins)]

* **ARM**: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2110.15156)][[GitHub (in construction)](https://github.com/amazon-research/anti-aliasing-transformer)]

* **DVT**: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2105.15075)][[PyTorch](https://github.com/blackfeather-wang/Dynamic-Vision-Transformer)]

* **Aug-S**: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2106.15941)]

* **TNT**: "Transformer in Transformer", NeurIPS, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2103.00112)][[PyTorch](https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch)][[PyTorch (lucidrains)](https://github.com/lucidrains/transformer-in-transformer)]

* **ViTAE**: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2106.03348)][[PyTorch](https://github.com/Annbless/ViTAE)]

* **DeepViT**: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 (*NUS + ByteDance*). [[Paper](https://arxiv.org/abs/2103.11886)][[Code](https://github.com/zhoudaquan/dvit_repo)]

* **So-ViT**: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2104.10935)][[PyTorch](https://github.com/jiangtaoxie/So-ViT)]

* **LV-ViT**: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (*ByteDance*). [[Paper](https://arxiv.org/abs/2104.10858)][[PyTorch](https://github.com/zihangJiang/TokenLabeling)]

* **NesT**: "Aggregating Nested Transformers", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2105.12723)][[Tensorflow](https://github.com/google-research/nested-transformer)]

* **KVT**: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2106.00515)]

* **Refined-ViT**: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2106.03714)][[PyTorch](https://github.com/zhoudaquan/Refiner_ViT)]

* **Shuffle-Transformer**: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2106.03650)]

* **CAT**: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 (*KuaiShou*). [[Paper](https://arxiv.org/abs/2106.05786)][[PyTorch](https://github.com/linhezheng19/CAT)]

* **V-MoE**: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.05974)]

* **P2T**: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (*Nankai University*). [[Paper](https://arxiv.org/abs/2106.12011)]

* **PvTv2**: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2106.13797)][[PyTorch](https://github.com/whai362/PVT)]

* **LG-Transformer**: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (*IIAI, UAE*). [[Paper](https://arxiv.org/abs/2107.04735)]

* **ViP**: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (*Oxford*). [[Paper](https://arxiv.org/abs/2107.05790)]

* **Scaled-ReLU**: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2109.03810)]

* **LIT**: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (*Monash University*). [[Paper](https://arxiv.org/abs/2105.14217)][[PyTorch](https://github.com/zip-group/LIT)]

* **DTN**: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2112.02624)][[PyTorch (in construction)](https://github.com/wqshao126/DTN)]

* **RegionViT**: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (*MIT-IBM Watson*). [[Paper](https://arxiv.org/abs/2106.02689)][[PyTorch](https://github.com/ibm/regionvit)]

* **CrossFormer**: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2108.00154)][[PyTorch](https://github.com/cheerss/CrossFormer)]

* **?**: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (*UT Austin*). [[Paper](https://openreview.net/forum?id=O476oWmiNNp)]

* **ViT-G**: "Scaling Vision Transformers", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2106.04560)]

* **CSWin**: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2107.00652)][[PyTorch](https://github.com/microsoft/CSWin-Transformer)]

* **MPViT**: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2112.11010)][[PyTorch](https://github.com/youngwanLEE/MPViT)]

* **Diverse-ViT**: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2203.06345)][[PyTorch](https://github.com/VITA-Group/Diverse-ViT)]

* **DW-ViT**: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (*Dark Matter AI, China*). [[Paper](https://arxiv.org/abs/2203.12856)][[PyTorch (in construction)](https://github.com/pzhren/DW-ViT)]

* **MixFormer**: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2204.02557)][[Paddle](https://github.com/PaddlePaddle/PaddleClas)]

* **DAT**: "Vision Transformer with Deformable Attention", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2201.00520)][[PyTorch](https://github.com/LeapLabTHU/DAT)]

* **Swin-Transformer-V2**: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.09883)][[PyTorch](https://github.com/microsoft/Swin-Transformer)]

* **MSG-Transformer**: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (*Huazhong University of Science & Technology*). [[Paper](https://arxiv.org/abs/2105.15168)][[PyTorch](https://github.com/hustvl/MSG-Transformer)]

* **NomMer**: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2111.12994)][[PyTorch](https://github.com/TencentYoutuResearch/VisualRecognition-NomMer)]

* **Shunted**: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (*NUS*). [[Paper](https://arxiv.org/abs/2111.15193)][[PyTorch](https://github.com/OliverRensu/Shunted-Transformer)]

* **PyramidTNT**: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2201.00978)][[PyTorch](https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch)]

* **X-ViT**: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (*Kakao*). [[Paper](https://arxiv.org/abs/2205.13805)]

* **ReMixer**: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (*KAIST*). [[Paper](https://drive.google.com/file/d/1E6rXtj5h6tXiJR8Ae8u1vQcwyNyTZSVc/view)][[PyTorch](https://github.com/alinlab/remixer)]

* **UN**: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (*Hikvision*). [[Paper](https://arxiv.org/abs/2208.01313)][[Code (in construction)](https://github.com/hikvision-research/Unified-Normalization)]

* **Wave-ViT**: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2207.04978)][[PyTorch](https://github.com/YehLi/ImageNetModel)]

* **DaViT**: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2204.03645)][[PyTorch](https://github.com/dingmyu/davit)]

* **ScalableViT**: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2203.10790)]

* **MaxViT**: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2204.01697)][[Tensorflow](https://github.com/google-research/maxvit)]

* **VSA**: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2204.08446)][[PyTorch](https://github.com/ViTAE-Transformer/ViTAE-VSA)]

* **?**: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.01035)]

* **Ortho**: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (*CAS*). [[Paper](https://openreview.net/forum?id=GGtH47T31ZC)]

* **PerViT**: "Peripheral Vision Transformer", NeurIPS, 2022 (*POSTECH*). [[Paper](https://arxiv.org/abs/2206.06801)]

* **LITv2**: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (*Monash University*). [[Paper](https://arxiv.org/abs/2205.13213)][[PyTorch](https://github.com/zip-group/LITv2)]

* **BViT**: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2202.06268)]

* **O-ViT**: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (*East China Normal University*). [[Paper](https://arxiv.org/abs/2201.12133)]

* **MOA-Transformer**: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (*University of Kansas*). [[Paper](https://arxiv.org/abs/2201.12903)][[PyTorch](https://github.com/krushi1992/MOA-transformer)]

* **BOAT**: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (*Baidu + HKU*). [[Paper](https://arxiv.org/abs/2201.13027)]

* **ViTAEv2**: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2202.10108)]

* **HiP**: "Hierarchical Perceiver", arXiv, 2022 (*DeepMind*). [[Paper](https://arxiv.org/abs/2202.10890)]

* **PatchMerger**: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2202.12015)]

* **DGT**: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2203.03937)]

* **NAT**: "Neighborhood Attention Transformer", arXiv, 2022 (*Oregon*). [[Paper](https://arxiv.org/abs/2204.07143)][[PyTorch](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer)]

* **ASF-former**: "Adaptive Split-Fusion Transformer", arXiv, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2204.12196)][[PyTorch (in construction)](https://github.com/szx503045266/ASF-former)]

* **SP-ViT**: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2206.07662)]

* **EATFormer**: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2206.09325)]

* **LinGlo**: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (*TCL Research Wuhan*). [[Paper](https://arxiv.org/abs/2207.00188)]

* **Dual-ViT**: "Dual Vision Transformer", arXiv, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2207.04976)][[PyTorch](https://github.com/YehLi/ImageNetModel)]

* **MMA**: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (*Centre for Research and Technology Hellas, Greece*). [[Paper](https://arxiv.org/abs/2207.08569)]

* **MAFormer**: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2209.01620)]

* **AEWin**: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (*Southwest Jiaotong University*). [[Paper](https://arxiv.org/abs/2209.08726)]

* **GrafT**: "Grafting Vision Transformers", arXiv, 2022 (*Stony Brook*). [[Paper](https://arxiv.org/abs/2210.15943)]

* **?**: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2211.01785)]

* **LTH-ViT**: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (*Northeastern University, China*). [[Paper](https://arxiv.org/abs/2211.01484)]

* **TT**: "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 (*Hangzhou Dianzi University*). [[Paper](https://arxiv.org/abs/2211.06083)]

* **INTERN**: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2111.08687)][[Website](https://opengvlab.shlab.org.cn/)]

* **GGeM**: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (*NAVER*). [[Paper](https://arxiv.org/abs/2212.04114)]

* **GPViT**: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (*University of Edinburgh, Scotland + UCSD*). [[Paper](https://arxiv.org/abs/2212.06795)][[PyTorch](https://github.com/ChenhongyiYang/GPViT)]

* **CPVT**: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (*Meituan*). [[Paper](https://openreview.net/forum?id=3KWnuT-R1bh)][[Code (in construction)](https://github.com/Meituan-AutoML/CPVT)]

* **LipsFormer**: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2304.09856)][[Code (in construction)](https://github.com/IDEA-Research/LipsFormer)]

* **BiFormer**: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.08810)][[PyTorch](https://github.com/rayleizhu/BiFormer)]

* **AbSViT**: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2303.13043)][[PyTorch](https://github.com/bfshi/AbSViT)][[Website](https://sites.google.com/view/absvit)]

* **DependencyViT**: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2304.03282)][[Code (in construction)](https://github.com/dingmyu/DependencyViT)]

* **ResFormer**: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2212.00776)][[PyTorch (in construction)](https://github.com/ruitian12/resformer)]

* **SViT**: "Vision Transformer with Super Token Sampling", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.11167)]

* **PaCa-ViT**: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 (*NC State*). [[Paper](https://arxiv.org/abs/2203.11987)][[PyTorch](https://github.com/iVMCL/PaCaViT)]

* **GC-ViT**: "Global Context Vision Transformers", ICML, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2206.09959)][[PyTorch](https://github.com/NVlabs/GCViT)]

* **MAGNETO**: "MAGNETO: A Foundation Transformer", ICML, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.06423)]

* **Fcaformer**: "Fcaformer: Forward Cross Attention in Hybrid Vision Transformer", ICCV, 2023 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2211.07198)][[PyTorch](https://github.com/hkzhang91/CabViT)]

* **SMT**: "Scale-Aware Modulation Meet Transformer", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2307.08579)][[PyTorch](https://github.com/AFeng-x/SMT)]

* **FLatten-Transformer**: "FLatten Transformer: Vision Transformer using Focused Linear Attention", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2308.00442)][[PyTorch](https://github.com/LeapLabTHU/FLatten-Transformer)]

* **Path-Ensemble**: "Revisiting Vision Transformer from the View of Path Ensemble", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.06548)]

* **SG-Former**: "SG-Former: Self-guided Transformer with Evolving Token Reallocation", ICCV, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2308.12216)][[PyTorch](https://github.com/OliverRensu/SG-Former)]

* **SimPool**: "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?", ICCV, 2023 (*National Technical University of Athens*). [[Paper](https://arxiv.org/abs/2309.06891)]

* **LaPE**: "LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization", ICCV, 2023 (*Peking*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Yu_LaPE_Layer-adaptive_Position_Embedding_for_Vision_Transformers_with_Independent_Layer_ICCV_2023_paper.html)][[PyTorch](https://github.com/Ingrid725/LaPE)]

* **CB**: "Scratching Visual Transformer's Back with Uniform Attention", ICCV, 2023 (*NAVER*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Hyeon-Woo_Scratching_Visual_Transformers_Back_with_Uniform_Attention_ICCV_2023_paper.html)]

* **STL**: "Fully Attentional Networks with Self-emerging Token Labeling", ICCV, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2401.03844)][[PyTorch](https://github.com/NVlabs/STL)]

* **ClusterFormer**: "ClusterFormer: Clustering As A Universal Visual Learner", NeurIPS, 2023 (*Rochester Institute of Technology (RIT)*). [[Paper](https://arxiv.org/abs/2309.13196)]

* **SVT**: "Scattering Vision Transformer: Spectral Mixing Matters", NeurIPS, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2311.01310)][[PyTorch](https://github.com/badripatro/svt)][[Website](https://badripatro.github.io/svt/)]

* **CrossFormer++**: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2303.06908)][[PyTorch](https://github.com/cheerss/CrossFormer)]

* **QFormer**: "Vision Transformer with Quadrangle Attention", arXiv, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2303.15105)][[Code (in construction)](https://github.com/ViTAE-Transformer/QFormer)]

* **ViT-Calibrator**: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2304.04354)]

* **SpectFormer**: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2304.06446)][[PyTorch](https://github.com/badripatro/SpectFormers)][[Website](https://badripatro.github.io/SpectFormers/)]

* **UniNeXt**: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2304.13700)]

* **CageViT**: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2305.09924)]

* **?**: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 (*UIUC*). [[Paper](https://arxiv.org/abs/2305.16316)]

* **2-D-SSM**: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2306.06635)][[PyTorch](https://github.com/ethanbar11/ssm_2d)]

* **NaViT**: "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2307.06304)]

* **DAT++**: "DAT++: Spatially Dynamic Vision Transformer with Deformable Attention", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2309.01430)][[PyTorch](https://github.com/LeapLabTHU/DAT)]

* **?**: "Replacing softmax with ReLU in Vision Transformers", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2309.08586)]

* **RMT**: "RMT: Retentive Networks Meet Vision Transformers", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2309.11523)]

* **reg**: "Vision Transformers Need Registers", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2309.16588)]

* **ChannelViT**: "Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words", arXiv, 2023 (*Insitro, CA*). [[Paper](https://arxiv.org/abs/2309.16108)]

* **EViT**: "EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention", arXiv, 2023 (*Nankai University*). [[Paper](https://arxiv.org/abs/2310.06629)]

* **ViR**: "ViR: Vision Retention Networks", arXiv, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2310.19731)]

* **abs-win**: "Window Attention is Bugged: How not to Interpolate Position Embeddings", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2311.05613)]

* **FMViT**: "FMViT: A multiple-frequency mixing Vision Transformer", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2311.05707)][[Code (in construction)](https://github.com/tany0699/FMViT)]

* **GroupMixFormer**: "Advancing Vision Transformers with Group-Mix Attention", arXiv, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2311.15157)][[PyTorch](https://github.com/AILab-CVC/GroupMixFormer)]

* **PGT**: "Perceptual Group Tokenizer: Building Perception with Iterative Grouping", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2311.18296)]

* **SCHEME**: "SCHEME: Scalable Channer Mixer for Vision Transformers", arXiv, 2023 (*UCSD*). [[Paper](https://arxiv.org/abs/2312.00412)]

* **Agent-Attention**: "Agent Attention: On the Integration of Softmax and Linear Attention", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.08874)][[PyTorch](https://github.com/LeapLabTHU/Agent-Attention)]

* **ViTamin**: "ViTamin: Designing Scalable Vision Models in the Vision-Language Era", CVPR, 2024 (*ByteDance*). [[Paper](https://arxiv.org/abs/2404.02132)][[PyTorch](https://github.com/Beckschen/ViTamin)]

* **HIRI-ViT**: "HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs", TPAMI, 2024 (*HiDream.ai, China*). [[Paper](https://arxiv.org/abs/2403.11999)]

* **SPFormer**: "SPFormer: Enhancing Vision Transformer with Superpixel Representation", arXiv, 2024 (*JHU*). [[Paper](https://arxiv.org/abs/2401.02931)]

* **manifold-K**: "A Manifold Representation of the Key in Vision Transformers", arXiv, 2024 (*University of Oslo, Norway*). [[Paper](https://arxiv.org/abs/2402.00534)]

* **BiXT**: "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers", arXiv, 2024 (*University of Melbourne*). [[Paper](https://arxiv.org/abs/2402.12138)]

* **VisionLLaMA**: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks", arXiv, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2403.00522)][[Code (in construction)](https://github.com/Meituan-AutoML/VisionLLaMA)]

* **xT**: "xT: Nested Tokenization for Larger Context in Large Images", arXiv, 2024 (*Berkeley*). [[Paper](https://arxiv.org/abs/2403.01915)]

* **ACC-ViT**: "ACC-ViT: Atrous Convolution's Comeback in Vision Transformers", arXiv, 2024 (*Purdue*). [[Paper](https://arxiv.org/abs/2403.04200)]

* **ViTAR**: "ViTAR: Vision Transformer with Any Resolution", arXiv, 2024 (*CAS*). [[Paper](https://arxiv.org/abs/2403.18361)]

* **iLLaMA**: "Adapting LLaMA Decoder to Vision Transformer", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2404.06773)]

#### Efficient Vision Transformer

* **DeiT**: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2012.12877)][[PyTorch](https://github.com/facebookresearch/deit)]

* **ConViT**: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2103.10697)][[Code](https://github.com/facebookresearch/convit)]

* **?**: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (*NavInfo Europe, Netherlands*). [[Paper](https://arxiv.org/abs/2106.16006)]

* **PS-ViT**: "Vision Transformer with Progressive Sampling", ICCV, 2021 (*CPII*). [[Paper](https://arxiv.org/abs/2108.01684)]

* **HVT**: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (*Monash University*). [[Paper](https://arxiv.org/abs/2103.10619)][[PyTorch](https://github.com/MonashAI/HVT)]

* **CrossViT**: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (*MIT-IBM*). [[Paper](https://arxiv.org/abs/2103.14899)][[PyTorch](https://github.com/IBM/CrossViT)]

* **ViL**: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2103.15358)][[PyTorch](https://github.com/microsoft/vision-longformer)]

* **Visformer**: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (*Beihang University*). [[Paper](https://arxiv.org/abs/2104.12533)][[PyTorch](https://github.com/danczs/Visformer)]

* **MultiExitViT**: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (*Aarhus University, Denmark*). [[Paper](https://arxiv.org/abs/2106.15183)][[Tensorflow](https://gitlab.au.dk/maleci/multiexitvit)]

* **SViTE**: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (*UT Austin*). [[Paper](https://arxiv.org/abs/2106.04533)][[PyTorch](https://github.com/VITA-Group/SViTE)]

* **DGE**: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (*Megvii*). [[Paper](https://papers.nips.cc/paper/2021/hash/2d969e2cee8cfa07ce7ca0bb13c7a36d-Abstract.html)][[PyTorch](https://github.com/StevenGrove/vtpack)]

* **GG-Transformer**: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (*JHU*). [[Paper](https://arxiv.org/abs/2106.02277)][[Code (in construction)](https://github.com/yucornetto/GG-Transformer)]

* **DynamicViT**: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2106.02034)][[PyTorch](https://github.com/raoyongming/DynamicViT)][[Website](https://dynamicvit.ivg-research.xyz/)]

* **ResT**: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2105.13677)][[PyTorch](https://github.com/wofmanaf/ResT)]

* **Adder-Transformer**: "Adder Attention for Vision Transformer", NeurIPS, 2021 (*Huawei*). [[Paper](https://proceedings.neurips.cc/paper/2021/hash/a57e8915461b83adefb011530b711704-Abstract.html)]

* **SOFT**: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (*Fudan*). [[Paper](https://arxiv.org/abs/2110.11945)][[PyTorch](https://github.com/fudan-zvg/SOFT)][[Website](https://fudan-zvg.github.io/SOFT/)]

* **IA-RED²**: "IA-RED²: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (*MIT-IBM*). [[Paper](https://arxiv.org/abs/2106.12620)][[Website](http://people.csail.mit.edu/bpan/ia-red/)]

* **LocalViT**: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (*ETHZ*). [[Paper](https://arxiv.org/abs/2104.05707)][[PyTorch](https://github.com/ofsoundof/LocalViT)]

* **CCT**: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (*University of Oregon*). [[Paper](https://arxiv.org/abs/2104.05704)][[PyTorch](https://github.com/SHI-Labs/Compact-Transformers)]

* **DiversePatch**: "Vision Transformers with Patch Diversification", arXiv, 2021 (*UT Austin + Facebook*). [[Paper](https://arxiv.org/abs/2104.12753)][[PyTorch](https://github.com/ChengyueGongR/PatchVisionTransformer)] 

* **SL-ViT**: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (*Aarhus University*). [[Paper](https://arxiv.org/abs/2105.09121)]

* **?**: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (*Aarhus University, Denmark*). [[Paper](https://arxiv.org/abs/2106.15183)]

* **ViX**: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (*Indian Institute of Technology Bombay*). [[Paper](https://arxiv.org/abs/2107.02239)]

* **Transformer-LS**: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2107.02192)][[PyTorch](https://github.com/NVIDIA/transformer-ls)]

* **WideNet**: "Go Wider Instead of Deeper", arXiv, 2021 (*NUS*). [[Paper](https://arxiv.org/abs/2107.11817)]

* **Armour**: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (*Arm*). [[Paper](https://arxiv.org/abs/2108.01778)]

* **IPE**: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (*CUHK*). [[Paper](https://arxiv.org/abs/2108.13015)]

* **DS-Net++**: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (*Monash University*). [[Paper](https://arxiv.org/abs/2109.10060)][[PyTorch](https://github.com/changlin31/DS-Net)]

* **UFO-ViT**: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (*Kakao*). [[Paper](https://arxiv.org/abs/2109.14382)]

* **Evo-ViT**: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2108.01390)][[PyTorch](https://github.com/YifanXu74/Evo-ViT)]

* **PS-Attention**: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2112.14000)][[Paddle](https://github.com/BR-IDL/PaddleViT)]

* **ShiftViT**: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2201.10801)][[PyTorch](https://github.com/microsoft/SPACH)]

* **EViT**: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2202.07800)][[PyTorch](https://github.com/youweiliang/evit)]

* **QuadTree**: "QuadTree Attention for Vision Transformers", ICLR, 2022 (*Simon Fraser + Alibaba*). [[Paper](https://arxiv.org/abs/2201.02767)][[PyTorch](https://github.com/Tangshitao/QuadtreeAttention)]

* **Anti-Oversmoothing**: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2203.05962)][[PyTorch](https://github.com/VITA-Group/ViT-Anti-Oversmoothing)]

* **QnA**: "Learned Queries for Efficient Local Attention", CVPR, 2022 (*Tel-Aviv*). [[Paper](https://arxiv.org/abs/2112.11435)][[JAX](https://github.com/moabarar/qna)]

* **LVT**: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (*Adobe*). [[Paper](https://arxiv.org/abs/2112.10809)][[PyTorch](https://github.com/Chenglin-Yang/LVT)]

* **A-ViT**: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2112.07658)][[Website](https://a-vit.github.io/)]

* **PS-ViT**: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2106.02852)]

* **Rev-MViT**: "Reversible Vision Transformers", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2302.04869)][[PyTorch-1](https://github.com/karttikeya/minREV)][[PyTorch-2](https://github.com/facebookresearch/slowfast)]

* **AdaViT**: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2111.15668)]

* **DQS**: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (*Sorbonne Universite', France*). [[Paper](https://arxiv.org/abs/2205.10873)]

* **ATS**: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.15667)][[Website](https://adaptivetokensampling.github.io/)]

* **EdgeViT**: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2205.03436)][[PyTorch](https://github.com/saic-fi/edgevit)]

* **SReT**: "Sliced Recursive Transformer", ECCV, 2022 (*CMU + MBZUAI*). [[Paper](https://arxiv.org/abs/2111.05297)][[PyTorch](https://github.com/szq0214/SReT)]

* **SiT**: "Self-slimmed Vision Transformer", ECCV, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2111.12624)][[PyTorch](https://github.com/Sense-X/SiT)]

* **DFvT**: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (*Alibaba*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/322_ECCV_2022_paper.php)]

* **M³ViT**: "M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2210.14793)][[PyTorch](https://github.com/VITA-Group/M3ViT)]

* **ResT-V2**: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2204.07366)][[PyTorch](https://github.com/wofmanaf/ResT)]

* **DeiT-Manifold**: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2107.01378)]

* **EfficientFormer**: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (*Snap*). [[Paper](https://arxiv.org/abs/2206.01191)][[PyTorch](https://github.com/snap-research/EfficientFormer)]

* **GhostNetV2**: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2211.12905)][[PyTorch](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv2_pytorch)]

* **?**: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (*Borealis AI, Canada*). [[Paper](https://arxiv.org/abs/2211.05187)]

* **TerViT**: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2201.08050)]

* **MT-ViT**: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2203.01587)]

* **ViT-P**: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (*Chongqing University of Technology*). [[Paper](https://arxiv.org/abs/2203.02358)]

* **CF-ViT**: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (*Xiamen University + Tencent*). [[Paper](https://arxiv.org/abs/2203.03821)][[PyTorch](https://github.com/ChenMnZ/CF-ViT)]

* **EIT**: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (*Academy of Military Sciences, China*). [[Paper](https://arxiv.org/abs/2203.07116)]

* **SepViT**: "SepViT: Separable Vision Transformer", arXiv, 2022 (*University of Electronic Science and Technology of China*). [[Paper](https://arxiv.org/abs/2203.15380)]

* **TRT-ViT**: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2205.09579)]

* **SuperViT**: "Super Vision Transformer", arXiv, 2022 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2205.11397)][[PyTorch](https://github.com/lmbxmu/SuperViT)]

* **Tutel**: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.03382)][[PyTorch](https://github.com/microsoft/tutel)]

* **SimA**: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (*Maryland + UC Davis*). [[Paper](https://arxiv.org/abs/2206.08898)][[PyTorch](https://github.com/UCDvision/sima)]

* **EdgeNeXt**: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2206.10589)][[PyTorch](https://github.com/mmaaz60/EdgeNeXt)]

* **VVT**: "Vicinity Vision Transformer", arXiv, 2022 (*Australian National University*). [[Paper](https://arxiv.org/abs/2206.10552)][[Code (in construction)](https://github.com/OpenNLPLab/Vicinity-Vision-Transformer)]

* **SOFT**: "Softmax-free Linear Transformers", arXiv, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2207.03341)][[PyTorch](https://github.com/fudan-zvg/SOFT)]

* **MaiT**: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2207.03006)]

* **LightViT**: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2207.05557)][[Code (in construction)](https://github.com/hunto/LightViT)]

* **Next-ViT**: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2207.05501)]

* **XFormer**: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/pdf/2207.07268.pdf)]

* **PatchDropout**: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (*KTH, Sweden*). [[Paper](https://arxiv.org/abs/2208.07220)]

* **ClusTR**: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2208.13138)]

* **DiNAT**: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (*University of Oregon*). [[Paper](https://arxiv.org/abs/2209.15001)][[PyTorch](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer)]

* **MobileViTv3**: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (*Micron*). [[Paper](https://arxiv.org/abs/2209.15159)][[PyTorch](https://github.com/micronDLA/MobileViTv3)]

* **ViT-LSLA**: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (*Southwest University*). [[Paper](https://arxiv.org/abs/2210.17115)]

* **Token-Pooling**: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 (*Apple*). [[Paper](https://openaccess.thecvf.com/content/WACV2023/html/Marin_Token_Pooling_in_Vision_Transformers_for_Image_Classification_WACV_2023_paper.html)]

* **Tri-Level**: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2211.10801)][[Code (in construction)](https://github.com/ZLKong/Tri-Level-ViT)]

* **ViTCoD**: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2210.09573)]

* **ViTALiTy**: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (*Rice University*). [[Paper](https://arxiv.org/abs/2211.05109)]

* **HeatViT**: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2211.08110)]

* **ToMe**: "Token Merging: Your ViT But Faster", ICLR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2210.09461)][[PyTorch](https://github.com/facebookresearch/ToMe)]

* **HiViT**: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2205.14949)][[PyTorch](https://github.com/zhangxiaosong18/hivit)]

* **STViT**: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2303.08685)][[PyTorch](https://github.com/changsn/STViT-R)]

* **SparseViT**: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2303.17605)][[Website](https://sparsevit.mit.edu/)]

* **Slide-Transformer**: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2304.04237)][[Code (in construction)](https://github.com/LeapLabTHU/Slide-Transformer)]

* **RIFormer**: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2304.05659)][[PyTorch](https://github.com/open-mmlab/mmpretrain/tree/main/configs/riformer)][[Website](https://techmonsterwang.github.io/RIFormer/)]

* **EfficientViT**: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2305.07027)][[PyTorch](https://github.com/microsoft/Cream/tree/main/EfficientViT)]

* **Castling-ViT**: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2211.10526)]

* **ViT-Ti**: "RGB no more: Minimally-decoded JPEG Vision Transformers", CVPR, 2023 (*UMich*). [[Paper](https://arxiv.org/abs/2211.16421)]

* **Sparsifiner**: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 (*University of Toronto*). [[Paper](https://arxiv.org/abs/2303.13755)]

* **?**: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2211.11315)]

* **LTMP**: "Learned Thresholds Token Merging and Pruning for Vision Transformers", ICMLW, 2023 (*Ghent University, Belgium*). [[Paper](https://arxiv.org/abs/2307.10780)][[PyTorch](https://github.com/Mxbonn/ltmp)][[Website](https://maxim.bonnaerens.com/publication/ltmp/)]

* **ReViT**: "Make A Long Image Short: Adaptive Token Length for Vision Transformers", ECML PKDD, 2023 (*Midea Grou, China*). [[Paper](https://arxiv.org/abs/2307.02092)]

* **EfficientViT**: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", ICCV, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2205.14756)][[PyTorch](https://github.com/mit-han-lab/efficientvit)]

* **MPCViT**: "MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention", ICCV, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2211.13955)][[PyTorch](https://github.com/PKU-SEC-Lab/mpcvit)]

* **MST**: "Masked Spiking Transformer", ICCV, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2210.01208)]

* **EfficientFormerV2**: "Rethinking Vision Transformers for MobileNet Size and Speed", ICCV, 2023 (*Snap*). [[Paper](https://arxiv.org/abs/2212.08059)][[PyTorch](https://github.com/snap-research/EfficientFormer)]

* **DiffRate**: "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2305.17997)][[PyTorch](https://github.com/OpenGVLab/DiffRate)]

* **ElasticViT**: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2303.09730)]

* **FastViT**: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", ICCV, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2303.14189)][[PyTorch](https://github.com/apple/ml-fastvit)]

* **SeiT**: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", ICCV, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2303.11114)][[PyTorch](https://github.com/naver-ai/seit)]

* **TokenReduction**: "Which Tokens to Use? Investigating Token Reduction in Vision Transformers", ICCVW, 2023 (*Aalborg University, Denmark*). [[Paper](https://arxiv.org/abs/2308.04657)][[PyTorch](https://github.com/JoakimHaurum/TokenReduction)][[Website](https://vap.aau.dk/tokens/)]

* **LGViT**: "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer", ACMMM, 2023 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2308.00255)]

* **LBP-WHT**: "Efficient Low-rank Backpropagation for Vision Transformer Adaptation", NeurIPS, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2309.15275)]

* **FAT**: "Lightweight Vision Transformer with Bidirectional Interaction", NeurIPS, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2306.00396)][[PyTorch](https://github.com/qhfan/FAT)]

* **MCUFormer**: "MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2310.16898)][[PyTorch](https://github.com/liangyn22/MCUFormer)]

* **SoViT**: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2305.13035)]

* **CloFormer**: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2303.17803)]

* **Quadformer**: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2304.00287)][[Code (in construction)](https://github.com/TomerRonen34/mixed-resolution-vit)]

* **SparseFormer**: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2304.03768)][[Code (in construction)](https://github.com/showlab/sparseformer)]

* **EMO**: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.01146)][[PyTorch](https://github.com/zhangzjn/EMO)]

* **ByteFormer**: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2306.00238)]

* **?**: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 (*Jilin University*). [[Paper](https://arxiv.org/abs/2306.04897)]

* **FasterViT**: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2306.06189)]

* **NextViT**: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2306.10875)]

* **SkipAt**: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2301.02240)]

* **MSViT**: "MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers", arXiv, 2023 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2307.02321)]

* **DiT**: "DiT: Efficient Vision Transformers with Dynamic Token Routing", arXiv, 2023 (*Meituan*). [[Paper](https://arxiv.org/abs/2308.03409)][[Code (in construction)](https://github.com/Maycbj/DiT)]

* **?**: "Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers", arXiv, 2023 (*German Research Center for Artificial Intelligence (DFKI)*). [[Paper](https://arxiv.org/abs/2308.09372)][[PyTorch](https://github.com/tobna/WhatTransformerToFavor)]

* **Mobile-V-MoEs**: "Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2309.04354)]

* **PPT**: "PPT: Token Pruning and Pooling for Efficient Vision Transformers", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2310.01812)]

* **MatFormer**: "MatFormer: Nested Transformer for Elastic Inference", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2310.07707)]

* **SparseFormer**: "Bootstrapping SparseFormers from Vision Foundation Models", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2312.01987)][[PyTorch](https://github.com/showlab/sparseformer)]

* **GTP-ViT**: "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation", WACV, 2024 (*CSIRO Data61, Australia*). [[Paper](https://arxiv.org/abs/2311.03035)][[PyTorch](https://github.com/Ackesnal/GTP-ViT)]

* **ToFu**: "Token Fusion: Bridging the Gap between Token Pruning and Token Merging", WACV, 2024 (*Samsung*). [[Paper](https://arxiv.org/abs/2312.01026)]

* **Cached-Transformer**: "Cached Transformers: Improving Transformers with Differentiable Memory Cache", AAAI, 2024 (*CUHK*). [[Paper](https://arxiv.org/abs/2312.12742)]

* **LF-ViT**: "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition", AAAI, 2024 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2402.00033)][[PyTorch](https://github.com/edgeai1/LF-ViT)]

* **EfficientMod**: "Efficient Modulation for Vision Networks", ICLR, 2024 (*Microsoft*). [[Paper](https://arxiv.org/abs/2403.19963)][[PyTorch](https://github.com/ma-xu/EfficientMod)]

* **NOSE**: "MLP Can Be A Good Transformer Learner", CVPR, 2024 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2404.05657)][[PyTorch](https://github.com/sihaoevery/lambda_vit)]

* **SLAB**: "SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization", ICML, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2405.11582)][[PyTorch](https://github.com/xinghaochen/SLAB)]

* **S²**: "When Do We Not Need Larger Vision Models?", arXiv, 2024 (*Berkeley*). [[Paper](https://arxiv.org/abs/2403.13043)][[PyTorch](https://github.com/bfshi/scaling_on_scales)]

#### Conv + Transformer

* **LeViT**: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.01136)][[PyTorch](https://github.com/facebookresearch/LeViT)]

* **CeiT**: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (*SenseTime*). [[Paper](https://arxiv.org/abs/2103.11816)][[PyTorch (rishikksh20)](https://github.com/rishikksh20/CeiT)]

* **Conformer**: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2105.03889)][[PyTorch](https://github.com/pengzhiliang/Conformer)]

* **CoaT**: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (*UCSD*). [[Paper](https://arxiv.org/abs/2104.06399)][[PyTorch](https://github.com/mlpc-ucsd/CoaT)]

* **CvT**: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2103.15808)][[Code](https://github.com/leoxiaobin/CvT)]

* **ViTc**: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.14881)]

* **ConTNet**: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (*ByteDance*). [[Paper](https://arxiv.org/abs/2104.13497)][[PyTorch](https://github.com/yan-hao-tian/ConTNet)]

* **SPACH**: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2108.13002)]

* **MobileViT**: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (*Apple*). [[Paper](https://arxiv.org/abs/2110.02178)][[PyTorch](https://github.com/apple/ml-cvnets)]

* **CMT**: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2107.06263)]

* **Mobile-Former**: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2108.05895)][[PyTorch (in construction)](https://github.com/aaboys/mobileformer)]

* **TinyViT**: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.10666)][[PyTorch](https://github.com/microsoft/Cream/tree/main/TinyViT)]

* **CETNet**: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (*OPPO*). [[Paper](https://arxiv.org/abs/2207.13317)]

* **ParC-Net**: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2203.03952)][[PyTorch](https://github.com/hkzhang91/ParC-Net)]

* **?**: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2210.07240)][[PyTorch](https://github.com/hananshafi/vits-for-small-scale-datasets)]

* **DHVT**: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (*USTC*). [[Paper](https://arxiv.org/abs/2210.05958)][[Code (in construction)](https://github.com/ArieSeirack/DHVT)]

* **iFormer**: "Inception Transformer", NeurIPS, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2205.12956)][[PyTorch](https://github.com/sail-sg/iFormer)]

* **DenseDCT**: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (*University of Kansas*). [[Paper](https://arxiv.org/abs/2210.14319)]

* **CXV**: "Convolutional Xformers for Vision", arXiv, 2022 (*IIT Bombay*). [[Paper](https://arxiv.org/abs/2201.10271)][[PyTorch](https://github.com/pranavphoenix/CXV)]

* **ConvMixer**: "Patches Are All You Need?", arXiv, 2022 (*CMU*). [[Paper](https://arxiv.org/abs/2201.09792)][[PyTorch](https://github.com/locuslab/convmixer)]

* **MobileViTv2**: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (*Apple*). [[Paper](https://arxiv.org/abs/2206.02680)][[PyTorch](https://github.com/apple/ml-cvnets)]

* **UniFormer**: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2201.09450)][[PyTorch](https://github.com/Sense-X/UniFormer)]

* **EdgeFormer**: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (*?*). [[Paper](https://arxiv.org/abs/2203.03952)]

* **MoCoViT**: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2205.12635)]

* **DynamicViT**: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2207.01580)][[PyTorch](https://github.com/raoyongming/DynamicViT)]

* **ConvFormer**: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (*National University of Defense Technology, China*). [[Paper](https://arxiv.org/abs/2209.07738)]

* **Fast-ParC**: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2210.04020)]

* **MetaFormer**: "MetaFormer Baselines for Vision", arXiv, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2210.13452)][[PyTorch](https://github.com/sail-sg/metaformer)]

* **STM**: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2211.05781)][[Code (in construction)](https://github.com/OpenGVLab/STM-Evaluation)]

* **ParCNetV2**: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2211.07157)]

* **VAN**: "Visual Attention Network", arXiv, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2202.09741)][[PyTorch](https://github.com/Visual-Attention-Network)]

* **SD-MAE**: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 (*Hangzhou Dianzi University*). [[Paper](https://arxiv.org/abs/2212.05677)][[PyTorch (in construction)](https://github.com/Talented-Q/SDMAE)]

* **SATA**: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (*University of Kansas*). [[Paper](https://arxiv.org/abs/2210.12333)][[PyTorch (in construction)](https://github.com/xiangyu8/SATA)]

* **SparK**: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (*Bytedance*). [[Paper](https://openreview.net/forum?id=NRxydtWup1S)][[PyTorch](https://github.com/keyu-tian/SparK)]

* **MOAT**: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2210.01820)][[Tensorflow](https://github.com/google-research/deeplab2)]

* **InternImage**: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2211.05778)][[PyTorch](https://github.com/OpenGVLab/InternImage)]

* **SwiftFormer**: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", ICCV, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2303.15446)][[PyTorch](https://github.com/Amshaker/SwiftFormer)]

* **SCSC**: "SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers", ICCVW, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2308.07110)]

* **PSLT**: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2304.03481)][[Website](https://isee-ai.cn/wugaojie/PSLT.html)]

* **RepViT**: "RepViT: Revisiting Mobile CNN From ViT Perspective", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2307.09283)][[PyTorch](https://github.com/jameslahm/RepViT)]

* **?**: "Interpret Vision Transformers as ConvNets with Dynamic Convolutions", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2309.10713)]

* **UPDP**: "UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer", AAAI, 2024 (*AMD*). [[Paper](https://arxiv.org/abs/2401.06426)]

#### Training + Transformer

* **iGPT**: "Generative Pretraining From Pixels", ICML, 2020 (*OpenAI*). [[Paper](http://proceedings.mlr.press/v119/chen20s.html)][[Tensorflow](https://github.com/openai/image-gpt)]

* **CLIP**: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (*OpenAI*). [[Paper](https://arxiv.org/abs/2103.00020)][[PyTorch](https://github.com/openai/CLIP)]

* **MoCo-V3**: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.02057)]

* **DINO**: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.14294)][[PyTorch](https://github.com/facebookresearch/dino)]

* **drloc**: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (*University of Trento*). [[Paper](https://arxiv.org/abs/2106.03746)][[PyTorch](https://github.com/yhlleo/VTs-Drloc)]

* **CARE**: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2110.05340)][[PyTorch](https://github.com/ChongjianGE/CARE)]

* **MST**: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (*SenseTime*). [[Paper](https://arxiv.org/abs/2106.05656)]

* **SiT**: "SiT: Self-supervised Vision Transformer", arXiv, 2021 (*University of Surrey*). [[Paper](https://arxiv.org/abs/2104.03602)][[PyTorch](https://github.com/Sara-Ahmed/SiT)]

* **MoBY**: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2105.04553)][[PyTorch](https://github.com/SwinTransformer/Transformer-SSL)]

* **?**: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (*Pune Institute of Computer Technology, India*). [[Paper](https://arxiv.org/abs/2110.05270)]

* **Annotations-1.3B**: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (*Pinterest*). [[Paper](https://arxiv.org/abs/2108.05887)]

* **BEiT**: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.08254)][[PyTorch](https://github.com/microsoft/unilm/tree/master/beit)]

* **EsViT**: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.09785)]

* **iBOT**: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2111.07832)][[PyTorch](https://github.com/bytedance/ibot)]

* **MaskFeat**: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2112.09133)]

* **AutoProg**: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2203.14509)][[Code (in construction)](https://github.com/changlin31/AutoProg)]

* **MAE**: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2111.06377)][[PyTorch](https://github.com/facebookresearch/mae)][[PyTorch (pengzhiliang)](https://github.com/pengzhiliang/MAE-pytorch)]

* **SimMIM**: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.09886)][[PyTorch](https://github.com/microsoft/SimMIM)]

* **SelfPatch**: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2206.07990)][[PyTorch](https://github.com/alinlab/SelfPatch)]

* **Bootstrapping-ViTs**: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2112.03552)][[PyTorch](https://github.com/zhfeing/Bootstrapping-ViTs-pytorch)]

* **TransMix**: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (*JHU*). [[Paper](https://arxiv.org/abs/2111.09833)][[PyTorch](https://github.com/Beckschen/TransMix)]

* **PatchRot**: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (*Arizona State*). [[Paper](https://drive.google.com/file/d/1ZHdBMa-MCx05Y0teqb0vmgiiYj8t5xBB/view)]

* **SplitMask**: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2112.10740)]

* **MC-SSL**: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2111.15340)]

* **RelViT**: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (*University of Padova, Italy*). [[Paper](https://arxiv.org/abs/2206.00481?context=cs)]

* **data2vec**: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2202.03555)][[PyTorch](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec)]

* **SSTA**: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (*Tencent*). [[Paper](https://proceedings.mlr.press/v162/wu22c.html)][[Code (in construction)](https://github.com/GlassyWu/SSTA)]

* **MP3**: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (*Apple*). [[Paper](https://arxiv.org/abs/2207.07611)][[PyTorch](https://github.com/arshadshk/Position-Prediction-Pretraining)]

* **CutMixSL**: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (*Yonsei University, Korea*). [[Paper](https://arxiv.org/abs/2207.00234)]

* **BootMAE**: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.07116)][[PyTorch](https://github.com/LightDXY/BootMAE)]

* **TokenMix**: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.08409)][[PyTorch](https://github.com/Sense-X/TokenMix)]

* **?**: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2207.10026)][[PyTorch](https://github.com/lkhl/tiny-transformers)]

* **HAT**: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2204.00993)][[PyTorch](https://github.com/jiawangbai/HAT)]

* **IDMM**: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2201.10728)]

* **AttMask**: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (*National Technical University of Athens*). [[Paper](https://arxiv.org/abs/2203.12719)][[PyTorch](https://github.com/gkakogeorgiou/attmask)]

* **SLIP**: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (*Berkeley + Meta*). [[Paper](https://arxiv.org/abs/2112.12750)][[Pytorch](https://github.com/facebookresearch/SLIP)]

* **mc-BEiT**: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (*Peking University*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/1197_ECCV_2022_paper.php)]

* **SL2O**: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (*UT Austin*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2909_ECCV_2022_paper.php)][[PyTorch](https://github.com/VITA-Group/Scalable-L2O)]

* **TokenMixup**: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2210.07562)][[PyTorch](https://github.com/mlvlab/TokenMixup)]

* **PatchRot**: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (*Arizona State University*). [[Paper](https://arxiv.org/abs/2210.15722)]

* **GreenMIM**: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (*The University of Tokyo*). [[Paper](https://arxiv.org/abs/2205.13515)][[PyTorch](https://github.com/LayneH/GreenMIM)]

* **DP-CutMix**: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2210.15986)]

* **?**: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (*Google*). [[Paper](https://openreview.net/forum?id=4nPswr1KcP)][[Tensorflow](https://github.com/google-research/vision_transformer)][[PyTorch (rwightman)](https://github.com/rwightman/pytorch-image-models)]

* **PeCo**: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.12710)]

* **RePre**: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2201.06857)]

* **Beyond-Masking**: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2203.14313)][[Code (in construction)](https://github.com/sunsmarterjie/beyond_masking)]

* **Kronecker-Adaptation**: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.16329)]

* **DILEMMA**: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (*University of Bern, Switzerland*). [[Paper](https://arxiv.org/abs/2204.04788)]

* **DeiT-III**: "DeiT III: Revenge of the ViT", arXiv, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2204.07118)]

* **?**: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2205.01580)][[Tensorflow](https://github.com/google-research/big_vision)]

* **ConvMAE**: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2205.03892)][[PyTorch (in construction)](https://github.com/Alpha-VL/ConvMAE)]

* **UM-MAE**: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (*Nanjing University of Science and Technology*). [[Paper](https://arxiv.org/abs/2205.10063)][[PyTorch](https://github.com/implus/UM-MAE)]

* **GMML**: "GMML is All you Need", arXiv, 2022 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2205.14986)][[PyTorch](https://github.com/Sara-Ahmed/GMML)]

* **SIM**: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2206.01204)]

* **SupMAE**: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2205.14540)][[PyTorch](https://github.com/cmu-enyac/supmae)]

* **LoMaR**: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (*KAUST*). [[Paper](https://arxiv.org/abs/2206.00790)]

* **SAR**: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (*University of Trento, Italy*). [[Paper](https://arxiv.org/abs/2206.04636)]

* **ExtreMA**: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.04667)]

* **?**: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (*Nankai University*). [[Paper](https://arxiv.org/abs/2206.05184)]

* **?**: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (*Southwest Jiaotong University*). [[Paper](https://arxiv.org/abs/2206.04981)]

* **Jigsaw-ViT**: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (*KU Leuven, Belgium*). [[Paper](https://arxiv.org/abs/2207.11971)][[PyTorch](https://github.com/yingyichen-cyy/Nested-Co-teaching)][[Website](https://yingyichen-cyy.github.io/Jigsaw-ViT/)]

* **BEiT-v2**: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.06366)][[PyTorch](https://github.com/microsoft/unilm/tree/master/beit)]

* **MILAN**: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (*Princeton*). [[Paper](https://arxiv.org/abs/2208.06049)][[PyTorch (in construction)](https://github.com/zejiangh/MILAN)]

* **PSS**: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (*Franklin and Marshall College, Pennsylvania*). [[Paper](https://arxiv.org/abs/2208.09520)][[PyTorch](https://github.com/BradMcDanel/pss)]

* **dBOT**: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2209.03917)]

* **PatchErasing**: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2209.15006)] 

* **Self-Distillation**: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2210.02871)]

* **AutoView**: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2210.08458)][[Code (in construction)](https://github.com/Trent-tangtao/AutoView)]

* **LOCA**: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2212.02400)]

* **FT-CLIP**: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2212.06138)][[Code (in construction)](https://github.com/LightDXY/FT-CLIP)]

* **MixPro**: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (*Beijing University of Chemical Technology*). [[Paper](https://arxiv.org/abs/2304.12043)][[PyTorch (in construction)](https://github.com/fistyee/MixPro)]

* **ConMIM**: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2205.09616)][[Pytorch](https://github.com/TencentARC/ConMIM)]

* **ccMIM**: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (*Shanghai Jiao Tong*). [[Paper](https://openreview.net/forum?id=A3sgyt4HWp)]

* **CIM**: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (*Microsoft*). [[Paper](https://openreview.net/forum?id=09hVcSDkea)]

* **MFM**: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (*NTU, Singapore*). [[Paper](https://openreview.net/forum?id=9-umxtNPx5E)][[Website](https://www.mmlab-ntu.com/project/mfm/index.html)]

* **Mask3D**: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2302.14746)]

* **VisualAtom**: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 (*National Institute of Advanced Industrial Science and Technology (AIST), Japan*). [[Paper](https://arxiv.org/abs/2303.01112)][[PyTorch](https://github.com/masora1030/CVPR2023-FDSL-on-VisualAtom)][[Website](https://masora1030.github.io/Visual-Atoms-Pre-training-Vision-Transformers-with-Sinusoidal-Waves/)]

* **MixedAE**: "Mixed Autoencoder for Self-supervised Visual Representation Learning", CVPR, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2303.17152)]

* **TBM**: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 (*Singapore University of Technology and Design*). [[Paper](https://arxiv.org/abs/2304.04175)]

* **LGSimCLR**: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 (*UMich*). [[Paper](https://arxiv.org/abs/2302.12248)][[PyTorch](https://github.com/mbanani/lgssl)]

* **DisCo-CLIP**: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2304.08480)][[PyTorch (in construction)](https://github.com/IDEA-Research/DisCo-CLIP)]

* **MaskCLIP**: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.12262)][[Code (in construction)](https://github.com/LightDXY/MaskCLIP)]

* **MAGE**: "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2211.09117)][[PyTorch](https://github.com/LTH14/mage)]

* **MixMIM**: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2205.13137)][[PyTorch](https://github.com/Sense-X/MixMIM)]

* **iTPN**: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.12735)][[PyTorch](https://github.com/sunsmarterjie/iTPN)]

* **DropKey**: "DropKey for Vision Transformer", CVPR, 2023 (*Meitu*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Li_DropKey_for_Vision_Transformer_CVPR_2023_paper.html)]

* **FlexiViT**: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.08013)][[Tensorflow](https://github.com/google-research/big_vision)]

* **RA-CLIP**: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 (*Alibaba*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Xie_RA-CLIP_Retrieval_Augmented_Contrastive_Language-Image_Pre-Training_CVPR_2023_paper.html)]

* **CLIPPO**: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.08045)][[JAX](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/clippo/README.md)]

* **DMAE**: "Masked Autoencoders Enable Efficient Knowledge Distillers", CVPR, 2023 (*JHU + UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2208.12256)][[PyTorch](https://github.com/UCSC-VLAA/DMAE)]

* **HPM**: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2304.05919)][[PyTorch](https://github.com/Haochen-Wang409/HPM)]

* **LocalMIM**: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2303.05251)]

* **MaskAlign**: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2211.08887)][[PyTorch](https://github.com/OpenDriveLab/maskalign)]

* **RILS**: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.06958)][[Code (in construction)](https://github.com/hustvl/RILS)]

* **RelaxMIM**: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2208.04164)]

* **FDT**: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.14865)][[Code (in construction)](https://github.com/yuxiaochen1103/FDT)]

* **?**: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2206.01125)]

* **OpenCLIP**: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 (*LAION*). [[Paper](https://arxiv.org/abs/2212.07143)][[PyTorch](https://github.com/LAION-AI/scaling-laws-openclip)]

* **DiHT**: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.02280)][[PyTorch](https://github.com/facebookresearch/diht)]

* **M3I-Pretraining**: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2211.09807)][[Code (in construction)](https://github.com/OpenGVLab/M3I-Pretraining)]

* **SN-Net**: "Stitchable Neural Networks", CVPR, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2302.06586)][[PyTorch](https://github.com/ziplab/SN-Net)]

* **MAE-Lite**: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2205.14443)][[PyTorch](https://github.com/wangsr126/mae-lite)]

* **ViT-22B**: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2302.05442)]

* **GHN-3**: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?", ICML, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2303.04143)][[PyTorch](https://github.com/SamsungSAILMontreal/ghn3)]

* **A²MIM**: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 (*Westlake University, China*). [[Paper](https://arxiv.org/abs/2205.13943)][[PyTorch](https://github.com/Westlake-AI/openmixup)]

* **PQCL**: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 (*Alibaba*). [[Paper](https://openreview.net/forum?id=Si9pBgOGeD)][[PyTorch](https://github.com/Sherrylone/Query_Contrastive)]

* **DreamTeacher**: "DreamTeacher: Pretraining Image Backbones with Deep Generative Models", ICCV, 2023 (*NIVIDA*). [[Paper](https://arxiv.org/abs/2307.07487)][[Website](https://research.nvidia.com/labs/toronto-ai/DreamTeacher/)]

* **OFDB**: "Pre-training Vision Transformers with Very Limited Synthesized Images", ICCV, 2023 (*National Institute of Advanced Industrial Science and Technology (AIST), Japan*). [[Paper](https://arxiv.org/abs/2307.14710)][[PyTorch](https://github.com/ryoo-nakamura/OFDB/)]

* **MFF**: "Improving Pixel-based MIM by Reducing Wasted Modeling Capability", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2308.00261)][[PyTorch](https://github.com/open-mmlab/mmpretrain)]

* **TL-Align**: "Token-Label Alignment for Vision Transformers", ICCV, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2210.06455)][[PyTorch](https://github.com/Euphoria16/TL-Align)]

* **SMMix**: "SMMix: Self-Motivated Image Mixing for Vision Transformers", ICCV, 2023 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2212.12977)][[PyTorch](https://github.com/ChenMnZ/SMMix)]

* **DiffMAE**: "Diffusion Models as Masked Autoencoders", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.03283)][[Website](https://weichen582.github.io/diffmae.html)]

* **MAWS**: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2303.13496)][[PyTorch](https://github.com/facebookresearch/maws)]

* **CountBench**: "Teaching CLIP to Count to Ten", ICCV, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2302.12066)]

* **CLIPpy**: "Perceptual Grouping in Vision-Language Models", ICCV, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2210.09996)]

* **CiT**: "CiT: Curation in Training for Effective Vision-Language Data", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.02241)][[PyTorch](https://github.com/facebookresearch/CiT)]

* **I-JEPA**: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.08243)]

* **EfficientTrain**: "EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2211.09703)][[PyTorch](https://github.com/LeapLabTHU/EfficientTrain)]

* **StableRep**: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.00984)][[PyTorch](https://github.com/google-research/syn-rep-learn)]

* **LaCLIP**: "Improving CLIP Training with Language Rewrites", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2305.20088)][[PyTorch](https://github.com/LijieFan/LaCLIP)]

* **DesCo**: "DesCo: Learning Object Recognition with Rich Language Descriptions", NeurIPS, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2306.14060)]

* **?**: "Stable and low-precision training for large-scale vision-language models", NeurIPS, 2023 (*UW*). [[Paper](https://arxiv.org/abs/2304.13013)]

* **CapPa**: "Image Captioners Are Scalable Vision Learners Too", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2306.07915)][[JAX](https://github.com/google-research/big_vision)]

* **IV-CL**: "Does Visual Pretraining Help End-to-End Reasoning?", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2307.08506)]

* **CLIPA**: "An Inverse Scaling Law for CLIP Training", NeurIPS, 2023 (*UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2305.07017)][[PyTorch](https://github.com/UCSC-VLAA/CLIPA)]

* **Hummingbird**: "Towards In-context Scene Understanding", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2306.01667)]

* **RevColV2**: "RevColV2: Exploring Disentangled Representations in Masked Image Modeling", NeurIPS, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2309.01005)][[PyTorch](https://github.com/megvii-research/RevCol)]

* **ALIA**: "Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation", NeurIPS, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2305.16289)][[PyTorch](https://github.com/lisadunlap/ALIA)]

* **?**: "Improving Multimodal Datasets with Image Captioning", NeurIPS (Datasets and Benchmarks), 2023 (*UW*). [[Paper](https://arxiv.org/abs/2307.10350)]

* **CCViT**: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2303.04664)]

* **SoftCLIP**: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2303.17561)]

* **RECLIP**: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.06028)]

* **DINOv2**: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.07193)]

* **?**: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.13089)]

* **Filter**: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2305.05095)]

* **?**: "Improved baselines for vision-language pre-training", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2305.08675)]

* **3T**: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2305.16999)]

* **ADDP**: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 (*CUHK + Tsinghua*). [[Paper](https://arxiv.org/abs/2306.05423)]

* **MOFI**: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2306.07952)]

* **MaPeT**: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 (*UniMoRE, Italy*). [[Paper](https://arxiv.org/abs/2306.07346)][[PyTorch](https://github.com/aimagelab/MaPeT)]

* **RECO**: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.07196)]

* **CLIPA-v2**: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 (*UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2306.15658)][[PyTorch](https://github.com/UCSC-VLAA/CLIPA)]

* **PatchMixing**: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 (*Boston*). [[Paper](https://arxiv.org/abs/2306.17848)][[Website](https://arielnlee.github.io/PatchMixing/)]

* **SN-Netv2**: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2307.00154)][[PyTorch (in construction)](https://github.com/ziplab/SN-Netv2)]

* **CLIP-GPT**: "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", arXiv, 2023 (*Dublin City University, Ireland*). [[Paper](https://arxiv.org/abs/2307.11661)]

* **FlexPredict**: "Predicting masked tokens in stochastic locations improves masked image modeling", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2308.00566)]

* **Soft-MoE**: "From Sparse to Soft Mixtures of Experts", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2308.00951)]

* **DropPos**: "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions", NeurIPS, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2309.03576)][[PyTorch](https://github.com/Haochen-Wang409/DropPos)]

* **MIRL**: "Masked Image Residual Learning for Scaling Deeper Vision Transformers", NeurIPS, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2309.14136)]

* **CMM**: "Investigating the Limitation of CLIP Models: The Worst-Performing Categories", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2310.03324)]

* **LC-MAE**: "Longer-range Contextualized Masked Autoencoder", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2310.13593)]

* **SILC**: "SILC: Improving Vision Language Pretraining with Self-Distillation", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2310.13355)]

* **CLIPTex**: "CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2310.14108)]

* **NxTP**: "Object Recognition as Next Token Prediction", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.02142)][[PyTorch](https://github.com/kaiyuyue/nxtp)]

* **?**: "Scaling Laws of Synthetic Images for Model Training ... for Now", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.04567)][[PyTorch](https://github.com/google-research/syn-rep-learn)]

* **SynCLR**: "Learning Vision from Models Rivals Learning Vision from Data", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.17742)][[PyTorch](https://github.com/google-research/syn-rep-learn)]

* **EWA**: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.06093)]

* **DTM**: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2401.00254)]

* **SSAT**: "Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders", WACV, 2024 (*UNC Charlotte*). [[Paper](https://arxiv.org/abs/2310.20704)][[Code (in construction)](https://github.com/dominickrei/Limited-data-vits)]

* **FEC**: "Neural Clustering based Visual Representation Learning", CVPR, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2403.17409)]

* **EfficientTrain++**: "EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training", TPAMI, 2024 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2405.08768)][[PyTorch](https://github.com/LeapLabTHU/EfficientTrain)]

* **DVT**: "Denoising Vision Transformers", arXiv, 2024 (*USC*). [[Paper](https://arxiv.org/abs/2401.02957)][[PyTorch](https://github.com/Jiawei-Yang/Denoising-ViT)][[Website](https://jiawei-yang.github.io/DenoisingViT/)]

* **AIM**: "Scalable Pre-training of Large Autoregressive Image Models", arXiv, 2024 (*Apple*). [[Paper](https://arxiv.org/abs/2401.08541)][[PyTorch](https://github.com/apple/ml-aim)]

* **DDM**: "Deconstructing Denoising Diffusion Models for Self-Supervised Learning", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2401.14404)]

* **CrossMAE**: "Rethinking Patch Dependence for Masked Autoencoders", arXiv, 2024 (*Berkeley*). [[Paper](https://arxiv.org/abs/2401.14391)][[PyTorch](https://github.com/TonyLianLong/CrossMAE)][[Website](https://crossmae.github.io/)]

* **IWM**: "Learning and Leveraging World Models in Visual Representation Learning", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2403.00504)]

* **?**: "Can Generative Models Improve Self-Supervised Representation Learning?", arXiv, 2024 (*Vector Institute*). [[Paper](https://arxiv.org/abs/2403.05966)]

#### Robustness + Transformer

* **ViT-Robustness**: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2103.14586)]

* **SAGA**: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 (*University of Connecticut*). [[Paper](https://arxiv.org/abs/2104.02610)]

* **?**: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (*KAIST*). [[Paper](https://arxiv.org/abs/2110.02797)][[PyTorch](https://github.com/phibenz/robustness_comparison_vit_mlp-mixer_cnn)]

* **ViTs-vs-CNNs**: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 (*JHU + UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2111.05464)][[PyTorch](https://github.com/ytongbai/ViTs-vs-CNNs)]

* **T-CNN**: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.05795)]

* **Transformer-Attack**: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 (*Xi'an Jiaotong*). [[Paper](https://arxiv.org/abs/2103.15670)]

* **?**: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 (*University of Rennes*). [[Paper](https://arxiv.org/abs/2106.03734)]

* **?**: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (*ANU*). [[Paper](https://arxiv.org/abs/2106.04169)][[PyTorch](https://github.com/Muzammal-Naseer/Improving-Adversarial-Transferability-of-Vision-Transformers)]

* **?**: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (*University of Pittsburgh*). [[Paper](https://arxiv.org/abs/2106.13122)]

* **Token-Attack**: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 (*New York University*). [[Paper](https://arxiv.org/abs/2110.04337)]

* **?**: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2111.10493)]

* **?**: "Vision Transformers are Robust Learners", AAAI, 2022 (*PyImageSearch + IBM*). [[Paper](https://arxiv.org/abs/2105.07581)][[Tensorflow](https://github.com/sayakpaul/robustness-vit)]

* **PNA**: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (*Fudan + Maryland*). [[Paper](https://arxiv.org/abs/2109.04176)][[PyTorch](https://github.com/zhipeng-wei/PNA-PatchOut)]

* **MIA-Former**: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (*Rice University*). [[Paper](https://arxiv.org/abs/2112.11542)]

* **Patch-Fool**: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (*Rice University*). [[Paper](https://arxiv.org/abs/2203.08392)][[PyTorch](https://github.com/RICE-EIC/Patch-Fool)]

* **Generalization-Enhanced-ViT**: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (*Beihang University + NTU, Singapore*). [[Paper](https://arxiv.org/abs/2106.07617)]

* **ECViT**: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (*Tencent*).[[Paper](https://arxiv.org/abs/2203.08519)]

* **Attention-Fool**: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (*Bosch*). [[Paper](https://arxiv.org/abs/2203.13639)]

* **Memory-Token**: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2203.15243)]

* **APRIL**: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2112.14087)]

* **Smooth-ViT**: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (*MIT*). [[Paper](https://arxiv.org/abs/2110.07719)][[PyTorch](https://github.com/MadryLab/smoothed-vit)]

* **RVT**: "Towards Robust Vision Transformer", CVPR, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2105.07926)][[PyTorch](https://github.com/alibaba/easyrobust)]

* **Pyramid**: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2111.15121)]

* **VARS**: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (*Berkeley + Microsoft*). [[Paper](https://arxiv.org/abs/2204.10962)][[PyTorch](https://github.com/bfshi/VARS)]

* **FAN**: "Understanding The Robustness in Vision Transformers", ICML, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2204.12451)][[PyTorch](https://github.com/NVlabs/FAN)]

* **CFA**: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (*The University of Tokyo*). [[Paper](https://arxiv.org/abs/2206.13951)][[PyTorch](https://github.com/kojima-takeshi188/CFA)]

* **?**: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (*University of Exeter, UK*). [[Paper](https://arxiv.org/abs/2208.00906)][[PyTorch](https://github.com/TrustAI/ODE4RobustViT)]

* **?**: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2207.11347)]

* **AGAT**: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2207.10498)]

* **?**: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2111.10659)]

* **ViP**: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (*UC Santa Cruz*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/7173_ECCV_2022_paper.php)][[PyTorch](https://github.com/UCSC-VLAA/vit_cert)]

* **?**: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2210.07540)][[PyTorch](https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers)]

* **PAR**: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2112.03492)]

* **RobustViT**: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (*Tel-Aviv*). [[Paper](https://arxiv.org/abs/2206.01161)][[PyTorch](https://github.com/hila-chefer/RobustViT)]

* **?**: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2110.07858)]

* **NVD**: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (*Boston*). [[Paper](https://openreview.net/forum?id=Aisi2oEq1sc)]

* **?**: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (*UW-Madison*). [[Paper](https://arxiv.org/abs/2203.09125)]

* **MA**: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.12204)]

* **?**: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (*Fudan + Microsoft*). [[Paper](https://arxiv.org/abs/2204.12143)]

* **?**: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (*Tokyo Metropolitan University*). [[Paper](https://arxiv.org/abs/2205.12041)]

* **FedWAvg**: "Federated Adversarial Training with Transformers", arXiv, 2022 (*Institute of Electronics and Digital Technologies (IETR), France*). [[Paper](https://arxiv.org/abs/2206.02131)]

* **Backdoor-Transformer**: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (*Maryland + UC Davis*). [[Paper](https://arxiv.org/abs/2206.08477)][[Code (in construction)](https://github.com/UCDvision/backdoor_transformer)]

* **?**: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2206.12381)]

* **?**: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (*Tokyo Metropolitan University*). [[Paper](https://arxiv.org/abs/2207.05366)]

* **?**: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2208.09602)]

* **CLIPping Privacy**: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2209.07341)]

* **?**: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (*EPFL*). [[Paper](https://arxiv.org/abs/2209.07399)]

* **?**: "Attacking Compressed Vision Transformers", arXiv, 2022 (*NYU*). [[Paper](https://arxiv.org/abs/2209.13785)]

* **C-AVP**: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (*Michigan State*). [[Paper](https://arxiv.org/abs/2210.06284)]

* **?**: "Curved Representation Space of Vision Transformers", arXiv, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2210.05742)]

* **RKDE**: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2210.05794)]

* **MRAP**: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (*Arizona State University*). [[Paper](https://arxiv.org/abs/2210.07663)]

* **model-soup**: "Revisiting adapters with adversarial training", ICLR, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2210.04886)]

* **?**: "Budgeted Training for Vision Transformer", ICLR, 2023 (*Tsinghua*). [[Paper](https://openreview.net/forum?id=sVzBN-DlJRi)]

* **RobustCNN**: "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 (*UC Santa Cruz + JHU*). [[Paper](https://arxiv.org/abs/2206.03452)][[PyTorch](https://github.com/UCSC-VLAA/RobustCNN)]

* **DMAE**: "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2210.06983)][[PyTorch](https://github.com/quanlin-wu/dmae)]

* **TGR**: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.15754)][[PyTorch](https://github.com/jpzhang1810/TGR)]

* **TrojViT**: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 (*Indiana University Bloomington*). [[Paper](https://arxiv.org/abs/2208.13049)]

* **RSPC**: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 (*MPI*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Guo_Improving_Robustness_of_Vision_Transformers_by_Reducing_Sensitivity_To_Patch_CVPR_2023_paper.html)]

* **TORA-ViT**: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 (*The University of Sydney*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Li_Trade-Off_Between_Robustness_and_Accuracy_of_Vision_Transformers_CVPR_2023_paper.html)]

* **BadViT**: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?", CVPR, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Yuan_You_Are_Catching_My_Attention_Are_Vision_Transformers_Bad_Learners_CVPR_2023_paper.html)]

* **?**: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 (*University of Pittsburgh*). [[Paper](https://openreview.net/forum?id=GR4c6Onxfw)]

* **RobustMAE**: "Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2308.10315)][[PyTorch (in construction)](https://github.com/shikiw/RobustMAE)]

* **?**: "Efficiently Robustify Pre-trained Models", ICCV, 2023 (*IIT Roorkee, India*). [[Paper](https://arxiv.org/abs/2309.07499)]

* **?**: "Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients", ICCV, 2023 (*Tsinghua*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Ma_Transferable_Adversarial_Attack_for_Both_Vision_Transformers_and_Convolutional_Networks_ICCV_2023_paper.html)]

* **CleanCLIP**: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", ICCV, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2303.03323)][[PyTorch](https://github.com/nishadsinghi/CleanCLIP)]

* **QBBA**: "Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks", BMVC, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2309.06438)]

* **RBFormer**: "RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias", BMVC, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2309.13245)]

* **PreLayerNorm**: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 (*POSTECH*). [[Paper](https://arxiv.org/abs/2111.08413)]

* **CertViT**: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 (*INRIA*). [[Paper](https://arxiv.org/abs/2302.10287)][[PyTorch](https://github.com/sagarverma/transformer-lipschitz)]

* **RoCLIP**: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2303.06854)]

* **DeepMIM**: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2303.08817)][[Code (in construction)](https://github.com/OliverRensu/DeepMIM)]

* **TAP-ADL**: "Robustifying Token Attention for Vision Transformers", ICCV, 2023 (*MPI*). [[Paper](https://arxiv.org/abs/2303.11126)][[PyTorch](https://github.com/guoyongcs/TAPADL)]

* **EWA**: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.06093)]

* **SlowFormer**: "SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers", arXiv, 2023 (*UC Davis*). [[Paper](https://arxiv.org/abs/2310.02544)][[PyTorch](https://github.com/UCDvision/SlowFormer)]

* **DTM**: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2401.00254)]

* **SWARM**: "Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers", CVPR, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2405.10612)][[Code (in construction)](https://github.com/20000yshust/SWARM)]

* **?**: "Safety of Multimodal Large Language Models on Images and Text", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2402.00357)]

#### Model Compression + Transformer

* **ViT-quant**: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2106.14156)]

* **VTP**: "Visual Transformer Pruning", arXiv, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2104.08500)]

* **MD-ViT**: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 (*Princeton*). [[Paper](https://arxiv.org/abs/2201.00043)]

* **FQ-ViT**: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2111.13824)][[PyTorch](https://github.com/linyang-zhh/FQ-ViT)]

* **UVC**: "Unified Visual Transformer Compression", ICLR, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2203.08243)][[PyTorch](https://github.com/VITA-Group/UVC)]

* **MiniViT**: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2204.07154)][[PyTorch](https://github.com/microsoft/Cream/tree/main/MiniViT)]

* **Auto-ViT-Acc**: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2208.05163)]

* **APQ-ViT**: "Towards Accurate Post-Training Quantization for Vision Transformer", ACMMM, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2303.14341)]

* **SPViT**: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2112.13890)][[PyTorch](https://github.com/PeiyanFlying/SPViT)]

* **PSAQ-ViT**: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2203.02250)][[PyTorch](https://github.com/zkkli/PSAQ-ViT)]

* **PTQ4ViT**: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2111.12293)]

* **EAPruning**: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2210.00181)]

* **Q-ViT**: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2210.06707)][[PyTorch](https://github.com/YanjingLi0202/Q-ViT)]

* **SAViT**: "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 (*Hikvision*). [[Paper](https://openreview.net/forum?id=w5DacXWzQ-Q)]

* **VTC-LFC**: "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 (*Alibaba*). [[Paper](https://openreview.net/forum?id=HuiLIB6EaOk)][[PyTorch](https://github.com/Daner-Wang/VTC-LFC)]

* **Q-ViT**: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2201.07703)]

* **VAQF**: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2201.06618)]

* **VTP**: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (*UCLA*). [[Paper](https://arxiv.org/abs/2203.13444)]

* **SiDT**: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (*UC Irvine*). [[Paper](https://arxiv.org/abs/2204.07722)]

* **PSAQ-ViT-V2**: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2209.05687)][[PyTorch](https://github.com/zkkli/PSAQ-ViT)]

* **AS**: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2209.13802)]

* **SaiT**: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2210.05832)]

* **oViT**: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (*IST Austria*). [[Paper](https://arxiv.org/abs/2210.09223)]

* **CPT-V**: "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2211.09643)]

* **TPS**: "Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers", CVPR, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2304.10716)][[PyTorch](https://github.com/megvii-research/TPS-CVPR2023)]

* **GPUSQ-ViT**: "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2305.10727)]

* **X-Pruner**: "X-Pruner: eXplainable Pruning for Vision Transformers", CVPR, 2023 (*James Cook University, Australia*). [[Paper](https://arxiv.org/abs/2303.04935)][[PyTorch (in construction)](https://github.com/vickyyu90/XPruner)]

* **NoisyQuant**: "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2211.16056)]

* **NViT**: "Global Vision Transformer Pruning with Hessian-Aware Saliency", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2110.04869)]

* **BinaryViT**: "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models", CVPRW, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2306.16678)][[PyTorch](https://github.com/phuoc-hoan-le/binaryvit)]

* **OFQ**: "Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2302.02210)][[PyTorch](https://github.com/nbasyl/OFQ)]

* **UPop**: "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", ICML, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2301.13741)][[PyTorch](https://github.com/sdc17/UPop)]

* **COMCAT**: "COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models", ICML, 2023 (*Rutgers*). [[Paper](https://arxiv.org/abs/2305.17235)][[PyTorch](https://github.com/jinqixiao/ComCAT)]

* **Evol-Q**: "Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2308.10814)][[Code (in construction)](https://github.com/enyac-group/evol-q)]

* **BiViT**: "BiViT: Extremely Compressed Binary Vision Transformer", ICCV, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2211.07091)]

* **I-ViT**: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2207.01405)][[PyTorch](https://github.com/zkkli/I-ViT)]

* **RepQ-ViT**: "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2212.08254)][[PyTorch](https://github.com/zkkli/RepQ-ViT)]

* **LLM-FP4**: "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", EMNLP, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2310.16836)][[Code (in construction)](https://github.com/nbasyl/LLM-FP4)]

* **Q-HyViT**: "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023 (*Electronics and Telecommunications Research Institute (ETRI), Korea*). [[Paper](https://arxiv.org/abs/2303.12557)]

* **Bi-ViT**: "Bi-ViT: Pushing the Limit of Vision Transformer Quantization", arXiv, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2305.12354)]

* **BinaryViT**: "BinaryViT: Towards Efficient and Accurate Binary Vision Transformers", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2305.14730)]

* **Zero-TP**: "Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers", arXiv, 2023 (*Princeton*). [[Paper](https://arxiv.org/abs/2305.17328)]

* **?**: "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing", arXiv, 2023 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2306.12929)]

* **VVTQ**: "Variation-aware Vision Transformer Quantization", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2307.00331)][[PyTorch](https://github.com/HuangOwen/VVTQ)]

* **DIMAP**: "Data-independent Module-aware Pruning for Hierarchical Vision Transformers", ICLR, 2024 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2404.13648)][[Code (in construction)](https://github.com/he-y/Data-independent-Module-Aware-Pruning)]

* **MADTP**: "MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer", CVPR, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2403.02991)][[Code (in construction)](https://github.com/double125/MADTP)]

* **DC-ViT**: "Dense Vision Transformer Compression with Few Samples", CVPR, 2024 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2403.18708)]

[[Back to Overview](#overview)]

### Attention-Free 

#### MLP-Series

* **RepMLP**: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2105.01883)][[PyTorch](https://github.com/DingXiaoH/RepMLP)]

* **EAMLP**: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2105.02358)]

* **Forward-Only**: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 (*Oxford*). [[Paper](https://arxiv.org/abs/2105.02723)][[PyTorch](https://github.com/lukemelas/do-you-even-need-attention)]

* **ResMLP**: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2105.03404)]

* **?**: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2105.15078)]

* **ViP**: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2106.12368)][[PyTorch](https://github.com/Andrew-Qibin/VisionPermutator)]

* **CCS**: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.14882)]

* **S²-MLPv2**: "S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2108.01072)]

* **RaftMLP**: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 (*Rikkyo University, Japan*). [[Paper](https://arxiv.org/abs/2108.04384)][[PyTorch](https://github.com/okojoalg/raft-mlp)]

* **Hire-MLP**: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2108.13341)]

* **Sparse-MLP**: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 (*NUS*). [[Paper](https://arxiv.org/abs/2109.02008)]

* **ConvMLP**: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 (*University of Oregon*). [[Paper](https://arxiv.org/abs/2109.04454)][[PyTorch](https://github.com/SHI-Labs/Convolutional-MLPs)]

* **sMLP**: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2109.05422)]

* **MLP-Mixer**: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2105.01601)][[Tensorflow](https://github.com/google-research/vision_transformer)][[PyTorch-1 (lucidrains)](https://github.com/lucidrains/mlp-mixer-pytorch)][[PyTorch-2 (rishikksh20)](https://github.com/rishikksh20/MLP-Mixer-pytorch)]

* **gMLP**: "Pay Attention to MLPs", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2105.08050)][[PyTorch (antonyvigouret)](https://github.com/antonyvigouret/Pay-Attention-to-MLPs)]

* **S²-MLP**: "S²-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.07477)]

* **CycleMLP**: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (*HKU*). [[Paper](https://arxiv.org/abs/2107.10224)][[PyTorch](https://github.com/ShoufaChen/CycleMLP)]

* **AS-MLP**: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (*ShanghaiTech University*). [[Paper](https://arxiv.org/abs/2107.08391)][[PyTorch](https://github.com/svip-lab/AS-MLP)]

* **Wave-MLP**: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2111.12294)][[PyTorch](https://github.com/huawei-noah/CV-Backbones/tree/master/wavemlp_pytorch)]

* **DynaMixer**: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2201.12083)][[PyTorch](https://github.com/ziyuwwang/DynaMixer)]

* **STD**: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (*Huawei*). [[Paper](https://proceedings.mlr.press/v162/li22c.html)]

* **AMixer**: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (*Tsinghua University*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/4464_ECCV_2022_paper.php)]

* **MS-MLP**: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2202.06510)]

* **ActiveMLP**: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.06108)]

* **MDMLP**: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (*Jiangsu University*). [[Paper](https://arxiv.org/abs/2205.14477)][[PyTorch](https://github.com/Amoza-Theodore/MDMLP)]

* **PosMLP**: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (*University of Science and Technology

of China*). [[Paper](https://arxiv.org/abs/2207.07284)][[PyTorch](https://github.com/Zhicaiwww/PosMLP)]

* **SplitMixer**: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (*Quintic AI, California*). [[Paper](https://arxiv.org/abs/2207.10255)][[PyTorch](https://github.com/aliborji/splitmixer)]

* **gSwin**: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (*PKSHATechnology, Japan*). [[Paper](https://arxiv.org/abs/2208.11718)]

* **?**: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (*Berkeley*). [[Paper](https://arxiv.org/abs/2209.06383)]

* **AFFNet**: "Adaptive Frequency Filters As Efficient Global Token Mixers", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2307.14008)]

* **Strip-MLP**: "Strip-MLP: Efficient Token Interaction for Vision MLP", ICCV, 2023 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2307.11458)][[PyTorch](https://github.com/Med-Process/Strip_MLP)]

#### Other Attention-Free

* **DWNet**: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (*Nankai Univerisy*). [[Paper](https://arxiv.org/abs/2106.04263)][[PyTorch](https://github.com/Atten4Vis/DemystifyLocalViT)]

* **PoolFormer**: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2111.11418)][[PyTorch](https://github.com/sail-sg/poolformer)]

* **ConvNext**: "A ConvNet for the 2020s", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2201.03545)][[PyTorch](https://github.com/facebookresearch/ConvNeXt)]

* **RepLKNet**: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2203.06717)][[MegEngine](https://github.com/MegEngine/RepLKNet)][[PyTorch](https://github.com/DingXiaoH/RepLKNet-pytorch)]

* **FocalNet**: "Focal Modulation Networks", NeurIPS, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.11926)][[PyTorch](https://github.com/microsoft/FocalNet)]

* **HorNet**: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2207.14284)][[PyTorch](https://github.com/raoyongming/HorNet)][[Website](https://hornet.ivg-research.xyz/)]

* **S4ND**: "S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces", NeurIPS, 2022 (*Stanford*). [[Paper](https://arxiv.org/abs/2210.06583)]

* **Sequencer**: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (*Rikkyo University, Japan*). [[Paper](https://arxiv.org/abs/2205.01972)]

* **MogaNet**: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (*Westlake University, China*). [[Paper](https://arxiv.org/abs/2211.03295)]

* **Conv2Former**: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2211.11943)]

* **CoC**: "Image as Set of Points", ICLR, 2023 (*Northeastern*). [[Paper](https://arxiv.org/abs/2303.01494)][[PyTorch](https://github.com/ma-xu/Context-Cluster)]

* **SLaK**: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2207.03620)][[PyTorch](https://github.com/VITA-Group/SLaK)]

* **ConvNeXt-V2**: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.00808)][[PyTorch](https://github.com/facebookresearch/ConvNeXt-V2)]

* **SPANet**: "SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation", ICCV, 2023 (*Korea Institute of Science and Technology*). [[Paper](https://arxiv.org/abs/2308.11568)][[Code (in construction)](https://github.com/DoranLyong/SPANet-official)][[Website](https://doranlyong.github.io/projects/spanet/)]

* **DFFormer**: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 (*Rikkyo University, Japan*). [[Paper](https://arxiv.org/abs/2303.03932)][[Code (in construction)](https://github.com/okojoalg/dfformer)]

* **?**: "ConvNets Match Vision Transformers at Scale", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2310.16764)]

* **VMamba**: "VMamba: Visual State Space Model", arXiv, 2024 (*CAS*). [[Paper](https://arxiv.org/abs/2401.10166)][[PyTorch](https://github.com/MzeroMiko/VMamba)]

* **Vim**: "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", arXiv, 2024 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2401.09417)][[PyTorch](https://github.com/hustvl/Vim

* **VRWKV**: "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.02308)][[PyTorch](https://github.com/OpenGVLab/Vision-RWKV)]

* **LocalMamba**: "LocalMamba: Visual State Space Model with Windowed Selective Scan", arXiv, 2024 (*University of Sydney*). [[Paper](https://arxiv.org/abs/2403.09338)][[PyTorch](https://github.com/hunto/LocalMamba)]

* **SiMBA**: "SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series", arXiv, 2024 (*Microsoft*). [[Paper](https://arxiv.org/abs/2403.15360)][[PyTorch](https://github.com/badripatro/Simba)]

* **PlainMamba**: "PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition", arXiv, 2024 (*University of Edinburgh, Scotland*). [[Paper](https://arxiv.org/abs/2403.17695)][[PyTorch](https://github.com/ChenhongyiYang/PlainMamba)]

* **EfficientVMamba**: "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba", arXiv, 2024 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2403.09977)][[PyTorch](https://github.com/TerryPei/EfficientVMamba)]

* **RDNet**: "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs", arXiv, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2403.19588)]

* **MambaOut**: "MambaOut: Do We Really Need Mamba for Vision?", arXiv, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2405.07992)][[PyTorch](https://github.com/yuweihao/MambaOut)]

[[Back to Overview](#overview)]

### Analysis for Transformer 

* **Attention-CNN**: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (*EPFL*). [[Paper](https://openreview.net/forum?id=HJlnC1rKPB)][[PyTorch](https://github.com/epfml/attention-cnn)][[Website](https://epfml.github.io/attention-cnn/)]

* **Transformer-Explainability**: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2012.09838)][[PyTorch](https://github.com/hila-chefer/Transformer-Explainability)]

* **?**: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 (*Princeton*). [[Paper](https://arxiv.org/abs/2105.07197)]

* **?**: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 (*HKU*). [[Paper](https://arxiv.org/abs/2108.05305)]

* **?**: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2108.08810)]

* **?**: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2105.10497)][[PyTorch](https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers)]

* **FoveaTer**: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (*UCSB*). [[Paper](https://arxiv.org/abs/2105.14173)]

* **?**: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.04263)]

* **?**: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.07998)]

* **?**: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (*Horizon Robotic*). [[Paper](https://arxiv.org/abs/2107.02174)]

* **?**: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 (*Temple University*). [[Paper](https://openaccess.thecvf.com/content/WACV2022/html/Black_Visualizing_Paired_Image_Similarity_in_Transformer_Networks_WACV_2022_paper.html)][[PyTorch](https://github.com/vidarlab/xformer-paired-viz)]

* **FDSL**: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 (*AIST*). [[Paper](https://arxiv.org/abs/2103.13023)][[PyTorch](https://github.com/nakashima-kodai/FractalDB-Pretrained-ViT-PyTorch)][[Website](https://hirokatsukataoka16.github.io/Vision-Transformers-without-Natural-Images/)]

* **AlterNet**: "How Do Vision Transformers Work?", ICLR, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2202.06709)][[PyTorch](https://github.com/xxxnell/how-do-vits-work)]

* **?**: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2106.01548)][[Tensorflow](https://github.com/google-research/vision_transformer)]

* **?**: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (*Stanford*). [[Paper](https://arxiv.org/abs/2205.08078)]

* **?**: "Three things everyone should know about Vision Transformers", ECCV, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2203.09795)]

* **?**: "Vision Transformers provably learn spatial structure", NeurIPS, 2022 (*Princeton*). [[Paper](https://arxiv.org/abs/2210.09221)]

* **AWD-ViT**: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2203.05922)]

* **?**: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 (*Quintic AI, CA*). [[Paper](https://arxiv.org/abs/2203.11678)][[Code](https://github.com/aliborji/hybrid_images)]

* **MJP**: "Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers", CVPR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2205.12551)][[PyTorch](https://github.com/yhlleo/MJP)]

* **?**: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 (*University of Electronic Science and Technology of China*). [[Paper](https://arxiv.org/abs/2206.11073)]

* **?**: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 (*University of Groningen, The Netherlands*). [[Paper](https://arxiv.org/abs/2208.04693)]

* **?**: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 (*Technion Israel Institute Of Technology*). [[Paper](https://arxiv.org/abs/2208.08191)]

* **ProtoPFormer**: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2208.10431)][[PyTorch](https://github.com/zju-vipa/ProtoPFormer)]

* **ICLIP**: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2209.07046)][[Code (in construction)](https://github.com/xmed-lab/ICLIP)]

* **?**: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2210.06313)]

* **?**: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (*Monash University*). [[Paper](https://arxiv.org/abs/2210.07646)][[PyTorch](https://github.com/byM1902/ViT_visualization)]

* **ViT-CX**: "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2211.03064)]

* **?**: "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2211.08543)]

* **IAV**: "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2211.15428)]

* **ViT-Shapley**: "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 (*UW*). [[Paper](https://arxiv.org/abs/2206.05282)][[PyTorch](https://github.com/suinleelab/vit-shapley)]

* **ImageNet-X**: "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 (*Meta*). [[Paper](https://openreview.net/forum?id=HXz7Vcm3VgM)]

* **?**: "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 (*Rensselaer Polytechnic Institute, NY*). [[Paper](https://openreview.net/forum?id=jClGv3Qjhb)]

* **?**: "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2305.00729)][[PyTorch (in construction)](https://github.com/naver-ai/cl-vs-mim)]

* **?**: "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2210.01936)]

* **CLIP-Dissect**: "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 (*UCSD*). [[Paper](https://arxiv.org/abs/2204.10965)]

* **?**: "Understanding Masked Autoencoders via Hierarchical Latent Variable Models", CVPR, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2306.04898)]

* **?**: "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", CVPR, 2023 (*Maryland*). [[Paper](https://arxiv.org/abs/2212.03862)][[PyTorch](https://github.com/mwalmer-umd/vit_analysis)][[Website](https://www.cs.umd.edu/~sakshams/vit_analysis/)]

* **?**: "Masked Autoencoding Does Not Help Natural Language Supervision at Scale", CVPR, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2301.07836)]

* **?**: "On Data Scaling in Masked Image Modeling", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.04664)][[PyTorch](https://github.com/microsoft/SimMIM)]

* **?**: "Revealing the Dark Secrets of Masked Image Modeling", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2205.13543)]

* **Vision-DiffMask**: "VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking", CVPRW, 2023 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2304.06391)][[PyTorch](https://github.com/AngelosNal/Vision-DiffMask)]

* **?**: "A Multidimensional Analysis of Social Biases in Vision Transformers", ICCV, 2023 (*University of Mannheim, Germany*). [[Paper](https://arxiv.org/abs/2308.01948)][[PyTorch](https://github.com/jannik-brinkmann/social-biases-in-vision-transformers)]

* **?**: "Analyzing Vision Transformers for Image Classification in Class Embedding Space", NeurIPS, 2023 (*Goethe University Frankfurt, Germany*). [[Paper](https://arxiv.org/abs/2310.18969)]

* **BoB**: "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks", NeurIPS, 2023 (*NYU*). [[Paper](https://arxiv.org/abs/2310.19909)][[PyTorch](https://github.com/hsouri/Battle-of-the-Backbones)]

* **ViT-CoT**: "Are Vision Transformers More Data Hungry Than Newborn Visual Systems?", NeurIPS, 2023 (*Indiana University Bloomington, Indiana*). [[Paper](https://arxiv.org/abs/2312.02843)]

* **AtMan**: "AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation", NeurIPS, 2023 (*Aleph Alpha, Germany*). [[Paper](https://arxiv.org/abs/2301.08110)][[PyTorch](https://github.com/Aleph-Alpha/AtMan)]

* **AttentionViz**: "AttentionViz: A Global View of Transformer Attention", arXiv, 2023 (*Harvard*). [[Paper](https://arxiv.org/abs/2305.03210)][[Website](http://attentionviz.com/)]

* **?**: "Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields", arXiv, 2023 (*POSTECH*). [[Paper](https://arxiv.org/abs/2305.04722)]

* **?**: "Reviving Shift Equivariance in Vision Transformers", arXiv, 2023 (*Maryland*). [[Paper](https://arxiv.org/abs/2306.07470)]

* **ViT-ReciproCAM**: "ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer", arXiv, 2023 (*Intel*). [[Paper](https://arxiv.org/abs/2310.02588)]

* **Eureka-moment**: "Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems", arXiv, 2023 (*Bosch*). [[Paper](https://arxiv.org/abs/2310.12956)]

* **INTR**: "A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis", arXiv, 2023 (*OSU*). [[Paper](https://arxiv.org/abs/2311.04157)][[PyTorch](https://github.com/Imageomics/INTR)]

* **?**: "Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention", AAAI, 2024 (*Korea Institute of Science and Technology (KIST)*). [[Paper](https://arxiv.org/abs/2402.04563)][[PyTorch](https://github.com/LeemSaebom/Attention-Guided-CAM-Visual-Explanations-of-Vision-Transformer-Guided-by-Self-Attention)]

* **RelatiViT**: "Can Transformers Capture Spatial Relations between Objects?", ICLR, 2024 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2403.00729)][[Code (in construction)](https://github.com/AlvinWen428/spatial-relation)][[Website](https://sites.google.com/view/spatial-relation)]

* **TokenTM**: "Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer", CVPR, 2024 (*Illinois Institute of Technology*). [[Paper](https://arxiv.org/abs/2403.14552)]

* **SaCo**: "On the Faithfulness of Vision Transformer Explanations", CVPR, 2024 (*Illinois Institute of Technology*). [[Paper](https://arxiv.org/abs/2404.01415)]

* **?**: "A Decade's Battle on Dataset Bias: Are We There Yet?", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2403.08632)][[Code (in construction)](https://github.com/liuzhuang13/bias)]

* **LeGrad**: "LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity", arXiv, 2024 (*University of Bonn, Germany*). [[Paper](https://arxiv.org/abs/2404.03214)][[PyTorch](https://github.com/WalBouss/LeGrad)]

[[Back to Overview](#overview)]

## Detection

### Object Detection

* General:

    * **detrex**: "detrex: Benchmarking Detection Transformers", arXiv, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2306.07265)][[PyTorch](https://github.com/IDEA-Research/detrex)]

* CNN-based backbone:

    * **DETR**: "End-to-End Object Detection with Transformers", ECCV, 2020 (*Facebook*). [[Paper](https://arxiv.org/abs/2005.12872)][[PyTorch](https://github.com/facebookresearch/detr)]

    * **Deformable DETR**: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 (*SenseTime*). [[Paper](https://arxiv.org/abs/2010.04159)][[PyTorch](https://github.com/fundamentalvision/Deformable-DETR)]

    * **UP-DETR**: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2011.09094)][[PyTorch](https://github.com/dddzg/up-detr)]

    * **SMCA**: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (*CUHK*). [[Paper](https://arxiv.org/abs/2108.02404)][[PyTorch](https://github.com/gaopengcuhk/SMCA-DETR)]

    * **Conditional-DETR**: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2108.06152)]

    * **PnP-DETR**: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (*Yitu*). [[Paper](https://arxiv.org/abs/2109.07036)][[Code (in construction)](https://github.com/twangnh/pnp-detr)]

    * **TSP**: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 (*CMU*). [[Paper](https://arxiv.org/abs/2011.10881)]

    * **Dynamic-DETR**: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 (*Microsoft*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Dai_Dynamic_DETR_End-to-End_Object_Detection_With_Dynamic_Attention_ICCV_2021_paper.html)]

    * **ViT-YOLO**: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (*Xidian University*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021W/VisDrone/html/Zhang_ViT-YOLOTransformer-Based_YOLO_for_Object_Detection_ICCVW_2021_paper.html)]

    * **ACT**: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 (*Peking + CUHK*). [[Paper](https://arxiv.org/abs/2011.09315)][[PyTorch](https://github.com/gaopengcuhk/SMCA-DETR/)]

    * **DIL-ViT**: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (*Monash University Malaysia*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0675.pdf)]

    * **Efficient-DETR**: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2104.01318)]

    * **CA-FPN**: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2105.09464)]

    * **DETReg**: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 (*Tel-Aviv + Berkeley*). [[Paper](https://arxiv.org/abs/2106.04550)][[Website](https://www.amirbar.net/detreg/)]

    * **GQPos**: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2108.09691)]

    * **Anchor-DETR**: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2109.07107)][[PyTorch](https://github.com/megvii-research/AnchorDETR)]

    * **Sparse-DETR**: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (*Kakao*). [[Paper](https://arxiv.org/abs/2111.14330)][[PyTorch](https://github.com/kakaobrain/sparse-detr)]

    * **DAB-DETR**: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2201.12329)][[PyTorch](https://github.com/SlongLiu/DAB-DETR)]

    * **DN-DETR**: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (*International Digital Economy Academy (IDEA), China*). [[Paper](https://arxiv.org/abs/2203.01305)][[PyTorch](https://github.com/FengLi-ust/DN-DETR)]

    * **SAM-DETR**: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2203.06883)][[PyTorch](https://github.com/ZhangGongjie/SAM-DETR)]

    * **AdaMixer**: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2203.16507)][[Code (in construction)](https://github.com/MCG-NJU/AdaMixer)]

    * **DESTR**: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (*Oregon State*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/He_DESTR_Object_Detection_With_Split_Transformer_CVPR_2022_paper.html)]

    * **REGO**: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2112.04632)][[PyTorch](https://github.com/zhechen/Deformable-DETR-REGO)]

    * **?**: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 (*Ant Group*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Hong_Training_Object_Detectors_From_Scratch_An_Empirical_Study_in_the_CVPR_2022_paper.html)]

    * **DE-DETR**: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2203.09507)][[PyTorch](https://github.com/encounter1997/DE-DETRs)]

    * **DFFT**: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2206.06829)]

    * **Cornerformer**: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 (*Huawei*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/4286_ECCV_2022_paper.php)]

    * **?**: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 (*Microsoft*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/8094_ECCV_2022_paper.php)][[Code (in construction)](https://github.com/SwinTransformer/Simple-21K-Detection)]

    * **Obj2Seq**: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2209.13948)][[PyTorch](https://github.com/CASIA-IVA-Lab/Obj2Seq)]

    * **KA**: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2203.03187)]

    * **TCC**: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2207.06603)]

    * **Conditional-DETR-V2**: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2207.08914)]

    * **SAM-DETR++**: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2207.14172)][[PyTorch](https://github.com/ZhangGongjie/SAM-DETR)]

    * **ComplETR**: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2209.05654)]

    * **Pair-DETR**: "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2210.16476)]

    * **Group-DETR-v2**: "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2211.03594)]

    * **KD-DETR**: "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2211.08071)]

    * **D³ETR**: "D³ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2211.09768)]

    * **each-DETR**: "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2211.11953)][[Code (in construction)](https://github.com/LeonHLJ/Teach-DETR)]

    * **DETA**: "NMS Strikes Back", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2212.06137)][[PyTorch](https://github.com/jozhang97/DETA)]

    * **ViT-Adapter**: "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2205.08534)][[PyTorch](https://github.com/czczup/ViT-Adapter)]

    * **Lite-DETR**: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.07335)][[Code (in construction)](https://github.com/IDEA-Research/Lite-DETR)]

    * **DDQ**: "Dense Distinct Query for End-to-End Object Detection", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.12776)][[PyTorch](https://github.com/jshilong/DDQ)]

    * **SiameseDETR**: "Siamese DETR", CVPR, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2303.18144)][[PyTorch](https://github.com/Zx55/SiameseDETR)]

    * **SAP-DETR**: "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.02006)]

    * **Q-DETR**: "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2304.00253)][[Code (in construction)](https://github.com/SteveTsui/Q-DETR)]

    * **Lite-DETR**: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.07335)][[PyTorch](https://github.com/IDEA-Research/Lite-DETR)]

    * **H-DETR**: "DETRs with Hybrid Matching", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.13080)][[PyTorch](https://github.com/HDETR)]

    * **MaskDINO**: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", CVPR, 2023 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2206.02777)][[PyTorch](https://github.com/IDEACVR/MaskDINO)]

    * **IMFA**: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2208.11356)][[Code (in construction)](https://github.com/ZhangGongjie/IMFA)]

    * **SQR**: "Enhanced Training of Query-Based Object Detection via Selective Query Recollection", CVPR, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2212.07593)][[PyTorch](https://github.com/Fangyi-Chen/SQR)]

    * **DQ-Det**: "Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation", ICML, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2307.12239)]

    * **SpeedDETR**: "SpeedDETR: Speed-aware Transformers for End-to-end Object Detection", ICML, 2023 (*Northeastern University*). [[Paper](https://openreview.net/forum?id=5VdcSxrlTK)]

    * **AlignDet**: "AlignDet: Aligning Pre-training and Fine-tuning in Object Detection", ICCV, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2307.11077)][[PyTorch](https://github.com/liming-ai/AlignDet)][[Website](https://liming-ai.github.io/AlignDet/)]

    * **Focus-DETR**: "Less is More: Focus Attention for Efficient DETR", ICCV, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2307.12612)][[PyTorch](https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR)][[MindSpore](https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR)]

    * **Plain-DETR**: "DETR Doesn't Need Multi-Scale or Locality Design", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.01904)][[Code (in construction)](https://github.com/impiga/Plain-DETR)]

    * **ASAG**: "ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2308.09242)][[PyTorch](https://github.com/iSEE-Laboratory/ASAG)]

    * **MIMDet**: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", ICCV, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2204.02964)][[PyTorch](https://github.com/hustvl/MIMDet)]

    * **Stable-DINO**: "Detection Transformer with Stable Matching", ICCV, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2304.04742)][[Code (in construction)](https://github.com/IDEA-Research/Stable-DINO)]

    * **imTED**: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2205.09613)][[PyTorch](https://github.com/LiewFeng/imTED)]

    * **Group-DETR**: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", ICCV, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2207.13085)][[Code (in construction)](https://github.com/Atten4Vis/GroupDETR)]

    * **Co-DETR**: "DETRs with Collaborative Hybrid Assignments Training", ICCV, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2211.12860)][[PyTorch](https://github.com/Sense-X/Co-DETR)]

    * **DETRDistill**: "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2211.10156)]

    * **Decoupled-DETR**: "Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection", ICCV, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2310.15955)]

    * **StageInteractor**: "StageInteractor: Query-based Object Detector with Cross-stage Interaction", ICCV, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2304.04978)]

    * **Rank-DETR**: "Rank-DETR for High Quality Object Detection", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2310.08854)][[PyTorch](https://github.com/LeapLabTHU/Rank-DETR)]

    * **Cal-DETR**: "Cal-DETR: Calibrated Detection Transformer", NeurIPS, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2311.03570)][[PyTorch](https://github.com/akhtarvision/cal-detr)]

    * **KS-DETR**: "KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer", arXiv, 2023 (*Toyota Technological Institute*). [[Paper](https://arxiv.org/abs/2302.11208)][[PyTorch](https://github.com/edocanonymous/KS-DETR)]

    * **FeatAug-DETR**: "FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.01503)][[Codee (in construction)](https://github.com/rongyaofang/FeatAug-DETR)]

    * **RT-DETR**: "DETRs Beat YOLOs on Real-time Object Detection", arXiv, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2304.08069)]

    * **Align-DETR**: "Align-DETR: Improving DETR with Simple IoU-aware BCE loss", arXiv, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2304.07527)][[PyTorch](https://github.com/FelixCaae/AlignDETR)]

    * **Box-DETR**: "Box-DETR: Understanding and Boxing Conditional Spatial Queries", arXiv, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2307.08353)][[PyTorch (in construction)](https://github.com/tiny-smart/box-detr)]

    * **RefineBox**: "Enhancing Your Trained DETRs with Box Refinement", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2307.11828)][[Code (in construction)](https://github.com/YiqunChen1999/RefineBox)]

    * **?**: "Revisiting DETR Pre-training for Object Detection", arXiv, 2023 (*Toronto*). [[Paper](https://arxiv.org/abs/2308.01300)]

    * **Gen2Det**: "Gen2Det: Generate to Detect", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.04566)]

    * **ViT-CoMer**: "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions", CVPR, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2403.07392)][[PyTorch](https://github.com/Traffic-X/ViT-CoMer)]

    * **Salience-DETR**: "Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement", CVPR, 2024 (*Xi'an Jiaotong University*). [[Paper](https://arxiv.org/abs/2403.16131)][[PyTorch](https://github.com/xiuqhou/Salience-DETR)]

    * **MS-DETR**: "MS-DETR: Efficient DETR Training with Mixed Supervision", arXiv, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2401.03989)][[Code (in construction)](https://github.com/Atten4Vis/MS-DETR)]

* Transformer-based backbone:

    * **ViT-FRCNN**: "Toward Transformer-Based Object Detection", arXiv, 2020 (*Pinterest*). [[Paper](https://arxiv.org/abs/2012.09958)]

    * **WB-DETR**: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (*CAS*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_WB-DETR_Transformer-Based_Detector_Without_Backbone_ICCV_2021_paper.html)]

    * **YOLOS**: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (*Horizon Robotics*). [[Paper](https://arxiv.org/abs/2106.00666)][[PyTorch](https://github.com/hustvl/YOLOS)]

    * **?**: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2111.11429)]

    * **ViDT**: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (*NAVER*). [[Paper](https://arxiv.org/abs/2110.03921)][[PyTorch](https://github.com/naver-ai/vidt)]

    * **FP-DETR**: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (*USTC*). [[Paper](https://openreview.net/forum?id=yjMQuLLcGWK)]

    * **DETR++**: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2206.02977)]

    * **ViTDet**: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2203.16527)]

    * **UViT**: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2112.09747)]

    * **CFDT**: "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 (*Huawei*). [[Paper](https://openreview.net/forum?id=iuW96ssPQX)]

    * **D²ETR**: "D²ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2203.00860)]

    * **DINO**: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2203.03605)][[PyTorch](https://github.com/IDEACVR/DINO)]

    * **SimPLR**: "SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation", arXiv, 2023 (*UvA*). [[Paper](https://arxiv.org/abs/2310.05920)]

[[Back to Overview](#overview)]

### 3D Object Detection

* **AST-GRU**: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 (*Baidu*). [[Paper](https://arxiv.org/abs/2004.01389)][[Code (in construction)](https://github.com/yinjunbo/3DVID)]

* **Pointformer**: "3D Object Detection with Pointformer", arXiv, 2020 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2012.11409)]

* **CT3D**: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2108.10723)][[Code (in construction)](https://github.com/hlsheng1/CT3D)]

* **Group-Free-3D**: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2104.00678)][[PyTorch](https://github.com/zeliu98/Group-Free-3D)]

* **VoTr**: "Voxel Transformer for 3D Object Detection", ICCV, 2021 (*CUHK + NUS*). [[Paper](https://arxiv.org/abs/2109.02497)]

* **3DETR**: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2109.08141)][[PyTorch](https://github.com/facebookresearch/3detr)][[Website](https://facebookresearch.github.io/3detr/)]

* **DETR3D**: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 (*MIT*). [[Paper](https://arxiv.org/abs/2110.06922)]

* **M3DETR**: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (*University of Maryland*). [[Paper](https://arxiv.org/abs/2104.11896)][[PyTorch](https://github.com/rayguan97/M3DETR)]

* **MonoDTR**: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 (*NTU*). [[Paper](https://arxiv.org/abs/2203.10981)][[Code (in construction)](https://github.com/kuanchihhuang/MonoDTR)]

* **VoxSeT**: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2203.10314)][[PyTorch](https://github.com/skyhehe123/VoxSeT)]

* **TransFusion**: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2203.11496)][[PyTorch](https://github.com/XuyangBai/TransFusion)]

* **CAT-Det**: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2204.00325)]

* **TokenFusion**: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2204.08721)]

* **SST**: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2112.06375)][[PyTorch](https://github.com/TuSimple/SST)]

* **LIFT**: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (*Shanghai Jiao Tong University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_LIFT_Learning_4D_LiDAR_Image_Fusion_Transformer_for_3D_Object_CVPR_2022_paper.html)]

* **BoxeR**: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2111.13087)][[PyTorch](https://github.com/kienduynguyen/BoxeR)]

* **BrT**: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2210.01391)]

* **VISTA**: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2203.09704)][[PyTorch](https://github.com/Gorilla-Lab-SCUT/VISTA)]

* **STRL**: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 (*Bosch*). [[Paper](https://drive.google.com/file/d/1_2RedCoqCH4cM6J-TOy18nevVd9RTr8c/view)]

* **MTrans**: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 (*HKU*). [[Paper](https://arxiv.org/abs/2207.09805)][[PyTorch](https://github.com/Cliu2/MTrans)]

* **CenterFormer**: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (*TuSimple*). [[Paper](https://arxiv.org/abs/2209.05588)][[Code (in construction)](https://github.com/TuSimple/centerformer)]

* **BUTD-DETR**: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 (*CMU*). [[Paper](https://arxiv.org/abs/2112.08879)][[PyTorch](https://github.com/nickgkan/butd_detr)][[Website](https://butd-detr.github.io/)]

* **SpatialDETR**: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (*Mercedes-Benz*). [[Paper](https://markus-enzweiler.de/downloads/publications/ECCV2022-spatial_detr.pdf)][[PyTorch](https://github.com/cgtuebingen/SpatialDETR)]

* **CramNet**: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (*Waymo*). [[Paper](https://arxiv.org/abs/2210.09267)]

* **SWFormer**: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 (*Waymo*). [[Paper](https://arxiv.org/abs/2210.07372)]

* **EMMF-Det**: "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 (*Hikvision*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/6955_ECCV_2022_paper.php)]

* **UVTR**: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2206.00630)][[PyTorch](https://github.com/dvlab-research/UVTR)]

* **MsSVT**: "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 (*Beijing Institute of Technology*). [[Paper](https://openreview.net/forum?id=hOVEBHpHrMu)][[PyTorch](https://github.com/dscdyc/MsSVT)]

* **DeepInteraction**: "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2208.11112)][[PyTorch](https://github.com/fudan-zvg/DeepInteraction)]

* **PETR**: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2203.05625)]

* **Graph-DETR3D**: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 (*University of Science and Technology of China*). [[Paper](https://arxiv.org/abs/2204.11582)]

* **PolarFormer**: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2206.15398#)][[Code (in construction)](https://github.com/fudan-zvg/PolarFormer)]

* **AST-GRU**: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2207.12659)]

* **SEFormer**: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2209.01745)]

* **CRAFT**: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2209.06535)]

* **CrossDTR**: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (*NTU*). [[Paper](https://arxiv.org/abs/2209.13507)][[Code (in construction)](https://github.com/sty61010/CrossDTR)]

* **Focal-PETR**: "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2212.05505)]

* **Li3DeTr**: "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 (*University of Coimbra, Portugal*). [[Paper](https://arxiv.org/abs/2210.15365)]

* **PiMAE**: "PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2303.08129)][[PyTorch](https://github.com/BLVLab/PiMAE)]

* **OcTr**: "OcTr: Octree-based Transformer for 3D Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2303.12621)]

* **MonoATT**: "MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer", CVPR, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2303.13018)]

* **PVT-SSD**: "PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2305.06621)][[Code (in construction)](https://github.com/Nightmare-n/PVT-SSD)]

* **ConQueR**: "ConQueR: Query Contrast Voxel-DETR for 3D Object Detection", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2212.07289)][[PyTorch](https://github.com/poodarchu/EFG)][[Website](https://benjin.me/projects/2022_conquer/)]

* **FrustumFormer**: "FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2301.04467)][[PyTorch (in construction)](https://github.com/Robertwyq/Frustum)]

* **DSVT**: "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2301.06051)][[PyTorch](https://github.com/Haiyang-W/DSVT)]

* **AShapeFormer**: "AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers", CVPR, 2023 (*Hunan University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Li_AShapeFormer_Semantics-Guided_Object-Level_Active_Shape_Encoding_for_3D_Object_Detection_CVPR_2023_paper.html)][[Code (in construction)](https://github.com/ZechuanLi/AShapeFormer)]

* **MV-JAR**: "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.13510)][[Code (in construction)](https://github.com/SmartBot-PJLab/MV-JAR)]

* **FocalFormer3D**: "FocalFormer3D: Focusing on Hard Instance for 3D Object Detection", ICCV, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2308.04556)][[PyTorch](https://github.com/NVlabs/FocalFormer3D)]

* **3DPPE**: "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", ICCV, 2023 (*Houmo AI, China*). [[Paper](https://arxiv.org/abs/2211.14710)][[PyTorch](https://github.com/drilistbox/3DPPE)]

* **PARQ**: "Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection", ICCV, 2023 (*Northeastern*). [[Paper](https://arxiv.org/abs/2310.01401)][[PyTorch](https://github.com/ymingxie/parq)][[Website](https://ymingxie.github.io/parq/)]

* **CMT**: "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection", ICCV, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2301.01283)][[PyTorch](https://github.com/junjie18/CMT)]

* **MonoDETR**: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", ICCV, 2023 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2203.13310)][[PyTorch](https://github.com/ZrrSkywalker/MonoDETR)]

* **DTH**: "Efficient Transformer-based 3D Object Detection with Dynamic Token Halting", ICCV, 2023 (*Cruise*). [[Paper](https://arxiv.org/abs/2303.05078)]

* **PETRv2**: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", ICCV, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2206.01256)][[PyTorch](https://github.com/megvii-research/PETR)]

* **MV2D**: "Object as Query: Lifting any 2D Object Detector to 3D Detection", ICCV, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2301.02364)]

* **?**: "An Empirical Analysis of Range for 3D Object Detection", ICCVW, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2308.04054)]

* **Uni3DETR**: "Uni3DETR: Unified 3D Detection Transformer", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2310.05699)][[PyTorch](https://github.com/zhenyuw16/Uni3DETR)]

* **Diffusion-SS3D**: "Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection", NeurIPS, 2023 (*NYCU*). [[Paper](https://arxiv.org/abs/2312.02966)][[PyTorch](https://github.com/luluho1208/Diffusion-SS3D)]

* **STEMD**: "Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2307.00347)][[Code (in construction)(https://github.com/Eaphan/STEMD)]]

* **V-DETR**: "V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.04409)][[Code (in construction)](https://github.com/yichaoshen-MS/V-DETR)]

* **3DiffTection**: "3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features", arXiv, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2311.04391)][[Code (in construction)](https://github.com/nv-tlabs/3DiffTection)][[Website](https://research.nvidia.com/labs/toronto-ai/3difftection/)]

* **PTT**: "PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection", arXiv, 2023 (*UC Merced*). [[Paper](https://arxiv.org/abs/2312.08371)][[Code (in construction)](https://github.com/kuanchihhuang/PTT)]

* **Point-DETR3D**: "Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection", AAAI, 2024 (*USTC*). [[Paper](https://arxiv.org/abs/2403.15317)]

* **MixSup**: "MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection", ICLR, 2024 (*CAS*). [[Paper](https://arxiv.org/abs/2401.16305)][[PyTorch](https://github.com/BraveGroup/PointSAM-for-MixSup)]

* **QAF2D**: "Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors", CVPR, 2024 (*Nullmax, China*). [[Paper](https://arxiv.org/abs/2403.06093)]

* **ScatterFormer**: "ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention", arXiv, 2024 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2401.00912)][[Code (in construction)](https://github.com/skyhehe123/ScatterFormer)]

* **MsSVT++**: "MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection", arXiv, 2024 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2401.11718)][[PyTorch](https://github.com/dscdyc/MsSVT)]

[[Back to Overview](#overview)]

### Multi-Modal Detection

* **OVR-CNN**: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 (*Snap*). [[Paper](https://arxiv.org/abs/2011.10678)][[PyTorch](https://github.com/alirezazareian/ovr-cnn)]

* **MDETR**: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (*NYU*). [[Paper](https://arxiv.org/abs/2104.12763)][[PyTorch](https://github.com/ashkamath/mdetr)][[Website](https://ashkamath.github.io/mdetr_page/)]

* **FETNet**: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (*Tsinghua*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/1400.pdf)]

* **MEDUSA**: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 (*Google*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0568.pdf)][[PyTorch](https://github.com/songhwanjun/MEDUSA)]

* **StrucTexT**: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2108.02923)]

* **MAVL**: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2111.11430)][[PyTorch](https://github.com/mmaaz60/mvits_for_class_agnostic_od)]

* **OWL-ViT**: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2205.06230)][[JAX](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)][[Hugging Face](https://huggingface.co/docs/transformers/model_doc/owlvit)]

* **X-DETR**: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2204.05626)]

* **simCrossTrans**: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (*The City University of New York*). [[Paper](https://arxiv.org/abs/2203.10456)][[PyTorch](https://github.com/liketheflower/simCrossTrans)]

* **?**: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (*USC*). [[Paper](https://arxiv.org/abs/2206.09592)]

* **YONOD**: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 (*CUNY*). [[Paper](https://arxiv.org/abs/2207.01071)][[PyTorch](https://github.com/liketheflower/YONOD)]

* **OmDet**: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (*Binjiang Institute of Zhejiang University*). [[Paper](https://arxiv.org/abs/2209.05946)]

* **ContFormer**: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2210.02953)]

* **DQ-DETR**: "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 (*International Digital Economy Academy (IDEA)*). [[Paper](https://arxiv.org/abs/2211.15516)][[Code (in construction)](https://github.com/IDEA-Research/DQ-DETR)]

* **F-VLM**: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", ICLR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2209.15639)][[Website](https://sites.google.com/view/f-vlm)]

* **OV-3DET**: "Open-Vocabulary Point-Cloud Object Detection without 3D Annotation", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2304.00788)][[PyTorch](https://github.com/lyhdet/OV-3DET)]

* **Detection-Hub**: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", CVPR, 2023 (*Fudan + Microsoft*). [[Paper](https://arxiv.org/abs/2206.03484)]

* **OmniLabel**: "OmniLabel: A Challenging Benchmark for Language-Based Object Detection", ICCV, 2023 (*NEC*). [[Paper](https://arxiv.org/abs/2304.11463)][[GitHub](https://github.com/samschulter/omnilabeltools)][[Website](https://www.omnilabel.org/)]

* **MM-OVOD**: "Multi-Modal Classifiers for Open-Vocabulary Object Detection", ICML, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2306.05493?s=31)][[Code (in construction)](https://github.com/prannaykaul/mm-ovod)][[Website](https://www.robots.ox.ac.uk/~vgg/research/mm-ovod/)]

* **CoDA**: "CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection", NeurIPS, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2310.02960)][[PyTorch](https://github.com/yangcaoai/CoDA_NeurIPS2023)][[Website](https://yangcaoai.github.io/publications/CoDA.html)]

* **ContextDET**: "Contextual Object Detection with Multimodal Large Language Models", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2305.18279)][[Code (in construction)](https://github.com/yuhangzang/ContextDET)][[Website](https://www.mmlab-ntu.com/project/contextdet/index.html)]

* **Object2Scene**: "Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection", arXiv, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2309.09456)]

[[Back to Overview](#overview)]

### HOI Detection

* **HOI-Transformer**: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2103.04503)][[PyTorch](https://github.com/bbepoch/HoiTransformer)]

* **HOTR**: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (*Kakao + Korea University*). [[Paper](https://arxiv.org/abs/2104.13682)][[PyTorch](https://github.com/kakaobrain/HOTR)]

* **MSTR**: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (*Kakao*). [[Paper](https://arxiv.org/abs/2203.14709)]

* **SSRT**: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2204.00746)]

* **CPC**: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2204.04836)][[PyTorch (in construction)](https://github.com/mlvlab/CPChoi)]

* **DisTR**: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2204.09290)]

* **STIP**: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2206.06291)][[PyTorch](https://github.com/zyong812/STIP)]

* **DOQ**: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (*South China University of Technology*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Qu_Distillation_Using_Oracle_Queries_for_Transformer-Based_Human-Object_Interaction_Detection_CVPR_2022_paper.html)]

* **UPT**: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 (*Australian Centre for Robotic Vision*). [[Paper](https://arxiv.org/abs/2112.01838)][[PyTorch](https://github.com/fredzzhang/upt)][[Website](https://fredzzhang.com/unary-pairwise-transformers/)]

* **CATN**: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2204.04911)]

* **GEN-VLKT**: "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection", CVPR, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2203.13954)][[PyTorch](https://github.com/YueLiao/gen-vlkt)]

* **HQM**: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2207.05293)][[PyTorch](https://github.com/MuchHair/HQM)]

* **Iwin**: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2203.10537)]

* **RLIP**: "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2209.01814)][[PyTorch](https://github.com/JacobYuan7/RLIP)]

* **TUTOR**: "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2206.01908)]

* **?**: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2210.05668)][[PyTorch](https://github.com/Yang-Li-2000/Understanding-Embodied-Reference-with-Touch-Line-Transformer)]

* **?**: "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 (*KU Leuven*). [[Paper](https://openreview.net/forum?id=resApVNcqSB)]

* **HOICLIP**: "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models", CVPR, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2303.15786)][[Code (in construction)](https://github.com/Artanic30/HOICLIP)]

* **ViPLO**: "ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection", CVPR, 2023 (*mAy-I, Korea*). [[Paper](https://arxiv.org/abs/2304.08114)][[PyTorch](https://github.com/Jeeseung-Park/ViPLO)]

* **OpenCat**: "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework", CVPR, 2023 (*Renmin University of China*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Zheng_Open-Category_Human-Object_Interaction_Pre-Training_via_Language_Modeling_Framework_CVPR_2023_paper.html)]

* **CQL**: "Category Query Learning for Human-Object Interaction Classification", CVPR, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2303.14005)][[Code (in construction)](https://github.com/charles-xie/CQL)]

* **RmLR**: "Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection", ICCV, 2023 (*Southeast University, China*). [[Paper](https://arxiv.org/abs/2307.13529)]

* **PViC**: "Exploring Predicate Visual Context in Detecting of Human-Object Interactions", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.06202)][[PyTorch](https://github.com/fredzzhang/pvic)]

* **AGER**: "Agglomerative Transformer for Human-Object Interaction Detection", ICCV, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2308.08370)][[Code (in construction)](https://github.com/six6607/AGER)]

* **RLIPv2**: "RLIPv2: Fast Scaling of Relational Language-Image Pre-training", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.09351)][[PyTorch](https://github.com/JacobYuan7/RLIPv2)]

* **EgoPCA**: "EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding", ICCV, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2309.02423)][[Website](https://mvig-rhos.com/ego_pca)]

* **UniHOI**: "Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models", NeurIPS, 2023 (*Southeast University*). [[Paper](https://arxiv.org/abs/2311.03799)][[Code (in construction)](https://github.com/Caoyichao/UniHOI)]

* **LogicHOI**: "Neural-Logic Human-Object Interaction Detection", NeurIPS, 2023 (*University of Technology Sydney*). [[Paper](https://arxiv.org/abs/2311.09817)][[Code (in construction)](https://github.com/weijianan1/LogicHOI)]

* **?**: "Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels", arXiv, 2023 (*KU Leuven*). [[Paper](https://arxiv.org/abs/2309.05069)]

* **DP-HOI**: "Disentangled Pre-training for Human-Object Interaction Detection", CVPR, 2024 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2404.01725)][[Code (in construction)](https://github.com/xingaoli/DP-HOI)]

* **HOI-Ref**: "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision", arXiv, 2024 (*University of Bristol, UK*). [[Paper](https://arxiv.org/abs/2404.09933)][[PyTorch](https://github.com/Sid2697/HOI-Ref)][[Website](https://sid2697.github.io/hoi-ref/)]

[[Back to Overview](#overview)]

### Salient Object Detection

* **VST**: "Visual Saliency Transformer", ICCV, 2021 (*Northwestern Polytechincal University*). [[Paper](https://arxiv.org/abs/2104.12099)]

* **?**: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2112.13528)]

* **SwinNet**: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (*Anhui University*). [[Paper](https://arxiv.org/abs/2204.05585)][[Code](https://github.com/liuzywen/SwinNet)]

* **SOD-Transformer**: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2104.10127)]

* **GLSTR**: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2108.02759)]

* **TriTransNet**: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (*Anhui University*). [[Paper](https://arxiv.org/abs/2108.03990)]

* **AbiU-Net**: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 (*Nankai University*). [[Paper](https://arxiv.org/abs/2108.07851)]

* **TranSalNet**: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (*Cardiff University, UK*). [[Paper](https://arxiv.org/abs/2110.03593)]

* **DFTR**: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2203.06429)]

* **GroupTransNet**: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (*Nankai university*). [[Paper](https://arxiv.org/abs/2203.10785)]

* **SelfReformer**: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2205.11283)]

* **DTMINet**: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2206.03105)]

* **MCNet**: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2207.03558)][[PyTorch](https://github.com/jxr326/SwinMCNet)]

* **SiaTrans**: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (*Shandong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2207.04224)]

* **PSFormer**: "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 (*Nanjing University of Aeronautics and Astronautics*). [[Paper](https://arxiv.org/abs/2210.15933)]

* **RMFormer**: "Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection", ACMMM, 2023 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2308.03826)]

[[Back to Overview](#overview)]

### Other Detection Tasks

* X-supervised:

    * **LOST**: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (*Valeo.ai*). [[Paper](https://arxiv.org/abs/2109.14279)][[PyTorch](https://github.com/valeoai/LOST)]

    * **Omni-DETR**: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2203.16089)][[PyTorch](https://github.com/amazon-research/omni-detr)]

    * **TokenCut**: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 (*Univ. Grenoble Alpes, France*). [[Paper](https://arxiv.org/abs/2202.11539)][[PyTorch](https://github.com/YangtaoWANG95/TokenCut)][[Website](https://www.m-psi.fr/Papers/TokenCut2022/)]

    * **WS-DETR**: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.05205)]

    * **TRT**: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2208.01838)][[PyTorch](https://github.com/su-hui-zz/ReAttentionTransformer)]

    * **TokenCut**: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 (*Univ. Grenoble Alpes, France*). [[Paper](https://arxiv.org/abs/2209.00383)][[PyTorch](https://github.com/YangtaoWANG95/TokenCut)][[Website](https://www.m-psi.fr/Papers/TokenCut2022/)]

    * **Semi-DETR**: "Semi-DETR: Semi-Supervised Object Detection With Detection Transformers", CVPR, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2307.08095)][[Paddle (in construction)](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/semi_det/semi_detr)][[PyTorch (JCZ404)](https://github.com/JCZ404/Semi-DETR)]

    * **MoTok**: "Object Discovery from Motion-Guided Tokens", CVPR, 2023 (*Toyota*). [[Paper](https://arxiv.org/abs/2303.15555)][[PyTorch](https://github.com/zpbao/MoTok/)][[Website](https://zpbao.github.io/projects/MoTok/)]

    * **CutLER**: "Cut and Learn for Unsupervised Object Detection and Instance Segmentation", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.11320)][[PyTorch](https://github.com/facebookresearch/CutLER)][[Website](http://people.eecs.berkeley.edu/~xdwang/projects/CutLER/)]

    * **ISA-TS**: "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames", ICML, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2302.04973)]

    * **MOST**: "MOST: Multiple Object localization with Self-supervised Transformers for object discovery", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.05387)][[PyTorch](https://github.com/rssaketh/MOST/)][[Website](https://rssaketh.github.io/most)]

    * **GenPromp**: "Generative Prompt Model for Weakly Supervised Object Localization", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2307.09756)][[PyTorch](https://github.com/callsys/GenPromp)]

    * **SAT**: "Spatial-Aware Token for Weakly Supervised Object Localization", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2303.10438)][[PyTorch](https://github.com/wpy1999/SAT)]

    * **ALWOD**: "ALWOD: Active Learning for Weakly-Supervised Object Detection", ICCV, 2023 (*Rutgers*). [[Paper](https://arxiv.org/abs/2309.07914)][[Code (in construction)](https://github.com/seqam-lab/ALWOD)]

    * **HASSOD**: "HASSOD: Hierarchical Adaptive Self-Supervised Object Detection", NeurIPS, 2023 (*UIUC*). [[Paper](https://github.com/Shengcao-Cao/HASSOD)][[PyTorch](https://github.com/Shengcao-Cao/HASSOD)][[Website](https://hassod-neurips23.github.io/)]

    * **SeqCo-DETR**: "SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers", arXiv, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2303.08481)]

    * **R-MAE**: "R-MAE: Regions Meet Masked Autoencoders", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2306.05411)]

    * **SimDETR**: "SimDETR: Simplifying self-supervised pretraining for DETR", arXiv, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2307.15697)]

    * **U2Seg**: "Unsupervised Universal Image Segmentation", arXiv, 2023 (*Berkely*). [[Paper](https://arxiv.org/abs/2312.17243)][[PyTorch](https://github.com/u2seg/U2Seg)]

    * **CuVLER**: "CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers", CVPR, 2024 (*Technion - Israel Institute of Technology*). [[Paper](https://arxiv.org/abs/2403.07700)][[PyTorch](https://github.com/shahaf-arica/CuVLER)]

    * **Sparse-Semi-DETR**: "Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection", CVPR, 2024 (*DFKI, Germany*). [[Paper](https://arxiv.org/abs/2404.01819)]

* X-Shot Object Detection:

    * **AIT**: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 (*Academia Sinica*). [[Paper](https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Adaptive_Image_Transformer_for_One-Shot_Object_Detection_CVPR_2021_paper.html)]

    * **Meta-DETR**: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (*NTU Singapore*). [[Paper](https://arxiv.org/abs/2103.11731)][[PyTorch](https://github.com/ZhangGongjie/Meta-DETR)]

    * **CAT**: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2104.14984)]

    * **FCT**: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (*Columbia*). [[Paper](https://arxiv.org/abs/2203.15021)]

    * **SaFT**: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.09093)]

    * **TENET**: "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 (*ANU*). [[Paper](https://arxiv.org/abs/2210.16897)][[PyTorch](https://github.com/ZS123-lang/TENET)]

    * **Meta-DETR**: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2208.00219)]

    * **Incremental-DETR**: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (*NUS*). [[Paper](https://arxiv.org/abs/2205.04042)]

    * **FS-DETR**: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", ICCV, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2210.04845)]

    * **Meta-ZSDETR**: "Meta-ZSDETR: Zero-shot DETR with Meta-learning", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.09540)]

    * **?**: "Revisiting Few-Shot Object Detection with Vision-Language Models", arXiv, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2312.14494)]

* Open-World/Vocabulary:

    * **OW-DETR**: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 (*IIAI*). [[Paper](https://arxiv.org/abs/2112.01513)][[PyTorch](https://github.com/akshitac8/OW-DETR)]

    * **DetPro**: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2203.14940)][[PyTorch](https://github.com/dyabel/detpro)]

    * **RegionCLIP**: "RegionCLIP: Region-based Language-Image Pretraining", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2112.09106)][[PyTorch](https://github.com/microsoft/RegionCLIP)]

    * **PromptDet**: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2203.16513)][[PyTorch](https://github.com/fcjian/PromptDet)][[Website](https://fcjian.github.io/promptdet/)]

    * **OV-DETR**: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2203.11876)]

    * **VL-PLM**: "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 (*Rutgers University*). [[Paper](https://arxiv.org/abs/2207.08954)][[PyTorch](https://github.com/xiaofeng94/VL-PLM)][[Website](https://www.nec-labs.com/~mas/VL-PLM/)]

    * **DetCLIP**: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2209.09407)]

    * **WWbL**: "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 (*Tel-Aviv*). [[Paper](https://arxiv.org/abs/2206.09358)][[PyTorch](https://github.com/talshaharabany/what-is-where-by-looking)][[Demo](https://replicate.com/talshaharabany/what-is-where-by-looking)]

    * **P³OVD**: "P³OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2211.00849)]

    * **Open-World-DETR**: "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 (*NUS*). [[Paper](https://arxiv.org/abs/2212.02969)]

    * **BARON**: "Aligning Bag of Regions for Open-Vocabulary Object Detection", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2302.13996)][[PyTorch](https://github.com/wusize/ovdet)]

    * **CapDet**: "CapDet: Unifying Dense Captioning and Open-World Detection Pretraining", CVPR, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2303.02489)]

    * **CORA**: "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.13076)][[PyTorch](https://github.com/tgxs002/CORA)]

    * **UniDetector**: "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2303.11749)][[PyTorch](https://github.com/zhenyuw16/UniDetector)]

    * **DetCLIPv2**: "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment", CVPR, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2304.04514)]

    * **RO-ViT**: "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2305.07011)]

    * **CAT**: "CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection", CVPR, 2023 (*Northeast University, China*). [[Paper](https://arxiv.org/abs/2301.01970)][[PyTorch](https://github.com/xiaomabufei/CAT)]

    * **CondHead**: "Learning to Detect and Segment for Open Vocabulary Object Detection", CVPR, 2023 (*Sichuan University*). [[Paper](https://arxiv.org/abs/2212.12130)]

    * **OADP**: "Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2303.05892)][[PyTorch](https://github.com/LutingWang/OADP)]

    * **OVAD**: "Open-vocabulary Attribute Detection", CVPR, 2023 (*University of Freiburg, Germany*). [[Paper](https://arxiv.org/abs/2211.12914)][[Website](https://ovad-benchmark.github.io/)]

    * **OvarNet**: "OvarNet: Towards Open-vocabulary Object Attribute Recognition", CVPR, 2023 (*Xiaohongshu*). [[Paper](https://arxiv.org/abs/2301.09506)][[Website](https://kyanchen.github.io/OvarNet/)][[PyTorch](https://github.com/KyanChen/OvarNet)]

    * **ALLOW**: "Annealing-Based Label-Transfer Learning for Open World Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Ma_Annealing-Based_Label-Transfer_Learning_for_Open_World_Object_Detection_CVPR_2023_paper.html)][[PyTorch](https://github.com/DIG-Beihang/ALLOW)]

    * **PROB**: "PROB: Probabilistic Objectness for Open World Object Detection", CVPR, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2212.01424)][[PyTorch](https://github.com/orrzohar/PROB)][[Website](https://orrzohar.github.io/projects/prob/)]

    * **RandBox**: "Random Boxes Are Open-world Object Detectors", ICCV, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2307.08249)][[PyTorch](https://github.com/scuwyh2000/RandBox)]

    * **Cascade-DETR**: "Cascade-DETR: Delving into High-Quality Universal Object Detection", ICCV, 2023 (*ETHZ + HKUST*). [[Paper](https://arxiv.org/abs/2307.11035)][[Pytorch](https://github.com/SysCV/cascade-detr)]

    * **EdaDet**: "EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment", ICCV, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2309.01151)][[Website](https://chengshiest.github.io/edadet/)]

    * **V3Det**: "V3Det: Vast Vocabulary Visual Detection Dataset", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2304.03752)][[GitHub](https://github.com/V3Det/V3Det)][[Website](https://v3det.openxlab.org.cn/)]

    * **CoDet**: "CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection", NeurIPS, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2310.16667)][[PyTorch](https://github.com/CVMI-Lab/CoDet)]

    * **DAMEX**: "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", NeurIPS, 2023 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2311.04894)][[Code (in construction)](https://github.com/jinga-lala/DAMEX)]

    * **OWL-ST**: "Scaling Open-Vocabulary Object Detection", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2306.09683)]

    * **MQ-Det**: "Multi-modal Queried Object Detection in the Wild", NeurIPS, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2305.18980)][[PyTorch](https://github.com/YifanXu74/MQ-Det)]

    * **Grounding-DINO**: "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", arXiv, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.05499)]

    * **GridCLIP**: "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning", arXiv, 2023 (*Queen Mary University of London*). [[Paper](https://arxiv.org/abs/2303.09252)]

    * **?**: "Three ways to improve feature alignment for open vocabulary detection", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2303.13518)]

    * **PCL**: "Open-Vocabulary Object Detection using Pseudo Caption Labels", arXiv, 2023 (*Kakao*). [[Paper](https://arxiv.org/abs/2303.13040)]

    * **Prompt-OVD**: "Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2303.14386)]

    * **LOWA**: "LOWA: Localize Objects in the Wild with Attributes", arXiv, 2023 (*Mineral, California*). [[Paper](https://arxiv.org/abs/2305.20047)]

    * **SGDN**: "Open-Vocabulary Object Detection via Scene Graph Discovery", arXiv, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2307.03339)]

    * **SAS-Det**: "Improving Pseudo Labels for Open-Vocabulary Object Detection", arXiv, 2023 (*NEC*). [[Paper](https://arxiv.org/abs/2308.06412)]

    * **DE-ViT**: "Detect Every Thing with Few Examples", arXiv, 2023 (*Rutgers*). [[Paper](https://arxiv.org/abs/2309.12969)][[PyTorch](https://github.com/mlzxy/devit)]

    * **CLIPSelf**: "CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction", arXiv, 2023 (*NTU, Singapore*). [[Papewr](https://arxiv.org/abs/2310.01403)][[PyTorch](https://github.com/wusize/CLIPSelf)]

    * **DST-Det**: "DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2310.01393)][[Code (in consgtruction)](https://github.com/xushilin1/dst-det)]

    * **DITO**: "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2310.00161)]

    * **RegionSpot**: "Recognize Any Regions", arXiv, 2023 (*University Of Surrey, England*). [[Paper](https://arxiv.org/abs/2311.01373)][[Code (in construction)](https://github.com/Surrey-UPLab/Recognize-Any-Regions)]

    * **DECOLA**: "Language-conditioned Detection Transformer", arXiv, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2311.17902)][[PyTorch](https://github.com/janghyuncho/DECOLA)]

    * **PLAC**: "Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection", arXiv, 2023 (*Kakao*). [[Paper](https://arxiv.org/abs/2312.02103)]

    * **FOMO**: "Open World Object Detection in the Era of Foundation Models", arXiv, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2312.05745)][[Website](https://orrzohar.github.io/projects/fomo/)]

    * **LP-OVOD**: "LP-OVOD: Open-Vocabulary Object Detection by Linear Probing", WACV, 2024 (*VinAI, Vietnam*). [[Paper](https://arxiv.org/abs/2310.17109)]

    * **ProxyDet**: "ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection", WACV, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2312.07266)]

    * **WSOVOD**: "Weakly Supervised Open-Vocabulary Object Detection", AAAI, 2024 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2312.12437)][[Code (i construction)](https://github.com/HunterJ-Lin/WSOVOD)]

    * **CLIM**: "CLIM: Contrastive Language-Image Mosaic for Region Representation", AAAI, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2312.11376)][[PyTorch](https://github.com/wusize/CLIM)]

    * **SS-OWFormer**: "Semi-supervised Open-World Object Detection", AAAI, 2024 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2402.16013)][[PyTorch](https://github.com/sahalshajim/SS-OWFormer)]

    * **DVDet**: "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors", ICLR, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2402.04630)]

    * **GenerateU**: "Generative Region-Language Pretraining for Open-Ended Object Detection", CVPR, 2024 (*Monash University*). [[Paper](https://arxiv.org/abs/2403.10191)][[PyTorch](https://github.com/FoundationVision/GenerateU)]

    * **DetCLIPv3**: "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection", CVPR, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2404.09216)]

    * **RALF**: "Retrieval-Augmented Open-Vocabulary Object Detection", CVPR, 2024 (*Korea University*). [[Paper](https://arxiv.org/abs/2404.05687)][[Code (in construction)](https://github.com/mlvlab/RALF)]

    * **SHiNe**: "SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection", CVPR, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2405.10053)]

    * **MM-Grounding-DINO**: "An Open and Comprehensive Pipeline for Unified Object Grounding and Detection", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2401.02361)][[PyTorch](https://github.com/open-mmlab/mmdetection/tree/main/configs/grounding_dino)]

    * **YOLO-World**: "YOLO-World: Real-Time Open-Vocabulary Object Detection", arXiv, 2024 (*Tencent*). [[Paper](https://arxiv.org/abs/2401.17270)][[Code (in construction)](https://github.com/AILab-CVC/YOLO-World)]

    * **T-Rex2**: "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy", arXiv, 2024 (*IDEA*). [[Paper](https://arxiv.org/abs/2403.14610)][[PyTorch](https://github.com/IDEA-Research/T-Rex)][[Website](https://deepdataspace.com/home)]

    * **Grounding-DINO-1.5**: "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", arXiv, 2024 (*IDEA*). [[Paper](https://arxiv.org/abs/2405.10300)][[Code](https://github.com/IDEA-Research/Grounding-DINO-1.5-API)]

* Pedestrian Detection:

    * **PED**: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2012.06785)][[PyTorch](https://github.com/Hatmm/PED-DETR-for-Pedestrian-Detection)]

    * **?**: "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 (*ICL*). [[Paper](https://openreview.net/forum?id=eow_ZGaw24j)]

    * **Pedestron**: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (*IIAI*). [[Paper](https://arxiv.org/abs/2201.03176)][[PyTorch](https://github.com/hasanirtiza/Pedestron)]

    * **VLPD**: "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision", CVPR, 2023 (*University of Science and Technology Beijing*). [[Paper](https://arxiv.org/abs/2304.03135)][[PyTorch](https://github.com/lmy98129/VLPD)]

* Lane Detection:

    * **LSTR**: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 (*Xi'an Jiaotong*). [[Paper](https://arxiv.org/abs/2011.04233)][[PyTorch](https://github.com/liuruijin17/LSTR)]

    * **LETR**: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 (*UCSD*). [[Paper](https://arxiv.org/abs/2101.01909)][[PyTorch](https://github.com/mlpc-ucsd/LETR)]

    * **Laneformer**: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2203.09830)]

    * **TLC**: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 (*Peking University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Tong_Transformer_Based_Line_Segment_Classifier_With_Image_Context_for_Real-Time_CVPR_2022_paper.html)]

    * **PersFormer**: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2203.11089)][[PyTorch](https://github.com/OpenPerceptionX/OpenLane)]

    * **MHVA**: "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 (*Beihang University*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/3918_ECCV_2022_paper.php)]

    * **PriorLane**: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (*Zhejiang Lab*). [[Paper](https://arxiv.org/abs/2209.06994)][[PyTorch](https://github.com/vincentqqb/priorlane)]

    * **CurveFormer**: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 (*NullMax, China*). [[Paper](https://arxiv.org/abs/2209.07989)]

    * **LATR**: "LATR: 3D Lane Detection from Monocular Images with Transformer", ICCV, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.04583)][[PyTorch](https://github.com/JMoonr/LATR)]

    * **O2SFormer**: "End to End Lane detection with One-to-Several Transformer", arXiv, 2023 (*Southeast University, China*). [[Paper](https://arxiv.org/abs/2305.00675)][[PyTorch](https://github.com/zkyseu/O2SFormer)]

    * **Lane2Seq**: "Lane2Seq: Towards Unified Lane Detection via Sequence Generation", CVPR, 2024 (*Southeast University, China*). [[Paper](https://arxiv.org/abs/2402.17172)]

* Object Localization:

    * **TS-CAM**: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2103.14862)]

    * **LCTR**: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2112.05291)]

    * **ViTOL**: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (*Mercedes-Benz*). [[Paper](https://arxiv.org/abs/2204.06772)][[PyTorch](https://github.com/Saurav-31/ViTOL)]

    * **SCM**: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.10447)][[PyTorch](https://github.com/164140757/SCM)]

    * **CaFT**: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2201.00475)]

    * **CoW**: "CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation", CVPR, 2023 (*Columbia*). [[Paper](https://arxiv.org/abs/2203.10421)][[PyTorch](https://github.com/columbia-ai-robotics/cow)][[Website](https://cow.cs.columbia.edu/)]

    * **ESC**: "ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation", ICML, 2023 (*UCSC*). [[Paper](https://arxiv.org/abs/2301.13166)]

* Relation Detection:

    * **PST**: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2105.02170)]

    * **PST**: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2105.02170)]

    * **TROI**: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2106.03162)]

    * **RelTransformer**: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (*KAUST*). [[Paper](https://arxiv.org/abs/2104.11934)][[PyTorch](https://github.com/Vision-CAIR/RelTransformer)]

    * **VReBERT**: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (*ANU*). [[Paper](https://arxiv.org/abs/2206.09111)]

    * **UniVRD**: "Unified Visual Relationship Detection with Vision and Language Models", ICCV, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2303.08998)][[Code (in construction)](https://github.com/google-research/scenic/tree/main/scenic/projects/univrd)]

    * **RECODE**: "Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models", NeurIPS, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2305.12476)]

    * **SG-ViT**: "Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection", arXiv, 2024 (*DeepMind*). [[Paper](https://arxiv.org/abs/2403.14270)]

* Anomaly Detection:

    * **VT-ADL**: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 (*University of Udine, Italy*). [[Paper](https://arxiv.org/abs/2104.10036)]

    * **InTra**: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 (*Fujitsu*). [[Paper](https://arxiv.org/abs/2104.13897)]

    * **AnoViT**: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2203.10808)]

    * **WinCLIP**: "WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation", CVPR, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2303.14814)]

    * **M3DM**: "Multimodal Industrial Anomaly Detection via Hybrid Fusion", CVPR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2303.00601)][[PyTorch](https://github.com/nomewang/M3DM)]

* Cross-Domain:

    * **SSTN**: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 (*Gwangju Institute of Science and Technology*). [[Paper](https://arxiv.org/abs/2103.03150)]

    * **MTTrans**: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2205.01643)]

    * **OAA-OTA**: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.14195)]

    * **SSTA**: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 (*University of Electronic Science and Technology of China*). [[Paper](https://arxiv.org/abs/2206.00222)]

    * **DETR-GA**: "DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2304.07082)]

    * **DA-DETR**: "DA-DETR: Domain Adaptive Detection Transformer with Information Fusion", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2103.17084)]

    * **?**: "CLIP the Gap: A Single Domain Generalization Approach for Object Detection", CVPR, 2023 (*EPFL*). [[Paper](https://arxiv.org/abs/2301.05499)][[PyTorch](https://github.com/vidit09/domaingen)]

    * **PM-DETR**: "PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2307.00313)]

* Co-Salient Object Detection:

    * **CoSformer**: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2104.14729)]

* Oriented Object Detection:

    * **O²DETR**: "Oriented Object Detection with Transformer", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.03146)]

    * **AO2-DETR**: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2205.12785)]

    * **ARS-DETR**: "ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer", arXiv, 2023 (*Harbin Institude of Technology*). [[Paper](https://arxiv.org/abs/2303.04989)][[PyTorch](https://github.com/httle/ARS-DETR)]

    * **RHINO**: "RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection", arXiv, 2023 (*SI Analytics*). [[Paper](https://arxiv.org/abs/2305.07598)]

* Multiview Detection:

    * **MVDeTr**: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 (*ANU*). [[Paper](https://arxiv.org/abs/2108.05888)]

* Polygon Detection:

    * **?**: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 (*Delft University of Technology, Netherlands*). [[Paper](https://arxiv.org/abs/2108.07533)]

* Drone-view:

    * **TPH**: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 (*Beihang University*). [[Paper](https://arxiv.org/abs/2108.11539)]

    * **TransVisDrone**: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (*UCF*). [[Paper](https://arxiv.org/abs/2210.08423)][[Code (in construction)](https://github.com/tusharsangam/TransVisDrone)]

* Infrared:

    * **?**: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (*Chongqing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2109.14379)]

    * **MiPa**: "MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection", arXiv, 2024 (*ETS Montreal*). [[Paper](https://arxiv.org/abs/2404.18849)][[Code (in construction)](https://github.com/heitorrapela/MiPa)]

* Text Detection:

    * **SwinTextSpotter**: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2203.10209)][[PyTorch](https://github.com/mxin262/SwinTextSpotter)]

    * **TESTR**: "Text Spotting Transformers", CVPR, 2022 (*UCSD*). [[Paper](https://arxiv.org/abs/2204.01918)][[PyTorch](https://github.com/mlpc-ucsd/TESTR)]

    * **TTS**: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2202.05508)]

    * **oCLIP**: "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2203.03911)]

    * **TransDETR**: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2203.10539)][[PyTorch](https://github.com/weijiawu/TransDETR)]

    * **?**: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (*University of Waterloo, Canada*). [[Paper](https://arxiv.org/abs/2202.11221)]

    * **?**: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 (*University of Science and Technology Beijing*). [[Paper](https://arxiv.org/abs/2205.05320)][[Code (in construction)](https://github.com/GXYM/TextBPN-Plus-Plus)]

    * **DPTNet**: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2208.09878)]

    * **ATTR**: "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2211.13984)]

    * **DPText-DETR**: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", AAAI, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2207.04491)][[PyTorch](https://github.com/ymy-k/DPText-DETR)]

    * **TCM**: "Turning a CLIP Model into a Scene Text Detector", CVPR, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2302.14338)][[PyTorch](https://github.com/wenwenyu/TCM)]

    * **DeepSolo**: "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting", CVPR, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2211.10772)][[PyTorch](https://github.com/ViTAE-Transformer/DeepSolo)]

    * **ESTextSpotter**: "ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer", ICCV, 2023 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2308.10147)][[PyTorch](https://github.com/mxin262/ESTextSpotter)]

    * **PBFormer**: "PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer", ACMMM, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2308.15004)] 

    * **DeepSolo++**: "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting", arXiv, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2305.19957)][[PyTorch](https://github.com/ViTAE-Transformer/DeepSolo)]

    * **FastTCM**: "Turning a CLIP Model into a Scene Text Spotter", arXiv, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2308.10408)][[PyTorch](https://github.com/wenwenyu/TCM)]

    * **SRFormer**: "SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2308.10531)]

    * **TGA**: "Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis", CVPR, 2024 (*Microsoft*). [[Paper](https://arxiv.org/abs/2405.07481)]

    * **SwinTextSpotter-v2**: "SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting", arXiv, 2024 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2401.07641)]

* Change Detection:

    * **ChangeFormer**: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 (*JHU*). [[Paper](https://arxiv.org/abs/2201.01293)][[PyTorch](https://github.com/wgcban/ChangeFormer)]

    * **IDET**: "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 (*Civil Aviation University of China*). [[Paper](https://arxiv.org/abs/2207.09240)]

* Edge Detection:

    * **EDTER**: "EDTER: Edge Detection with Transformer", CVPR, 2022 (*Beijing Jiaotong University*). [[Paper](https://arxiv.org/abs/2203.08566)][[Code (in construction)](https://github.com/MengyangPu/EDTER)]

    * **HEAT**: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 (*Simon Fraser*). [[Paper](https://arxiv.org/abs/2111.15143)][[PyTorch](https://github.com/woodfrog/heat)][[Website](https://heat-structured-reconstruction.github.io/)]

* Person Search:

    * **COAT**: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 (*Kitware*). [[Paper](https://arxiv.org/abs/2203.09642)][[PyTorch](https://github.com/Kitware/COAT)]

    * **PSTR**: "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2204.03340)][[PyTorch](https://github.com/JialeCao001/PSTR)]

* Manipulation Detection:

    * **ObjectFormer**: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2203.14681)]

* Mirror Detection:

    * **SATNet**: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2207.06332)][[PyTorch](https://github.com/tyhuang0428/SATNet)]

* Shadow Detection:

    * **SCOTCH-SODA**: "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", CVPR, 2023 (*University of Cambridge*). [[Paper](https://arxiv.org/abs/2211.06885)]

* Keypoint Detection:

    * **SalViT**: "From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection", arXiv, 2023 (*ANU*). [[Paper](https://arxiv.org/abs/2304.03140)]

* Continual Learning:

    * **CL-DETR**: "Continual Detection Transformer for Incremental Object Detection", CVPR, 2023 (*MPI*). [[Paper](https://arxiv.org/abs/2304.03110)]

* Visual Query Detection/Localization:

    * **CocoFormer**: "Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2211.10528)][[PyTorch](https://github.com/facebookresearch/vq2d_cvpr)]

    * **VQLoC**: "Single-Stage Visual Query Localization in Egocentric Videos", NeurIPS, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2306.09324)][[PyTorch](https://github.com/hwjiang1510/VQLoC)][[Website](https://hwjiang1510.github.io/VQLoC/)]

* Task-Driven Object Detection:

    * **CoTDet**: "CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection", ICCV, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2309.01093)]

* Diffusion:

    * **DiffusionEngine**: "DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2309.03893)][[PyTorch](https://github.com/bytedance/DiffusionEngine)][[Website](https://mettyz.github.io/DiffusionEngine/)]

    * **TADP**: "Text-image Alignment for Diffusion-based Perception", arXiv, 2023 (*CalTech*). [[Paper](https://arxiv.org/abs/2310.00031)][[Website](https://www.vision.caltech.edu/tadp/)]

    * **InstaGen**: "InstaGen: Enhancing Object Detection by Training on Synthetic Dataset", arXiv, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2402.05937)][[Code (in construction)](https://github.com/fcjian/InstaGen)][[Website](https://fcjian.github.io/InstaGen/)]

[[Back to Overview](#overview)]

## Segmentation

### Semantic Segmentation

* **SETR**: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2012.15840)][[PyTorch](https://github.com/fudan-zvg/SETR)][[Website](https://fudan-zvg.github.io/SETR/)]

* **TrSeg**: "TrSeg: Transformer for semantic segmentation", PRL, 2021 (*Korea University*). [[Paper](https://www.sciencedirect.com/science/article/abs/pii/S016786552100163X)][[PyTorch](https://github.com/youngsjjn/TrSeg)]

* **CWT**: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2108.03032)][[PyTorch](https://github.com/zhiheLu/CWT-for-FSS)]

* **Segmenter**: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (*INRIA*). [[Paper](https://arxiv.org/abs/2105.05633)][[PyTorch](https://github.com/rstrudel/segmenter)]

* **UN-EPT**: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2107.14209)][[PyTorch](https://github.com/amazon-research/unified-ept)]

* **FTN**: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.04108)]

* **SegFormer**: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2105.15203)][[PyTorch](https://github.com/NVlabs/SegFormer)]

* **MaskFormer**: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", NeurIPS, 2021 (*UIUC + Facebook*). [[Paper](https://arxiv.org/abs/2107.06278)][[Website](https://bowenc0221.github.io/maskformer/)]

* **OffRoadTranSeg**: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 (*IISER. India*). [[Paper](https://arxiv.org/abs/2106.13963)]

* **TRFS**: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (*ETHZ*). [[Paper](https://arxiv.org/abs/2108.02266)]

* **Flying-Guide-Dog**: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (*KIT, Germany*). [[Paper](https://arxiv.org/abs/2108.07007)][[Code (in construction)](https://github.com/EckoTan0804/flying-guide-dog)]

* **VSPW**: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 (*Xiaomi*). [[Paper](https://arxiv.org/abs/2109.01316)]

* **SDTP**: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (*?*). [[Paper](https://arxiv.org/abs/2109.08963)]

* **TopFormer**: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2204.05525)][[PyTorch](https://github.com/hustvl/TopFormer)]

* **HRViT**: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2111.01236)][[PyTorch](https://github.com/facebookresearch/HRViT)]

* **GReaT**: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2209.09545)]

* **SegDeformer**: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 (*Shanghai Jiao Tong + Huawei*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/383_ECCV_2022_paper.php)][[PyTorch](https://github.com/lygsbw/segdeformer)]

* **PAUMER**: "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 (*Idiap, Switzerland*). [[Paper](https://bmvc2022.mpi-inf.mpg.de/737/)]

* **SegViT**: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2210.05844)][[PyTorch](https://github.com/zbwxp/SegVit)]

* **RTFormer**: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2210.07124)][[Paddle](https://github.com/PaddlePaddle/PaddleSeg)]

* **SegNeXt**: "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2209.08575)]

* **Lawin**: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2201.01615)][[PyTorch](https://github.com/yan-hao-tian/lawin)]

* **PFT**: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (*CUHK + SenseTime*). [[Paper](https://arxiv.org/abs/2201.04019)]

* **DFlatFormer**: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 (*OPPO*). [[Paper](https://arxiv.org/abs/2201.09139)]

* **FeSeFormer**: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2203.14124)]

* **StructToken**: "StructToken: Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2203.12612)]

* **HILA**: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 (*University of Toronto*). [[Paper](https://arxiv.org/abs/2207.02126)][[Website](https://www.cs.toronto.edu/~garyleung/hila/)][[PyTorch](https://github.com/fidler-lab/hila)]

* **HLG**: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2207.09339)][[PyTorch](https://github.com/fudan-zvg/SETR)]

* **SSformer**: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (*Nanjing University of Aeronautics and Astronautics*). [[Paper](https://arxiv.org/abs/2208.02034)][[PyTorch](https://github.com/shiwt03/SSformer)]

* **NamedMask**: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2209.11228)][[PyTorch](https://github.com/NoelShin/namedmask)][[Website](https://www.robots.ox.ac.uk/~vgg/research/namedmask/)]

* **IncepFormer**: "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 (*Nanjing University of Aeronautics and Astronautics*). [[Paper](https://arxiv.org/abs/2212.03035)][[PyTorch](https://github.com/shendu0321/IncepFormer)]

* **SeaFormer**: "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.13156)]

* **PPL**: "Probabilistic Prompt Learning for Dense Prediction", CVPR, 2023 (*Yonsei*). [[Paper](https://arxiv.org/abs/2304.00779)]

* **AFF**: "AutoFocusFormer: Image Segmentation off the Grid", CVPR, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2304.12406)]

* **CTS**: "Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers", CVPR, 2023 (*Eindhoven University of Technology, Netherlands*). [[Paper](https://arxiv.org/abs/2306.02095)][[PyTorch](https://github.com/tue-mps/cts-segmenter)][[Website](https://tue-mps.github.io/CTS/)]

* **TSG**: "Transformer Scale Gate for Semantic Segmentation", CVPR, 2023 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2205.07056)]

* **FASeg**: "Dynamic Focus-aware Positional Queries for Semantic Segmentation", CVPR, 2023 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2204.01244)][[PyTorch](https://github.com/ziplab/FASeg)]

* **HFD-BSD**: "A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation", ICCV, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2307.12574)]

* **DToP**: "Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation", ICCV, 2023 (*South China University of Technology + The University of Adelaide*). [[Paper](https://arxiv.org/abs/2308.01045)]

* **FreeMask**: "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models", NeurIPS, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2310.15160)][[PyTorch](https://github.com/LiheYoung/FreeMask)]

* **AiluRus**: "AiluRus: A Scalable ViT Framework for Dense Prediction", NeurIPS, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2311.01197)][[Code (in construction)](https://github.com/caddyless/AiluRus)]

* **SegViTv2**: "SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2306.06289)][[PyTorch](https://github.com/zbwxp/SegVit)]

* **DoViT**: "Dynamic Token-Pass Transformers for Semantic Segmentation", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.01944)]

* **CFT**: "Category Feature Transformer for Semantic Segmentation", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2308.05581)]

* **ICPC**: "ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.07078)]

* **Superpixel-Association**: "Superpixel Transformers for Efficient Semantic Segmentation", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2309.16889)]

* **PlainSeg**: "Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2310.12755)][[PyTorch](https://github.com/ydhongHIT/PlainSeg)]

* **SCTNet**: "SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation", AAAI, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2312.17071)][[Code (in construction)](https://github.com/xzz777/SCTNet)]

* **?**: "Region-Based Representations Revisited", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2402.02352)]

[[Back to Overview](#overview)]

### Depth Estimation

* **DPT**: "Vision Transformers for Dense Prediction", ICCV, 2021 (*Intel*). [[Paper](https://arxiv.org/abs/2103.13413)][[PyTorch](https://github.com/intel-isl/DPT)]

* **TransDepth**: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 (*Haerbin Institute of Technology + University of Trento*). [[Paper](https://arxiv.org/abs/2103.12091)][[PyTorch](https://github.com/ygjwd12345/TransDepth)]

* **ASTransformer**: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (*USTC*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0244.pdf)][[PyTorch](https://github.com/WJ-Chang-42/ASTransformer)]

* **MT-SfMLearner**: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (*NavInfo Europe, Netherlands*). [[Paper](https://arxiv.org/abs/2202.03131)]

* **DepthFormer**: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 (*Toyota*). [[Paper](https://arxiv.org/abs/2204.07616)]

* **GuideFormer**: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (*Agency for Defense Development, Korea*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Rho_GuideFormer_Transformers_for_Image_Guided_Depth_Completion_CVPR_2022_paper.html)]

* **SparseFormer**: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2206.04557)]

* **DEST**: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2204.13791)]

* **MonoViT**: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (*University of Bologna, Italy*). [[Paper](https://arxiv.org/abs/2208.03543)][[PyTorch](https://github.com/zxcqlf/MonoViT)]

* **Spike-Transformer**: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 (*Peking University*). [[Paper]()][[PyTorch](https://github.com/Leozhangjiyuan/MDE-SpikingCamera)]

* **?**: "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 (*IIT Madras*). [[Paper](https://arxiv.org/abs/2211.11066)]

* **GLPanoDepth**: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2202.02796)]

* **DepthFormer**: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2203.14211)][[PyTorch](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)]

* **BinsFormer**: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.00987)][[PyTorch](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)]

* **SideRT**: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2204.13892)]

* **MonoFormer**: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (*DGIST, Korea*). [[Paper](https://arxiv.org/abs/2205.11083)]

* **Depthformer**: "Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 (*Indian Institute of Technology Delhi*). [[Paper](https://arxiv.org/abs/2207.04535)]

* **TODE-Trans**: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (*USTC*). [[Paper](https://arxiv.org/abs/2209.08455)][[Code (in construction)](https://github.com/yuchendoudou/TODE)]

* **ObjCAViT**: "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 (*ICL*). [[Paper](https://arxiv.org/abs/2211.17232)]

* **ROIFormer**: "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 (*OPPO*). [[Paper](https://arxiv.org/abs/2212.05729)]

* **TST**: "Lightweight Monocular Depth Estimation via Token-Sharing Transformer", ICRA, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2306.05682)]

* **CompletionFormer**: "CompletionFormer: Depth Completion with Convolutions and Vision Transformers", CVPR, 2023 (*University of Bologna, Italy*). [[Paper](https://arxiv.org/abs/2304.13030)][[PyTorch](https://github.com/youmi-zym/CompletionFormer)][[Website](https://youmi-zym.github.io/projects/CompletionFormer/)]

* **Lite-Mono**: "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", CVPR, 2023 (*University of Twente, Netherlands*). [[Paper](https://arxiv.org/abs/2211.13202)][[PyTorch](https://github.com/noahzn/Lite-Mono)]

* **EGformer**: "EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation", ICCV, 2023 (*SNU*). [[Paper](https://arxiv.org/abs/2304.07803)]

* **ZeroDepth**: "Towards Zero-Shot Scale-Aware Monocular Depth Estimation", ICCV, 2023 (*Toyota*). [[Paper](https://arxiv.org/abs/2306.17253)][[PyTorch](https://github.com/tri-ml/vidar)][[Website](https://sites.google.com/view/tri-zerodepth)]

* **Win-Win**: "Win-Win: Training High-Resolution Vision Transformers from Two Windows", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2310.00632)]

* **?**: "Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation", WACV, 2024 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2311.01034)]

* **DeCoTR**: "DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions", CVPR, 2024 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2403.12202)]

* **Depth-Anything**: "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data", arXiv, 2024 (*TikTok*). [[Paper](https://arxiv.org/abs/2401.10891)][[PyTorch](https://github.com/LiheYoung/Depth-Anything)][[Website](https://depth-anything.github.io/)]

[[Back to Overview](#overview)]

### Object Segmentation

* **SOTR**: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (*China Agricultural University*). [[Paper](https://arxiv.org/abs/2108.06747)][[PyTorch](https://github.com/easton-cau/SOTR)]

* **Trans4Trans**: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2107.03172)][[Code (in construction)](https://github.com/jamycheung/Trans4Trans)]

* **Trans2Seg**: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (*HKU + SenseTime*). [[Paper](https://arxiv.org/abs/2101.08461)][[PyTorch](https://github.com/xieenze/Trans2Seg)]

* **SOIT**: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (*Hikvision*). [[Paper](https://arxiv.org/abs/2112.11037)][[PyTorch](https://github.com/hikvision-research/opera)]

* **CAST**: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 (*Berkeley*). [[Paper](https://arxiv.org/abs/2210.00314)]

* **?**: "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 (*Aalto University, Finland*). [[Paper](https://arxiv.org/abs/2210.14139)]

* **MSMFormer**: "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 (*UT Dallas*). [[Paper](https://arxiv.org/abs/2211.11679)][[PyTorch](https://github.com/YoungSean/UnseenObjectsWithMeanShift)]

[[Back to Overview](#overview)]

### Other Segmentation Tasks

* Any-X/Every-X:

    * **SAM**: "Segment Anything", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.02643)][[PyTorch](https://github.com/facebookresearch/segment-anything)][[Website](https://segment-anything.com/)]

    * **SEEM**: "Segment Everything Everywhere All at Once", NeurIPS, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2304.06718)][[PyTorch](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)]

    * **HQ-SAM**: "Segment Anything in High Quality", NeurIPS, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2306.01567)][[PyTorch](https://github.com/SysCV/SAM-HQ)]

    * **?**: "An Empirical Study on the Robustness of the Segment Anything Model (SAM)", arXiv, 2023 (*UCSB*). [[Paper](https://arxiv.org/abs/2305.06422)]

    * **?**: "A Comprehensive Survey on Segment Anything Model for Vision and Beyond", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2305.08196)]

    * **SAD**: "SAD: Segment Any RGBD", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2305.14207)][[PyTorch](https://github.com/Jun-CEN/SegmentAnyRGBD)]

    * **?**: "A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering", arXiv, 2023 (*Kyung Hee University, Korea*). [[Paper](https://arxiv.org/abs/2306.06211)]

    * **?**: "Robustness of SAM: Segment Anything Under Corruptions and Beyond", arXiv, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2306.07713)]

    * **FastSAM**: "Fast Segment Anything", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2306.12156)][[PyTorch](https://github.com/CASIA-IVA-Lab/FastSAM)]

    * **MobileSAM**: "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications", arXiv, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2306.14289)][[PyTorch](https://github.com/ChaoningZhang/MobileSAM)]

    * **Semantic-SAM**: "Semantic-SAM: Segment and Recognize Anything at Any Granularity", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2307.04767)][[Code (in construction)](https://github.com/UX-Decoder/Semantic-SAM)]

    * **Follow-Anything**: "Follow Anything: Open-set detection, tracking, and following in real-time", arXiv, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2308.05737)]

    * **DINOv**: "Visual In-Context Prompting", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2311.13601)][[Code (in construction)](https://github.com/UX-Decoder/DINOv)]

    * **Stable-SAM**: "Stable Segment Anything Model", arXiv, 2023 (*Kuaishou*). [[Paper](https://arxiv.org/abs/2311.15776)][[Code (in construction)](https://github.com/fanq15/Stable-SAM)]

    * **EfficientSAM**: "EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.00863)]

    * **EdgeSAM**: "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2312.06660)][[PyTorch](https://github.com/chongzhou96/EdgeSAM)][[Website](https://mmlab-ntu.github.io/project/edgesam/)]

    * **RepViT-SAM**: "RepViT-SAM: Towards Real-Time Segmenting Anything", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.05760)][[PyTorch](https://github.com/THU-MIG/RepViT)]

    * **SlimSAM**: "0.1% Data Makes Segment Anything Slim", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2312.05284)][[PyTorch](https://github.com/czg1225/SlimSAM)]

    * **FIND**: "Interfacing Foundation Models' Embeddings", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2312.07532)][[PyTorch (in construction)](https://github.com/UX-Decoder/FIND)][[Website](https://x-decoder-vl.github.io/)]

    * **SqueezeSAM**: "SqueezeSAM: User-friendly mobile interactive segmentation", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.06736)]

    * **TAP**: "Tokenize Anything via Prompting", arXiv, 2023 (*BAAI*). [[Paper](https://arxiv.org/abs/2312.09128)][[PyTorch](https://github.com/baaivision/tokenize-anything)]

    * **MobileSAMv2**: "MobileSAMv2: Faster Segment Anything to Everything", arXiv, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2312.09579)][[PyTorch](https://github.com/ChaoningZhang/MobileSAM)]

    * **TinySAM**: "TinySAM: Pushing the Envelope for Efficient Segment Anything Model", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2312.13789)][[PyTorch](https://github.com/xinghaochen/TinySAM)]

    * **Conv-LoRA**: "Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model", ICLR, 2024 (*Amazon*). [[Paper](https://arxiv.org/abs/2401.17868)][[PyTorch](https://github.com/autogluon/autogluon)]

    * **PerSAM**: "Personalize Segment Anything Model with One Shot", ICLR, 2024 (*CUHK*). [[Paper](https://arxiv.org/abs/2305.03048)][[PyTorch](https://github.com/ZrrSkywalker/Personalize-SAM)]

    * **VRP-SAM**: "VRP-SAM: SAM with Visual Reference Prompt", CVPR, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2402.17726)]

    * **UAD**: "Unsegment Anything by Simulating Deformation", CVPR, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2404.02585)][[PyTorch](https://github.com/jiahaolu97/anything-unsegmentable)]

    * **ASAM**: "ASAM: Boosting Segment Anything Model with Adversarial Tuning", CVPR, 2024 (*vivo*). [[Paper](https://arxiv.org/abs/2405.00256)][[PyTorch](https://github.com/luckybird1994/ASAM)][[Website](https://asam2024.github.io/)]

    * **PTQ4SAM**: "PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024 (*Beihang*). [[Paper](https://arxiv.org/abs/2405.03144)][[PyTorch](https://github.com/chengtao-lv/PTQ4SAM)]

    * **BA-SAM**: "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model", arXiv, 2024 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2401.02317)]

    * **OV-SAM**: "Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2401.02955)][[PyTorch](https://github.com/HarborYuan/ovsam)][[Website](https://www.mmlab-ntu.com/project/ovsam/)]

    * **SSPrompt**: "Learning to Prompt Segment Anything Models", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2401.04651)]

    * **RAP-SAM**: "RAP-SAM: Towards Real-Time All-Purpose Segment Anything", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2401.10228)][[PyTorch](https://github.com/xushilin1/RAP-SAM/)][[Website](https://xushilin1.github.io/rap_sam/)]

    * **PA-SAM**: "PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation", arXiv, 2024 (*OPPO*). [[Paper](https://arxiv.org/abs/2401.13051)][[PyTorch](https://github.com/xzz2/pa-sam)]

    * **Grounded-SAM**: "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks", arXiv, 2024 (*IDEA*). [[Paper](https://arxiv.org/abs/2401.14159)][[PyTorch](https://github.com/IDEA-Research/Grounded-Segment-Anything)]

    * **EfficientViT-SAM**: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2402.05008)][[PyTorch](https://github.com/mit-han-lab/efficientvit)]

    * **DeiSAM**: "DeiSAM: Segment Anything with Deictic Prompting", arXiv, 2024 (*TU Darmstadt, Germany*). [[Paper](https://arxiv.org/abs/2402.14123)]

    * **CAT-SAM**: "CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2402.03631)][[PyTorch (in construction)](https://github.com/weihao1115/cat-sam)][[Website](https://xiaoaoran.github.io/projects/CAT-SAM)]

    * **BLO-SAM**: "BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM", arXiv, 2024 (*UCSD*). [[Paper](https://arxiv.org/abs/2402.16338)][[PyTorch](https://github.com/importZL/BLO-SAM)]

    * **P²SAM**: "Part-aware Personalized Segment Anything Model for Patient-Specific Segmentation", arXiv, 2024 (*UMich*). [[Paper](https://arxiv.org/abs/2403.05433)]

    * **RA**: "Practical Region-level Attack against Segment Anything Models", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2404.08255)]

* Vision-Language:

    * **LSeg**: "Language-driven Semantic Segmentation", ICLR, 2022 (*Cornell*). [[Paper](https://arxiv.org/abs/2201.03546)][[PyTorch](https://github.com/isl-org/lang-seg)]

    * **ZegFormer**: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2112.07910)][[PyTorch](https://github.com/dingjiansw101/ZegFormer)]

    * **CLIPSeg**: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 (*University of Göttingen, Germany*). [[Paper](https://arxiv.org/abs/2112.10003)][[PyTorch](https://github.com/timojl/clipseg)]

    * **DenseCLIP**: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2112.01518)][[PyTorch](https://github.com/raoyongming/DenseCLIP)][[Website](https://denseclip.ivg-research.xyz/)]

    * **GroupViT**: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2202.11094)][[Website](https://jerryxu.net/GroupViT/)][[PyTorch](https://github.com/NVlabs/GroupViT)]

    * **MaskCLIP**: "Extract Free Dense Labels from CLIP", ECCV, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2112.01071)][[PyTorch](https://github.com/chongzhou96/MaskCLIP)][[Website](https://www.mmlab-ntu.com/project/maskclip/)]

    * **ViewCo**: "ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency", ICLR, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2302.10307)][[Code (in construction)](https://github.com/pzhren/ViewCo)]

    * **LMSeg**: "LMSeg: Language-guided Multi-dataset Segmentation", ICLR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2302.13495)]

    * **VL-Fields**: "VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations", ICRA, 2023 (*University of Edinburgh, UK*). [[Paper](https://arxiv.org/abs/2305.12427)][[Website](https://tsagkas.github.io/vl-fields/)]

    * **X-Decoder**: "Generalized Decoding for Pixel, Image, and Language", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2212.11270)][[PyTorch](https://github.com/microsoft/X-Decoder)][[Website](https://x-decoder-vl.github.io/)]

    * **IFSeg**: "IFSeg: Image-free Semantic Segmentation via Vision-Language Model", CVPR, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2303.14396)][[PyTorch](https://github.com/alinlab/ifseg)]

    * **SAZS**: "Delving into Shape-aware Zero-shot Semantic Segmentation", CVPR, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2304.08491)][[PyTorch](https://github.com/Liuxinyv/SAZS)]

    * **CLIP-S⁴**: "CLIP-S⁴: Language-Guided Self-Supervised Semantic Segmentation", CVPR, 2023 (*Bosch*). [[Paper](https://arxiv.org/abs/2305.01040)]

    * **D²Zero**: "Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2305.13173)][[Code (in construction)](https://github.com/heshuting555/D2Zero)][[Website](https://henghuiding.github.io/D2Zero/)]

    * **PADing**: "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2306.11087)][[PyTorch](https://github.com/heshuting555/PADing)][[Website](https://henghuiding.github.io/PADing/)]

    * **LD-ZNet**: "LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2303.12343)][[PyTorch](https://github.com/koutilya-pnvr/LD-ZNet)][[Website](https://koutilya-pnvr.github.io/LD-ZNet/)]

    * **MAFT**: "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation", NeurIPS, 2023 (*Picsart*). [[Paper](https://arxiv.org/abs/2310.00240)][[PyTorch](https://github.com/jiaosiyu1999/MAFT)]

    * **PGSeg**: "Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation", NeurIPS, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2310.19001)][[PyTorch](https://github.com/Ferenas/PGSeg)]

    * **MESS**: "What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (*IBM*). [[Paper](https://arxiv.org/abs/2306.15521)][[PyTorch](https://github.com/blumenstiel/MESS)][[Website](https://blumenstiel.github.io/mess-benchmark/)]

    * **ZegOT**: "ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts", arXiv, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2301.12171)]

    * **SimCon**: "SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation", arXiv, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2302.03432)]

    * **DiffusionSeg**: "DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2303.09813)]

    * **ASCG**: "Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2304.01114)]

    * **ClsCLIP**: "[CLS] Token is All You Need for Zero-Shot Semantic Segmentation", arXiv, 2023 (*Eastern Institute for Advanced Study, China*). [[Paper](https://arxiv.org/abs/2304.06212)]

    * **CLIPTeacher**: "CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation", arXiv, 2023 (*Nagoya University*). [[Paper](https://arxiv.org/abs/2310.02296)]

    * **SAM-CLIP**: "SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2310.15308)]

    * **GEM**: "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", arXiv, 2023 (*University of Bonn, Germany*). [[Paper](https://arxiv.org/abs/2312.00878)][[PyTorch](https://github.com/WalBouss/GEM)]

    * **CaR**: "CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.07661)][[Code (in construction)](https://github.com/kevin-ssy/CLIP_as_RNN)][[Website](https://torrvision.com/clip_as_rnn/)]

    * **SPT**: "Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation", AAAI, 2024 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2312.12754)][[PyTorch (in construction)](https://github.com/clearxu/SPT)]

    * **FMbSeg**: "Annotation Free Semantic Segmentation with Vision Foundation Models", arXiv, 2024 (*Toyota*). [[Paper](https://arxiv.org/abs/2403.09307)]

* Open-World/Vocabulary:

    * **ViL-Seg**: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.08455)]

    * **OVSS**: "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2112.14757)][[PyTorch](https://github.com/MendelXu/zsseg.baseline)]

    * **OpenSeg**: "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2112.12143)]

    * **Fusioner**: "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 (*Shanghai Jiao Tong University*). [[Paper](https://arxiv.org/abs/2210.15138)][[Website](https://yyh-rain-song.github.io/Fusioner_webpage/)]

    * **OVSeg**: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2210.04150)][[PyTorch](https://github.com/facebookresearch/ov-seg)][[Website](https://jeff-liangf.github.io/projects/ovseg/)]

    * **ZegCLIP**: "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", CVPR, 2023 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2212.03588)][[PyTorch](https://github.com/ZiqinZhou66/ZegCLIP)]

    * **TCL**: "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", CVPR, 2023 (*Kakao*). [[Paper](https://arxiv.org/abs/2212.00785)][[PyTorch](https://github.com/kakaobrain/tcl)]

    * **ODISE**: "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2303.04803)][[PyTorch](https://github.com/NVlabs/ODISE)][[Website](https://jerryxu.net/ODISE/)]

    * **Mask-free-OVIS**: "Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations", CVPR, 2023 (*Salesforce*). [[Paper](https://arxiv.org/abs/2303.16891)][[PyTorch (in construction)](https://github.com/Vibashan/Maskfree-OVIS)]

    * **FreeSeg**: "FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation", CVPR, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.17225)]

    * **SAN**: "Side Adapter Network for Open-Vocabulary Semantic Segmentation", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2302.12242)][[PyTorch](https://github.com/MendelXu/SAN)]

    * **OVSegmentor**: "Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision", CVPR, 2023 (*Fudan University*). [[Paper](https://arxiv.org/abs/2301.09121)][[PyTorch](https://github.com/Jazzcharles/OVSegmentor/)][[Website](https://jazzcharles.github.io/OVSegmentor/)]

    * **PACL**: "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2212.04994)]

    * **MaskCLIP**: "Open-Vocabulary Universal Image Segmentation with MaskCLIP", ICML, 2023 (*UCSD*). [[Paper](https://arxiv.org/abs/2208.08984)][[Website](https://maskclip.github.io/)]

    * **SegCLIP**: "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", ICML, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2211.14813)][[PyTorch](https://github.com/ArrowLuo/SegCLIP)]

    * **SWORD**: "Exploring Transformers for Open-world Instance Segmentation", ICCV, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2308.04206)]

    * **Grounded-Diffusion**: "Open-vocabulary Object Segmentation with Diffusion Models", ICCV, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2301.05221)][[PyTorch](https://github.com/Lipurple/Grounded-Diffusion)][[Website](https://lipurple.github.io/Grounded_Diffusion/)]

    * **SegPrompt**: "SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning", ICCV, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.06531)][[PyTorch](https://github.com/aim-uofa/SegPrompt)]

    * **CGG**: "Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation", ICCV, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2301.00805)][[PyTorch](https://github.com/jzwu48033552/betrayed-by-captions)][[Website](https://www.mmlab-ntu.com/project/betrayed_caption/index.html)]

    * **OpenSeeD**: "A Simple Framework for Open-Vocabulary Segmentation and Detection", ICCV, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.08131)][[PyTorch](https://github.com/IDEA-Research/OpenSeeD)]

    * **OPSNet**: "Open-vocabulary Panoptic Segmentation with Embedding Modulation", ICCV, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2303.11324)]

    * **GKC**: "Global Knowledge Calibration for Fast Open-Vocabulary Segmentation", ICCV, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.09181)]

    * **ZeroSeg**: "Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only", ICCV, 2023 (*Meta*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Chen_Exploring_Open-Vocabulary_Semantic_Segmentation_from_CLIP_Vision_Encoder_Distillation_Only_ICCV_2023_paper.html)]

    * **MasQCLIP**: "MasQCLIP for Open-Vocabulary Universal Image Segmentation", ICCV, 2023 (*UCSD*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Xu_MasQCLIP_for_Open-Vocabulary_Universal_Image_Segmentation_ICCV_2023_paper.html)][[PyTorch](https://github.com/mlpc-ucsd/MasQCLIP)][[Website](https://masqclip.github.io/)]

    * **VLPart**: "Going Denser with Open-Vocabulary Part Segmentation", ICCV, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2305.11173)][[PyTorch](https://github.com/facebookresearch/VLPart)]

    * **DeOP**: "Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network", ICCV, 2023 (*Meituan*). [[Paper](https://arxiv.org/abs/2304.01198)]][[PyTorch](https://github.com/CongHan0808/DeOP)]

    * **MixReorg**: "MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2308.04829)]

    * **OV-PARTS**: "OV-PARTS: Towards Open-Vocabulary Part Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2310.05107)][[PyTorch](https://github.com/OpenRobotLab/OV_PARTS)]

    * **HIPIE**: "Hierarchical Open-vocabulary Universal Image Segmentation", NeurIPS, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2307.00764)][[PyTorch](https://github.com/berkeley-hipie/HIPIE)][[Website](http://people.eecs.berkeley.edu/~xdwang/projects/HIPIE/)]

    * **?**: "Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation", NeurIPS, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2309.00096)]

    * **FC-CLIP**: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP", NeurIPS, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2308.02487)][[PyTorch](https://github.com/bytedance/fc-clip)]

    * **WLSegNet**: "A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation", arXiv, 2023 (*IIT, New Delhi*). [[Paper](https://arxiv.org/abs/2302.14163)]

    * **CAT-Seg**: "CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*Korea University*). [[Paper](https://arxiv.org/abs/2303.11797)][[PyTorch](https://github.com/KU-CVLAB/CAT-Seg)][[Website](https://ku-cvlab.github.io/CAT-Seg/)]

    * **MVP-SEG**: "MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*Xiaohongshu, China*). [[Paper](https://arxiv.org/abs/2304.06957)]

    * **TagCLIP**: "TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2304.07547)]

    * **OVDiff**: "Diffusion Models for Zero-Shot Open-Vocabulary Segmentation", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2306.09316)][[Website](https://www.robots.ox.ac.uk/~vgg/research/ovdiff/)]

    * **UOVN**: "Unified Open-Vocabulary Dense Visual Prediction", arXiv, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2307.08238)]

    * **CLIP-DIY**: "CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free", arXiv, 2023 (*Warsaw University of Technology, Poland*). [[Paper](https://arxiv.org/abs/2309.14289)]

    * **Entity**: "Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion", arXiv, 2023 (*Harbin Engineering University*). [[Paper](https://arxiv.org/abs/2311.03352)][[PyTorch](https://github.com/qqlu/Entity)]

    * **OSM**: "Towards Open-Ended Visual Recognition with Large Language Model", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2311.08400)][[PyTorch](https://github.com/bytedance/OmniScient-Model)]

    * **SED**: "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*Tianjin*). [[Paper](https://arxiv.org/abs/2311.15537)][[PyTorch (in construction)](https://github.com/xb534/SED)]

    * **PnP-OVSS**: "Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2311.17095)]

    * **SCLIP**: "SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference", arXiv, 2023 (*JHU*). [[Paper](https://arxiv.org/abs/2312.01597)]

    * **GranSAM**: "Towards Granularity-adjusted Pixel-level Semantic Annotation", arXiv, 2023 (*UC Riverside*). [[Paper](https://arxiv.org/abs/2312.02420)]

    * **Sambor**: "Boosting Segment Anything Model Towards Open-Vocabulary Learning", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2312.03628)][[Code (in construction)](https://github.com/ucas-vg/Sambor)]

    * **SCAN**: "Open-Vocabulary Segmentation with Semantic-Assisted Calibration", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.04089)][[Code (in construction)](https://github.com/workforai/SCAN)]

    * **Self-Seg**: "Self-Guided Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*UvA*). [[Paper](https://arxiv.org/abs/2312.04539)]

    * **OpenSD**: "OpenSD: Unified Open-Vocabulary Segmentation and Detection", arXiv, 2023 (*OPPO*). [[Paper](https://arxiv.org/abs/2312.06703)]

    * **CLIP-DINOiser**: "CLIP-DINOiser: Teaching CLIP a few DINO tricks", arXiv, 2023 (*Warsaw University of Technology, Poland*). [[Paper](https://arxiv.org/abs/2312.12359)][[PyTorch](https://github.com/wysoczanska/clip_dinoiser)]

    * **TagAlign**: "TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification", arXiv, 2023 (*Ant Group*). [[Paper](https://arxiv.org/abs/2312.14149)][[PyTorch](https://github.com/Qinying-Liu/TagAlign)][[Website](https://qinying-liu.github.io/Tag-Align/)]

    * **OVFoodSeg**: "OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation", CVPR, 2024 (*Singapore Management University (SMU)*). [[Paper](https://arxiv.org/abs/2404.01409)]

    * **FreeDA**: "Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation", CVPR, 2024 (*University of Modena and Reggio Emilia (UniMoRe), Italy*). [[Paper](https://arxiv.org/abs/2404.06542)][[Website](https://aimagelab.github.io/freeda/)]

    * **S-Seg**: "Exploring Simple Open-Vocabulary Semantic Segmentation", arXiv, 2024 (*Oxford*). [[Paper](https://arxiv.org/abs/2401.12217)][[Code (in construction)](https://github.com/zlai0/S-Seg)]

    * **PosSAM**: "PosSAM: Panoptic Open-vocabulary Segment Anything", arXiv, 2024 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2403.09620)]][[Code (in construction)](https://github.com/Vibashan/PosSAM)][[Website](https://vibashan.github.io/possam-web/)]

* LLM-based:

    * **LISA**: "LISA: Reasoning Segmentation via Large Language Model", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.00692)][[PyTorch](https://github.com/dvlab-research/LISA)]

    * **PixelLM**: "PixelLM: Pixel Reasoning with Large Multimodal Model", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2312.02228)][[Code (in construction)](https://github.com/MaverickRen/PixelLM)][[Website](https://pixellm.github.io/)]

    * **PixelLLM**: "Pixel Aligned Language Models", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.09237)][[Website](https://jerryxu.net/PixelLLM/)]

    * **GSVA**: "GSVA: Generalized Segmentation via Multimodal Large Language Models", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.10103)]

    * **LISA++**: "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2312.17240)]

    * **GROUNDHOG**: "GROUNDHOG: Grounding Large Language Models to Holistic Segmentation", CVPR, 2024 (*Amazon*). [[Paper](https://arxiv.org/abs/2402.16846)][[Website](https://groundhog-mllm.github.io/)]

    * **PSALM**: "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model", arXiv, 2024 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2403.14598)][[PyTorch](https://github.com/zamling/PSALM)]

    * **LLaVASeg**: "Empowering Segmentation Ability to Multi-modal Large Language Models", arXiv, 2024 (*vivo*). [[Paper](https://arxiv.org/abs/2403.14141)]

    * **LaSagnA**: "LaSagnA: Language-based Segmentation Assistant for Complex Queries", arXiv, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2404.08506)][[PyTorch](https://github.com/congvvc/LaSagnA)]

* Universal Segmentation:

    * **K-Net**: "K-Net: Towards Unified Image Segmentation", NeurIPS, 2021 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2106.14855)][[PyTorch](https://github.com/ZwwWayne/K-Net/)]

    * **Mask2Former**: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2112.01527)][[PyTorch](https://github.com/facebookresearch/Mask2Former)][[Website](https://bowenc0221.github.io/mask2former/)]

    * **MP-Former**: "MP-Former: Mask-Piloted Transformer for Image Segmentation", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.07336)][[Code (in construction)](https://github.com/IDEA-Research/MP-Former)]

    * **OneFormer**: "OneFormer: One Transformer to Rule Universal Image Segmentation", CVPR, 2023 (*Oregon*). [[Paper](https://arxiv.org/abs/2211.06220)][[PyTorch](https://github.com/SHI-Labs/OneFormer)][[Website](https://praeclarumjj3.github.io/oneformer/)]

    * **UNINEXT**: "Universal Instance Perception as Object Discovery and Retrieval", CVPR, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.06674)][[PyTorch](https://github.com/MasterBin-IIAU/UNINEXT)]

    * **ClustSeg**: "CLUSTSEG: Clustering for Universal Segmentation", ICML, 2023 (*Rochester Institute of Technology*). [[Paper](https://arxiv.org/abs/2305.02187)]

    * **DaTaSeg**: "DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.01736)]

    * **DFormer**: "DFormer: Diffusion-guided Transformer for Universal Image Segmentation", arXiv, 2023 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2306.03437)][[Code (in construction)](https://github.com/cp3wan/DFormer)]

    * **?**: "A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task", arXiv, 2023 (*OMRON SINIC X, Japan*). [[Paper](https://arxiv.org/abs/2307.02862)]

    * **Mask2Anomaly**: "Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation", arXiv, 2023 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2309.04573)]

    * **SegGen**: "SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis", arXiv, 2023 (*Adobe*). [[Paper](https://arxiv.org/abs/2311.03355)][[Code (in construction)](https://github.com/prismformore/seggen)][[Website](https://seggenerator.github.io/)]

    * **PolyMaX**: "PolyMaX: General Dense Prediction with Mask Transformer", WACV, 2024 (*Google*). [[Paper](https://arxiv.org/abs/2311.05770)][[Tensorflow](https://github.com/google-research/deeplab2)]

    * **PEM**: "PEM: Prototype-based Efficient MaskFormer for Image Segmentation", CVPR, 2024 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2402.19422)][[Code (in construction)](https://github.com/NiccoloCavagnero/PEM)]

    * **OMG-Seg**: "OMG-Seg: Is One Model Good Enough For All Segmentation?", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2401.10229)][[PyTorch](https://github.com/lxtGH/OMG-Seg)][[Website](https://lxtgh.github.io/project/omg_seg/)]

    * **Uni-OVSeg**: "Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision", arXiv, 2024 (*University of Sydney*). [[Paper](https://arxiv.org/abs/2402.08960)][[PyTorch (in construction)](https://github.com/DerrickWang005/Uni-OVSeg.pytorch)]

    * **PRO-SCALE**: "Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation", arXiv, 2024 (*NEC*). [[Paper](https://arxiv.org/abs/2404.14657)]

* Multi-Modal:

    * **UCTNet**: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 (*Lehigh University, Pennsylvania*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/7082_ECCV_2022_paper.php)]

    * **CMX**: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2203.04838)][[PyTorch](https://github.com/huaaaliu/RGBX_Semantic_Segmentation)]

    * **DeLiVER**: "Delivering Arbitrary-Modal Semantic Segmentation", CVPR, 2023 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2303.01480)][[PyTorch](https://github.com/jamycheung/DELIVER)][[Website](https://jamycheung.github.io/DELIVER.html)]

    * **DFormer**: "DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation", arXiv, 2023 (*Nankai University*). [[Paper](https://arxiv.org/abs/2309.09668)][[PyTorch](https://github.com/VCIP-RGBD/DFormer)]

    * **Sigma**: "Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation", arXiv, 2024 (*CMU*). [[Paper](https://arxiv.org/abs/2404.04256)][[PyTorch](https://github.com/zifuwan/Sigma)]

* Panoptic Segmentation:

    * **MaX-DeepLab**: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2012.00759)][[PyTorch (conradry)](https://github.com/conradry/max-deeplab)]

    * **SIAin**: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 (*SI Analytics, South Korea*). [[Paper](https://arxiv.org/abs/2110.04009)]

    * **VPS-Transformer**: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 (*Technical University of Cluj-Napoca, Romania*). [[Paper](https://openaccess.thecvf.com/content/WACV2022/html/Petrovai_Time-Space_Transformers_for_Video_Panoptic_Segmentation_WACV_2022_paper.html)]

    * **CMT-DeepLab**: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2206.08948)]

    * **Panoptic-SegFormer**: "Panoptic SegFormer", CVPR, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2109.03814)][[PyTorch](https://github.com/zhiqi-li/Panoptic-SegFormer)]

    * **kMaX-DeepLab**: "k-means Mask Transformer", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2207.04044)][[Tensorflow](https://github.com/google-research/deeplab2)]

    * **Panoptic-PartFormer**: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (*Peking*). [[Paper](https://arxiv.org/abs/2204.04655)][[PyTorch](https://github.com/lxtGH/Panoptic-PartFormer)]

    * **CoMFormer**: "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", CVPR, 2023 (*Sorbonne Université, France*). [[Paper](https://arxiv.org/abs/2211.13999)]

    * **YOSO**: "You Only Segment Once: Towards Real-Time Panoptic Segmentation", CVPR, 2023 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2303.14651)][[PyTorch](https://github.com/hujiecpp/YOSO)]

    * **Pix2Seq-D**: "A Generalist Framework for Panoptic Segmentation of Images and Videos", ICCV, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2210.06366)][[Tensorflow2](https://github.com/google-research/pix2seq)]

    * **DeepDPS**: "Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning", ICCV, 2023 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2307.14786)][[Code (in construction)](https://github.com/jwh97nn/DeepDPS)]

    * **ReMaX**: "ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.17319)][[Tensorflow2](https://github.com/google-research/deeplab2)]

    * **PanopticPartFormer++**: "PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2204.04655)][[PyTorch](https://github.com/lxtGH/Panoptic-PartFormer)]

    * **MaXTron**: "MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2311.18537)]

    * **ECLIPSE**: "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning", CVPR, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2403.20126)][[Code (in construction)](https://github.com/clovaai/ECLIPSE)]

* Instance Segmentation:

    * **ISTR**: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2105.00637)][[PyTorch](https://github.com/hujiecpp/ISTR)]

    * **Mask-Transfiner**: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2111.13673)][[PyTorch](https://github.com/SysCV/transfiner)][[Website](https://www.vis.xyz/pub/transfiner/)]

    * **BoundaryFormer**: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (*UCSD*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Lazarow_Instance_Segmentation_With_Mask-Supervised_Polygonal_Boundary_Transformers_CVPR_2022_paper.html)]

    * **PPT**: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2206.10845)]

    * **TOIST**: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2210.10775)][[PyTorch](https://github.com/AIR-DISCOVER/TOIST)]

    * **MAL**: "Vision Transformers Are Good Mask Auto-Labelers", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2301.03992)][[PyTorch](https://github.com/NVlabs/mask-auto-labeler)]

    * **FastInst**: "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation", CVPR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2303.08594)][[PyTorch](https://github.com/junjiehe96/FastInst)]

    * **SP**: "Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt", CVPR, 2023 (*Northwestern Polytechnical University, China*). [[Paper](https://arxiv.org/abs/2302.01171)]

    * **X-Paste**: "X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion", ICML, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2212.03863)][[PyTorch](https://github.com/yoctta/XPaste)]

    * **DynaMITe**: "DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer", ICCV, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2304.06668)][[PyTorch](https://github.com/sabarim/dynamite/)][[Website](https://sabarim.github.io/dynamite/)]

    * **Mask-Frozen-DETR**: "Mask Frozen-DETR: High Quality Instance Segmentation with One GPU", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.03747)]

* Optical Flow:

    * **CRAFT**: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 (*A\*STAR, Singapore*). [[Paper](https://arxiv.org/abs/2203.16896)][[PyTorch](https://github.com/askerlee/craft)]

    * **KPA-Flow**: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 (*Megvii*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Luo_Learning_Optical_Flow_With_Kernel_Patch_Attention_CVPR_2022_paper.html)][[PyTorch (in construction)](https://github.com/megvii-research/KPAFlow)]

    * **GMFlowNet**: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 (*Rutgers*). [[Paper](https://arxiv.org/abs/2203.11335)][[PyTorch](https://github.com/xiaofeng94/GMFlowNet)]

    * **FlowFormer**: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2203.16194)][[Website](https://drinkingcoder.github.io/publication/flowformer/)]

    * **TransFlow**: "TransFlow: Transformer as Flow Learner", CVPR, 2023 (*Rochester Institute of Technology*). [[Paper](https://arxiv.org/abs/2304.11523)]

    * **FlowFormer++**: "FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.01237)]

* Panoramic Semantic Segmentation:

    * **Trans4PASS**: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2203.01452)][[PyTorch](https://github.com/jamycheung/Trans4PASS)]

    * **SGAT4PASS**: "SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation", IJCAI, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2306.03403)][[Code (in construction)](https://github.com/TencentARC/SGAT4PASS)]

    * **FlowFormer**: "FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2306.05442)]

* X-Shot:

    * **CyCTR**: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (*University of Technology Sydney*). [[Paper](https://arxiv.org/abs/2106.02320)]

    * **CATrans**: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2204.12817)]

    * **VAT**: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2207.10866)][[PyTorch](https://github.com/Seokju-Cho/Volumetric-Aggregation-Transformer)][[Website](https://seokju-cho.github.io/VAT/)]

    * **DCAMA**: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2207.08549)]

    * **AAFormer**: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 (*USTC*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/1397_ECCV_2022_paper.php)]

    * **IPMT**: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2210.06780)][[PyTorch](https://github.com/LIUYUANWEI98/IPMT)]

    * **TAFT**: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2202.06498)]

    * **MSANet**: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (*AiV Research Group, Korea*). [[Paper](https://arxiv.org/abs/2206.09667)][[PyTorch](https://github.com/AIVResearch/MSANet)]

    * **MuHS**: "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 (*Zhejiang University*). [[Paper](https://openreview.net/forum?id=CGuvK3U09LH)]

    * **VTM**: "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2303.14969)][[PyTorch](https://github.com/GitGyun/visual_token_matching)]

    * **SegGPT**: "SegGPT: Segmenting Everything In Context", ICCV, 2023 (*BAAI*). [[Paper](https://arxiv.org/abs/2304.03284)][[PyTorch](https://github.com/baaivision/Painter)]

    * **AMFormer**: "Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation", NeurIPS, 2023 (*ISTC*). [[Paper](https://arxiv.org/abs/2311.17626)][[Code (in construction)](https://github.com/Wyxdm/AMNet)]

    * **RefT**: "Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.01156)][[Code (in construction)](https://github.com/hanyue1648/RefT)]

    * **?**: "Multi-Modal Prototypes for Open-Set Semantic Segmentation", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2307.02003)]

    * **SPINO**: "Few-Shot Panoptic Segmentation With Foundation Models", arXiv, 2023 (*University of Freiburg, Germany*). [[Paper](https://arxiv.org/abs/2309.10726)][[Website](http://spino.cs.uni-freiburg.de/)]

    * **?**: "Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach", CVPR, 2024 (*UBC*). [[Paper](https://arxiv.org/abs/2404.11732)]

    * **RefLDM-Seg**: "Explore In-Context Segmentation via Latent Diffusion Models", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2403.09616)][[Code (in construction)](https://github.com/wang-chaoyang/RefLDMSeg)][[Website](https://wang-chaoyang.github.io/project/refldmseg/)]

    * **Chameleon**: "Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild", arXiv, 2024 (*KAIST*). [[Paper](https://arxiv.org/abs/2404.18459)]

* X-Supervised:

    * **MCTformer**: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 (*The University of Western Australia*). [[Paper](https://arxiv.org/abs/2203.02891)][[Code (in construction)](https://github.com/xulianuwa/MCTformer)]

    * **AFA**: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2203.02664)][[PyTorch](https://github.com/rulixiang/afa)]

    * **HSG**: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 (*Berkeley*). [[Paper](https://arxiv.org/abs/2204.11432)][[PyTorch](https://github.com/twke18/HSG)]

    * **CLIMS**: "Cross Language Image Matching for Weakly Supervised Semantic Segmentation", CVPR, 2022 (*Shenzhen University*). [[Paper](https://arxiv.org/abs/2203.02668)][[PyTorch](https://github.com/CVI-SZU/CLIMS)]

    * **?**: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 (*Université Paris-Saclay, France*). [[Paper](https://arxiv.org/abs/2205.15173)]

    * **SegSwap**: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (*École des Ponts ParisTech*). [[Paper](https://arxiv.org/abs/2110.15904)][[PyTorch](https://github.com/XiSHEN0220/SegSwap)][[Website](http://imagine.enpc.fr/~shenx/SegSwap/)]

    * **ViT-PCM**: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 (*Sapienza University, Italy*). [[Paper](https://arxiv.org/abs/2210.17400)][[Tensorflow](https://github.com/deepplants/ViT-PCM)]

    * **TransFGU**: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2112.01515)][[PyTorch](https://github.com/damo-cv/TransFGU)]

    * **TransCAM**: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*University of Toronto*). [[Paper](https://arxiv.org/abs/2203.07239)][[PyTorch](https://github.com/liruiwen/TransCAM)]

    * **WegFormer**: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*Tongji University, China*). [[Paper](https://arxiv.org/abs/2203.08421)]

    * **MaskDistill**: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (*KU Leuven*). [[Paper](https://arxiv.org/abs/2206.06363)][[PyTorch](https://github.com/wvangansbeke/MaskDistill)]

    * **eX-ViT**: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*La Trobe University, Australia*). [[Paper](https://arxiv.org/abs/2207.05358)]

    * **TCC**: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2209.02178)]

    * **SemFormer**: "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*Shenzhen University*). [[Paper](https://arxiv.org/abs/2210.14618)][[PyTorch](https://github.com/JLChen-C/SemFormer)]

    * **CLIP-ES**: "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2212.09506)][[PyTorch](https://github.com/linyq2117/CLIP-ES)]

    * **ToCo**: "Token Contrast for Weakly-Supervised Semantic Segmentation", CVPR, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2303.01267)][[PyTorch](https://arxiv.org/abs/2303.01267)]

    * **DPF**: "DPF: Learning Dense Prediction Fields with Weak Supervision", CVPR, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2303.16890)][[PyTorch](https://github.com/cxx226/DPF)]

    * **SemiCVT**: "SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Huang_SemiCVT_Semi-Supervised_Convolutional_Vision_Transformer_for_Semantic_Segmentation_CVPR_2023_paper.html)]

    * **AttentionShift**: "AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (*CAS*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Liao_AttentionShift_Iteratively_Estimated_Part-Based_Attention_Map_for_Pointly_Supervised_Instance_CVPR_2023_paper.html)]

    * **MMCST**: "Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization", CVPR, 2023 (*The University of Western Australia*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Xu_Learning_Multi-Modal_Class-Specific_Tokens_for_Weakly_Supervised_Dense_Object_Localization_CVPR_2023_paper.html)]

    * **SimSeg**: "A Simple Framework for Text-Supervised Semantic Segmentation", CVPR, 2023 (*ByteDance*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Yi_A_Simple_Framework_for_Text-Supervised_Semantic_Segmentation_CVPR_2023_paper.html)][[Code (in construction)](https://github.com/muyangyi/SimSeg)]

    * **SIM**: "SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation", CVPR, 2023 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2303.08578)][[PyTorch (in construction)](https://github.com/lslrh/SIM)]

    * **AttentionShift**: "AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (*CAS*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Liao_AttentionShift_Iteratively_Estimated_Part-Based_Attention_Map_for_Pointly_Supervised_Instance_CVPR_2023_paper.html)]

    * **Point2Mask**: "Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport", ICCV, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.01779)][[PyTorch](https://github.com/LiWentomng/Point2Mask)]

    * **BoxSnake**: "BoxSnake: Polygonal Instance Segmentation with Box Supervision", ICCV, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2303.11630)]

    * **QA-CLIMS**: "Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation", ACMMM, 2023 (*Shenzhen University*). [[Paper](https://arxiv.org/abs/2401.09883)][[Code (in construction)](https://github.com/CVI-SZU/QA-CLIMS)]

    * **CoCu**: "Bridging Semantic Gaps for Language-Supervised Semantic Segmentation", NeurIPS, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2309.13505)][[PyTorch](https://github.com/xing0047/CoCu)]

    * **APro**: "Label-efficient Segmentation via Affinity Propagation", NeurIPS, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2310.10533)][[PyTorch](https://github.com/CircleRadon/APro)][[Website](https://liwentomng.github.io/apro/)]

    * **PaintSeg**: "PaintSeg: Training-free Segmentation via Painting", NeurIPS, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2305.19406)]

    * **SmooSeg**: "SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation", NeurIPS, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2310.17874)][[PyTorch](https://github.com/mc-lan/SmooSeg)]

    * **VLOSS**: "Towards Universal Vision-language Omni-supervised Segmentation", arXiv, 2023 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2303.06547)]

    * **MECPformer**: "MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Tongji University*). [[Paper](https://arxiv.org/abs/2303.10689)][[Code (in construction)](https://github.com/ChunmengLiu1/MECPformer)]

    * **WeakTr**: "WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation", arXiv, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2304.01184)][[PyTorch](https://github.com/hustvl/WeakTr)]

    * **SAM-WSSS**: "An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems", arXiv, 2023 (*ANU*). [[Paper](https://arxiv.org/abs/2305.01586)]

    * **?**: "Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang University + Nankai University*). [[Paper](https://arxiv.org/abs/2305.01275)]

    * **AReAM**: "Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2305.03112)]

    * **SEPL**: "Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*OSU*). [[Paper](https://arxiv.org/abs/2305.05803)][[Code (in construction)](https://github.com/cskyl/SAM_WSSS)]

    * **MIMIC**: "MIMIC: Masked Image Modeling with Image Correspondences", arXiv, 2023 (*UW*). [[Paper](https://arxiv.org/abs/2306.15128)][[PyTorch](https://github.com/RAIVNLab/MIMIC)]

    * **POLE**: "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation", arXiv, 2023 (*ETS Montreal, Canada*). [[Paper](https://arxiv.org/abs/2307.00097)][[PyTorch](https://github.com/rB080/WSS_POLE)]

    * **GD**: "Guided Distillation for Semi-Supervised Instance Segmentation", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2308.02668)]

    * **MCTformer+**: "MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*The University of Western Australia*). [[Paper](https://arxiv.org/abs/2308.03005)][[PyTorch](https://github.com/xulianuwa/MCTformer)]

    * **MMC**: "Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding", arXiv, 2023 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2308.11448)]

    * **CRATE**: "Emergence of Segmentation with Minimalistic White-Box Transformers", arXiv, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2308.16271)][[PyTorch](https://github.com/Ma-Lab-Berkeley/CRATE)]

    * **?**: "Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models", arXiv, 2023 (*Singapore Management University*). [[Paper](https://arxiv.org/abs/2310.13026)]

    * **MCC**: "Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang Lab, China*). [[Paper](https://arxiv.org/abs/2305.08491)][[PyTorch](https://github.com/fwu11/MCC)]

    * **CRATE**: "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arXiv, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2311.13110)][[PyTorch](https://github.com/Ma-Lab-Berkeley/CRATE)][[Website](https://ma-lab-berkeley.github.io/CRATE/)]

    * **SAMS**: "Foundation Model Assisted Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2312.03585)]

    * **SemiVL**: "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2311.16241)][[PyTorch](https://github.com/google-research/semivl)]

    * **Self-reinforcement**: "Progressive Uncertain Feature Self-reinforcement for Weakly Supervised Semantic Segmentation", AAAI, 2024 (*Zhejiang Lab*). [[Paper](https://arxiv.org/abs/2312.08916)][[PyTorch](https://github.com/Jessie459/feature-self-reinforcement)]

    * **FeatUp**: "FeatUp: A Model-Agnostic Framework for Features at Any Resolution", ICLR, 2024 (*MIT*). [[Paper](https://arxiv.org/abs/2403.10516)]

    * **Zip-Your-CLIP**: "The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models", ICLR, 2024 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2404.11957)][[PyTorch](https://github.com/ChengShiest/Zip-Your-CLIP)]

    * **SeCo**: "Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation", CVPR, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2402.18467)][[Code (in construction)](https://github.com/zwyang6/SeCo)]

    * **AllSpark**: "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation", CVPR, 2024 (*HKUST*). [[Paper](https://arxiv.org/abs/2403.01818)][[PyTorch](https://github.com/xmed-lab/AllSpark)]

    * **CPAL**: "Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation", CVPR, 2024 (*Monash University*). [[Paper](https://arxiv.org/abs/2403.07630)][[Code (in construction)](https://github.com/Barrett-python/CPAL)]

    * **DuPL**: "DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation", CVPR, 2024 (*Shanghai University*). [[Paper](https://arxiv.org/abs/2403.11184)][[PyTorch](https://github.com/Wu0409/DuPL)]

    * **CoDe**: "Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation", CVPR, 2024 (*NTU*). [[Paper](https://arxiv.org/abs/2404.04231)][[Code (in construction)](https://github.com/072jiajia/image-text-co-decomposition)]

    * **SemPLeS**: "SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2401.11791)]

    * **WeakSAM**: "WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition", arXiv, 2024 (*Huazhong University of Science & Technology (HUST)*). [[Paper](https://arxiv.org/abs/2402.14812)][[PyTorch](https://github.com/hustvl/WeakSAM)]

    * **CoSA**: "Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation", arXiv, 2024 (*Lancaster University, UK*). [[Paper](https://arxiv.org/abs/2402.17891)][[Code (in construction)](https://github.com/youshyee/CoSA)]

    * **CoBra**: "CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation", arXiv, 2024 (*Yonsei*). [[Paper](https://arxiv.org/abs/2403.08801)]

* Cross-Domain:

    * **DAFormer**: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2111.14887)][[PyTorch](https://github.com/lhoyer/DAFormer)]

    * **HGFormer**: "HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation", CVPR, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2305.13031)][[Code (in construction)](https://github.com/dingjiansw101/HGFormer)]

    * **UniDAformer**: "UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2206.15083)]

    * **MIC**: "MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation", CVPR, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2212.01322)][[PyTorch](https://github.com/lhoyer/MIC)]

    * **CDAC**: "CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation", ICCV, 2023 (*Boston*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Wang_CDAC_Cross-domain_Attention_Consistency_in_Transformer_for_Domain_Adaptive_Semantic_ICCV_2023_paper.html)][[PyTorch](https://github.com/wangkaihong/CDAC)]

    * **EDAPS**: "EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation", ICCV, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2304.14291)][[PyTorch](https://github.com/susaha/edaps)]

    * **PTDiffSeg**: "Prompting Diffusion Representations for Cross-Domain Semantic Segmentation", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2307.02138)][[Code (in construction)](https://github.com/ETHRuiGong/PTDiffSeg)]

    * **Rein**: "Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation", arXiv, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2312.04265)]

* Continual Learning:

    * **TISS**: "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2211.10253)]

    * **Incrementer**: "Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class", CVPR, 2023 (*University of Electronic Science and Technology of China*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Shang_Incrementer_Transformer_for_Class-Incremental_Semantic_Segmentation_With_Knowledge_Distillation_Focusing_CVPR_2023_paper.html)]

* Crack Detection:

    * **CrackFormer**: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (*Nanjing University of Science and Technology*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_CrackFormer_Transformer_Network_for_Fine-Grained_Crack_Detection_ICCV_2021_paper.html)]

* Camouflaged/Concealed Object:

    * **UGTR**: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (*Group42, Abu Dhabi*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Yang_Uncertainty-Guided_Transformer_Reasoning_for_Camouflaged_Object_Detection_ICCV_2021_paper.html)][[PyTorch](https://github.com/fanyang587/UGTR)]

    * **COD**: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 (*Anhui University, China*). [[Paper](https://arxiv.org/abs/2205.10579)][[Code (in construction)](https://github.com/liuzywen/COD)]

    * **OSFormer**: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2207.02255)][[PyTorch](https://github.com/PJLallen/OSFormer)]

    * **FSPNet**: "Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers", CVPR, 2023 (*Sichuan Changhong Electric, China*). [[Paper](https://arxiv.org/abs/2303.14816)][[PyTorch](https://github.com/ZhouHuang23/FSPNet)][[Website](https://tzxiang.github.io/project/COD-FSPNet/index.html)]

    * **MFG**: "Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2305.11003)][[Code (in construction)](https://github.com/ChunmingHe/WS-SAM)]

* Background Separation:

    * **TransBlast**: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 (*University of British Columbia*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021W/RSLCV/html/Osman_TransBlast_Self-Supervised_Learning_Using_Augmented_Subspace_With_Transformer_for_BackgroundForeground_ICCVW_2021_paper.html)]

* Scene Understanding:

    * **BANet**: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2106.12413)]

    * **Cerberus-Transformer**: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2111.12608)][[PyTorch](https://github.com/OPEN-AIR-SUN/Cerberus)]

    * **IRISformer**: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (*UCSD*). [[Paper](https://arxiv.org/abs/2206.08423)][[Code (in construction)](https://github.com/ViLab-UCSD/IRISformer)]

* 3D Segmentation:

    * **Stratified-Transformer**: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2203.14508)][[PyTorch](https://github.com/dvlab-research/Stratified-Transformer)]

    * **CodedVTR**: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2203.09887)]

    * **M2F3D**: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (*RWTH Aachen University, Germany*). [[Paper](https://jonasschult.github.io/Mask3D/assets/workshop_paper.pdf)][[Website](https://jonasschult.github.io/Mask3D/)]

    * **3DSeg**: "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 (*The University of Tokyo*). [[Paper](https://openreview.net/forum?id=4dZeBJ83oxk)]

    * **Analogical-Network**: "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 (*CMU*). [[Paper](https://openreview.net/forum?id=SRIQZTh0IK)]

    * **VoxFormer**: "VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2302.12251)][[PyTorch](https://github.com/NVlabs/VoxFormer)]

    * **GrowSP**: "GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds", CVPR, 2023 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2305.16404)][[PyTorch](https://github.com/vLAR-group/GrowSP)]

    * **RangeViT**: "RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving", CVPR, 2023 (*Valeo.ai, France*). [[Paper](https://arxiv.org/abs/2301.10222)][[Code (in construction)](https://github.com/valeoai/rangevit)]

    * **MeshFormer**: "Heat Diffusion based Multi-scale and Geometric Structure-aware Transformer for Mesh Segmentation", CVPR, 2023 (*University of Macau*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Wong_Heat_Diffusion_Based_Multi-Scale_and_Geometric_Structure-Aware_Transformer_for_Mesh_CVPR_2023_paper.html)]  

    * **MSeg3D**: "MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2303.08600)][[PyTorch](https://github.com/jialeli1/lidarseg3d)]

    * **SGVF-SVFE**: "See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data", ICCV, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2307.10782)]

    * **SVQNet**: "SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2308.13323)]

    * **MAF-Transformer**: "Mask-Attention-Free Transformer for 3D Instance Segmentation", ICCV, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2309.01692)][[PyTorch](https://github.com/dvlab-research/Mask-Attention-Free-Transformer)]

    * **UniSeg**: "UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2309.05573)][[PyTorch](https://github.com/PJLab-ADG/PCSeg)]

    * **MIT**: "2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision", ICCV, 2023 (*NTU*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Yang_2D-3D_Interlaced_Transformer_for_Point_Cloud_Segmentation_with_Scene-Level_Supervision_ICCV_2023_paper.html)]

    * **CVSformer**: "CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion", ICCV, 2023 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2307.07938)]

    * **SPT**: "Efficient 3D Semantic Segmentation with Superpoint Transformer", ICCV, 2023 (*Univ Gustave Eiffel, France*). [[Paper](https://arxiv.org/abs/2306.08045)][[PyTorch](https://github.com/drprojects/superpoint_transformer)]

    * **SATR**: "SATR: Zero-Shot Semantic Segmentation of 3D Shapes", ICCV, 2023 (*KAUST*). [[Paper](https://arxiv.org/abs/2304.04909)][[PyTorch](https://github.com/Samir55/SATR)][[Website](https://samir55.github.io/SATR/)]

    * **3D-OWIS**: "3D Indoor Instance Segmentation in an Open-World", NeurIPS, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2309.14338)]

    * **SA3D**: "Segment Anything in 3D with NeRFs", NeurIPS, 2023 (*SJTU*). [[Paper](https://arxiv.org/abs/2304.12308)][[PyTorch](https://github.com/Jumpat/SegmentAnythingin3D)][[Website](https://jumpat.github.io/SA3D/)]

    * **Contrastive-Lift**: "Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion", NeurIPS, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2306.04633)][[PyTorch](https://github.com/yashbhalgat/Contrastive-Lift)][[Website](https://www.robots.ox.ac.uk/~vgg/research/contrastive-lift/)]

    * **P3Former**: "Position-Guided Point Cloud Panoptic Segmentation Transformer", arXiv, 2023 (*1Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.13509)][[Code (in construction)](https://github.com/SmartBot-PJLab/P3Former)]

    * **UnScene3D**: "UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes", arXiv, 2023 (*TUM*). [[Paper](https://arxiv.org/abs/2303.14541)][[Website](https://rozdavid.github.io/unscene3d)]

    * **CNS**: "Towards Label-free Scene Understanding by Vision Foundation Models", NeurIPS, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2306.03899)][[Code (in construction)](https://github.com/runnanchen/Label-Free-Scene-Understanding)]

    * **DCTNet**: "Dynamic Clustering Transformer Network for Point Cloud Segmentation", arXiv, 2023 (*University of Waterloo, Waterloo, Canada*). [[Paper](https://arxiv.org/abs/2306.08073)]

    * **Symphonies**: "Symphonize 3D Semantic Scene Completion with Contextual Instance Queries", arXiv, 2023 (*Horizon Robotics*). [[Paper](https://arxiv.org/abs/2306.15670)][[PyTorch](https://github.com/hustvl/Symphonies)]

    * **TFS3D**: "Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.12961)][[PyTorch](https://github.com/yangyangyang127/TFS3D)]

    * **CIP-WPIS**: "When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision", arXiv, 2023 (*Australian National University*). [[Paper](https://arxiv.org/abs/2309.00828)]

    * **?**: "SAM-guided Unsupervised Domain Adaptation for 3D Segmentation", arXiv, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2310.08820)]

    * **CSF**: "Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2311.01989)]

    * **?**: "Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2311.14665)]

    * **OneFormer3D**: "OneFormer3D: One Transformer for Unified Point Cloud Segmentation", arXiv, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2311.14405)]

    * **SAGA**: "Segment Any 3D Gaussians", arXiv, 2023 (*SJTU*). [[Paper](https://arxiv.org/abs/2312.00860)][[Code (in construction)](https://github.com/Jumpat/SegAnyGAussians)][[Website](https://jumpat.github.io/SAGA/)]

    * **SANeRF-HQ**: "SANeRF-HQ: Segment Anything for NeRF in High Quality", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2312.01531)][[Code (in construction)](https://github.com/lyclyc52/SANeRF-HQ)][[Website](https://lyclyc52.github.io/SANeRF-HQ/)]

    * **SAM-Graph**: "SAM-guided Graph Cut for 3D Instance Segmentation", arXiv, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2312.08372)][[Code (in construction)](https://github.com/zju3dv/SAM-Graph)][[Website](https://zju3dv.github.io/sam_graph/)]

    * **SAI3D**: "SAI3D: Segment Any Instance in 3D Scenes", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2312.11557)]

    * **COSeg**: "Rethinking Few-shot 3D Point Cloud Semantic Segmentation", CVPR, 2024 (*ETHZ*). [[Paper](https://arxiv.org/abs/2403.00592)][[Code (in construction)](https://github.com/ZhaochongAn/COSeg)]

    * **CSC**: "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception", CVPR, 2024 (*East China Normal University*). [[Paper](https://arxiv.org/abs/2405.07201)][[Code (in construction)](https://github.com/chenhaomingbob/CSC)]

* Multi-Task:

    * **InvPT**: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2203.07997)][[PyTorch](https://github.com/prismformore/Multi-Task-Transformer)]

    * **MTFormer**: "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 (*CUHK*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/1353_ECCV_2022_paper.php)]

    * **MQTransformer**: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2205.14354)]

    * **DeMT**: "DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction", AAAI, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2301.03461)][[PyTorch](https://github.com/yangyangxu0/DeMT)]

    * **TaskPrompter**: "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 (*HKUST*). [[Paper](https://openreview.net/forum?id=-CwPopPJda)][[PyTorch (in construction)](https://github.com/prismformore/Multi-Task-Transformer)]

    * **AiT**: "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2301.02229)][[PyTorch](https://github.com/SwinTransformer/AiT)]

    * **InvPT++**: "InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2306.04842)]

    * **DeMTG**: "Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction", arXiv, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2308.05721)][[PyTorch](https://github.com/yangyangxu0/DeMTG)]

    * **SRT**: "Sub-token ViT Embedding via Stochastic Resonance Transformers", arXiv, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2310.03967)]

    * **MLoRE**: "Multi-Task Dense Prediction via Mixture of Low-Rank Experts", CVPR, 2024 (*vivo*). [[Paper](https://arxiv.org/abs/2403.17749)]

    * **ODIN**: "ODIN: A Single Model for 2D and 3D Perception", arXiv, 2024 (*CMU*). [[Paper](https://arxiv.org/abs/2401.02416)][[Code (in construction)](https://github.com/ayushjain1144/odin)][[Website](https://odin-seg.github.io/)]

    * **LiFT**: "LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors", arXiv, 2024 (*Maryland*). [[Paper](https://arxiv.org/abs/2403.14625)]

* Forecasting:

    * **DiffAttn**: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 (*UIUC*). [[Paper](https://arxiv.org/abs/2204.07157)][[Code (in construction)](https://github.com/cgraber/psf-diffattn)]

* LiDAR:

    * **HelixNet**: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 (*CNRS, France*). [[Paper](https://arxiv.org/abs/2206.08194)][[Website](https://romainloiseau.fr/helixnet/)][[PyTorch](https://github.com/romainloiseau/Helix4D)]

    * **Gaussian-Radar-Transformer**: "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 (*University of Bonn, Germany*). [[Paper](https://arxiv.org/abs/2212.03690)]

    * **MOST**: "Lidar Panoptic Segmentation and Tracking without Bells and Whistles", IROS, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2310.12464)][[PyTorch](https://github.com/abhinavagarwalla/most-lps)]

    * **4D-Former**: "4D-Former: Multimodal 4D Panoptic Segmentation", CoRL, 2023 (*Waabi, Canada*). [[Paper](https://arxiv.org/abs/2311.01520)][[Website](https://waabi.ai/4d-former/)]

    * **MASK4D**: "MASK4D: Mask Transformer for 4D Panoptic Segmentation", arXiv, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2309.16133)]

* Co-Segmentation:

    * **ReCo**: "ReCo: Retrieve and Co-segment for Zero-shot Transfer", NeurIPS, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2206.07045)][[PyTorch](https://github.com/NoelShin/reco)][[Website](https://www.robots.ox.ac.uk/~vgg/research/reco/)]

    * **DINO-ViT-feature**: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 (*Weizmann Institute of Science, Israel*). [[Paper](https://arxiv.org/abs/2112.05814)][[PyTorch](https://github.com/ShirAmir/dino-vit-features)][[Website](https://dino-vit-features.github.io/)]

    * **LCCo**: "LCCo: Lending CLIP to Co-Segmentation", arXiv, 2023 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2308.11506)]

* Top-Down Semantic Segmentation:

    * **Trans4Map**: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2207.06205)]

* Surface Normal:

    * **Normal-Transformer**: "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 (*University of Technology Sydney*). [[Paper](https://arxiv.org/abs/2211.10580)]

* Applications:

    * **FloodTransformer**: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (*BITS Pilani, India*). [[Paper](https://arxiv.org/abs/2210.04218)]

* Diffusion:

    * **VPD**: "Unleashing Text-to-Image Diffusion Models for Visual Perception", ICCV, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2303.02153)][[PyTorch](https://github.com/wl-zhao/VPD)][[Website](https://vpd.ivg-research.xyz/)]

    * **Dataset-Diffusion**: "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation", NeurIPS, 2023 (*VinAI, Vietnam*). [[Paper](https://arxiv.org/abs/2309.14303)][[PyTorch](https://github.com/VinAIResearch/Dataset-Diffusion)][[Website](https://dataset-diffusion.github.io/)]

    * **SegRefiner**: "SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process", NeurIPS, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2312.12425)][[PyTorch](https://github.com/MengyuWang826/SegRefiner)]

    * **DatasetDM**: "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models", NeurIPS, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.06160)][[PyTorch](https://github.com/showlab/DatasetDM)][[Website](https://weijiawu.github.io/DatasetDM_page/)]

    * **DiffSeg**: "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion", arXiv, 2023 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2308.12469)]

    * **DiffSegmenter**: "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter", arXiv, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2309.02773)]

    * **?**: "From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2309.04109)]

    * **LDMSeg**: "A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting", arXiv, 2024 (*Segments.ai, Belgium*). [[Paper](https://arxiv.org/abs/2401.10227)][[PyTorch](https://github.com/segments-ai/latent-diffusion-segmentation)]

* Low-Level Structure Segmentation:

    * **EVP**: "Explicit Visual Prompting for Low-Level Structure Segmentations", CVPR, 2023. (*Tencent*). [[Paper](https://arxiv.org/abs/2303.10883)][[PyTorch](https://github.com/NiFangBaAGe/Explict-Visual-Prompt)]

    * **EVP**: "Explicit Visual Prompting for Universal Foreground Segmentations", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2305.18476)][[PyTorch](https://github.com/NiFangBaAGe/Explict-Visual-Prompt)]

    * **EmerDiff**: "EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2401.11739)][[Website](https://kmcode1.github.io/Projects/EmerDiff/)]

* Zero-Guidance Segmentation:

    * **zero-guide-seg**: "Zero-guidance Segmentation Using Zero Segment Labels", arXiv, 2023 (*VISTEC, Thailand*). [[Paper](https://arxiv.org/abs/2303.13396)][[Website](https://zero-guide-seg.github.io/)]

* Part Segmentation:

    * **OPS**: "Towards Open-World Segmentation of Parts", CVPR, 2023 (*Adobe*). [[Paper](https://arxiv.org/abs/2305.16804)][[PyTorch](https://github.com/tydpan/OpenPartSeg)]

    * **PartDistillation**: "PartDistillation: Learning Parts from Instance Segmentation", CVPR, 2023 (*Meta*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Cho_PartDistillation_Learning_Parts_From_Instance_Segmentation_CVPR_2023_paper.html)]

* Entity Segmentation:

    * **AIMS**: "AIMS: All-Inclusive Multi-Level Segmentation", NeurIPS, 2023 (*UC Merced*). [[Paper](https://arxiv.org/abs/2305.17768)][[PyTorch](https://github.com/dvlab-research/Entity)]

    * **SOHES**: "SOHES: Self-supervised Open-world Hierarchical Entity Segmentation", ICLR, 2024 (*Adobe*). [[Paper](https://arxiv.org/abs/2404.12386)][[Website](https://sohes.github.io/)]

* Evaluation:

    * **?**: "Robustness Analysis on Foundational Segmentation Models", arXiv, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2306.09278)][[PyTorch](https://github.com/DeepLearningRobustnessStudies/SegmetationRobustness)]

* Interactive Segmentation:

    * **InterFormer**: "InterFormer: Real-time Interactive Image Segmentation", ICCV, 2023 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2304.02942)][[PyTorch](https://github.com/YouHuang67/InterFormer)]

    * **SimpleClick**: "SimpleClick: Interactive Image Segmentation with Simple Vision Transformers", ICCV, 2023 (*UNC*). [[Paper](https://arxiv.org/abs/2210.11006)][[PyTorch](https://github.com/uncbiag/SimpleClick)]

    * **iCMFormer**: "Interactive Image Segmentation with Cross-Modality Vision Transformers", arXiv, 2023 (*University of Twente, Netherlands*). [[Paper](https://arxiv.org/abs/2307.02280)][[Code (in construction)](https://github.com/lik1996/iCMFormer)]

    * **MFP**: "MFP: Making Full Use of Probability Maps for Interactive Image Segmentation", CVPR, 2024 (*Korea University*). [[Paper](https://arxiv.org/abs/2404.18448)][[Code (in construction)](https://github.com/cwlee00/MFP)]

    * **GraCo**: "GraCo: Granularity-Controllable Interactive Segmentation", CVPR, 2024 (*Peking*). [[Paper](https://arxiv.org/abs/2405.00587)][[Website](https://zhao-yian.github.io/GraCo/)]

* Amodal Segmentation:

    * **AISFormer**: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (*University of Arkansas, Arkansas*). [[Paper](https://arxiv.org/abs/2210.06323)][[PyTorch](https://github.com/UARK-AICV/AISFormer)]

    * **C2F-Seg**: "Coarse-to-Fine Amodal Segmentation with Shape Prior", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.16825)][[Code (in construction)](https://github.com/JianxGao/C2F-Seg)][[Website](https://jianxgao.github.io/C2F-Seg/)]

    * **EoRaS**: "Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2309.13248)][[Code (in construction)](https://github.com/kfan21/EoRaS)]

    * **MP3D-Amodal**: "Amodal Ground Truth and Completion in the Wild", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2312.17247)][[Website (in construction)](https://www.robots.ox.ac.uk/~vgg/research/amodal/)]

* Amonaly Segmentation:

    * **Mask2Anomaly**: "Unmasking Anomalies in Road-Scene Segmentation", ICCV, 2023 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2307.13316)][[PyTorch](https://github.com/shyam671/Mask2Anomaly-Unmasking-Anomalies-in-Road-Scene-Segmentation)]

* In-Context Segmentation:

    * **SEGIC**: "SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2311.14671)][[Code (in construction)](https://github.com/MengLcool/SEGIC)]

[[Back to Overview](#overview)]

## Video (High-level)

### Action Recognition

* RGB mainly

    * **Action Transformer**: "Video Action Transformer Network", CVPR, 2019 (*DeepMind*). [[Paper](https://arxiv.org/abs/1812.02707)][[Code (ppriyank)](https://github.com/ppriyank/Video-Action-Transformer-Network-Pytorch-)]

    * **ViViT-Ensemble**: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2106.05058)]

    * **TimeSformer**: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2102.05095)][[PyTorch (lucidrains)](https://github.com/lucidrains/TimeSformer-pytorch)]

    * **MViT**: "Multiscale Vision Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.11227)][[PyTorch](https://github.com/facebookresearch/SlowFast)]

    * **VidTr**: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2104.11746)][[PyTorch](https://github.com/amazon-research/gluonmm)]

    * **ViViT**: "ViViT: A Video Vision Transformer", ICCV, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2103.15691)][[PyTorch (rishikksh20)](https://github.com/rishikksh20/ViViT-pytorch)]

    * **VTN**: "Video Transformer Network", ICCVW, 2021 (*Theator*). [[Paper](https://arxiv.org/abs/2102.00719)][[PyTorch](https://github.com/bomri/SlowFast/tree/master/projects/vtn)]

    * **TokShift**: "Token Shift Transformer for Video Classification", ACMMM, 2021 (*CUHK*). [[Paper](https://arxiv.org/abs/2108.02432)][[PyTorch](https://github.com/VideoNetworks/TokShift-Transformer)]

    * **Motionformer**: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.05392)][[PyTorch](https://github.com/facebookresearch/Motionformer)][[Website](https://facebookresearch.github.io/Motionformer/)]

    * **X-ViT**: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (*Samsung*). [[Paper](https://arxiv.org/abs/2106.05968)][[PyTorch](https://github.com/1adrianb/video-transformers)]

    * **SCT**: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (*Kuaishou*). [[Paper](https://arxiv.org/abs/2108.11575)]

    * **RSANet**: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (*POSTECH*). [[Paper](https://arxiv.org/abs/2111.01673)][[PyTorch](https://github.com/KimManjin/RSA)][[Website](http://cvlab.postech.ac.kr/research/RSA/)]

    * **STAM**: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2103.13915)][[Code](https://github.com/Alibaba-MIIL/STAM)]

    * **GAT**: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (*Samsung*). [[Paper](https://arxiv.org/abs/2103.10043)]

    * **TokenLearner**: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.11297)]

    * **VLF**: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (*The University of Sheffield*). [[Paper](https://arxiv.org/abs/2107.00451)]

    * **UniFormer**: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (*CAS + SenstTime*). [[Paper](https://arxiv.org/abs/2201.04676)][[PyTorch](https://github.com/Sense-X/UniFormer)]

    * **Video-Swin**: "Video Swin Transformer", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.13230)][[PyTorch](https://github.com/SwinTransformer/Video-Swin-Transformer)]

    * **DirecFormer**: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (*University of Arkansas*). [[Paper](https://arxiv.org/abs/2203.10233)][[Code (in construction)](https://github.com/uark-cviu/DirecFormer)]

    * **DVT**: "Deformable Video Transformer", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2203.16795)]

    * **MeMViT**: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2201.08383)]

    * **MLP-3D**: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2206.06292)][[PyTorch (in construction)](https://github.com/ZhaofanQiu/MLP-3D)]

    * **RViT**: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (*TCL Corporate Research, HK*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Yang_Recurring_the_Transformer_for_Video_Action_Recognition_CVPR_2022_paper.html)]

    * **SIFA**: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2206.06931)][[PyTorch](https://github.com/FuchenUSTC/SIFA)]

    * **MViTv2**: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2112.01526)][[PyTorch](https://github.com/facebookresearch/mvit)]

    * **MTV**: "Multiview Transformers for Video Recognition", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2201.04288)][[Tensorflow](https://github.com/google-research/scenic/tree/main/scenic/projects/mtv)]

    * **ORViT**: "Object-Region Video Transformers", CVPR, 2022 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2110.06915)][[Website](https://roeiherz.github.io/ORViT/)]

    * **TIME**: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2207.09067)][[PyTorch](https://github.com/alinlab/temporal-selfsupervision)]

    * **TPS**: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2207.13259)][[PyTorch](https://github.com/MartinXM/TPS)]

    * **DualFormer**: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2112.04674)][[PyTorch](https://github.com/sail-sg/dualformer)]

    * **STTS**: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2111.11591)][[PyTorch](https://github.com/wangjk666/STTS)]

    * **Turbo**: "Turbo Training with Token Dropout", BMVC, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2210.04889)]

    * **MultiTrain**: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2209.12362)][[Code (in construction)](https://github.com/JunweiLiang/MultiTrain)]

    * **SViT**: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2206.06346)][[Website](https://eladb3.github.io/SViT/)]

    * **ST-Adapter**: "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2206.13559)][[Code (in construction)](https://github.com/linziyi96/st-adapter)]

    * **ATA**: "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.00132)]

    * **AIA**: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (*University of Science and Technology of China*). [[Paper](https://arxiv.org/abs/2204.09303)][[PyTorch](https://github.com/haoyanbin918/Attention-in-Attention)]

    * **MSCA**: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (*Nagoya Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.00452)]

    * **VAST**: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2208.11108)]

    * **Video-MobileFormer**: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.12257)]

    * **MAM²**: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2210.05234)]

    * **?**: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2210.08164)]

    * **STAN**: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (*The University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2208.01753)]

    * **PatchBlender**: "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 (*Mila*). [[Paper](https://arxiv.org/abs/2211.14449)]

    * **DualPath**: "Dual-path Adaptation from Image to Video Transformers", CVPR, 2023 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2303.09857)][[PyTorch (in construction)](https://github.com/park-jungin/DualPath)]

    * **S-ViT**: "Streaming Video Model", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2303.17228)][[Code (in construction)](https://github.com/yuzhms/Streaming-Video-Model)]

    * **TubeViT**: "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.03229)]

    * **AdaMAE**: "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", CVPR, 2023 (*JHU*). [[Paper](https://arxiv.org/abs/2211.09120)][[PyTorch](https://github.com/wgcban/adamae)]

    * **ObjectViViT**: "How can objects help action recognition?", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.11726)]

    * **SMViT**: "Simple MViT: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (*Meta*). [[Paper](https://omidpoursaeed.github.io/publication/smvit/)]

    * **Hiera**: "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2306.00989)][[PyTorch](https://github.com/facebookresearch/hiera)]

    * **Video-FocalNet**: "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition", ICCV, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2307.06947)][[PyTorch](https://github.com/TalalWasim/Video-FocalNets)][[Website](https://talalwasim.github.io/Video-FocalNets/)]

    * **ATM**: "What Can Simple Arithmetic Operations Do for Temporal Modeling?", ICCV, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2307.08908)][[Code (in construction)](https://github.com/whwu95/ATM)]

    * **STA**: "Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation", ICCV, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2308.04549)]

    * **Helping-Hands**: "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model", ICCV, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2308.07918)][[PyTorch](https://github.com/Chuhanxx/helping_hand_for_egocentric_videos)]

    * **SUM-L**: "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition", ICCV, 2023 (*University of Delaware, Delaware*). [[Paper](https://arxiv.org/abs/2308.11489)][[Code (in construction)](https://github.com/wqtwjt1996/SUM-L)]

    * **BEAR**: "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition", ICCV, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2303.13505)][[GitHub](https://github.com/AndongDeng/BEAR)]

    * **UniFormerV2**: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.09552)][[PyTorch](https://github.com/OpenGVLab/UniFormerV2)]

    * **CAST**: "CAST: Cross-Attention in Space and Time for Video Action Recognition", NeurIPS, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2311.18825)][[PyTorch](https://github.com/KHU-VLL/CAST)][[Website](https://jong980812.github.io/CAST.github.io/)]

    * **PPMA**: "Learning Human Action Recognition Representations Without Real Humans", NeurIPS (Datasets and Benchmarks), 2023 (*IBM*). [[Paper](https://arxiv.org/abs/2311.06231)][[PyTorch](https://github.com/howardzh01/PPMA)]

    * **SVT**: "SVT: Supertoken Video Transformer for Efficient Video Understanding", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.00325)]

    * **PLAR**: "Prompt Learning for Action Recognition", arXiv, 2023 (*Maryland*). [[Paper](https://arxiv.org/abs/2305.12437)]

    * **SFA-ViViT**: "Optimizing ViViT Training: Time and Memory Reduction for Action Recognition", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.04822)]

    * **TAdaConv**: "Temporally-Adaptive Models for Efficient Video Understanding", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2308.05787)][[PyTorch](https://github.com/alibaba-mmai-research/TAdaConv)]

    * **ZeroI2V**: "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2310.01324)]

    * **MV-Former**: "Multi-entity Video Transformers for Fine-Grained Video Representation Learning", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2311.10873)][[PyTorch](https://github.com/facebookresearch/video_rep_learning)]

    * **GeoDeformer**: "GeoDeformer: Geometric Deformable Transformer for Action Recognition", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2311.17975)]

    * **Early-ViT**: "Early Action Recognition with Action Prototypes", arXiv, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2312.06598)]

    * **MCA**: "Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition", ICLR, 2024 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2403.09506)][[PyTorch](https://github.com/BeSpontaneous/MCA-pytorch)]

    * **StructViT**: "Learning Correlation Structures for Vision Transformers", CVPR, 2024 (*POSTECH*). [[Paper](https://arxiv.org/abs/2404.03924)]

    * **VideoMamba**: "VideoMamba: State Space Model for Efficient Video Understanding", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.06977)][[PyTorch](https://github.com/OpenGVLab/VideoMamba)]

    * **Video-Mamba-Suite**: "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.09626)][[PyTorch](https://github.com/OpenGVLab/video-mamba-suite)]

* Depth:

    * **Trear**: "Trear: Transformer-based RGB-D Egocentric Action Recognition",  IEEE Transactions on Cognitive and Developmental Systems, 2021 (*Tianjing University*). [[Paper](https://ieeexplore.ieee.org/document/9312201)]

* Pose/Skeleton:

    * **ST-TR**: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (*Polytechnic University of Milan*). [[Paper](https://arxiv.org/abs/2012.06399)]

    * **AcT**: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2107.00606)][[Code (in construction)](https://github.com/FedericoAngelini/MPOSE2021_Dataset)]

    * **STAR**: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (*UCLA*). [[Paper](https://arxiv.org/abs/2107.07089)]

    * **GCsT**: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2109.02860)]

    * **GL-Transformer**: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (*Seoul National University*). [[Paper](https://arxiv.org/abs/2207.06101)][[PyTorch](https://github.com/Boeun-Kim/GL-Transformer)]

    * **?**: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (*University of Delaware*). [[Paper](https://arxiv.org/abs/2208.01161)]

    * **FG-STFormer**: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (*Zhengzhou University*). [[Paper](https://arxiv.org/abs/2210.02693)]

    * **STTFormer**: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (*Xidian University*). [[Paper](https://arxiv.org/abs/2201.02849)][[Code (in construction)](https://github.com/heleiqiu/STTFormer)]

    * **ProFormer**: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2202.11423)][[PyTorch](https://github.com/KPeng9510/ProFormer)]

    * **?**: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2206.15002)]

    * **HyperSA**: "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 (*University of Mannheim, Germany*). [[Paper](https://arxiv.org/abs/2211.09590)]

    * **STAR-Transformer**: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (*Keimyung University, Korea*). [[Paper](https://arxiv.org/abs/2210.07503)]

    * **STMT**: "STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition", CVPR, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2303.18177)][[Code (in construction)](https://github.com/zgzxy001/STMT)]

    * **SkeletonMAE**: "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2307.08476)][[Code (in construction)](https://github.com/HongYan1123/SkeletonMAE)]

    * **MAMP**: "Masked Motion Predictors are Strong 3D Action Representation Learners", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2308.07092)][[PyTorch](https://github.com/maoyunyao/MAMP)]

    * **LAC**: "LAC - Latent Action Composition for Skeleton-based Action Segmentation", ICCV, 2023 (*INRIA*). [[Paper](https://arxiv.org/abs/2308.14500)][[Website](https://walker1126.github.io/LAC/)]

    * **SkeleTR**: "SkeleTR: Towards Skeleton-based Action Recognition in the Wild", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2309.11445)]

    * **PCM³**: "Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning", ACMMM, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2308.03975)][[Website](https://jhang2020.github.io/Projects/PCM3/PCM3.html)]

    * **PoseAwareVT**: "Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers", arXiv, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2306.09331)][[PyTorch](https://github.com/dominickrei/PoseAwareVT)]

    * **HandFormer**: "On the Utility of 3D Hand Poses for Action Recognition", arXiv, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2403.09805)][[Code (in construction)](https://github.com/s-shamil/HandFormer)][[Website](https://s-shamil.github.io/HandFormer/)]

    * **SkateFormer**: "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition", arXiv, 2024 (*KAIST*). [[Paper](https://arxiv.org/abs/2403.09508)][[Code (in construction)](https://github.com/KAIST-VICLab/SkateFormer)][[Website](https://jeonghyeokdo.github.io/SkateFormer_site/)]

* Multi-modal:

    * **MBT**: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2107.00135)]

    * **MM-ViT**: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (*OPPO*). [[Paper](https://arxiv.org/abs/2108.09322)]

    * **MMT-NCRC**: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (*UCF*). [[Paper](https://arxiv.org/abs/2204.04564)][[Code (in construction)](https://github.com/Momilijaz96/MMT_for_NCRC)]

    * **M&M**: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2206.09852)]

    * **VT-CE**: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2208.01897)]

    * **Hi-TRS**: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (*Rutgers*). [[Paper](https://arxiv.org/abs/2207.09644)][[PyTorch](https://github.com/yuxiaochen1103/Hi-TRS)]

    * **MVFT**: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2202.12949)]

    * **MOV**: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2207.07646)]

    * **3Mformer**: "3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition", CVPR, 2023 (*ANU*). [[Paper](https://arxiv.org/abs/2303.14474)]

    * **UMT**: "On Uni-Modal Feature Learning in Supervised Multi-Modal Learning", ICML, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2305.01233)]

    * **?**: "Multimodal Distillation for Egocentric Action Recognition", ICCV, 2023 (*KU Leuven*). [[Paper](https://arxiv.org/abs/2307.07483)]

    * **MotionBERT**: "MotionBERT: Unified Pretraining for Human Motion Analysis", ICCV, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2210.06551)][[PyTorch](https://github.com/Walter0807/MotionBERT)][[Website](https://motionbert.github.io/)]

    * **TIM**: "TIM: A Time Interval Machine for Audio-Visual Action Recognition", CVPR, 2024 (*University of Bristol + Oxford*). [[Paper](https://arxiv.org/abs/2404.05559)][[PyTorch](https://github.com/JacobChalk/TIM)][[Website](https://jacobchalk.github.io/TIM-Project/)]

* Group Activity:

    * **GroupFormer**: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (*Sensetime*). [[Paper](https://arxiv.org/abs/2108.12630)]

    * **?**: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (*Hitachi*). [[Paper](https://arxiv.org/abs/2207.05254)]

    * **GAFL**: "Learning Group Activity Features Through Person Attribute Prediction", CVPR, 2024 (*Toyota Technological Institute, Japan*). [[Paper](https://arxiv.org/abs/2403.02753)]

[[Back to Overview](#overview)]

### Action Detection/Localization

* **OadTR**: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2106.11149)][[PyTorch](https://github.com/wangxiang1230/OadTR)]

* **RTD-Net**: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2102.01894)][[PyTorch](https://github.com/MCG-NJU/RTD-Action)]

* **FS-TAL**: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2110.10552)][[PyTorch](https://github.com/sauradip/fewshotQAT)]

* **LSTR**: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2107.03377)][[PyTorch](https://github.com/amazon-research/long-short-term-transformer)][[Website](https://xumingze0308.github.io/projects/lstr/)]

* **ATAG**: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2103.16024)]

* **TAPG-Transformer**: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2105.12043)]

* **TadTR**: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2106.10271)][[Code (in construction)](https://github.com/xlliu7/TadTR)]

* **Vidpress-Soccer**: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.14447)][[GitHub](https://github.com/baidu-research/vidpress-sports)]

* **MS-TCT**: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (*INRIA*). [[Paper](https://arxiv.org/abs/2112.03902)][[PyTorch](https://github.com/dairui01/MS-TCT)]

* **UGPT**: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (*Rensselaer Polytechnic Institute, NY*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Guo_Uncertainty-Guided_Probabilistic_Transformer_for_Complex_Action_Recognition_CVPR_2022_paper.html)]

* **TubeR**: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2104.00969)]

* **DDM-Net**: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2112.04771)][[PyTorch](https://github.com/MCG-NJU/DDM)]

* **?**: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2207.03038)][[PyTorch](https://github.com/GX77/Dual-Stream-Transformer-for-Generic-Event-Boundary-Captioning)]

* **?**: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (*Renmin University of China*). [[Paper](https://arxiv.org/abs/2208.05375)]

* **EAMAT**: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2205.05854)][[Code (in construction)](https://github.com/shuoyang129/EAMAT)]

* **STPT**: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2207.10448)]

* **TeSTra**: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2209.09236)][[PyTorch](https://github.com/zhaoyue-zephyrus/TeSTra)]

* **TALLFormer**: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (*UNC*). [[Paper](https://arxiv.org/abs/2204.01680)][[PyTorch](https://github.com/klauscc/TALLFormer)]

* **?**: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 (*Rensselaer Polytechnic Institute, NY*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2138_ECCV_2022_paper.php)]

* **ActionFormer**: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (*UW-Madison*). [[Paper](https://arxiv.org/abs/2202.07925)][[PyTorch](https://github.com/happyharrycn/actionformer_release)]

* **ActionFormer**: "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 (*UW-Madison*). [[Paper](https://arxiv.org/abs/2211.09074)][[Pytorch](https://github.com/happyharrycn/actionformer_release)]

* **CoOadTR**: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (*Aarhus University, Denmark*). [[Paper](https://arxiv.org/abs/2201.06268)][[PyTorch](https://github.com/LukasHedegaard/continual-transformers)]

* **Temporal-Perceiver**: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2203.00307)]

* **LocATe**: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (*Stanford*). [[Paper](https://arxiv.org/abs/2203.10719)]

* **HTNet**: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2207.09662)]

* **AdaPerFormer**: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2208.11908)]

* **CWC-Trans**: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2208.14209)]

* **HIT**: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 (*NTHU*). [[Paper](https://arxiv.org/abs/2210.12686)][[PyTorch](https://github.com/joslefaure/HIT)]

* **LART**: "On the Benefits of 3D Pose and Tracking for Human Action Recognition", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.01199)][[Website](http://people.eecs.berkeley.edu/~jathushan/LART/)]

* **TranS4mer**: "Efficient Movie Scene Detection using State-Space Transformers", CVPR, 2023 (*Comcast*). [[Paper](https://arxiv.org/abs/2212.14427)]

* **TTM**: "Token Turing Machines", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2211.09119)][[JAX](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing)]

* **?**: "Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection", CVPR, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2303.17285)]

* **Self-DETR**: "Self-Feedback DETR for Temporal Action Detection", ICCV, 2023 (*Sungkyunkwan University, Korea*). [[Paper](https://arxiv.org/abs/2308.10570)]

* **UnLoc**: "UnLoc: A Unified Framework for Video Localization Tasks", ICCV, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2308.11062)][[JAX](https://github.com/google-research/scenic)]

* **EVAD**: "Efficient Video Action Detection with Token Dropout and Context Refinement", ICCV, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2304.08451)][[PyTorch](https://github.com/MCG-NJU/EVAD)]

* **MS-DETR**: "MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction", ACL, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2305.18969)][[PyTorch](https://github.com/K-Nick/MS-DETR)]

* **STAR**: "End-to-End Spatio-Temporal Action Localisation with Video Transformers", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.12160)]

* **DiffTAD**: "DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion", arXiv, 2023 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2303.14863)][[PyTorch (in construction)](https://github.com/sauradip/DiffusionTAD)]

* **MNA-ZBD**: "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection", arXiv, 2023 (*Renmin University of China*). [[Paper](https://arxiv.org/abs/2307.10567)]

* **PAT**: "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection", arXiv, 2023 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2308.05051)]

* **ViT-TAD**: "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos", arXiv, 2023 (*Nanjing University (NJU)*). [[Paper](https://arxiv.org/abs/2312.01897)]

* **Cafe**: "Towards More Practical Group Activity Detection: A New Benchmark and Model", arXiv, 2023 (*POSTECH*). [[Paper](https://arxiv.org/abs/2312.02878)][[PyTorch](https://github.com/dk-kim/CAFE_codebase)][[Website](https://dk-kim.github.io/CAFE/)]

* **?**: "Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization", arXiv, 2023 (*Queen Mary, UK*). [[Paper](https://arxiv.org/abs/2312.17686)]

* **SMAST**: "A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection", TPAMI, 2024 (*University of Virginia*). [[Paper](https://arxiv.org/abs/2405.08204)]

* **OV-STAD**: "Open-Vocabulary Spatio-Temporal Action Detection", arXiv, 2024 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2405.10832)]

[[Back to Overview](#overview)]

### Action Prediction/Anticipation

* **AVT**: "Anticipative Video Transformer", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.02036)][[PyTorch](https://github.com/facebookresearch/AVT)][[Website](https://facebookresearch.github.io/AVT/)]

* **TTPP**: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2003.03530)]

* **HORST**: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2104.08665)][[PyTorch](https://github.com/CorcovadoMing/HORST)]

* **?**: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2107.08579)]

* **FUTR**: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (*POSTECH*). [[Paper](https://arxiv.org/abs/2205.14022)]

* **VPTR**: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (*Polytechnique Montreal, Canada*). [[Paper](https://arxiv.org/abs/2203.15836)][[PyTorch](https://github.com/XiYe20/VPTR)]

* **Earthformer**: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2207.05833)]

* **InAViT**: "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2211.14154)]

* **VPTR**: "Video Prediction by Efficient Transformers", IVC, 2022 (*Polytechnique Montreal, Canada*). [[Paper](https://arxiv.org/abs/2212.06026)][[Pytorch](https://github.com/XiYe20/VPTR)]

* **AFFT**: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2210.12649)][[Code (in construction)](https://github.com/zeyun-zhong/AFFT)]

* **GliTr**: "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 (*McGill University, Canada*). [[Paper](https://arxiv.org/abs/2210.13605)]

* **RAFTformer**: "Latency Matters: Real-Time Action Forecasting Transformer", CVPR, 2023 (*Honda*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Girase_Latency_Matters_Real-Time_Action_Forecasting_Transformer_CVPR_2023_paper.html)]

* **AdamsFormer**: "AdamsFormer for Spatial Action Localization in the Future", CVPR, 2023 (*Honda*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Chi_AdamsFormer_for_Spatial_Action_Localization_in_the_Future_CVPR_2023_paper.html)]

* **TemPr**: "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction", CVPR, 2023 (*University of Bristol*). [[Paper](https://arxiv.org/abs/2204.13340)][[PyTorch](https://github.com/alexandrosstergiou/progressive-action-prediction)][[Website](https://alexandrosstergiou.github.io/project_pages/TemPr/index.html)]

* **MAT**: "Memory-and-Anticipation Transformer for Online Action Understanding", ICCV, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2308.07893)][[PyTorch](https://github.com/Echo0125/Memory-and-Anticipation-Transformer)]

* **SwinLSTM**: "SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM", ICCV, 2023 (*Hainan University*). [[Paper](https://arxiv.org/abs/2308.09891)][[PyTorch](https://github.com/SongTang-x/SwinLSTM)]

* **MVP**: "Multiscale Video Pretraining for Long-Term Activity Forecasting", arXiv, 2023 (*Boston*). [[Paper](https://arxiv.org/abs/2307.12854)]

* **DiffAnt**: "DiffAnt: Diffusion Models for Action Anticipation", arXiv, 2023 (*Karlsruhe Institute of Technology (KIT), Germany*). [[Paper](https://arxiv.org/abs/2311.15991)]

* **LALM**: "LALM: Long-Term Action Anticipation with Language Models", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2311.17944)]

* **?**: "Learning from One Continuous Video Stream", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2312.00598)]

* **ObjectPrompt**: "Object-centric Video Representation for Long-term Action Anticipation", WACV, 2024 (*Honda*). [[Paper](https://arxiv.org/abs/2311.00180)][[Code (in construction)](https://github.com/brown-palm/ObjectPrompt)]

[[Back to Overview](#overview)]

### Video Object Segmentation

* **GC**: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 (*Tencent*). [[Paper](https://arxiv.org/abs/2001.11243)]

* **SSTVOS**: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (*Modiface*). [[Paper](https://arxiv.org/abs/2101.08833)][[Code (in construction)](https://github.com/dukebw/SSTVOS)]

* **JOINT**: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (*University of Science and Technology of China*). [[Paper](https://arxiv.org/abs/2108.03679)][[PyTorch](https://github.com/maoyunyao/JOINT)]

* **AOT**: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2106.02638)][[PyTorch (yoxu515)](https://github.com/yoxu515/aot-benchmark)][[Code (in construction)](https://github.com/z-x-yang/AOT)]

* **TransVOS**: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2106.00588)]

* **SITVOS**: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2112.13983)] 

* **HODOR**: "Differentiable Soft-Masked Attention", CVPRW, 2022 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2206.00182)]

* **BATMAN**: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.01159)]

* **DeAOT**: "Decoupling Features in Hierarchical Propagation for Video Object Segmentation", NeurIPS, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2210.09782)][[PyTorch](https://github.com/z-x-yang/AOT)]

* **AOT**: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2203.11442)][[PyTorch](https://github.com/yoxu515/aot-benchmark)]

* **MED-VT**: "MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation", CVPR, 2023 (*York University*). [[Paper](https://arxiv.org/abs/2304.05930)][[Website](https://rkyuca.github.io/medvt/)]

* **?**: "Boosting Video Object Segmentation via Space-time Correspondence Learning", CVPR, 2023 (*Shanghai Jiao Tong University (SJTU)*). [[Paper](https://arxiv.org/abs/2304.06211)]

* **Isomer**: "Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation", ICCV, 2023 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2308.06693)][[PyTorch](https://github.com/DLUT-yyc/Isomer)]

* **SimVOS**: "Scalable Video Object Segmentation with Simplified Framework", ICCV, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.09903)]

* **MITS**: "Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation", ICCV, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2308.13266)][[PyTorch](https://github.com/yoxu515/MITS)]

* **VIPMT**: "Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation", ICCV, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2309.11160)][[Code (in construction)](https://github.com/nankepan/VIPMT)]

* **MOSE**: "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes", ICCV, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2302.01872)][[GitHub](https://github.com/henghuiding/MOSE-api)][[Website](https://henghuiding.github.io/MOSE/)]

* **LVOS**: "LVOS: A Benchmark for Long-term Video Object Segmentation", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2211.10181)][[GitHub](https://github.com/LingyiHongfd/LVOS)][[Website](https://lingyihongfd.github.io/lvos.github.io/)]

* **JointFormer**: "Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2308.13505)]

* **PanoVOS**: "PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2309.12303)][[Code (in construction)](https://github.com/shilinyan99/PanoVOS)][[Website](https://shilinyan99.github.io/PanoVOS/index_pano.html)]

* **Cutie**: "Putting the Object Back into Video Object Segmentation", arXiv, 2023 (*UIUC*). [[Paper](https://arxiv.org/abs/2310.12982)][[PyTorch](https://github.com/hkchengrex/Cutie)][[Website](https://hkchengrex.com/Cutie/)]

* **M³T**: "M³T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking", arXiv, 2023 (*UBC*). [[Paper](https://arxiv.org/abs/2312.08514)]

* **?**: "Appearance-based Refinement for Object-Centric Motion Segmentation", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2312.11463)]

* **DATTT**: "Depth-aware Test-Time Training for Zero-shot Video Object Segmentation", CVPR, 2024 (*University of Macau*). [[Paper](https://arxiv.org/abs/2403.04258)][[PyTorch](https://github.com/NiFangBaAGe/DATTT)][[Website](https://nifangbaage.github.io/DATTT/)]

* **LLE-VOS**: "Event-assisted Low-Light Video Object Segmentation", CVPR, 2024 (*USTC*). [[Paper](https://arxiv.org/abs/2404.01945)]

* **Point-VOS**: "Point-VOS: Pointing Up Video Object Segmentation", arXiv, 2024 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2402.05917)][[Website](https://pointvos.github.io/)]

* **MAVOS**: "Efficient Video Object Segmentation via Modulated Cross-Attention Memory", arXiv, 2024 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2403.17937)][[Code (in construction)](https://github.com/Amshaker/MAVOS)]

* **STMA**: "Spatial-Temporal Multi-level Association for Video Object Segmentation", arXiv, 2024 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2404.06265)]

* **Flow-SAM**: "Moving Object Segmentation: All You Need Is SAM (and Flow)", arXiv, 2024 (*Oxford*). [[Paper](https://arxiv.org/abs/2404.12389)][[Website](https://www.robots.ox.ac.uk/~vgg/research/flowsam/)]

* **LVOSv2**: "LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation", arXiv, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2404.19326)][[GitHub](https://github.com/LingyiHongfd/LVOS)][[Website](https://lingyihongfd.github.io/lvos.github.io/)]

[[Back to Overview](#overview)]

### Video Instance Segmentation

* **VisTR**: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (*Meituan*). [[Paper](https://arxiv.org/abs/2011.14503)][[PyTorch](https://github.com/Epiphqny/VisTR)]

* **IFC**: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2106.03299)][[PyTorch](https://github.com/sukjunhwang/IFC)]

* **Deformable-VisTR**: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (*University at Buffalo*). [[Paper](https://arxiv.org/abs/2203.06318)][[Code (in construction)](https://github.com/skrya/DefVIS)]

* **TeViT**: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2204.08412)][[PyTorch](https://github.com/hustvl/TeViT)]

* **GMP-VIS**: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (*Shandong University*). [[Paper](https://paperswithcode.com/paper/a-graph-matching-perspective-with)]

* **VMT**: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2207.14012)][[GitHub](https://github.com/SysCV/vmt)][[Website](https://www.vis.xyz/pub/vmt/)]

* **SeqFormer**: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2112.08275)][[PyTorch](https://github.com/wjf5203/SeqFormer)]

* **MS-STS**: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2203.13253)][[PyTorch](https://github.com/OmkarThawakar/MSSTS-VIS)]

* **MinVIS**: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2208.02245)][[PyTorch](https://github.com/NVlabs/MinVIS)]

* **VITA**: "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2206.04403)][[PyTorch](https://github.com/sukjunhwang/VITA)]

* **IFR**: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.07011)]

* **DeVIS**: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2207.11103)][[PyTorch](https://github.com/acaelles97/DeVIS)]

* **InstanceFormer**: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (*Ludwig Maximilian University of Munich*). [[Paper](https://arxiv.org/abs/2208.10547)][[Code (in construction)](https://github.com/rajatkoner08/InstanceFormer)]

* **MaskFreeVIS**: "Mask-Free Video Instance Segmentation", CVPR, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2303.15904)][[PyTorch](https://github.com/SysCV/MaskFreeVis)]

* **MDQE**: "MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos", CVPR, 2023 (*Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2303.14395)][[PyTorch](https://github.com/MinghanLi/MDQE_CVPR2023)]

* **GenVIS**: "A Generalized Framework for Video Instance Segmentation", CVPR, 2023 (*Yonsei*). [[Paper](https://arxiv.org/abs/2211.08834)][[PyTorch](https://github.com/miranheo/GenVIS)]

* **CTVIS**: "CTVIS: Consistent Training for Online Video Instance Segmentation", ICCV, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2307.12616)][[PyTorch](https://github.com/KainingYing/CTVIS)]

* **TCOVIS**: "TCOVIS: Temporally Consistent Online Video Instance Segmentation", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2309.11857)][[Code (in construction)](https://github.com/jun-long-li/TCOVIS)]

* **DVIS**: "DVIS: Decoupled Video Instance Segmentation Framework", ICCV, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2306.03413)][[PyTorch](https://github.com/zhang-tao-whu/DVIS)]

* **TMT-VIS**: "TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation", NeurIPS, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2312.06630)][[Code (in construction)](https://github.com/rkzheng99/TMT-VIS)]

* **BoxVIS**: "BoxVIS: Video Instance Segmentation with Box Annotations", arXiv, 2023 (*Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2303.14618)][[Code (in construction)](https://github.com/MinghanLi/BoxVIS)]

* **OW-VISFormer**: "Video Instance Segmentation in an Open-World", arXiv, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2304.01200)][[Code (in construction)](https://github.com/OmkarThawakar/OWVISFormer)]

* **GRAtt-VIS**: "GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation", arXiv, 2023 (*LMU Munich*). [[Paper](https://arxiv.org/abs/2305.17096)][[Code (in construction)](https://github.com/Tanveer81/GRAttVIS)]

* **RefineVIS**: "RefineVIS: Video Instance Segmentation with Temporal Attention Refinement", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2306.04774)]

* **VideoCutLER**: "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2308.14710)][[PyTorch](https://github.com/facebookresearch/CutLER/tree/main/videocutler)]

* **NOVIS**: "NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation", arXiv, 2023 (*TUM*). [[Paper](https://arxiv.org/abs/2308.15266)]

* **VISAGE**: "VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement", arXiv, 2023 (*Yonsei*). [[Paper](https://arxiv.org/abs/2312.04885)][[Code (in construction)](https://github.com/KimHanjung/VISAGE)]

* **OW-VISCap**: "OW-VISCap: Open-World Video Instance Segmentation and Captioning", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2404.03657)][[Website](https://anwesachoudhuri.github.io/OpenWorldVISCap/)]

* **PointVIS**: "What is Point Supervision Worth in Video Instance Segmentation?", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2404.01990)]

[[Back to Overview](#overview)]

### Other Video Tasks

* Action Segmentation

    * **ASFormer**: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (*Peking University*). [[Paper](https://arxiv.org/abs/2110.08568)][[PyTorch](https://github.com/ChinaYi/ASFormer)]

    * **Bridge-Prompt**: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2203.14104)][[PyTorch](https://github.com/ttlmh/Bridge-Prompt)]

    * **SC-Transformer++**: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2206.12634)][[Code (in construction)](https://github.com/lufficc/SC-Transformer)]

    * **UVAST**: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 (*Bosch*). [[Paper](https://arxiv.org/abs/2209.00638)][[PyTorch](https://github.com/boschresearch/UVAST)]

    * **?**: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2201.05675)]

    * **CETNet**: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (*Shijiazhuang Tiedao University*). [[Paper](https://arxiv.org/abs/2205.09445)]

    * **EUT**: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2205.13425)]

    * **SC-Transformer**: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2206.02985)]

    * **DXFormer**: "Enhancing Transformer Backbone for Egocentric Video Action Segmentation", CVPRW, 2023 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2305.11365)][[Website (in construction)](https://www.sail-nu.com/dxformer)]

    * **LTContext**: "How Much Temporal Long-Term Context is Needed for Action Segmentation?", ICCV, 2023 (*University of Bonn*). [[Paper](https://arxiv.org/abs/2308.11358)][[PyTorch](https://github.com/LTContext/LTContext)]

    * **DiffAct**: "Diffusion Action Segmentation", ICCV, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2303.17959)][[PyTorch](https://github.com/Finspire13/DiffAct)]

    * **TST**: "Temporal Segment Transformer for Action Segmentation", arXiv, 2023 (*Shanghai Tech*). [[Paper](https://arxiv.org/abs/2302.13074)]

* Video X Segmentation:

    * **STT**: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (*Shanghai Jiao Tong*). [[Paper](https://dl.acm.org/doi/abs/10.1145/3474085.3475409)]

    * **CFFM**: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (*ETH Zurich*). [[Paper](https://arxiv.org/abs/2204.03330)][[PyTorch](https://github.com/GuoleiSun/VSS-CFFM)]

    * **TF-DL**: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2205.15361)]

    * **Video-K-Net**: "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation", CVPR, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2204.04656)][[PyTorch](https://github.com/lxtGH/Video-K-Net)]

    * **MRCFA**: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (*ETH Zurich*). [[Paper](https://arxiv.org/pdf/2207.10436)][[PyTorch](https://github.com/GuoleiSun/VSS-MRCFA)]   

    * **PolyphonicFormer**: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2112.02582)][[Code (in construction)](https://github.com/HarborYuan/PolyphonicFormer)]

    * **?**: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (*Technical University of Cluj-Napoca, Romania*). [[Paper](https://arxiv.org/abs/2210.03546)]

    * **CAROQ**: "Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation", CVPR, 2023 (*UIUC*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Choudhuri_Context-Aware_Relative_Object_Queries_To_Unify_Video_Instance_and_Panoptic_CVPR_2023_paper.html)][[PyTorch](https://github.com/AnwesaChoudhuri/CAROQ)][[Website](https://anwesachoudhuri.github.io/ContextAwareRelativeObjectQueries/)]

    * **TarViS**: "TarViS: A Unified Approach for Target-based Video Segmentation", CVPR, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2301.02657)][[PyTorch](https://github.com/Ali2500/TarViS)]

    * **MEGA**: "MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2308.11185)]

    * **DEVA**: "Tracking Anything with Decoupled Video Segmentation", ICCV, 2023 (*UIUC*). [[Paper](https://arxiv.org/abs/2309.03903)][[PyTorch](https://github.com/hkchengrex/Tracking-Anything-with-DEVA)][[Website](https://hkchengrex.com/Tracking-Anything-with-DEVA/)]

    * **Tube-Link**: "Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation", ICCV, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2303.12782)][[PyTorch](https://github.com/lxtGH/Tube-Link)]

    * **THE-Mask**: "Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation", BMVC, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2309.08020)][[Code (in construction)](https://github.com/ZhaochongAn/THE-Mask)]

    * **MPVSS**: "Mask Propagation for Efficient Video Semantic Segmentation", NeurIPS, 2023 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2310.18954)][[Code (in construction)](https://github.com/ziplab/MPVSS)]

    * **Video-kMaX**: "Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.04694)]

    * **SAM-PT**: "Segment Anything Meets Point Tracking", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2307.01197)][[Code (in construction)](https://github.com/SysCV/sam-pt)]

    * **TTT-MAE**: "Test-Time Training on Video Streams", arXiv, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2307.05014)][[Website](https://video-ttt.github.io/)]

    * **UniVS**: "UniVS: Unified and Universal Video Segmentation with Prompts as Queries", CVPR, 2024 (*OPPO*). [[Paper](https://arxiv.org/abs/2402.18115)][[PyTorch](https://github.com/MinghanLi/UniVS)][[Website](https://sites.google.com/view/unified-video-seg-univs)]

    * **DVIS++**: "DVIS++: Improved Decoupled Framework for Universal Video Segmentation", arXiv, 2024 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2312.13305)][[PyTorch](https://github.com/zhang-tao-whu/DVIS_Plus)]

    * **SAM-PD**: "SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising", arXiv, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2403.04194)][[PyTorch (in construction)](https://github.com/infZhou/SAM-PD)]

    * **OneVOS**: "OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework", arXiv, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2403.08682)]

* Video Object Detection:

    * **TransVOD**: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (*Shanghai Jiao Tong + SenseTime*). [[Paper](https://arxiv.org/abs/2105.10920)][[Code (in construction)](https://github.com/SJTU-LuHe/TransVOD)]

    * **MODETR**: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (*Valeo, Egypt*). [[Paper](https://arxiv.org/abs/2106.11422)]

    * **ST-MTL**: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (*Valeo, Egypt*). [[Paper](https://arxiv.org/abs/2106.11401)]

    * **ST-DETR**: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (*Valeo, Egypt*). [[Paper](https://arxiv.org/abs/2107.05887)]

    * **PTSEFormer**: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (*Shanghai Jiao Tong University*). [[Paper](https://arxiv.org/abs/2209.02242)][[PyTorch](https://github.com/Hon-Wong/PTSEFormer)]

    * **TransVOD**: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (*Shanghai Jiao Tong + SenseTime*). [[Paper](https://arxiv.org/abs/2201.05047)]

    * **?**: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (*Zenseact, Sweden*). [[Paper](https://arxiv.org/abs/2204.10321)]

    * **ClipVID**: "Identity-Consistent Aggregation for Video Object Detection", ICCV, 2023 (*University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2308.07737)][[Code (in construction)](https://github.com/bladewaltz1/clipvid)]

    * **OCL**: "Unsupervised Open-Vocabulary Object Localization in Videos", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2309.09858)]

    * **CETR**: "Context Enhanced Transformer for Single Image Object Detection", AAAI, 2024 (*Korea University*). [[Paper](https://arxiv.org/abs/2312.14492)][[Code (in construction)](https://github.com/KU-CVLAB/CETR)][[Website](https://ku-cvlab.github.io/CETR/)]

* Dense Video Tasks (Detection + Segmentation):

    * **TDViT**: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 (*Queen's University Belfast, UK*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/5559_ECCV_2022_paper.php)][[Code (in construction)](https://github.com/guanxiongsun/TDViT)]

    * **FAQ**: "Feature Aggregated Queries for Transformer-Based Video Object Detectors", CVPR, 2023 (*UCF*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Cui_Feature_Aggregated_Queries_for_Transformer-Based_Video_Object_Detectors_CVPR_2023_paper.html)][[PyTorch](https://github.com/YimingCuiCuiCui/FAQ)]

    * **Video-OWL-ViT**: "Video OWL-ViT: Temporally-consistent open-world localization in video", ICCV, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2308.11093)]

* Video Retrieval:

    * **SVRTN**: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2104.07993)]

* Video Hashing:

    * **BTH**: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 (*Tsinghua*). [[Paper](https://openaccess.thecvf.com/content/CVPR2021/html/Li_Self-Supervised_Video_Hashing_via_Bidirectional_Transformers_CVPR_2021_paper.html)][[PyTorch](https://github.com/Lily1994/BTH)]

* Video-Language:

    * **ActionCLIP**: "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2109.08472)][[PyTorch](https://github.com/sallymmx/ActionCLIP)]

    * **?**: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (*Shanghai Jiao Tong + Oxford*). [[Paper](https://arxiv.org/abs/2112.04478)][[PyTorch](https://github.com/ju-chen/Efficient-Prompt)][[Website](https://ju-chen.github.io/efficient-prompt/)]

    * **X-CLIP**: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.02816)][[PyTorch](https://github.com/microsoft/VideoX/tree/master/X-CLIP)]

    * **EVL**: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2208.03550)][[PyTorch (in construction)](https://github.com/OpenGVLab/efficient-video-recognition)]

    * **STALE**: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2207.08184)][[Code (in construction)](https://github.com/sauradip/STALE)]

    * **?**: "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 (*Beijing Laboratory of Intelligent Information Technology*). [[Paper](https://arxiv.org/abs/2211.12030)]

    * **VLG**: "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2212.01638)]

    * **InternVideo**: "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2212.03191)][[Code (in construction)](https://github.com/OpenGVLab/InternVideo)][[Website](https://opengvlab.shlab.org.cn/home)]

    * **PromptonomyViT**: "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 (*Tel Aviv + IBM*). [[Paper](https://arxiv.org/abs/2212.04821)]

    * **MUPPET**: "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2211.14905)][[Code (in construction)](https://github.com/sauradip/MUPPET)]

    * **MovieCLIP**: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 (*USC*). [[Paper](https://arxiv.org/abs/2210.11065)][[Website](https://sail.usc.edu/~mica/MovieCLIP/)]

    * **TranZAD**: "Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection", WACV, 2023 (*UC Riverside*). [[Paper](https://openaccess.thecvf.com/content/WACV2023/html/Nag_Semantics_Guided_Contrastive_Learning_of_Transformers_for_Zero-Shot_Temporal_Activity_WACV_2023_paper.html)]

    * **Text4Vis**: "Revisiting Classifier: Transferring Vision-Language Models for Video Recognition", AAAI, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2207.01297)][[PyTorch](https://github.com/whwu95/Text4Vis)]

    * **AIM**: "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2302.03024)][[PyTorch](https://github.com/taoyang1122/adapt-image-models)][[Website](https://adapt-image-models.github.io/)]

    * **ViFi-CLIP**: "Fine-tuned CLIP Models are Efficient Video Learners", CVPR, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2212.03640)][[PyTorch](https://github.com/muzairkhattak/ViFi-CLIP)]

    * **LaViLa**: "Learning Video Representations from Large Language Models", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2212.04501)][[PyTorch](https://github.com/facebookresearch/LaViLa)][[Website](https://facebookresearch.github.io/LaViLa/)]

    * **TVP**: "Text-Visual Prompting for Efficient 2D Temporal Video Grounding", CVPR, 2023 (*Intel*). [[Paper](https://arxiv.org/abs/2303.04995)]

    * **Vita-CLIP**: "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting", CVPR, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2304.03307)][[PyTorch](https://github.com/TalalWasim/Vita-CLIP)]

    * **STAN**: "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2301.11116)][[PyTorch](https://github.com/farewellthree/STAN)]

    * **CBP-VLP**: "Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization", CVPR, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2212.09335)]

    * **BIKE**: "Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models", CVPR, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2301.00182)][[PyTorch](https://github.com/whwu95/BIKE)]

    * **HierVL**: "HierVL: Learning Hierarchical Video-Language Embeddings", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.02311)][[PyTorch](https://github.com/facebookresearch/HierVL)]

    * **?**: "Test of Time: Instilling Video-Language Models with a Sense of Time", CVPR, 2023 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2301.02074)][[PyTorch](https://github.com/bpiyush/TestOfTime)][[Website](https://bpiyush.github.io/testoftime-website/index.html)]

    * **Open-VCLIP**: "Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization", ICML, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2302.00624)][[PyTorch](https://github.com/wengzejia1/Open-VCLIP)]

    * **ILA**: "Implicit Temporal Modeling with Learnable Alignment for Video Recognition", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2304.10465)][[PyTorch](https://github.com/Francis-Rings/ILA)]

    * **OV2Seg**: "Towards Open-Vocabulary Video Instance Segmentation", ICCV, 2023 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2304.01715)][[PyTorch](https://github.com/haochenheheda/LVVIS)]

    * **DiST**: "Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2309.07911)][[PyTorch](https://github.com/alibaba-mmai-research/DiST)]

    * **GAP**: "Generative Action Description Prompts for Skeleton-based Action Recognition", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2208.05318)][[PyTorch](https://github.com/MartinXM/GAP)]

    * **MAXI**: "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", ICCV, 2023 (*Graz University of Technology, Austria*). [[Paper](https://arxiv.org/abs/2303.08914)][[PyTorch](https://github.com/wlin-at/MAXI)]

    * **?**: "Language as the Medium: Multimodal Video Classification through text only", ICCVW, 2023 (*Unitary, UK*). [[Paper](https://arxiv.org/abs/2309.10783)]

    * **MAP**: "Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning", ACMMM, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2308.04828)]

    * **OTI**: "Orthogonal Temporal Interpolation for Zero-Shot Video Recognition", ACMMM, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2308.06897)][[Code (in construction)](https://github.com/sweetorangezhuyan/mm2023_oti)]

    * **Symbol-LLM**: "Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning", NeurIPS, 2023 (*Shanghai Jiao Tong University (SJTU)*). [[Paper](https://arxiv.org/abs/2311.17365)][[Code (in construction)](https://github.com/enlighten0707/Symbol-LLM)][[Website](https://mvig-rhos.com/symbol_llm)]

    * **OAP-AOP**: "Opening the Vocabulary of Egocentric Actions", NeurIPS, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2308.11488)][[PyTorch (in construction)](https://github.com/dibschat/openvocab-egoAR)][[Website](https://dibschat.github.io/openvocab-egoAR/)]

    * **CLIP-FSAR**: "CLIP-guided Prototype Modulating for Few-shot Action Recognition", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2303.02982)][[PyTorch](https://github.com/alibaba-mmai-research/CLIP-FSAR)]

    * **?**: "Multi-modal Prompting for Low-Shot Temporal Action Localization", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2303.11732)]

    * **VicTR**: "VicTR: Video-conditioned Text Representations for Activity Recognition", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.02560)]

    * **OpenVIS**: "OpenVIS: Open-vocabulary Video Instance Segmentation", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2305.16835)]

    * **ALGO**: "Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning", arXiv, 2023 (*Oklahoma State University*). [[Paper](https://arxiv.org/abs/2305.16602)]

    * **?**: "Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.10596)]

    * **MSQNet**: "MSQNet: Actor-agnostic Action Recognition with Multi-modal Query", arXiv, 2023 (*University of Surrey, England*). [[Paper](https://arxiv.org/abs/2307.10763)][[Code (in construction)](https://github.com/mondalanindya/MSQNet)]

    * **AVION**: "Training a Large Video Model on a Single Machine in a Day", arXiv, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2309.16669)][[PyTorch](https://github.com/zhaoyue-zephyrus/AVION)]

    * **Open-VCLIP**: "Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2310.05010)][[PyTorch](https://github.com/wengzejia1/Open-VCLIP)]

    * **Videoprompter**: "Videoprompter: an ensemble of foundational models for zero-shot video understanding", arXiv, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2310.15324)]

    * **MM-VID**: "MM-VID: Advancing Video Understanding with GPT-4V(vision)", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2310.19773)][[Website](https://multimodal-vid.github.io/)]

    * **Chat-UniVi**: "Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2311.08046)]

    * **Side4Video**: "Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2311.15769)][[Code (in construction)](https://github.com/HJYao00/Side4Video)]

    * **ALT**: "Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2311.15619)]

    * **MM-Narrator**: "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2311.17435)][[Website](https://mm-narrator.github.io/)]

    * **Spacewalk-18**: "Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains", arXiv, 2023 (*Brown*). [[Paper](https://arxiv.org/abs/2311.18773)][[Website](https://brown-palm.github.io/Spacewalk-18/)]

    * **OST**: "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition", arXiv, 2023 (*Hunan University (HNU)*). [[Paper](https://arxiv.org/abs/2312.00096)][[Code (in construction)](https://github.com/tomchen-ctj/OST)][[Website](https://tomchen-ctj.github.io/OST/)]

    * **AP-CLIP**: "Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition", arXiv, 2023 (*Xi'an Jiaotong*). [[Paper](https://arxiv.org/abs/2312.02226)]

    * **EZ-CLIP**: "EZ-CLIP: Efficient Zeroshot Video Action Recognition", arXiv, 2023 (*Østfold University College, Norway*). [[Paper](https://arxiv.org/abs/2312.08010)][[PyTorch (in construction)](https://github.com/Shahzadnit/EZ-CLIP)]

    * **M²-CLIP**: "M²-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition", AAAI, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2401.11649)]

    * **FROSTER**: "FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition", ICLR, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2402.03241)][[PyTorch](https://github.com/Visual-AI/FROSTER)][[Website](https://visual-ai.github.io/froster/)]

    * **LaIAR**: "Language Model Guided Interpretable Video Action Reasoning", CVPR, 2024 (*Xidian University*). [[Paper](https://arxiv.org/abs/2404.01591)][[Code (in construction)](https://github.com/NingWang2049/LaIAR)]

    * **BriVIS**: "Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation", arXiv, 2024 (*Peking*). [[Paper](https://arxiv.org/abs/2401.09732)][[PyTorch (in construction)](https://github.com/sennnnn/OpenVIS)]

    * **ActionHub**: "ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition", arXiv, 2024 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2401.11654)]

    * **ZERO**: "Zero Shot Open-ended Video Inference", arXiv, 2024 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2401.12471)]

    * **SATA**: "Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition", arXiv, 2024 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2403.01560)][[Code (in construction)](https://github.com/KunyuLin/XOV-Action/)]

    * **CLIP-VIS**: "CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.12455)][[PyTorch](https://github.com/zwq456/CLIP-VIS)]

* X-supervised Learning:

    * **LSTCL**: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.09212)]

    * **SVT**: "Self-supervised Video Transformer", CVPR, 2022 (*Stony Brook*). [[Paper](https://arxiv.org/abs/2112.01514)][[PyTorch](https://github.com/kahnchana/svt)][[Website](https://kahnchana.github.io/svt/)]

    * **BEVT**: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2112.01529)][[PyTorch](https://github.com/xyzforever/BEVT)]

    * **SCVRL**: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2205.11710)]

    * **VIMPAC**: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (*UNC*). [[Paper](https://arxiv.org/abs/2106.11250)][[PyTorch](https://github.com/airsplay/vimpac)]

    * **?**: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.12795)]

    * **VideoMAE**: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", NeurIPS, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2203.12602)][[Pytorch](https://github.com/MCG-NJU/VideoMAE)]

    * **MAE-ST**: "Masked Autoencoders As Spatiotemporal Learners", NeurIPS, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2205.09113)][[PyTorch](https://github.com/facebookresearch/mae_st)]

    * **?**: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2209.07474)]

    * **MaskViT**: "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2206.11894)][[Code (in construction)](https://github.com/agrimgupta92/maskvit)][[Website](https://maskedvit.github.io/)]

    * **WeakSVR**: "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos", CVPR, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2303.12370)][[PyTorch](https://github.com/svip-lab/WeakSVR)]

    * **VideoMAE-V2**: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.16727)][[PyTorch](https://github.com/OpenGVLab/VideoMAEv2)]

    * **SVFormer**: "SVFormer: Semi-supervised Video Transformer for Action Recognition", CVPR, 2023 (*Fudan University*). [[Paper](https://arxiv.org/abs/2211.13222)][[PyTorch](https://github.com/ChenHsing/SVFormer)]

    * **OmniMAE**: "OmniMAE: Single Model Masked Pretraining on Images and Videos", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2206.08356)][[PyTorch](https://github.com/facebookresearch/omnivore)]

    * **MVD**: "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", CVPR, 2023 (*Fudan Univeristy*). [[Paper](https://arxiv.org/abs/2212.04500)][[PyTorch](https://github.com/ruiwang2021/mvd)]

    * **MME**: "Masked Motion Encoding for Self-Supervised Video Representation Learning", CVPR, 2023 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2210.06096)][[PyTorch](https://github.com/XinyuSun/MME)]

    * **MGMAE**: "MGMAE: Motion Guided Masking for Video Masked Autoencoding", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2308.10794)]

    * **MGM**: "Motion-Guided Masking for Spatiotemporal Representation Learning", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2308.12962)]

    * **TimeT**: "Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations", ICCV, 2023 (*UvA*). [[Paper](https://arxiv.org/abs/2308.11796)][[PyTorch](https://github.com/SMSD75/Timetuning)]

    * **LSS**: "Language-based Action Concept Spaces Improve Video Self-Supervised Learning", NeurIPS, 2023 (*Stony Brook*). [[Paper](https://arxiv.org/abs/2307.10922)]

    * **VITO**: "Self-supervised video pretraining yields human-aligned visual representations", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2210.06433)]

    * **SiamMAE**: "Siamese Masked Autoencoders", NeurIPS, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2305.14344)][[Website](https://siam-mae-video.github.io/)]

    * **ViC-MAE**: "Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders", arXiv, 2023 (*Rice University*). [[Paper](https://arxiv.org/abs/2303.12001)]

    * **LSTA**: "Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation", arXiv, 2023 (*Hangzhou Dianzi University*). [[Paper](https://arxiv.org/abs/2309.11707)]

    * **DoRA**: "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video", arXiv, 2023 (*INRIA*). [[Paper](https://arxiv.org/abs/2310.08584)]

    * **AMD**: "Asymmetric Masked Distillation for Pre-Training Small Foundation Models", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2311.03149)]

    * **SSL-UVOS**: "Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2311.17893)]

    * **NMS**: "No More Shortcuts: Realizing the Potential of Temporal Self-Supervision", AAAI, 2024 (*Adobe*). [[Paper](https://arxiv.org/abs/2312.13008)][[Website](https://daveishan.github.io/nms-webpage/)]

    * **VideoMAC**: "VideoMAC: Video Masked Autoencoders Meet ConvNets", CVPR, 2024 (*Nanjing University of Science and Technology*). [[Paper](https://arxiv.org/abs/2402.19082)]

    * **GPM**: "Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention", arXiv, 2024 (*HKUST*). [[Paper](https://arxiv.org/abs/2401.13937)]

    * **MV2MAE**: "MV2MAE: Multi-View Video Masked Autoencoders", arXiv, 2024 (*Amazon*). [[Paper](https://arxiv.org/abs/2401.15900)][[PyTorch](https://github.com/NUST-Machine-Intelligence-Laboratory/VideoMAC)]

    * **V-JEPA**: "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2404.08471)][[PyTorch](https://github.com/facebookresearch/jepa)][[Website](https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/)]

* Transfer Learning/Adaptation:

    * **APT**: "Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling", FG, 2024 (*JHU*). [[Paper](https://arxiv.org/abs/2403.06978)][[PyTorch](https://github.com/wgcban/apt)]

* X-shot:

    * **ResT**: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2205.01657)]

    * **ViSET**: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (*University of South Florida*). [[Paper](https://arxiv.org/abs/2203.05156)]

    * **REST**: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2209.15000)]

    * **MoLo**: "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition", CVPR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2304.00946)][[Code (in construction)](https://github.com/alibaba-mmai-research/MoLo)]

    * **MA-CLIP**: "Multimodal Adaptation of CLIP for Few-Shot Action Recognition", arXiv, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.01532)]

    * **SA-CT**: "On the Importance of Spatial Relations for Few-shot Action Recognition", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.07119)]

    * **CapFSAR**: "Few-shot Action Recognition with Captioning Foundation Models", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2310.10125)]

* Multi-Task:

    * **EgoPack**: "A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives", CVPR, 2024 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2403.03037)][[PyTorch (in construction)](https://github.com/sapeirone/EgoPack)][[Website](https://sapeirone.github.io/EgoPack/)]

* Anomaly Detection:

    * **CT-D2GAN**: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (*NEC*). [[Paper](https://arxiv.org/abs/2107.13720)]

    * **ADTR**: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (*Shanghai Jiao Tong University*). [[Paper](https://arxiv.org/abs/2209.01816)]

    * **SSMCTB**: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (*UCF*). [[Paper](https://arxiv.org/abs/2209.12148)][[Code (in construction)](https://github.com/ristea/ssmctb)]

    * **?**: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2206.08568)]

    * **CLIP-TSA**: "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", ICIP, 2023 (*University of Arkansas*). [[Paper](https://arxiv.org/abs/2212.05136)]

    * **?**: "Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features", CVPR, 2023 (*Konica Minolta, Japan*). [[Paper](https://arxiv.org/abs/2303.15167)]

    * **TPWNG**: "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection", CVPR, 2024 (*Xidian University*). [[Paper](https://arxiv.org/abs/2404.08531)]

* Relation Detection:

    * **VidVRD**: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2108.08669)][[PyTorch](https://github.com/Dawn-LX/VidVRD-tracklets)]

    * **VRDFormer**: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (*Renmin University of China*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zheng_VRDFormer_End-to-End_Video_Visual_Relation_Detection_With_Transformers_CVPR_2022_paper.html)][[Code (in construction)](https://github.com/zhengsipeng/VRDFormer_VRD)]

    * **VidSGG-BIG**: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2112.04222)][[PyTorch](https://github.com/Dawn-LX/VidSGG-BIG)]

    * **RePro**: "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2302.00268)][[PyTorch (in construction)](https://github.com/Dawn-LX/OpenVoc-VidVRD)]

* Saliency Prediction:

    * **STSANet**: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (*Shanghai University*). [[Paper](https://arxiv.org/abs/2108.10696)]

    * **UFO**: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2203.04708)][[PyTorch](https://github.com/suyukun666/UFO)]

    * **DMT**: "Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection", CVPR, 2023 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2305.00514)][[PyTorch](https://github.com/dragonlee258079/DMT)]

    * **CASP-Net**: "CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective", CVPR, 2023 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2303.06357)]

* Video Inpainting Detection:

    * **FAST**: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (*Tsinghua University*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Yu_Frequency-Aware_Spatiotemporal_Transformers_for_Video_Inpainting_Detection_ICCV_2021_paper.html)]

* Driver Activity:

    * **TransDARC**: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2203.00927)]

    * **?**: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (*Jericho High School, NY*). [[Paper](https://arxiv.org/abs/2207.12148)]

    * **ViT-DD**: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (*Purdue*). [[Paper](https://arxiv.org/abs/2209.09178)][[PyTorch (in construction)](https://github.com/PurdueDigitalTwin/ViT-DD)]

* Video Alignment:

    * **DGWT**: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (*University of New South Wales, Australia*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0993.pdf)]

* Sport-related:

    * **Skating-Mixer**: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2203.03990)]

* Action Counting:

    * **TransRAC**: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2204.01018)][[PyTorch](https://github.com/SvipRepetitionCounting/TransRAC)][[Website](https://svip-lab.github.io/dataset/RepCount_dataset.html)]

    * **PoseRAC**: "PoseRAC: Pose Saliency Transformer for Repetitive Action Counting", arXiv, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2303.08450)][[PyTorch](https://github.com/MiracleDance/PoseRAC)]

* Action Quality Assessment:

    * **?**: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2207.09270)]

    * **?**: "Action Quality Assessment using Transformers", arXiv, 2022 (*USC*). [[Paper](https://arxiv.org/abs/2207.12318)]

* Human Interaction:

    * **IGFormer**: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (*The University of Melbourne*). [[Paper](https://arxiv.org/abs/2207.12100)]

* Cross-Domain:

    * **UDAVT**: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (*University of Trento*). [[Paper](https://arxiv.org/abs/2207.12842)][[Code (in construction)](https://github.com/vturrisi/UDAVT)]

    * **AutoLabel**: "AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation", CVPR, 2023 (*University of Trento*). [[Paper](https://arxiv.org/abs/2304.01110)][[PyTorch](https://github.com/gzaraunitn/autolabel)]

    * **DALL-V**: "The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation", ICCV, 2023 (*University of Trento*). [[Paper](https://arxiv.org/abs/2308.09139)][[PyTorch](https://github.com/giaczara/dallv)]

* Multi-Camera Editing:

    * **TC-Transformer**: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2210.08737)]

* Instructional/Procedural Video:

    * **ProcedureVRL**: "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2303.17839)]

    * **Paprika**: "Procedure-Aware Pretraining for Instructional Video Understanding", CVPR, 2023 (*Salesforce*). [[Paper](https://arxiv.org/abs/2303.18230)][[PyTorch](https://github.com/salesforce/paprika)]

    * **StepFormer**: "StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos", CVPR, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2304.13265)]

    * **E3P**: "Event-Guided Procedure Planning from Instructional Videos with Text Supervision", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2308.08885)]

    * **VLaMP**: "Pretrained Language Models as Visual Planners for Human Assistance", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.09179)]

    * **VINA**: "Learning to Ground Instructional Articles in Videos through Narrations", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2306.03802)][[Website](https://eval.ai/web/challenges/challenge-page/2082/overview)]

    * **PREGO**: "PREGO: online mistake detection in PRocedural EGOcentric videos", CVPR, 2024 (*Sapienza University of Rome, Italy*). [[Paper](https://arxiv.org/abs/2404.01933)][[Code (in construction)](https://github.com/aleflabo/PREGO)]

* Continual Learning:

    * **PIVOT**: "PIVOT: Prompting for Video Continual Learning", CVPR, 2023 (*KAUST*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Villa_PIVOT_Prompting_for_Video_Continual_Learning_CVPR_2023_paper.html)]

* 3D:

    * **MaST-Pre**: "Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos", ICCV, 2023 (*CloudWalk, China*). [[Paper](https://arxiv.org/abs/2308.09245)][[PyTorch](https://github.com/JohnsonSign/MaST-Pre)]

    * **EPIC-Fields**: "EPIC Fields: Marrying 3D Geometry and Video Understanding", NeurIPS, 2023 (*Oxford + Bristol*). [[Paper](https://arxiv.org/abs/2306.08731)][[Website](https://epic-kitchens.github.io/epic-fields/)]

* Audio-Video:

    * **AVGN**: "Audio-Visual Glance Network for Efficient Video Recognition", ICCV, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2308.09322)]

* Event Camera:

    * **EventTransAct**: "EventTransAct: A video transformer-based framework for Event-camera based action recognition", IROS, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2308.13711)][[PyTorch](https://github.com/tristandb8/EventTransAct)][[Website](https://tristandb8.github.io/EventTransAct_webpage/)]

* Long Video:

    * **EgoSchema**: "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding", NeurIPS, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2308.09126)][[PyTorch](https://github.com/egoschema/EgoSchema)][[Website](https://egoschema.github.io/)]

    * **KTS**: "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2309.11569)]

    * **TCR**: "Text-Conditioned Resampler For Long Form Video Understanding", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.11897)]

    * **MC-ViT**: "Memory Consolidation Enables Long-Context Video Understanding", arXiv, 2024 (*DeepMind*). [[Paper](https://arxiv.org/abs/2402.05861)]

    * **VideoAgent**: "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", arXiv, 2024 (*Stanford*). [[Paper](https://arxiv.org/abs/2403.10517)]

* Video Story:

    * **YouTube-News-Timeline**: "Video Timeline Modeling For News Story Understanding", NeurIPS (Datasets and Benchmarks), 2023 (*Google*). [[Paper](https://arxiv.org/abs/2309.13446)][[GotHub](https://github.com/google-research/google-research/tree/master/video_timeline_modeling)]

* Analysis:

    * **VTCD**: "Understanding Video Transformers via Universal Concept Discovery", arXiv, 2024 (*Toyota*). [[Paper](https://arxiv.org/abs/2401.10831)][[Website](https://yorkucvil.github.io/VTCD/)]

[[Back to Overview](#overview)]

---

## References

* Online Resources:

    * [Papers with Code](https://paperswithcode.com/methods/category/vision-transformer)

    * [Transformer tutorial (Lucas Beyer)](http://lucasb.eyer.be/transformer)

    * [CS25: Transformers United (Course @ Stanford)](https://web.stanford.edu/class/cs25/)

    * [The Annotated Transformer (Blog)](http://nlp.seas.harvard.edu/annotated-transformer/)

    * [3D Vision with Transformers (GitHub)](https://github.com/lahoud/3d-vision-transformers)

    * [Networks Beyond Attention (GitHub)](https://github.com/FocalNet/Networks-Beyond-Attention)

    * [Practical Introduction to Transformers (GitHub)](https://github.com/IbrahimSobh/Transformers)

    * [Awesome Transformer Architecture Search (GitHub)](https://github.com/automl/awesome-transformer-search)

    * [Transformer-in-Vision (GitHub)](https://github.com/DirtyHarryLYL/Transformer-in-Vision)   

    * [Awesome Visual-Transformer (GitHub)](https://github.com/dk-liang/Awesome-Visual-Transformer)

    * [Awesome Transformer for Vision Resources List (GitHub)](https://github.com/lijiaman/awesome-transformer-for-vision)

    * [Transformer-in-Computer-Vision (GitHub)](https://github.com/Yangzhangcst/Transformer-in-Computer-Vision)

    * [Transformer Tutorial in ICASSP 2022)](https://transformer-tutorial.github.io/icassp2022/)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cmhungsteve/Awesome-Transformer-Attention

Awesome Lists containing this project

README