Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cmhungsteve/Awesome-Transformer-Attention

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
https://github.com/cmhungsteve/Awesome-Transformer-Attention

List: Awesome-Transformer-Attention

attention-mechanism attention-mechanisms awesome-list computer-vision deep-learning detr papers self-attention transformer transformer-architecture transformer-awesome transformer-cv transformer-models transformer-with-cv transformers vision-transformer visual-transformer vit

Last synced: about 1 month ago
JSON representation

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites

Awesome Lists containing this project

README

        

# Ultimate-Awesome-Transformer-Attention [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

This repo contains a comprehensive paper list of **Vision Transformer & Attention**, including papers, codes, and related websites.

This list is maintained by [Min-Hung Chen](https://minhungchen.netlify.app/). (*Actively* keep updating)

If you find some ignored papers, **feel free to [*create pull requests*](https://github.com/cmhungsteve/Awesome-Transformer-Attention/blob/main/How-to-PR.md), [*open issues*](https://github.com/cmhungsteve/Awesome-Transformer-Attention/issues/new), or [*email* me](mailto:[email protected])**.

Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider **[citing](#citation)** and **★STARing** this list.

Feel free to share this list with others!

**[Update: January, 2024]** Added all the related papers from *NeurIPS 2023*!

**[Update: December, 2023]** Added all the related papers from *ICCV 2023*!

**[Update: September, 2023]** Split the multi-modal paper list to [README_multimodal.md](README_multimodal.md)

**[Update: June, 2023]** Added all the related papers from *ICML 2023*!

**[Update: June, 2023]** Added all the related papers from *CVPR 2023*!

**[Update: February, 2023]** Added all the related papers from *ICLR 2023*!

**[Update: December, 2022]** Added attention-free papers from [Networks Beyond Attention (GitHub)](https://github.com/FocalNet/Networks-Beyond-Attention) made by [Jianwei Yang](https://github.com/jwyang)

**[Update: November, 2022]** Added all the related papers from *NeurIPS 2022*!

**[Update: October, 2022]** Split the 2nd half of the paper list to [README_2.md](README_2.md)

**[Update: October, 2022]** Added all the related papers from *ECCV 2022*!

**[Update: September, 2022]** Added the [Transformer tutorial slides](http://lucasb.eyer.be/transformer) made by [Lucas Beyer](https://twitter.com/giffmana)!

**[Update: June, 2022]** Added all the related papers from *CVPR 2022*!

---
## Overview

- [Citation](#citation)
- [Survey](#survey)
- [Image Classification / Backbone](#image-classification--backbone)
- [Replace Conv w/ Attention](#replace-conv-w-attention)
- [Pure Attention](#pure-attention)
- [Conv-stem + Attention](#conv-stem--attention)
- [Conv + Attention](#conv--attention)
- [Vision Transformer](#vision-transformer)
- [General Vision Transformer](#general-vision-transformer)
- [Efficient Vision Transformer](#efficient-vision-transformer)
- [Conv + Transformer](#conv--transformer)
- [Training + Transformer](#training--transformer)
- [Robustness + Transformer](#robustness--transformer)
- [Model Compression + Transformer](#model-compression--transformer)
- [Attention-Free](#attention-free)
- [MLP-Series](#mlp-series)
- [Other Attention-Free](#other-attention-free)
- [Analysis for Transformer](#analysis-for-transformer)
- [Detection](#detection)
- [Object Detection](#object-detection)
- [3D Object Detection](#3d-object-detection)
- [Multi-Modal Detection](#multi-modal-detection)
- [HOI Detection](#hoi-detection)
- [Salient Object Detection](#salient-object-detection)
- [Other Detection Tasks](#other-detection-tasks)
- [Segmentation](#segmentation)
- [Semantic Segmentation](#semantic-segmentation)
- [Depth Estimation](#depth-estimation)
- [Object Segmentation](#object-segmentation)
- [Other Segmentation Tasks](#other-segmentation-tasks)
- [Video (High-level)](#video-high-level)
- [Action Recognition](#action-recognition)
- [Action Detection/Localization](#action-detectionlocalization)
- [Action Prediction/Anticipation](#action-predictionanticipation)
- [Video Object Segmentation](#video-object-segmentation)
- [Video Instance Segmentation](#video-instance-segmentation)
- [Other Video Tasks](#other-video-tasks)
- [References](#references)

------ (The following papers are moved to [README_multimodal.md](README_multimodal.md)) ------

- [Multi-Modality](README_multimodal.md#multi-modality)
- [Visual Captioning](README_multimodal.md#visual-captioning)
- [Visual Question Answering](README_multimodal.md#visual-question-answering)
- [Visual Grounding](README_multimodal.md#visual-grounding)
- [Multi-Modal Representation Learning](README_multimodal.md#multi-modal-representation-learning)
- [Multi-Modal Retrieval](README_multimodal.md#multi-modal-retrieval)
- [Multi-Modal Generation](README_multimodal.md#multi-modal-generation)
- [Prompt Learning/Tuning](README_multimodal.md#prompt-learningtuning)
- [Visual Document Understanding](README_multimodal.md#visual-document-understanding)
- [Other Multi-Modal Tasks](README_multimodal.md#other-multi-modal-tasks)

------ (The following papers are moved to [README_2.md](README_2.md)) ------

- [Other High-level Vision Tasks](README_2.md#other-high-level-vision-tasks)
- [Point Cloud / 3D](README_2.md#point-cloud--3d)
- [Pose Estimation](README_2.md#pose-estimation)
- [Tracking](README_2.md#tracking)
- [Re-ID](README_2.md#re-id)
- [Face](README_2.md#face)
- [Scene Graph](README_2.md#scene-graph)
- [Neural Architecture Search](README_2.md#neural-architecture-search)
- [Transfer / X-Supervised / X-Shot / Continual Learning](README_2.md#transfer--x-supervised--x-shot--continual-learning)
- [Low-level Vision Tasks](README_2.md#low-level-vision-tasks)
- [Image Restoration](README_2.md#image-restoration)
- [Video Restoration](README_2.md#video-restoration)
- [Inpainting / Completion / Outpainting](README_2.md#inpainting--completion--outpainting)
- [Image Generation](README_2.md#image-generation)
- [Video Generation](README_2.md#video-generation)
- [Transfer / Translation / Manipulation](README_2.md#transfer--translation--manipulation)
- [Other Low-Level Tasks](README_2.md#other-low-level-tasks)
- [Reinforcement Learning](README_2.md#reinforcement-learning)
- [Navigation](README_2.md#navigation)
- [Other RL Tasks](README_2.md#other-rl-tasks)
- [Medical](README_2.md#medical)
- [Medical Segmentation](README_2.md#medical-segmentation)
- [Medical Classification](README_2.md#medical-classification)
- [Medical Detection](README_2.md#medical-detection)
- [Medical Reconstruction](README_2.md#medical-detection)
- [Medical Low-Level Vision](README_2.md#medical-low-level-vision)
- [Medical Vision-Language](README_2.md#medical-vision-language)
- [Medical Others](README_2.md#medical-others)
- [Other Tasks](README_2.md#other-tasks)
- [Attention Mechanisms in Vision/NLP](README_2.md#attention-mechanisms-in-visionnlp)
- [Attention for Vision](README_2.md#attention-for-vision)
- [NLP](README_2.md#attention-for-nlp)
- [Both](README_2.md#attention-for-both)
- [Others](README_2.md#attention-for-others)

---

## Citation
If you find this repository useful, please consider citing this list:
```
@misc{chen2022transformerpaperlist,
title = {Ultimate awesome paper list: transformer and attention},
author = {Chen, Min-Hung},
journal = {GitHub repository},
url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
year = {2022},
}
```

---

## Survey
* "A Survey on Multimodal Large Language Models for Autonomous Driving", WACVW, 2024 (*Purdue*). [[Paper](https://arxiv.org/abs/2311.12320)][[GitHub](https://github.com/IrohXu/Awesome-Multimodal-LLM-Autonomous-Driving)]
* "Efficient Multimodal Large Language Models: A Survey", arXiv, 2024 (*Tencent*). [[Paper](https://arxiv.org/abs/2405.10739)][[GitHub](https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey)]
* "From Sora What We Can See: A Survey of Text-to-Video Generation", arXiv, 2024 (*Newcastle University, UK*). [[Paper](https://arxiv.org/abs/2405.10674)][[GitHub](https://github.com/soraw-ai/Awesome-Text-to-Video-Generation)]
* "When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models", arXiv, 2024 (*Oxford*). [[Paper](https://arxiv.org/abs/2405.10255)][[GitHub](https://github.com/ActiveVisionLab/Awesome-LLM-3D)]
* "Foundation Models for Video Understanding: A Survey", arXiv, 2024 (*Aalborg University, Denmark*). [[Paper](https://arxiv.org/abs/2405.03770)][[GitHub](https://github.com/NeeluMadan/ViFM_Survey)]
* "Vision Mamba: A Comprehensive Survey and Taxonomy", arXiv, 2024 (*Chongqing University*). [[Paper](https://arxiv.org/abs/2405.04404)][[GitHub](https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy)]
* "Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond", arXiv, 2024 (*GigaAI, China*). [[Paper](https://arxiv.org/abs/2405.03520)][[GitHub](https://github.com/GigaAI-research/General-World-Models-Survey)]
* "Video Diffusion Models: A Survey", arXiv, 2024 (*Bielefeld University, Germany*). [[Paper](https://arxiv.org/abs/2405.03150)][[GitHub](https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models)]
* "Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras", arXiv, 2024 (*Lehigh + UPenn*). [[Paper](https://arxiv.org/abs/2404.18961)]
* "Hallucination of Multimodal Large Language Models: A Survey", arXiv, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2404.18930)][[GitHub](https://github.com/showlab/Awesome-MLLM-Hallucination)]
* "A Survey on Vision Mamba: Models, Applications and Challenges", arXiv, 2024 (*HKUST*). [[Paper](https://arxiv.org/abs/2404.18861)][[GitHub](https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models)]
* "State Space Model for New-Generation Network Alternative to Transformers: A Survey", arXiv, 2024 (*Anhui University*). [[Paper](https://arxiv.org/abs/2404.09516)][[GitHub](https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List)]
* "Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions", arXiv, 2024 (*IIT Patna*). [[Paper](https://arxiv.org/abs/2404.07214)]
* "From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2403.12027)][[GitHub](https://github.com/khuangaf/Awesome-Chart-Understanding)]
* "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey", arXiv, 2024 (*Northeastern*). [[Paper](https://arxiv.org/abs/2403.14608)]
* "Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation", arXiv, 2024 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2403.05131)]
* "Controllable Generation with Text-to-Image Diffusion Models: A Survey", arXiv, 2024 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2403.04279)][[GitHub](https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models)]
* "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models", arXiv, 2024 (*Lehigh University, Pennsylvania*). [[Paper](https://arxiv.org/abs/2402.17177)][[GitHub](https://github.com/lichao-sun/SoraReview)]
* "Large Multimodal Agents: A Survey", arXiv, 2024 (*CUHK*). [[Paper](https://arxiv.org/abs/2402.15116)][[GitHub](https://github.com/jun0wanan/awesome-large-multimodal-agents)]
* "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey", arXiv, 2024 (*BIGAI*). [[Paper](https://arxiv.org/abs/2402.02242)][[GitHub](https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning)]
* "Vision-Language Navigation with Embodied Intelligence: A Survey", arXiv, 2024 (*Qufu Normal University, China*). [[Paper](https://arxiv.org/abs/2402.14304)]
* "The (R)Evolution of Multimodal Large Language Models: A Survey", arXiv, 2024 (*University of Modena and Reggio Emilia (UniMoRE), Italy*). [[Paper](https://arxiv.org/abs/2402.12451)]
* "Masked Modeling for Self-supervised Representation Learning on Vision and Beyond", arXiv, 2024 (*Westlake University, China*). [[Paper](https://arxiv.org/abs/2401.00897)][[GitHub](https://github.com/Lupin1998/Awesome-MIM)]
* "Transformer for Object Re-Identification: A Survey", arXiv, 2024 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2401.06960)]
* "Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities", arXiv, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2401.08045)][[GtiHub](https://github.com/zhanghm1995/Forge_VFM4AD)]
* "MM-LLMs: Recent Advances in MultiModal Large Language Models", arXiv, 2024 (*Tencent*). [[Paper](https://arxiv.org/abs/2401.13601)]
* "From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2401.15071)]
* "A Survey on Hallucination in Large Vision-Language Models", arXiv, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2402.00253)]
* "A Survey for Foundation Models in Autonomous Driving", arXiv, 2024 (*Motional, Massachusetts*). [[Paper](https://arxiv.org/abs/2402.01105)]
* "A Survey on Transformer Compression", arXiv, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2402.05964)]
* "Vision + Language Applications: A Survey", CVPRW, 2023 (*Ritsumeikan University, Japan*). [[Paper](https://arxiv.org/abs/2305.14598)][[GitHub](https://github.com/Yutong-Zhou-cv/Awesome-Text-to-Image)]
* "Multimodal Learning With Transformers: A Survey", TPAMI, 2023 (*Tsinghua & Oxford*). [[Paper](https://arxiv.org/abs/2206.06488)]
* "A Survey of Visual Transformers", TNNLS, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2111.06091)][[GitHub](https://github.com/arekavandi/Transformer-SOD)]
* "Video Understanding with Large Language Models: A Survey", arXiv, 2023 (*University of Rochester*). [[Paper](https://arxiv.org/abs/2312.17432)][[GitHub](https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding)]
* "Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2312.16602)]
* "A Survey of Reasoning with Foundation Models: Concepts, Methodologies, and Outlook", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2312.11562)][[GitHub](https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models)]
* "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2312.12436)][GitHub](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)]
* "Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey", arXiv, 2023 (*JHU*). [[Paper](https://arxiv.org/abs/2312.10163)]
* "Explainability of Vision Transformers: A Comprehensive Review and New Perspectives", arXiv, 2023 (*Institute for Research in Fundamental Sciences (IPM), Iran*). [[Paper](https://arxiv.org/abs/2311.06786)]
* "Vision-Language Instruction Tuning: A Review and Analysis", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2311.08172)][[GitHub (in construction)](https://github.com/palchenli/VL-Instruction-Tuning)]
* "Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability", arXiv, 2023 (*York University*). [[Paper](https://arxiv.org/abs/2310.12296)]
* "Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey", arXiv, 2023 (*valeo.ai, France*). [[Paper](https://arxiv.org/abs/2310.12904)][[GitHub](https://github.com/valeoai/Awesome-Unsupervised-Object-Localization)]
* "A Survey on Video Diffusion Models", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2310.10647)][[GitHub](https://github.com/ChenHsing/Awesome-Video-Diffusion-Models)]
* "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2309.17421)]
* "Multimodal Foundation Models: From Specialists to General-Purpose Assistants", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2309.10020)]
* "Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art", arXiv, 2023 (*University of Western Australia*). [[Paper](https://arxiv.org/abs/2309.04902)]
* "RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 (*University of Sydney*). [[Paper](https://arxiv.org/abs/2309.00810)]
* "A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2309.02031)]
* "From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 (*UESTC*). [[Paper](https://arxiv.org/abs/2308.05305)]
* "Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2307.13721)][[GitHub](https://github.com/awaisrauf/Awesome-CV-Foundational-Models)]
* "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2307.12980)]
* "Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 (*Xi'an Jiaotong University*). [[Paper](https://arxiv.org/abs/2307.11471)]
* "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2307.09220)]
* "Transformers in Reinforcement Learning: A Survey", arXiv, 2023 (*Mila*). [[Paper](https://arxiv.org/abs/2307.05979)]
* "Vision Language Transformers: A Survey", arXiv, 2023 (*Boise State University, Idaho*). [[Paper](https://arxiv.org/abs/2307.03254)]
* "Towards Open Vocabulary Learning: A Survey", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2306.15880)][[GitHub](https://github.com/jianzongwu/Awesome-Open-Vocabulary)]
* "Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2306.14895)]
* "A Survey on Multimodal Large Language Models", arXiv, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2306.13549)][[GitHub](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)]
* "2D Object Detection with Transformers: A Review", arXiv, 2023 (*German Research Center for Artificial Intelligence, Germany*). [[Paper](https://arxiv.org/abs/2306.04670)]
* "Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 (*Eldorado’s Institute of Technology, Brazil*). [[Paper](https://arxiv.org/abs/2305.11033)]
* "Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 (*NYU*). [[Paper](https://arxiv.org/abs/2305.05726)]
* "Visual Tuning", arXiv, 2023 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2305.06061)]
* "Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 (*Fudan University*). [[Paper](https://arxiv.org/abs/2305.04691)]
* "Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 (*University of Peradeniya, Sri Lanka*). [[Paper](https://arxiv.org/abs/2305.03273)]
* "A Review of Deep Learning for Video Captioning", arXiv, 2023 (*Deakin University, Australia*). [[Paper](https://arxiv.org/abs/2304.11431)]
* "Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2304.09854)][[GitHub](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer)]
* "Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 (*?*). [[Paper](https://arxiv.org/abs/2304.00685)][[GitHub (in construction)](https://github.com/jingyi0000/VLM_survey)]
* "Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2303.07909)]
* "Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 (*Berkeley + Google*). [[Paper](https://arxiv.org/abs/2303.04129)]
* "Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2301.03505)][[GitHub](https://github.com/mindflow-institue/Awesome-Transformer)]
* "Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (*IBM*). [[Paper](https://arxiv.org/abs/2302.08374)][[GitHub](https://github.com/badripatro/efficient360)]
* "Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (*Indian Institute of Information Technology*). [[Paper](https://arxiv.org/abs/2302.08641)]
* "Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 (*Pengcheng Laboratory*). [[Paper](https://arxiv.org/abs/2302.10035)][[GitHub](https://github.com/wangxiao5791509/MultiModal_BigModels_Survey)]
* "A Survey on Visual Transformer", TPAMI, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2012.12556)]
* "Attention mechanisms in computer vision: A survey", Computational Visual Media, 2022 (*Tsinghua University, China*). [[Paper](https://arxiv.org/abs/2111.07624)][[Springer](https://link.springer.com/article/10.1007/s41095-022-0271-y)][[Github](https://github.com/MenghaoGuo/Awesome-Vision-Attentions)]
* "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (*NavInfo Europe, Netherlands*). [[Paper](https://arxiv.org/abs/2201.08683)]
* "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2204.07356)]
* "Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (*Covenant University, Nigeria*). [[Paper](https://arxiv.org/abs/2211.10043)]
* "A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (*Sejong University*). [[Paper](https://arxiv.org/abs/2211.06004)]
* "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.09263)]
* "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (*Illinois Institute of Technology, Chicago*). [[Paper](https://arxiv.org/abs/2210.02884)]
* "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (*Charles Sturt University, Australia*). [[Paper](https://arxiv.org/abs/2209.05700)]
* "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2202.09061)]
* "Transformers in Remote Sensing: A Survey", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2209.01206)][[Github](https://github.com/VIROBO-15/Transformer-in-Remote-Sensing)]
* "Medical image analysis based on transformer: A Review", arXiv, 2022 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2208.06643)]
* "3D Vision with Transformers: A Survey", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2208.04309)][[GitHub](https://github.com/lahoud/3d-vision-transformers)]
* "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (*NYCU*). [[Paper](https://arxiv.org/abs/2207.03041)]
* "Transformers in Medical Imaging: A Survey", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2201.09873)][[GitHub](https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging)]
* "Multimodal Learning with Transformers: A Survey", arXiv, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2206.06488)]
* "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2206.01136)]
* "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (*University of Waterloo*). [[Paper](https://arxiv.org/abs/2205.07417)]
* "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (*INESC TEC and University of Porto, Portugal*). [[Paper](https://arxiv.org/abs/2204.12406)]
* "Efficient Transformers: A Survey", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2009.06732)]
* "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2111.04060)]
* "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (*National University of Sciences and Technology (NUST), Pakistan*). [[Paper](https://arxiv.org/abs/2203.15269)]
* "Video Transformers: A Survey", arXiv, 2022 (*Universitat de Barcelona, Spain*). [[Paper](https://arxiv.org/abs/2201.05991)]
* "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2202.12165)]
* "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (*?*). [[Paper](https://arxiv.org/abs/2203.01536)]
* "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (*Xidian University*). [[Paper](https://arxiv.org/abs/2203.12944)]
* "Image Captioning In the Transformer Age", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2204.07374)][[GitHub](https://github.com/SjokerLily/awesome-image-captioning)]
* "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (*Fayoum University, Egypt*). [[Paper](https://arxiv.org/abs/2204.07756)]
* "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2101.01169)]
* "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (*Renmin University of China*). [[Paper](https://arxiv.org/abs/2109.09920)]
* "A Survey of Transformers", arXiv, 2021 (*Fudan*). [[Paper](https://arxiv.org/abs/2106.04554)]
* "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (*University of Kashmir, India*). [[Paper](https://arxiv.org/abs/2106.07550)]

[[Back to Overview](#overview)]

## Image Classification / Backbone
### Replace Conv w/ Attention
#### Pure Attention
* **LR-Net**: "Local Relation Networks for Image Recognition", ICCV, 2019 (*Microsoft*). [[Paper](https://arxiv.org/abs/1904.11491)][[PyTorch (gan3sh500)](https://github.com/gan3sh500/local-relational-nets)]
* **SASA**: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 (*Google*). [[Paper](https://arxiv.org/abs/1906.05909)][[PyTorch-1 (leaderj1001)](https://github.com/leaderj1001/Stand-Alone-Self-Attention)][[PyTorch-2 (MerHS)](https://github.com/MerHS/SASA-pytorch)]
* **Axial-Transformer**: "Axial Attention in Multidimensional Transformers", arXiv, 2019 (*Google*). [[Paper](https://openreview.net/forum?id=H1e5GJBtDr)][[PyTorch (lucidrains)](https://github.com/lucidrains/axial-attention)]
* **SAN**: "Exploring Self-attention for Image Recognition", CVPR, 2020 (*CUHK + Intel*). [[Paper](https://arxiv.org/abs/2004.13621)][[PyTorch](https://github.com/hszhao/SAN)]
* **Axial-DeepLab**: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 (*Google*). [[Paper](https://arxiv.org/abs/2003.07853)][[PyTorch](https://github.com/csrhddlam/axial-deeplab)]
#### Conv-stem + Attention
* **GSA-Net**: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (*Google*). [[Paper](https://arxiv.org/abs/2010.03019)][[PyTorch (lucidrains)](https://github.com/lucidrains/global-self-attention-network)]
* **HaloNet**: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2103.12731)][[PyTorch (lucidrains)](https://github.com/lucidrains/halonet-pytorch)]
* **CoTNet**: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (*JD*). [[Paper](https://arxiv.org/abs/2107.12292)][[PyTorch](https://github.com/JDAI-CV/CoTNet)]
* **HAT-Net**: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2106.03180)][[PyTorch (in construction)](https://github.com/yun-liu/HAT-Net)]
#### Conv + Attention
* **AA**: "Attention Augmented Convolutional Networks", ICCV, 2019 (*Google*). [[Paper](https://arxiv.org/abs/1904.09925)][[PyTorch (leaderj1001)](https://github.com/leaderj1001/Attention-Augmented-Conv2d)][[Tensorflow (titu1994)](https://github.com/titu1994/keras-attention-augmented-convs)]
* **GCNet**: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) (*Microsoft*). [[Paper](https://arxiv.org/abs/2012.13375)][[PyTorch](https://github.com/xvjiarui/GCNet)]
* **LambdaNetworks**: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 (*Google*). [[Paper](https://openreview.net/forum?id=xTJEN-ggl1b)][[PyTorch-1 (lucidrains)](https://github.com/lucidrains/lambda-networks)][[PyTorch-2 (leaderj1001)](https://github.com/leaderj1001/LambdaNetworks)]
* **BoTNet**: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2101.11605)][[PyTorch-1 (lucidrains)](https://github.com/lucidrains/bottleneck-transformer-pytorch)][[PyTorch-2 (leaderj1001)](https://github.com/leaderj1001/BottleneckTransformers)]
* **GCT**: "Gaussian Context Transformer", CVPR, 2021 (*Zhejiang University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2021/html/Ruan_Gaussian_Context_Transformer_CVPR_2021_paper.html)]
* **CoAtNet**: "CoAtNet: Marrying Convolution and Attention for All Data Sizes", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.04803)]
* **ACmix**: "On the Integration of Self-Attention and Convolution", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2111.14556)][[PyTorch](https://github.com/LeapLabTHU/ACmix)]

[[Back to Overview](#overview)]

### Vision Transformer
#### General Vision Transformer
* **ViT**: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (*Google*). [[Paper](https://openreview.net/forum?id=YicbFdNTTy)][[Tensorflow](https://github.com/google-research/vision_transformer)][[PyTorch (lucidrains)](https://github.com/lucidrains/vit-pytorch)][[JAX (conceptofmind)](https://github.com/conceptofmind/vit-flax)]
* **Perceiver**: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (*DeepMind*). [[Paper](https://arxiv.org/abs/2103.03206)][[PyTorch (lucidrains)](https://github.com/lucidrains/perceiver-pytorch)]
* **PiT**: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (*NAVER*). [[Paper](https://arxiv.org/abs/2103.16302)][[PyTorch](https://github.com/naver-ai/pit)]
* **VT**: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (*Facebook*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Wu_Visual_Transformers_Where_Do_Transformers_Really_Belong_in_Vision_Models_ICCV_2021_paper.html)][[PyTorch (tahmid0007)](https://github.com/tahmid0007/VisualTransformers)]
* **PVT**: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2102.12122)][[PyTorch](https://github.com/whai362/PVT)]
* **iRPE**: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2107.14222)][[PyTorch](https://github.com/microsoft/Cream/tree/main/iRPE)]
* **CaiT**: "Going deeper with Image Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2103.17239)][[PyTorch](https://github.com/facebookresearch/deit)]
* **Swin-Transformer**: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2103.14030)][[PyTorch](https://github.com/microsoft/Swin-Transformer)][[PyTorch (berniwal)](https://github.com/berniwal/swin-transformer-pytorch)]
* **T2T-ViT**: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (*Yitu*). [[Paper](https://arxiv.org/abs/2101.11986)][[PyTorch](https://github.com/yitu-opensource/T2T-ViT)]
* **FFNBN**: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (*Microsoft*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021W/NeurArch/html/Yao_Leveraging_Batch_Normalization_for_Vision_Transformers_ICCVW_2021_paper.html)]
* **DPT**: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2107.14467)][[PyTorch](https://github.com/CASIA-IVA-Lab/DPT)]
* **Focal**: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2107.00641)][[PyTorch](https://github.com/microsoft/Focal-Transformer)]
* **XCiT**: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.09681)]
* **Twins**: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (*Meituan*). [[Paper](https://arxiv.org/abs/2104.13840)][[PyTorch)](https://github.com/Meituan-AutoML/Twins)]
* **ARM**: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2110.15156)][[GitHub (in construction)](https://github.com/amazon-research/anti-aliasing-transformer)]
* **DVT**: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2105.15075)][[PyTorch](https://github.com/blackfeather-wang/Dynamic-Vision-Transformer)]
* **Aug-S**: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2106.15941)]
* **TNT**: "Transformer in Transformer", NeurIPS, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2103.00112)][[PyTorch](https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch)][[PyTorch (lucidrains)](https://github.com/lucidrains/transformer-in-transformer)]
* **ViTAE**: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2106.03348)][[PyTorch](https://github.com/Annbless/ViTAE)]
* **DeepViT**: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 (*NUS + ByteDance*). [[Paper](https://arxiv.org/abs/2103.11886)][[Code](https://github.com/zhoudaquan/dvit_repo)]
* **So-ViT**: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2104.10935)][[PyTorch](https://github.com/jiangtaoxie/So-ViT)]
* **LV-ViT**: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (*ByteDance*). [[Paper](https://arxiv.org/abs/2104.10858)][[PyTorch](https://github.com/zihangJiang/TokenLabeling)]
* **NesT**: "Aggregating Nested Transformers", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2105.12723)][[Tensorflow](https://github.com/google-research/nested-transformer)]
* **KVT**: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2106.00515)]
* **Refined-ViT**: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2106.03714)][[PyTorch](https://github.com/zhoudaquan/Refiner_ViT)]
* **Shuffle-Transformer**: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2106.03650)]
* **CAT**: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 (*KuaiShou*). [[Paper](https://arxiv.org/abs/2106.05786)][[PyTorch](https://github.com/linhezheng19/CAT)]
* **V-MoE**: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.05974)]
* **P2T**: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (*Nankai University*). [[Paper](https://arxiv.org/abs/2106.12011)]
* **PvTv2**: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2106.13797)][[PyTorch](https://github.com/whai362/PVT)]
* **LG-Transformer**: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (*IIAI, UAE*). [[Paper](https://arxiv.org/abs/2107.04735)]
* **ViP**: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (*Oxford*). [[Paper](https://arxiv.org/abs/2107.05790)]
* **Scaled-ReLU**: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2109.03810)]
* **LIT**: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (*Monash University*). [[Paper](https://arxiv.org/abs/2105.14217)][[PyTorch](https://github.com/zip-group/LIT)]
* **DTN**: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2112.02624)][[PyTorch (in construction)](https://github.com/wqshao126/DTN)]
* **RegionViT**: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (*MIT-IBM Watson*). [[Paper](https://arxiv.org/abs/2106.02689)][[PyTorch](https://github.com/ibm/regionvit)]
* **CrossFormer**: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2108.00154)][[PyTorch](https://github.com/cheerss/CrossFormer)]
* **?**: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (*UT Austin*). [[Paper](https://openreview.net/forum?id=O476oWmiNNp)]
* **ViT-G**: "Scaling Vision Transformers", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2106.04560)]
* **CSWin**: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2107.00652)][[PyTorch](https://github.com/microsoft/CSWin-Transformer)]
* **MPViT**: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2112.11010)][[PyTorch](https://github.com/youngwanLEE/MPViT)]
* **Diverse-ViT**: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2203.06345)][[PyTorch](https://github.com/VITA-Group/Diverse-ViT)]
* **DW-ViT**: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (*Dark Matter AI, China*). [[Paper](https://arxiv.org/abs/2203.12856)][[PyTorch (in construction)](https://github.com/pzhren/DW-ViT)]
* **MixFormer**: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2204.02557)][[Paddle](https://github.com/PaddlePaddle/PaddleClas)]
* **DAT**: "Vision Transformer with Deformable Attention", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2201.00520)][[PyTorch](https://github.com/LeapLabTHU/DAT)]
* **Swin-Transformer-V2**: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.09883)][[PyTorch](https://github.com/microsoft/Swin-Transformer)]
* **MSG-Transformer**: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (*Huazhong University of Science & Technology*). [[Paper](https://arxiv.org/abs/2105.15168)][[PyTorch](https://github.com/hustvl/MSG-Transformer)]
* **NomMer**: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2111.12994)][[PyTorch](https://github.com/TencentYoutuResearch/VisualRecognition-NomMer)]
* **Shunted**: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (*NUS*). [[Paper](https://arxiv.org/abs/2111.15193)][[PyTorch](https://github.com/OliverRensu/Shunted-Transformer)]
* **PyramidTNT**: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2201.00978)][[PyTorch](https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch)]
* **X-ViT**: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (*Kakao*). [[Paper](https://arxiv.org/abs/2205.13805)]
* **ReMixer**: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (*KAIST*). [[Paper](https://drive.google.com/file/d/1E6rXtj5h6tXiJR8Ae8u1vQcwyNyTZSVc/view)][[PyTorch](https://github.com/alinlab/remixer)]
* **UN**: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (*Hikvision*). [[Paper](https://arxiv.org/abs/2208.01313)][[Code (in construction)](https://github.com/hikvision-research/Unified-Normalization)]
* **Wave-ViT**: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2207.04978)][[PyTorch](https://github.com/YehLi/ImageNetModel)]
* **DaViT**: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2204.03645)][[PyTorch](https://github.com/dingmyu/davit)]
* **ScalableViT**: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2203.10790)]
* **MaxViT**: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2204.01697)][[Tensorflow](https://github.com/google-research/maxvit)]
* **VSA**: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2204.08446)][[PyTorch](https://github.com/ViTAE-Transformer/ViTAE-VSA)]
* **?**: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.01035)]
* **Ortho**: "Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization", NeurIPS, 2022 (*CAS*). [[Paper](https://openreview.net/forum?id=GGtH47T31ZC)]
* **PerViT**: "Peripheral Vision Transformer", NeurIPS, 2022 (*POSTECH*). [[Paper](https://arxiv.org/abs/2206.06801)]
* **LITv2**: "Fast Vision Transformers with HiLo Attention", NeurIPS, 2022 (*Monash University*). [[Paper](https://arxiv.org/abs/2205.13213)][[PyTorch](https://github.com/zip-group/LITv2)]
* **BViT**: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2202.06268)]
* **O-ViT**: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (*East China Normal University*). [[Paper](https://arxiv.org/abs/2201.12133)]
* **MOA-Transformer**: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (*University of Kansas*). [[Paper](https://arxiv.org/abs/2201.12903)][[PyTorch](https://github.com/krushi1992/MOA-transformer)]
* **BOAT**: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (*Baidu + HKU*). [[Paper](https://arxiv.org/abs/2201.13027)]
* **ViTAEv2**: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2202.10108)]
* **HiP**: "Hierarchical Perceiver", arXiv, 2022 (*DeepMind*). [[Paper](https://arxiv.org/abs/2202.10890)]
* **PatchMerger**: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2202.12015)]
* **DGT**: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2203.03937)]
* **NAT**: "Neighborhood Attention Transformer", arXiv, 2022 (*Oregon*). [[Paper](https://arxiv.org/abs/2204.07143)][[PyTorch](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer)]
* **ASF-former**: "Adaptive Split-Fusion Transformer", arXiv, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2204.12196)][[PyTorch (in construction)](https://github.com/szx503045266/ASF-former)]
* **SP-ViT**: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2206.07662)]
* **EATFormer**: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2206.09325)]
* **LinGlo**: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (*TCL Research Wuhan*). [[Paper](https://arxiv.org/abs/2207.00188)]
* **Dual-ViT**: "Dual Vision Transformer", arXiv, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2207.04976)][[PyTorch](https://github.com/YehLi/ImageNetModel)]
* **MMA**: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (*Centre for Research and Technology Hellas, Greece*). [[Paper](https://arxiv.org/abs/2207.08569)]
* **MAFormer**: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2209.01620)]
* **AEWin**: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (*Southwest Jiaotong University*). [[Paper](https://arxiv.org/abs/2209.08726)]
* **GrafT**: "Grafting Vision Transformers", arXiv, 2022 (*Stony Brook*). [[Paper](https://arxiv.org/abs/2210.15943)]
* **?**: "Rethinking Hierarchicies in Pre-trained Plain Vision Transformer", arXiv, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2211.01785)]
* **LTH-ViT**: "The Lottery Ticket Hypothesis for Vision Transformers", arXiv, 2022 (*Northeastern University, China*). [[Paper](https://arxiv.org/abs/2211.01484)]
* **TT**: "Token Transformer: Can class token help window-based transformer build better long-range interactions?", arXiv, 2022 (*Hangzhou Dianzi University*). [[Paper](https://arxiv.org/abs/2211.06083)]
* **INTERN**: "INTERN: A New Learning Paradigm Towards General Vision", arXiv, 2022 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2111.08687)][[Website](https://opengvlab.shlab.org.cn/)]
* **GGeM**: "Group Generalized Mean Pooling for Vision Transformer", arXiv, 2022 (*NAVER*). [[Paper](https://arxiv.org/abs/2212.04114)]
* **GPViT**: "GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation", ICLR, 2023 (*University of Edinburgh, Scotland + UCSD*). [[Paper](https://arxiv.org/abs/2212.06795)][[PyTorch](https://github.com/ChenhongyiYang/GPViT)]
* **CPVT**: "Conditional Positional Encodings for Vision Transformers", ICLR, 2023 (*Meituan*). [[Paper](https://openreview.net/forum?id=3KWnuT-R1bh)][[Code (in construction)](https://github.com/Meituan-AutoML/CPVT)]
* **LipsFormer**: "LipsFormer: Introducing Lipschitz Continuity to Vision Transformers", ICLR, 2023 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2304.09856)][[Code (in construction)](https://github.com/IDEA-Research/LipsFormer)]
* **BiFormer**: "BiFormer: Vision Transformer with Bi-Level Routing Attention", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.08810)][[PyTorch](https://github.com/rayleizhu/BiFormer)]
* **AbSViT**: "Top-Down Visual Attention from Analysis by Synthesis", CVPR, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2303.13043)][[PyTorch](https://github.com/bfshi/AbSViT)][[Website](https://sites.google.com/view/absvit)]
* **DependencyViT**: "Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention", CVPR, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2304.03282)][[Code (in construction)](https://github.com/dingmyu/DependencyViT)]
* **ResFormer**: "ResFormer: Scaling ViTs with Multi-Resolution Training", CVPR, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2212.00776)][[PyTorch (in construction)](https://github.com/ruitian12/resformer)]
* **SViT**: "Vision Transformer with Super Token Sampling", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.11167)]
* **PaCa-ViT**: "PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers", CVPR, 2023 (*NC State*). [[Paper](https://arxiv.org/abs/2203.11987)][[PyTorch](https://github.com/iVMCL/PaCaViT)]
* **GC-ViT**: "Global Context Vision Transformers", ICML, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2206.09959)][[PyTorch](https://github.com/NVlabs/GCViT)]
* **MAGNETO**: "MAGNETO: A Foundation Transformer", ICML, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.06423)]
* **Fcaformer**: "Fcaformer: Forward Cross Attention in Hybrid Vision Transformer", ICCV, 2023 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2211.07198)][[PyTorch](https://github.com/hkzhang91/CabViT)]
* **SMT**: "Scale-Aware Modulation Meet Transformer", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2307.08579)][[PyTorch](https://github.com/AFeng-x/SMT)]
* **FLatten-Transformer**: "FLatten Transformer: Vision Transformer using Focused Linear Attention", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2308.00442)][[PyTorch](https://github.com/LeapLabTHU/FLatten-Transformer)]
* **Path-Ensemble**: "Revisiting Vision Transformer from the View of Path Ensemble", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.06548)]
* **SG-Former**: "SG-Former: Self-guided Transformer with Evolving Token Reallocation", ICCV, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2308.12216)][[PyTorch](https://github.com/OliverRensu/SG-Former)]
* **SimPool**: "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?", ICCV, 2023 (*National Technical University of Athens*). [[Paper](https://arxiv.org/abs/2309.06891)]
* **LaPE**: "LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization", ICCV, 2023 (*Peking*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Yu_LaPE_Layer-adaptive_Position_Embedding_for_Vision_Transformers_with_Independent_Layer_ICCV_2023_paper.html)][[PyTorch](https://github.com/Ingrid725/LaPE)]
* **CB**: "Scratching Visual Transformer's Back with Uniform Attention", ICCV, 2023 (*NAVER*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Hyeon-Woo_Scratching_Visual_Transformers_Back_with_Uniform_Attention_ICCV_2023_paper.html)]
* **STL**: "Fully Attentional Networks with Self-emerging Token Labeling", ICCV, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2401.03844)][[PyTorch](https://github.com/NVlabs/STL)]
* **ClusterFormer**: "ClusterFormer: Clustering As A Universal Visual Learner", NeurIPS, 2023 (*Rochester Institute of Technology (RIT)*). [[Paper](https://arxiv.org/abs/2309.13196)]
* **SVT**: "Scattering Vision Transformer: Spectral Mixing Matters", NeurIPS, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2311.01310)][[PyTorch](https://github.com/badripatro/svt)][[Website](https://badripatro.github.io/svt/)]
* **CrossFormer++**: "CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention", arXiv, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2303.06908)][[PyTorch](https://github.com/cheerss/CrossFormer)]
* **QFormer**: "Vision Transformer with Quadrangle Attention", arXiv, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2303.15105)][[Code (in construction)](https://github.com/ViTAE-Transformer/QFormer)]
* **ViT-Calibrator**: "ViT-Calibrator: Decision Stream Calibration for Vision Transformer", arXiv, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2304.04354)]
* **SpectFormer**: "SpectFormer: Frequency and Attention is what you need in a Vision Transformer", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2304.06446)][[PyTorch](https://github.com/badripatro/SpectFormers)][[Website](https://badripatro.github.io/SpectFormers/)]
* **UniNeXt**: "UniNeXt: Exploring A Unified Architecture for Vision Recognition", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2304.13700)]
* **CageViT**: "CageViT: Convolutional Activation Guided Efficient Vision Transformer", arXiv, 2023 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2305.09924)]
* **?**: "Making Vision Transformers Truly Shift-Equivariant", arXiv, 2023 (*UIUC*). [[Paper](https://arxiv.org/abs/2305.16316)]
* **2-D-SSM**: "2-D SSM: A General Spatial Layer for Visual Transformers", arXiv, 2023 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2306.06635)][[PyTorch](https://github.com/ethanbar11/ssm_2d)]
* **NaViT**: "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2307.06304)]
* **DAT++**: "DAT++: Spatially Dynamic Vision Transformer with Deformable Attention", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2309.01430)][[PyTorch](https://github.com/LeapLabTHU/DAT)]
* **?**: "Replacing softmax with ReLU in Vision Transformers", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2309.08586)]
* **RMT**: "RMT: Retentive Networks Meet Vision Transformers", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2309.11523)]
* **reg**: "Vision Transformers Need Registers", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2309.16588)]
* **ChannelViT**: "Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words", arXiv, 2023 (*Insitro, CA*). [[Paper](https://arxiv.org/abs/2309.16108)]
* **EViT**: "EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention", arXiv, 2023 (*Nankai University*). [[Paper](https://arxiv.org/abs/2310.06629)]
* **ViR**: "ViR: Vision Retention Networks", arXiv, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2310.19731)]
* **abs-win**: "Window Attention is Bugged: How not to Interpolate Position Embeddings", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2311.05613)]
* **FMViT**: "FMViT: A multiple-frequency mixing Vision Transformer", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2311.05707)][[Code (in construction)](https://github.com/tany0699/FMViT)]
* **GroupMixFormer**: "Advancing Vision Transformers with Group-Mix Attention", arXiv, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2311.15157)][[PyTorch](https://github.com/AILab-CVC/GroupMixFormer)]
* **PGT**: "Perceptual Group Tokenizer: Building Perception with Iterative Grouping", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2311.18296)]
* **SCHEME**: "SCHEME: Scalable Channer Mixer for Vision Transformers", arXiv, 2023 (*UCSD*). [[Paper](https://arxiv.org/abs/2312.00412)]
* **Agent-Attention**: "Agent Attention: On the Integration of Softmax and Linear Attention", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.08874)][[PyTorch](https://github.com/LeapLabTHU/Agent-Attention)]
* **ViTamin**: "ViTamin: Designing Scalable Vision Models in the Vision-Language Era", CVPR, 2024 (*ByteDance*). [[Paper](https://arxiv.org/abs/2404.02132)][[PyTorch](https://github.com/Beckschen/ViTamin)]
* **HIRI-ViT**: "HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs", TPAMI, 2024 (*HiDream.ai, China*). [[Paper](https://arxiv.org/abs/2403.11999)]
* **SPFormer**: "SPFormer: Enhancing Vision Transformer with Superpixel Representation", arXiv, 2024 (*JHU*). [[Paper](https://arxiv.org/abs/2401.02931)]
* **manifold-K**: "A Manifold Representation of the Key in Vision Transformers", arXiv, 2024 (*University of Oslo, Norway*). [[Paper](https://arxiv.org/abs/2402.00534)]
* **BiXT**: "Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers", arXiv, 2024 (*University of Melbourne*). [[Paper](https://arxiv.org/abs/2402.12138)]
* **VisionLLaMA**: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks", arXiv, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2403.00522)][[Code (in construction)](https://github.com/Meituan-AutoML/VisionLLaMA)]
* **xT**: "xT: Nested Tokenization for Larger Context in Large Images", arXiv, 2024 (*Berkeley*). [[Paper](https://arxiv.org/abs/2403.01915)]
* **ACC-ViT**: "ACC-ViT: Atrous Convolution's Comeback in Vision Transformers", arXiv, 2024 (*Purdue*). [[Paper](https://arxiv.org/abs/2403.04200)]
* **ViTAR**: "ViTAR: Vision Transformer with Any Resolution", arXiv, 2024 (*CAS*). [[Paper](https://arxiv.org/abs/2403.18361)]
* **iLLaMA**: "Adapting LLaMA Decoder to Vision Transformer", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2404.06773)]
#### Efficient Vision Transformer
* **DeiT**: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2012.12877)][[PyTorch](https://github.com/facebookresearch/deit)]
* **ConViT**: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2103.10697)][[Code](https://github.com/facebookresearch/convit)]
* **?**: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (*NavInfo Europe, Netherlands*). [[Paper](https://arxiv.org/abs/2106.16006)]
* **PS-ViT**: "Vision Transformer with Progressive Sampling", ICCV, 2021 (*CPII*). [[Paper](https://arxiv.org/abs/2108.01684)]
* **HVT**: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (*Monash University*). [[Paper](https://arxiv.org/abs/2103.10619)][[PyTorch](https://github.com/MonashAI/HVT)]
* **CrossViT**: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (*MIT-IBM*). [[Paper](https://arxiv.org/abs/2103.14899)][[PyTorch](https://github.com/IBM/CrossViT)]
* **ViL**: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2103.15358)][[PyTorch](https://github.com/microsoft/vision-longformer)]
* **Visformer**: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (*Beihang University*). [[Paper](https://arxiv.org/abs/2104.12533)][[PyTorch](https://github.com/danczs/Visformer)]
* **MultiExitViT**: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (*Aarhus University, Denmark*). [[Paper](https://arxiv.org/abs/2106.15183)][[Tensorflow](https://gitlab.au.dk/maleci/multiexitvit)]
* **SViTE**: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (*UT Austin*). [[Paper](https://arxiv.org/abs/2106.04533)][[PyTorch](https://github.com/VITA-Group/SViTE)]
* **DGE**: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (*Megvii*). [[Paper](https://papers.nips.cc/paper/2021/hash/2d969e2cee8cfa07ce7ca0bb13c7a36d-Abstract.html)][[PyTorch](https://github.com/StevenGrove/vtpack)]
* **GG-Transformer**: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (*JHU*). [[Paper](https://arxiv.org/abs/2106.02277)][[Code (in construction)](https://github.com/yucornetto/GG-Transformer)]
* **DynamicViT**: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2106.02034)][[PyTorch](https://github.com/raoyongming/DynamicViT)][[Website](https://dynamicvit.ivg-research.xyz/)]
* **ResT**: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2105.13677)][[PyTorch](https://github.com/wofmanaf/ResT)]
* **Adder-Transformer**: "Adder Attention for Vision Transformer", NeurIPS, 2021 (*Huawei*). [[Paper](https://proceedings.neurips.cc/paper/2021/hash/a57e8915461b83adefb011530b711704-Abstract.html)]
* **SOFT**: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (*Fudan*). [[Paper](https://arxiv.org/abs/2110.11945)][[PyTorch](https://github.com/fudan-zvg/SOFT)][[Website](https://fudan-zvg.github.io/SOFT/)]
* **IA-RED2**: "IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (*MIT-IBM*). [[Paper](https://arxiv.org/abs/2106.12620)][[Website](http://people.csail.mit.edu/bpan/ia-red/)]
* **LocalViT**: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (*ETHZ*). [[Paper](https://arxiv.org/abs/2104.05707)][[PyTorch](https://github.com/ofsoundof/LocalViT)]
* **CCT**: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (*University of Oregon*). [[Paper](https://arxiv.org/abs/2104.05704)][[PyTorch](https://github.com/SHI-Labs/Compact-Transformers)]
* **DiversePatch**: "Vision Transformers with Patch Diversification", arXiv, 2021 (*UT Austin + Facebook*). [[Paper](https://arxiv.org/abs/2104.12753)][[PyTorch](https://github.com/ChengyueGongR/PatchVisionTransformer)]
* **SL-ViT**: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (*Aarhus University*). [[Paper](https://arxiv.org/abs/2105.09121)]
* **?**: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (*Aarhus University, Denmark*). [[Paper](https://arxiv.org/abs/2106.15183)]
* **ViX**: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (*Indian Institute of Technology Bombay*). [[Paper](https://arxiv.org/abs/2107.02239)]
* **Transformer-LS**: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2107.02192)][[PyTorch](https://github.com/NVIDIA/transformer-ls)]
* **WideNet**: "Go Wider Instead of Deeper", arXiv, 2021 (*NUS*). [[Paper](https://arxiv.org/abs/2107.11817)]
* **Armour**: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (*Arm*). [[Paper](https://arxiv.org/abs/2108.01778)]
* **IPE**: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (*CUHK*). [[Paper](https://arxiv.org/abs/2108.13015)]
* **DS-Net++**: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (*Monash University*). [[Paper](https://arxiv.org/abs/2109.10060)][[PyTorch](https://github.com/changlin31/DS-Net)]
* **UFO-ViT**: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (*Kakao*). [[Paper](https://arxiv.org/abs/2109.14382)]
* **Evo-ViT**: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2108.01390)][[PyTorch](https://github.com/YifanXu74/Evo-ViT)]
* **PS-Attention**: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2112.14000)][[Paddle](https://github.com/BR-IDL/PaddleViT)]
* **ShiftViT**: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2201.10801)][[PyTorch](https://github.com/microsoft/SPACH)]
* **EViT**: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2202.07800)][[PyTorch](https://github.com/youweiliang/evit)]
* **QuadTree**: "QuadTree Attention for Vision Transformers", ICLR, 2022 (*Simon Fraser + Alibaba*). [[Paper](https://arxiv.org/abs/2201.02767)][[PyTorch](https://github.com/Tangshitao/QuadtreeAttention)]
* **Anti-Oversmoothing**: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2203.05962)][[PyTorch](https://github.com/VITA-Group/ViT-Anti-Oversmoothing)]
* **QnA**: "Learned Queries for Efficient Local Attention", CVPR, 2022 (*Tel-Aviv*). [[Paper](https://arxiv.org/abs/2112.11435)][[JAX](https://github.com/moabarar/qna)]
* **LVT**: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (*Adobe*). [[Paper](https://arxiv.org/abs/2112.10809)][[PyTorch](https://github.com/Chenglin-Yang/LVT)]
* **A-ViT**: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2112.07658)][[Website](https://a-vit.github.io/)]
* **PS-ViT**: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2106.02852)]
* **Rev-MViT**: "Reversible Vision Transformers", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2302.04869)][[PyTorch-1](https://github.com/karttikeya/minREV)][[PyTorch-2](https://github.com/facebookresearch/slowfast)]
* **AdaViT**: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2111.15668)]
* **DQS**: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (*Sorbonne Universite', France*). [[Paper](https://arxiv.org/abs/2205.10873)]
* **ATS**: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.15667)][[Website](https://adaptivetokensampling.github.io/)]
* **EdgeViT**: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2205.03436)][[PyTorch](https://github.com/saic-fi/edgevit)]
* **SReT**: "Sliced Recursive Transformer", ECCV, 2022 (*CMU + MBZUAI*). [[Paper](https://arxiv.org/abs/2111.05297)][[PyTorch](https://github.com/szq0214/SReT)]
* **SiT**: "Self-slimmed Vision Transformer", ECCV, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2111.12624)][[PyTorch](https://github.com/Sense-X/SiT)]
* **DFvT**: "Doubly-Fused ViT: Fuse Information from Vision Transformer Doubly with Local Representation", ECCV, 2022 (*Alibaba*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/322_ECCV_2022_paper.php)]
* **M3ViT**: "M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design", NeurIPS, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2210.14793)][[PyTorch](https://github.com/VITA-Group/M3ViT)]
* **ResT-V2**: "ResT V2: Simpler, Faster and Stronger", NeurIPS, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2204.07366)][[PyTorch](https://github.com/wofmanaf/ResT)]
* **DeiT-Manifold**: "Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation", NeurIPS, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2107.01378)]
* **EfficientFormer**: "EfficientFormer: Vision Transformers at MobileNet Speed", NeurIPS, 2022 (*Snap*). [[Paper](https://arxiv.org/abs/2206.01191)][[PyTorch](https://github.com/snap-research/EfficientFormer)]
* **GhostNetV2**: "GhostNetV2: Enhance Cheap Operation with Long-Range Attention", NeurIPS, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2211.12905)][[PyTorch](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv2_pytorch)]
* **?**: "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU", NeurIPSW, 2022 (*Borealis AI, Canada*). [[Paper](https://arxiv.org/abs/2211.05187)]
* **TerViT**: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2201.08050)]
* **MT-ViT**: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2203.01587)]
* **ViT-P**: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (*Chongqing University of Technology*). [[Paper](https://arxiv.org/abs/2203.02358)]
* **CF-ViT**: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (*Xiamen University + Tencent*). [[Paper](https://arxiv.org/abs/2203.03821)][[PyTorch](https://github.com/ChenMnZ/CF-ViT)]
* **EIT**: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (*Academy of Military Sciences, China*). [[Paper](https://arxiv.org/abs/2203.07116)]
* **SepViT**: "SepViT: Separable Vision Transformer", arXiv, 2022 (*University of Electronic Science and Technology of China*). [[Paper](https://arxiv.org/abs/2203.15380)]
* **TRT-ViT**: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2205.09579)]
* **SuperViT**: "Super Vision Transformer", arXiv, 2022 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2205.11397)][[PyTorch](https://github.com/lmbxmu/SuperViT)]
* **Tutel**: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.03382)][[PyTorch](https://github.com/microsoft/tutel)]
* **SimA**: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (*Maryland + UC Davis*). [[Paper](https://arxiv.org/abs/2206.08898)][[PyTorch](https://github.com/UCDvision/sima)]
* **EdgeNeXt**: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2206.10589)][[PyTorch](https://github.com/mmaaz60/EdgeNeXt)]
* **VVT**: "Vicinity Vision Transformer", arXiv, 2022 (*Australian National University*). [[Paper](https://arxiv.org/abs/2206.10552)][[Code (in construction)](https://github.com/OpenNLPLab/Vicinity-Vision-Transformer)]
* **SOFT**: "Softmax-free Linear Transformers", arXiv, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2207.03341)][[PyTorch](https://github.com/fudan-zvg/SOFT)]
* **MaiT**: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2207.03006)]
* **LightViT**: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2207.05557)][[Code (in construction)](https://github.com/hunto/LightViT)]
* **Next-ViT**: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2207.05501)]
* **XFormer**: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/pdf/2207.07268.pdf)]
* **PatchDropout**: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (*KTH, Sweden*). [[Paper](https://arxiv.org/abs/2208.07220)]
* **ClusTR**: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2208.13138)]
* **DiNAT**: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (*University of Oregon*). [[Paper](https://arxiv.org/abs/2209.15001)][[PyTorch](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer)]
* **MobileViTv3**: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (*Micron*). [[Paper](https://arxiv.org/abs/2209.15159)][[PyTorch](https://github.com/micronDLA/MobileViTv3)]
* **ViT-LSLA**: "ViT-LSLA: Vision Transformer with Light Self-Limited-Attention", arXiv, 2022 (*Southwest University*). [[Paper](https://arxiv.org/abs/2210.17115)]
* **Token-Pooling**: "Token Pooling in Vision Transformers for Image Classification", WACV, 2023 (*Apple*). [[Paper](https://openaccess.thecvf.com/content/WACV2023/html/Marin_Token_Pooling_in_Vision_Transformers_for_Image_Classification_WACV_2023_paper.html)]
* **Tri-Level**: "Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training", AAAI, 2023 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2211.10801)][[Code (in construction)](https://github.com/ZLKong/Tri-Level-ViT)]
* **ViTCoD**: "ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2210.09573)]
* **ViTALiTy**: "ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (*Rice University*). [[Paper](https://arxiv.org/abs/2211.05109)]
* **HeatViT**: "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers", IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2211.08110)]
* **ToMe**: "Token Merging: Your ViT But Faster", ICLR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2210.09461)][[PyTorch](https://github.com/facebookresearch/ToMe)]
* **HiViT**: "HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer", ICLR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2205.14949)][[PyTorch](https://github.com/zhangxiaosong18/hivit)]
* **STViT**: "Making Vision Transformers Efficient from A Token Sparsification View", CVPR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2303.08685)][[PyTorch](https://github.com/changsn/STViT-R)]
* **SparseViT**: "SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer", CVPR, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2303.17605)][[Website](https://sparsevit.mit.edu/)]
* **Slide-Transformer**: "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention", CVPR, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2304.04237)][[Code (in construction)](https://github.com/LeapLabTHU/Slide-Transformer)]
* **RIFormer**: "RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2304.05659)][[PyTorch](https://github.com/open-mmlab/mmpretrain/tree/main/configs/riformer)][[Website](https://techmonsterwang.github.io/RIFormer/)]
* **EfficientViT**: "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2305.07027)][[PyTorch](https://github.com/microsoft/Cream/tree/main/EfficientViT)]
* **Castling-ViT**: "Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2211.10526)]
* **ViT-Ti**: "RGB no more: Minimally-decoded JPEG Vision Transformers", CVPR, 2023 (*UMich*). [[Paper](https://arxiv.org/abs/2211.16421)]
* **Sparsifiner**: "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers", CVPR, 2023 (*University of Toronto*). [[Paper](https://arxiv.org/abs/2303.13755)]
* **?**: "Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers", CVPR, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2211.11315)]
* **LTMP**: "Learned Thresholds Token Merging and Pruning for Vision Transformers", ICMLW, 2023 (*Ghent University, Belgium*). [[Paper](https://arxiv.org/abs/2307.10780)][[PyTorch](https://github.com/Mxbonn/ltmp)][[Website](https://maxim.bonnaerens.com/publication/ltmp/)]
* **ReViT**: "Make A Long Image Short: Adaptive Token Length for Vision Transformers", ECML PKDD, 2023 (*Midea Grou, China*). [[Paper](https://arxiv.org/abs/2307.02092)]
* **EfficientViT**: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", ICCV, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2205.14756)][[PyTorch](https://github.com/mit-han-lab/efficientvit)]
* **MPCViT**: "MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention", ICCV, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2211.13955)][[PyTorch](https://github.com/PKU-SEC-Lab/mpcvit)]
* **MST**: "Masked Spiking Transformer", ICCV, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2210.01208)]
* **EfficientFormerV2**: "Rethinking Vision Transformers for MobileNet Size and Speed", ICCV, 2023 (*Snap*). [[Paper](https://arxiv.org/abs/2212.08059)][[PyTorch](https://github.com/snap-research/EfficientFormer)]
* **DiffRate**: "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2305.17997)][[PyTorch](https://github.com/OpenGVLab/DiffRate)]
* **ElasticViT**: "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2303.09730)]
* **FastViT**: "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization", ICCV, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2303.14189)][[PyTorch](https://github.com/apple/ml-fastvit)]
* **SeiT**: "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage", ICCV, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2303.11114)][[PyTorch](https://github.com/naver-ai/seit)]
* **TokenReduction**: "Which Tokens to Use? Investigating Token Reduction in Vision Transformers", ICCVW, 2023 (*Aalborg University, Denmark*). [[Paper](https://arxiv.org/abs/2308.04657)][[PyTorch](https://github.com/JoakimHaurum/TokenReduction)][[Website](https://vap.aau.dk/tokens/)]
* **LGViT**: "LGViT: Dynamic Early Exiting for Accelerating Vision Transformer", ACMMM, 2023 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2308.00255)]
* **LBP-WHT**: "Efficient Low-rank Backpropagation for Vision Transformer Adaptation", NeurIPS, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2309.15275)]
* **FAT**: "Lightweight Vision Transformer with Bidirectional Interaction", NeurIPS, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2306.00396)][[PyTorch](https://github.com/qhfan/FAT)]
* **MCUFormer**: "MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2310.16898)][[PyTorch](https://github.com/liangyn22/MCUFormer)]
* **SoViT**: "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2305.13035)]
* **CloFormer**: "Rethinking Local Perception in Lightweight Vision Transformer", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2303.17803)]
* **Quadformer**: "Vision Transformers with Mixed-Resolution Tokenization", arXiv, 2023 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2304.00287)][[Code (in construction)](https://github.com/TomerRonen34/mixed-resolution-vit)]
* **SparseFormer**: "SparseFormer: Sparse Visual Recognition via Limited Latent Tokens", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2304.03768)][[Code (in construction)](https://github.com/showlab/sparseformer)]
* **EMO**: "Rethinking Mobile Block for Efficient Attention-based Models", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.01146)][[PyTorch](https://github.com/zhangzjn/EMO)]
* **ByteFormer**: "Bytes Are All You Need: Transformers Operating Directly On File Bytes", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2306.00238)]
* **?**: "Muti-Scale And Token Mergence: Make Your ViT More Efficient", arXiv, 2023 (*Jilin University*). [[Paper](https://arxiv.org/abs/2306.04897)]
* **FasterViT**: "FasterViT: Fast Vision Transformers with Hierarchical Attention", arXiv, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2306.06189)]
* **NextViT**: "Vision Transformer with Attention Map Hallucination and FFN Compaction", arXiv, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2306.10875)]
* **SkipAt**: "Skip-Attention: Improving Vision Transformers by Paying Less Attention", arXiv, 2023 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2301.02240)]
* **MSViT**: "MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers", arXiv, 2023 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2307.02321)]
* **DiT**: "DiT: Efficient Vision Transformers with Dynamic Token Routing", arXiv, 2023 (*Meituan*). [[Paper](https://arxiv.org/abs/2308.03409)][[Code (in construction)](https://github.com/Maycbj/DiT)]
* **?**: "Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers", arXiv, 2023 (*German Research Center for Artificial Intelligence (DFKI)*). [[Paper](https://arxiv.org/abs/2308.09372)][[PyTorch](https://github.com/tobna/WhatTransformerToFavor)]
* **Mobile-V-MoEs**: "Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2309.04354)]
* **PPT**: "PPT: Token Pruning and Pooling for Efficient Vision Transformers", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2310.01812)]
* **MatFormer**: "MatFormer: Nested Transformer for Elastic Inference", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2310.07707)]
* **SparseFormer**: "Bootstrapping SparseFormers from Vision Foundation Models", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2312.01987)][[PyTorch](https://github.com/showlab/sparseformer)]
* **GTP-ViT**: "GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation", WACV, 2024 (*CSIRO Data61, Australia*). [[Paper](https://arxiv.org/abs/2311.03035)][[PyTorch](https://github.com/Ackesnal/GTP-ViT)]
* **ToFu**: "Token Fusion: Bridging the Gap between Token Pruning and Token Merging", WACV, 2024 (*Samsung*). [[Paper](https://arxiv.org/abs/2312.01026)]
* **Cached-Transformer**: "Cached Transformers: Improving Transformers with Differentiable Memory Cache", AAAI, 2024 (*CUHK*). [[Paper](https://arxiv.org/abs/2312.12742)]
* **LF-ViT**: "LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition", AAAI, 2024 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2402.00033)][[PyTorch](https://github.com/edgeai1/LF-ViT)]
* **EfficientMod**: "Efficient Modulation for Vision Networks", ICLR, 2024 (*Microsoft*). [[Paper](https://arxiv.org/abs/2403.19963)][[PyTorch](https://github.com/ma-xu/EfficientMod)]
* **NOSE**: "MLP Can Be A Good Transformer Learner", CVPR, 2024 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2404.05657)][[PyTorch](https://github.com/sihaoevery/lambda_vit)]
* **SLAB**: "SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization", ICML, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2405.11582)][[PyTorch](https://github.com/xinghaochen/SLAB)]
* **S2**: "When Do We Not Need Larger Vision Models?", arXiv, 2024 (*Berkeley*). [[Paper](https://arxiv.org/abs/2403.13043)][[PyTorch](https://github.com/bfshi/scaling_on_scales)]
#### Conv + Transformer
* **LeViT**: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.01136)][[PyTorch](https://github.com/facebookresearch/LeViT)]
* **CeiT**: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (*SenseTime*). [[Paper](https://arxiv.org/abs/2103.11816)][[PyTorch (rishikksh20)](https://github.com/rishikksh20/CeiT)]
* **Conformer**: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2105.03889)][[PyTorch](https://github.com/pengzhiliang/Conformer)]
* **CoaT**: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (*UCSD*). [[Paper](https://arxiv.org/abs/2104.06399)][[PyTorch](https://github.com/mlpc-ucsd/CoaT)]
* **CvT**: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2103.15808)][[Code](https://github.com/leoxiaobin/CvT)]
* **ViTc**: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.14881)]
* **ConTNet**: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (*ByteDance*). [[Paper](https://arxiv.org/abs/2104.13497)][[PyTorch](https://github.com/yan-hao-tian/ConTNet)]
* **SPACH**: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2108.13002)]
* **MobileViT**: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (*Apple*). [[Paper](https://arxiv.org/abs/2110.02178)][[PyTorch](https://github.com/apple/ml-cvnets)]
* **CMT**: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2107.06263)]
* **Mobile-Former**: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2108.05895)][[PyTorch (in construction)](https://github.com/aaboys/mobileformer)]
* **TinyViT**: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.10666)][[PyTorch](https://github.com/microsoft/Cream/tree/main/TinyViT)]
* **CETNet**: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (*OPPO*). [[Paper](https://arxiv.org/abs/2207.13317)]
* **ParC-Net**: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2203.03952)][[PyTorch](https://github.com/hkzhang91/ParC-Net)]
* **?**: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2210.07240)][[PyTorch](https://github.com/hananshafi/vits-for-small-scale-datasets)]
* **DHVT**: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (*USTC*). [[Paper](https://arxiv.org/abs/2210.05958)][[Code (in construction)](https://github.com/ArieSeirack/DHVT)]
* **iFormer**: "Inception Transformer", NeurIPS, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2205.12956)][[PyTorch](https://github.com/sail-sg/iFormer)]
* **DenseDCT**: "Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets", NeurIPSW, 2022 (*University of Kansas*). [[Paper](https://arxiv.org/abs/2210.14319)]
* **CXV**: "Convolutional Xformers for Vision", arXiv, 2022 (*IIT Bombay*). [[Paper](https://arxiv.org/abs/2201.10271)][[PyTorch](https://github.com/pranavphoenix/CXV)]
* **ConvMixer**: "Patches Are All You Need?", arXiv, 2022 (*CMU*). [[Paper](https://arxiv.org/abs/2201.09792)][[PyTorch](https://github.com/locuslab/convmixer)]
* **MobileViTv2**: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (*Apple*). [[Paper](https://arxiv.org/abs/2206.02680)][[PyTorch](https://github.com/apple/ml-cvnets)]
* **UniFormer**: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2201.09450)][[PyTorch](https://github.com/Sense-X/UniFormer)]
* **EdgeFormer**: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (*?*). [[Paper](https://arxiv.org/abs/2203.03952)]
* **MoCoViT**: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2205.12635)]
* **DynamicViT**: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2207.01580)][[PyTorch](https://github.com/raoyongming/DynamicViT)]
* **ConvFormer**: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (*National University of Defense Technology, China*). [[Paper](https://arxiv.org/abs/2209.07738)]
* **Fast-ParC**: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2210.04020)]
* **MetaFormer**: "MetaFormer Baselines for Vision", arXiv, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2210.13452)][[PyTorch](https://github.com/sail-sg/metaformer)]
* **STM**: "Demystify Transformers & Convolutions in Modern Image Deep Networks", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2211.05781)][[Code (in construction)](https://github.com/OpenGVLab/STM-Evaluation)]
* **ParCNetV2**: "ParCNetV2: Oversized Kernel with Enhanced Attention", arXiv, 2022 (*Intellifusion, China*). [[Paper](https://arxiv.org/abs/2211.07157)]
* **VAN**: "Visual Attention Network", arXiv, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2202.09741)][[PyTorch](https://github.com/Visual-Attention-Network)]
* **SD-MAE**: "Masked autoencoders is an effective solution to transformer data-hungry", arXiv, 2022 (*Hangzhou Dianzi University*). [[Paper](https://arxiv.org/abs/2212.05677)][[PyTorch (in construction)](https://github.com/Talented-Q/SDMAE)]
* **SATA**: "Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets", WACV, 2023 (*University of Kansas*). [[Paper](https://arxiv.org/abs/2210.12333)][[PyTorch (in construction)](https://github.com/xiangyu8/SATA)]
* **SparK**: "Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning", ICLR, 2023 (*Bytedance*). [[Paper](https://openreview.net/forum?id=NRxydtWup1S)][[PyTorch](https://github.com/keyu-tian/SparK)]
* **MOAT**: "MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models", ICLR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2210.01820)][[Tensorflow](https://github.com/google-research/deeplab2)]
* **InternImage**: "InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions", CVPR, 2023 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2211.05778)][[PyTorch](https://github.com/OpenGVLab/InternImage)]
* **SwiftFormer**: "SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications", ICCV, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2303.15446)][[PyTorch](https://github.com/Amshaker/SwiftFormer)]
* **SCSC**: "SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers", ICCVW, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2308.07110)]
* **PSLT**: "PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift", TPAMI, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2304.03481)][[Website](https://isee-ai.cn/wugaojie/PSLT.html)]
* **RepViT**: "RepViT: Revisiting Mobile CNN From ViT Perspective", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2307.09283)][[PyTorch](https://github.com/jameslahm/RepViT)]
* **?**: "Interpret Vision Transformers as ConvNets with Dynamic Convolutions", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2309.10713)]
* **UPDP**: "UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer", AAAI, 2024 (*AMD*). [[Paper](https://arxiv.org/abs/2401.06426)]
#### Training + Transformer
* **iGPT**: "Generative Pretraining From Pixels", ICML, 2020 (*OpenAI*). [[Paper](http://proceedings.mlr.press/v119/chen20s.html)][[Tensorflow](https://github.com/openai/image-gpt)]
* **CLIP**: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (*OpenAI*). [[Paper](https://arxiv.org/abs/2103.00020)][[PyTorch](https://github.com/openai/CLIP)]
* **MoCo-V3**: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.02057)]
* **DINO**: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.14294)][[PyTorch](https://github.com/facebookresearch/dino)]
* **drloc**: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (*University of Trento*). [[Paper](https://arxiv.org/abs/2106.03746)][[PyTorch](https://github.com/yhlleo/VTs-Drloc)]
* **CARE**: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2110.05340)][[PyTorch](https://github.com/ChongjianGE/CARE)]
* **MST**: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (*SenseTime*). [[Paper](https://arxiv.org/abs/2106.05656)]
* **SiT**: "SiT: Self-supervised Vision Transformer", arXiv, 2021 (*University of Surrey*). [[Paper](https://arxiv.org/abs/2104.03602)][[PyTorch](https://github.com/Sara-Ahmed/SiT)]
* **MoBY**: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2105.04553)][[PyTorch](https://github.com/SwinTransformer/Transformer-SSL)]
* **?**: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (*Pune Institute of Computer Technology, India*). [[Paper](https://arxiv.org/abs/2110.05270)]
* **Annotations-1.3B**: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (*Pinterest*). [[Paper](https://arxiv.org/abs/2108.05887)]
* **BEiT**: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.08254)][[PyTorch](https://github.com/microsoft/unilm/tree/master/beit)]
* **EsViT**: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.09785)]
* **iBOT**: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2111.07832)][[PyTorch](https://github.com/bytedance/ibot)]
* **MaskFeat**: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2112.09133)]
* **AutoProg**: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2203.14509)][[Code (in construction)](https://github.com/changlin31/AutoProg)]
* **MAE**: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2111.06377)][[PyTorch](https://github.com/facebookresearch/mae)][[PyTorch (pengzhiliang)](https://github.com/pengzhiliang/MAE-pytorch)]
* **SimMIM**: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.09886)][[PyTorch](https://github.com/microsoft/SimMIM)]
* **SelfPatch**: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2206.07990)][[PyTorch](https://github.com/alinlab/SelfPatch)]
* **Bootstrapping-ViTs**: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2112.03552)][[PyTorch](https://github.com/zhfeing/Bootstrapping-ViTs-pytorch)]
* **TransMix**: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (*JHU*). [[Paper](https://arxiv.org/abs/2111.09833)][[PyTorch](https://github.com/Beckschen/TransMix)]
* **PatchRot**: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (*Arizona State*). [[Paper](https://drive.google.com/file/d/1ZHdBMa-MCx05Y0teqb0vmgiiYj8t5xBB/view)]
* **SplitMask**: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2112.10740)]
* **MC-SSL**: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2111.15340)]
* **RelViT**: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (*University of Padova, Italy*). [[Paper](https://arxiv.org/abs/2206.00481?context=cs)]
* **data2vec**: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2202.03555)][[PyTorch](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec)]
* **SSTA**: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (*Tencent*). [[Paper](https://proceedings.mlr.press/v162/wu22c.html)][[Code (in construction)](https://github.com/GlassyWu/SSTA)]
* **MP3**: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (*Apple*). [[Paper](https://arxiv.org/abs/2207.07611)][[PyTorch](https://github.com/arshadshk/Position-Prediction-Pretraining)]
* **CutMixSL**: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (*Yonsei University, Korea*). [[Paper](https://arxiv.org/abs/2207.00234)]
* **BootMAE**: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.07116)][[PyTorch](https://github.com/LightDXY/BootMAE)]
* **TokenMix**: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.08409)][[PyTorch](https://github.com/Sense-X/TokenMix)]
* **?**: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2207.10026)][[PyTorch](https://github.com/lkhl/tiny-transformers)]
* **HAT**: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2204.00993)][[PyTorch](https://github.com/jiawangbai/HAT)]
* **IDMM**: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2201.10728)]
* **AttMask**: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (*National Technical University of Athens*). [[Paper](https://arxiv.org/abs/2203.12719)][[PyTorch](https://github.com/gkakogeorgiou/attmask)]
* **SLIP**: "SLIP: Self-supervision meets Language-Image Pre-training", ECCV, 2022 (*Berkeley + Meta*). [[Paper](https://arxiv.org/abs/2112.12750)][[Pytorch](https://github.com/facebookresearch/SLIP)]
* **mc-BEiT**: "mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training", ECCV, 2022 (*Peking University*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/1197_ECCV_2022_paper.php)]
* **SL2O**: "Scalable Learning to Optimize: A Learned Optimizer Can Train Big Models", ECCV, 2022 (*UT Austin*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2909_ECCV_2022_paper.php)][[PyTorch](https://github.com/VITA-Group/Scalable-L2O)]
* **TokenMixup**: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2210.07562)][[PyTorch](https://github.com/mlvlab/TokenMixup)]
* **PatchRot**: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", NeurIPSW, 2022 (*Arizona State University*). [[Paper](https://arxiv.org/abs/2210.15722)]
* **GreenMIM**: "Green Hierarchical Vision Transformer for Masked Image Modeling", NeurIPS, 2022 (*The University of Tokyo*). [[Paper](https://arxiv.org/abs/2205.13515)][[PyTorch](https://github.com/LayneH/GreenMIM)]
* **DP-CutMix**: "Differentially Private CutMix for Split Learning with Vision Transformer", NeurIPSW, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2210.15986)]
* **?**: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (*Google*). [[Paper](https://openreview.net/forum?id=4nPswr1KcP)][[Tensorflow](https://github.com/google-research/vision_transformer)][[PyTorch (rwightman)](https://github.com/rwightman/pytorch-image-models)]
* **PeCo**: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2111.12710)]
* **RePre**: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2201.06857)]
* **Beyond-Masking**: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2203.14313)][[Code (in construction)](https://github.com/sunsmarterjie/beyond_masking)]
* **Kronecker-Adaptation**: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.16329)]
* **DILEMMA**: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (*University of Bern, Switzerland*). [[Paper](https://arxiv.org/abs/2204.04788)]
* **DeiT-III**: "DeiT III: Revenge of the ViT", arXiv, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2204.07118)]
* **?**: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2205.01580)][[Tensorflow](https://github.com/google-research/big_vision)]
* **ConvMAE**: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2205.03892)][[PyTorch (in construction)](https://github.com/Alpha-VL/ConvMAE)]
* **UM-MAE**: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (*Nanjing University of Science and Technology*). [[Paper](https://arxiv.org/abs/2205.10063)][[PyTorch](https://github.com/implus/UM-MAE)]
* **GMML**: "GMML is All you Need", arXiv, 2022 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2205.14986)][[PyTorch](https://github.com/Sara-Ahmed/GMML)]
* **SIM**: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2206.01204)]
* **SupMAE**: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2205.14540)][[PyTorch](https://github.com/cmu-enyac/supmae)]
* **LoMaR**: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (*KAUST*). [[Paper](https://arxiv.org/abs/2206.00790)]
* **SAR**: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (*University of Trento, Italy*). [[Paper](https://arxiv.org/abs/2206.04636)]
* **ExtreMA**: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.04667)]
* **?**: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (*Nankai University*). [[Paper](https://arxiv.org/abs/2206.05184)]
* **?**: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (*Southwest Jiaotong University*). [[Paper](https://arxiv.org/abs/2206.04981)]
* **Jigsaw-ViT**: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (*KU Leuven, Belgium*). [[Paper](https://arxiv.org/abs/2207.11971)][[PyTorch](https://github.com/yingyichen-cyy/Nested-Co-teaching)][[Website](https://yingyichen-cyy.github.io/Jigsaw-ViT/)]
* **BEiT-v2**: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.06366)][[PyTorch](https://github.com/microsoft/unilm/tree/master/beit)]
* **MILAN**: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (*Princeton*). [[Paper](https://arxiv.org/abs/2208.06049)][[PyTorch (in construction)](https://github.com/zejiangh/MILAN)]
* **PSS**: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (*Franklin and Marshall College, Pennsylvania*). [[Paper](https://arxiv.org/abs/2208.09520)][[PyTorch](https://github.com/BradMcDanel/pss)]
* **dBOT**: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2209.03917)]
* **PatchErasing**: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2209.15006)]
* **Self-Distillation**: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2210.02871)]
* **AutoView**: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2210.08458)][[Code (in construction)](https://github.com/Trent-tangtao/AutoView)]
* **LOCA**: "Location-Aware Self-Supervised Transformers", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2212.02400)]
* **FT-CLIP**: "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2212.06138)][[Code (in construction)](https://github.com/LightDXY/FT-CLIP)]
* **MixPro**: "MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer", ICLR, 2023 (*Beijing University of Chemical Technology*). [[Paper](https://arxiv.org/abs/2304.12043)][[PyTorch (in construction)](https://github.com/fistyee/MixPro)]
* **ConMIM**: "Masked Image Modeling with Denoising Contrast", ICLR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2205.09616)][[Pytorch](https://github.com/TencentARC/ConMIM)]
* **ccMIM**: "Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining", ICLR, 2023 (*Shanghai Jiao Tong*). [[Paper](https://openreview.net/forum?id=A3sgyt4HWp)]
* **CIM**: "Corrupted Image Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (*Microsoft*). [[Paper](https://openreview.net/forum?id=09hVcSDkea)]
* **MFM**: "Masked Frequency Modeling for Self-Supervised Visual Pre-Training", ICLR, 2023 (*NTU, Singapore*). [[Paper](https://openreview.net/forum?id=9-umxtNPx5E)][[Website](https://www.mmlab-ntu.com/project/mfm/index.html)]
* **Mask3D**: "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2302.14746)]
* **VisualAtom**: "Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves", CVPR, 2023 (*National Institute of Advanced Industrial Science and Technology (AIST), Japan*). [[Paper](https://arxiv.org/abs/2303.01112)][[PyTorch](https://github.com/masora1030/CVPR2023-FDSL-on-VisualAtom)][[Website](https://masora1030.github.io/Visual-Atoms-Pre-training-Vision-Transformers-with-Sinusoidal-Waves/)]
* **MixedAE**: "Mixed Autoencoder for Self-supervised Visual Representation Learning", CVPR, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2303.17152)]
* **TBM**: "Token Boosting for Robust Self-Supervised Visual Transformer Pre-training", CVPR, 2023 (*Singapore University of Technology and Design*). [[Paper](https://arxiv.org/abs/2304.04175)]
* **LGSimCLR**: "Learning Visual Representations via Language-Guided Sampling", CVPR, 2023 (*UMich*). [[Paper](https://arxiv.org/abs/2302.12248)][[PyTorch](https://github.com/mbanani/lgssl)]
* **DisCo-CLIP**: "DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2304.08480)][[PyTorch (in construction)](https://github.com/IDEA-Research/DisCo-CLIP)]
* **MaskCLIP**: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.12262)][[Code (in construction)](https://github.com/LightDXY/MaskCLIP)]
* **MAGE**: "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2211.09117)][[PyTorch](https://github.com/LTH14/mage)]
* **MixMIM**: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", CVPR, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2205.13137)][[PyTorch](https://github.com/Sense-X/MixMIM)]
* **iTPN**: "Integrally Pre-Trained Transformer Pyramid Networks", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.12735)][[PyTorch](https://github.com/sunsmarterjie/iTPN)]
* **DropKey**: "DropKey for Vision Transformer", CVPR, 2023 (*Meitu*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Li_DropKey_for_Vision_Transformer_CVPR_2023_paper.html)]
* **FlexiViT**: "FlexiViT: One Model for All Patch Sizes", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.08013)][[Tensorflow](https://github.com/google-research/big_vision)]
* **RA-CLIP**: "RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training", CVPR, 2023 (*Alibaba*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Xie_RA-CLIP_Retrieval_Augmented_Contrastive_Language-Image_Pre-Training_CVPR_2023_paper.html)]
* **CLIPPO**: "CLIPPO: Image-and-Language Understanding from Pixels Only", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.08045)][[JAX](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/clippo/README.md)]
* **DMAE**: "Masked Autoencoders Enable Efficient Knowledge Distillers", CVPR, 2023 (*JHU + UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2208.12256)][[PyTorch](https://github.com/UCSC-VLAA/DMAE)]
* **HPM**: "Hard Patches Mining for Masked Image Modeling", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2304.05919)][[PyTorch](https://github.com/Haochen-Wang409/HPM)]
* **LocalMIM**: "Masked Image Modeling with Local Multi-Scale Reconstruction", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2303.05251)]
* **MaskAlign**: "Stare at What You See: Masked Image Modeling without Reconstruction", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2211.08887)][[PyTorch](https://github.com/OpenDriveLab/maskalign)]
* **RILS**: "RILS: Masked Visual Reconstruction in Language Semantic Space", CVPR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.06958)][[Code (in construction)](https://github.com/hustvl/RILS)]
* **RelaxMIM**: "Understanding Masked Image Modeling via Learning Occlusion Invariant Feature", CVPR, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2208.04164)]
* **FDT**: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.14865)][[Code (in construction)](https://github.com/yuxiaochen1103/FDT)]
* **?**: "Prefix Conditioning Unifies Language and Label Supervision", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2206.01125)]
* **OpenCLIP**: "Reproducible scaling laws for contrastive language-image learning", CVPR, 2023 (*LAION*). [[Paper](https://arxiv.org/abs/2212.07143)][[PyTorch](https://github.com/LAION-AI/scaling-laws-openclip)]
* **DiHT**: "Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.02280)][[PyTorch](https://github.com/facebookresearch/diht)]
* **M3I-Pretraining**: "Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2211.09807)][[Code (in construction)](https://github.com/OpenGVLab/M3I-Pretraining)]
* **SN-Net**: "Stitchable Neural Networks", CVPR, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2302.06586)][[PyTorch](https://github.com/ziplab/SN-Net)]
* **MAE-Lite**: "A Closer Look at Self-supervised Lightweight Vision Transformers", ICML, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2205.14443)][[PyTorch](https://github.com/wangsr126/mae-lite)]
* **ViT-22B**: "Scaling Vision Transformers to 22 Billion Parameters", ICML, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2302.05442)]
* **GHN-3**: "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?", ICML, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2303.04143)][[PyTorch](https://github.com/SamsungSAILMontreal/ghn3)]
* **A2MIM**: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", ICML, 2023 (*Westlake University, China*). [[Paper](https://arxiv.org/abs/2205.13943)][[PyTorch](https://github.com/Westlake-AI/openmixup)]
* **PQCL**: "Patch-level Contrastive Learning via Positional Query for Visual Pre-training", ICML, 2023 (*Alibaba*). [[Paper](https://openreview.net/forum?id=Si9pBgOGeD)][[PyTorch](https://github.com/Sherrylone/Query_Contrastive)]
* **DreamTeacher**: "DreamTeacher: Pretraining Image Backbones with Deep Generative Models", ICCV, 2023 (*NIVIDA*). [[Paper](https://arxiv.org/abs/2307.07487)][[Website](https://research.nvidia.com/labs/toronto-ai/DreamTeacher/)]
* **OFDB**: "Pre-training Vision Transformers with Very Limited Synthesized Images", ICCV, 2023 (*National Institute of Advanced Industrial Science and Technology (AIST), Japan*). [[Paper](https://arxiv.org/abs/2307.14710)][[PyTorch](https://github.com/ryoo-nakamura/OFDB/)]
* **MFF**: "Improving Pixel-based MIM by Reducing Wasted Modeling Capability", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2308.00261)][[PyTorch](https://github.com/open-mmlab/mmpretrain)]
* **TL-Align**: "Token-Label Alignment for Vision Transformers", ICCV, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2210.06455)][[PyTorch](https://github.com/Euphoria16/TL-Align)]
* **SMMix**: "SMMix: Self-Motivated Image Mixing for Vision Transformers", ICCV, 2023 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2212.12977)][[PyTorch](https://github.com/ChenMnZ/SMMix)]
* **DiffMAE**: "Diffusion Models as Masked Autoencoders", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.03283)][[Website](https://weichen582.github.io/diffmae.html)]
* **MAWS**: "The effectiveness of MAE pre-pretraining for billion-scale pretraining", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2303.13496)][[PyTorch](https://github.com/facebookresearch/maws)]
* **CountBench**: "Teaching CLIP to Count to Ten", ICCV, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2302.12066)]
* **CLIPpy**: "Perceptual Grouping in Vision-Language Models", ICCV, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2210.09996)]
* **CiT**: "CiT: Curation in Training for Effective Vision-Language Data", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.02241)][[PyTorch](https://github.com/facebookresearch/CiT)]
* **I-JEPA**: "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.08243)]
* **EfficientTrain**: "EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2211.09703)][[PyTorch](https://github.com/LeapLabTHU/EfficientTrain)]
* **StableRep**: "StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.00984)][[PyTorch](https://github.com/google-research/syn-rep-learn)]
* **LaCLIP**: "Improving CLIP Training with Language Rewrites", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2305.20088)][[PyTorch](https://github.com/LijieFan/LaCLIP)]
* **DesCo**: "DesCo: Learning Object Recognition with Rich Language Descriptions", NeurIPS, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2306.14060)]
* **?**: "Stable and low-precision training for large-scale vision-language models", NeurIPS, 2023 (*UW*). [[Paper](https://arxiv.org/abs/2304.13013)]
* **CapPa**: "Image Captioners Are Scalable Vision Learners Too", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2306.07915)][[JAX](https://github.com/google-research/big_vision)]
* **IV-CL**: "Does Visual Pretraining Help End-to-End Reasoning?", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2307.08506)]
* **CLIPA**: "An Inverse Scaling Law for CLIP Training", NeurIPS, 2023 (*UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2305.07017)][[PyTorch](https://github.com/UCSC-VLAA/CLIPA)]
* **Hummingbird**: "Towards In-context Scene Understanding", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2306.01667)]
* **RevColV2**: "RevColV2: Exploring Disentangled Representations in Masked Image Modeling", NeurIPS, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2309.01005)][[PyTorch](https://github.com/megvii-research/RevCol)]
* **ALIA**: "Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation", NeurIPS, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2305.16289)][[PyTorch](https://github.com/lisadunlap/ALIA)]
* **?**: "Improving Multimodal Datasets with Image Captioning", NeurIPS (Datasets and Benchmarks), 2023 (*UW*). [[Paper](https://arxiv.org/abs/2307.10350)]
* **CCViT**: "Centroid-centered Modeling for Efficient Vision Transformer Pre-training", arXiv, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2303.04664)]
* **SoftCLIP**: "SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2303.17561)]
* **RECLIP**: "RECLIP: Resource-efficient CLIP by Training with Small Images", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.06028)]
* **DINOv2**: "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.07193)]
* **?**: "Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.13089)]
* **Filter**: "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2305.05095)]
* **?**: "Improved baselines for vision-language pre-training", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2305.08675)]
* **3T**: "Three Towers: Flexible Contrastive Learning with Pretrained Image Models", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2305.16999)]
* **ADDP**: "ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process", arXiv, 2023 (*CUHK + Tsinghua*). [[Paper](https://arxiv.org/abs/2306.05423)]
* **MOFI**: "MOFI: Learning Image Representations from Noisy Entity Annotated Images", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2306.07952)]
* **MaPeT**: "Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training", arXiv, 2023 (*UniMoRE, Italy*). [[Paper](https://arxiv.org/abs/2306.07346)][[PyTorch](https://github.com/aimagelab/MaPeT)]
* **RECO**: "Retrieval-Enhanced Contrastive Vision-Text Models", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.07196)]
* **CLIPA-v2**: "CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy", arXiv, 2023 (*UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2306.15658)][[PyTorch](https://github.com/UCSC-VLAA/CLIPA)]
* **PatchMixing**: "Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing", arXiv, 2023 (*Boston*). [[Paper](https://arxiv.org/abs/2306.17848)][[Website](https://arielnlee.github.io/PatchMixing/)]
* **SN-Netv2**: "Stitched ViTs are Flexible Vision Backbones", arXiv, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2307.00154)][[PyTorch (in construction)](https://github.com/ziplab/SN-Netv2)]
* **CLIP-GPT**: "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts", arXiv, 2023 (*Dublin City University, Ireland*). [[Paper](https://arxiv.org/abs/2307.11661)]
* **FlexPredict**: "Predicting masked tokens in stochastic locations improves masked image modeling", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2308.00566)]
* **Soft-MoE**: "From Sparse to Soft Mixtures of Experts", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2308.00951)]
* **DropPos**: "DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions", NeurIPS, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2309.03576)][[PyTorch](https://github.com/Haochen-Wang409/DropPos)]
* **MIRL**: "Masked Image Residual Learning for Scaling Deeper Vision Transformers", NeurIPS, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2309.14136)]
* **CMM**: "Investigating the Limitation of CLIP Models: The Worst-Performing Categories", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2310.03324)]
* **LC-MAE**: "Longer-range Contextualized Masked Autoencoder", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2310.13593)]
* **SILC**: "SILC: Improving Vision Language Pretraining with Self-Distillation", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2310.13355)]
* **CLIPTex**: "CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2310.14108)]
* **NxTP**: "Object Recognition as Next Token Prediction", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.02142)][[PyTorch](https://github.com/kaiyuyue/nxtp)]
* **?**: "Scaling Laws of Synthetic Images for Model Training ... for Now", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.04567)][[PyTorch](https://github.com/google-research/syn-rep-learn)]
* **SynCLR**: "Learning Vision from Models Rivals Learning Vision from Data", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.17742)][[PyTorch](https://github.com/google-research/syn-rep-learn)]
* **EWA**: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.06093)]
* **DTM**: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2401.00254)]
* **SSAT**: "Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders", WACV, 2024 (*UNC Charlotte*). [[Paper](https://arxiv.org/abs/2310.20704)][[Code (in construction)](https://github.com/dominickrei/Limited-data-vits)]
* **FEC**: "Neural Clustering based Visual Representation Learning", CVPR, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2403.17409)]
* **EfficientTrain++**: "EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training", TPAMI, 2024 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2405.08768)][[PyTorch](https://github.com/LeapLabTHU/EfficientTrain)]
* **DVT**: "Denoising Vision Transformers", arXiv, 2024 (*USC*). [[Paper](https://arxiv.org/abs/2401.02957)][[PyTorch](https://github.com/Jiawei-Yang/Denoising-ViT)][[Website](https://jiawei-yang.github.io/DenoisingViT/)]
* **AIM**: "Scalable Pre-training of Large Autoregressive Image Models", arXiv, 2024 (*Apple*). [[Paper](https://arxiv.org/abs/2401.08541)][[PyTorch](https://github.com/apple/ml-aim)]
* **DDM**: "Deconstructing Denoising Diffusion Models for Self-Supervised Learning", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2401.14404)]
* **CrossMAE**: "Rethinking Patch Dependence for Masked Autoencoders", arXiv, 2024 (*Berkeley*). [[Paper](https://arxiv.org/abs/2401.14391)][[PyTorch](https://github.com/TonyLianLong/CrossMAE)][[Website](https://crossmae.github.io/)]
* **IWM**: "Learning and Leveraging World Models in Visual Representation Learning", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2403.00504)]
* **?**: "Can Generative Models Improve Self-Supervised Representation Learning?", arXiv, 2024 (*Vector Institute*). [[Paper](https://arxiv.org/abs/2403.05966)]
#### Robustness + Transformer
* **ViT-Robustness**: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2103.14586)]
* **SAGA**: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 (*University of Connecticut*). [[Paper](https://arxiv.org/abs/2104.02610)]
* **?**: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (*KAIST*). [[Paper](https://arxiv.org/abs/2110.02797)][[PyTorch](https://github.com/phibenz/robustness_comparison_vit_mlp-mixer_cnn)]
* **ViTs-vs-CNNs**: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 (*JHU + UC Santa Cruz*). [[Paper](https://arxiv.org/abs/2111.05464)][[PyTorch](https://github.com/ytongbai/ViTs-vs-CNNs)]
* **T-CNN**: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.05795)]
* **Transformer-Attack**: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 (*Xi'an Jiaotong*). [[Paper](https://arxiv.org/abs/2103.15670)]
* **?**: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 (*University of Rennes*). [[Paper](https://arxiv.org/abs/2106.03734)]
* **?**: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (*ANU*). [[Paper](https://arxiv.org/abs/2106.04169)][[PyTorch](https://github.com/Muzammal-Naseer/Improving-Adversarial-Transferability-of-Vision-Transformers)]
* **?**: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (*University of Pittsburgh*). [[Paper](https://arxiv.org/abs/2106.13122)]
* **Token-Attack**: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 (*New York University*). [[Paper](https://arxiv.org/abs/2110.04337)]
* **?**: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2111.10493)]
* **?**: "Vision Transformers are Robust Learners", AAAI, 2022 (*PyImageSearch + IBM*). [[Paper](https://arxiv.org/abs/2105.07581)][[Tensorflow](https://github.com/sayakpaul/robustness-vit)]
* **PNA**: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (*Fudan + Maryland*). [[Paper](https://arxiv.org/abs/2109.04176)][[PyTorch](https://github.com/zhipeng-wei/PNA-PatchOut)]
* **MIA-Former**: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (*Rice University*). [[Paper](https://arxiv.org/abs/2112.11542)]
* **Patch-Fool**: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (*Rice University*). [[Paper](https://arxiv.org/abs/2203.08392)][[PyTorch](https://github.com/RICE-EIC/Patch-Fool)]
* **Generalization-Enhanced-ViT**: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (*Beihang University + NTU, Singapore*). [[Paper](https://arxiv.org/abs/2106.07617)]
* **ECViT**: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (*Tencent*).[[Paper](https://arxiv.org/abs/2203.08519)]
* **Attention-Fool**: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (*Bosch*). [[Paper](https://arxiv.org/abs/2203.13639)]
* **Memory-Token**: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2203.15243)]
* **APRIL**: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2112.14087)]
* **Smooth-ViT**: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (*MIT*). [[Paper](https://arxiv.org/abs/2110.07719)][[PyTorch](https://github.com/MadryLab/smoothed-vit)]
* **RVT**: "Towards Robust Vision Transformer", CVPR, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2105.07926)][[PyTorch](https://github.com/alibaba/easyrobust)]
* **Pyramid**: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2111.15121)]
* **VARS**: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (*Berkeley + Microsoft*). [[Paper](https://arxiv.org/abs/2204.10962)][[PyTorch](https://github.com/bfshi/VARS)]
* **FAN**: "Understanding The Robustness in Vision Transformers", ICML, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2204.12451)][[PyTorch](https://github.com/NVlabs/FAN)]
* **CFA**: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (*The University of Tokyo*). [[Paper](https://arxiv.org/abs/2206.13951)][[PyTorch](https://github.com/kojima-takeshi188/CFA)]
* **?**: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (*University of Exeter, UK*). [[Paper](https://arxiv.org/abs/2208.00906)][[PyTorch](https://github.com/TrustAI/ODE4RobustViT)]
* **?**: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2207.11347)]
* **AGAT**: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2207.10498)]
* **?**: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2111.10659)]
* **ViP**: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (*UC Santa Cruz*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/7173_ECCV_2022_paper.php)][[PyTorch](https://github.com/UCSC-VLAA/vit_cert)]
* **?**: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2210.07540)][[PyTorch](https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers)]
* **PAR**: "Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal", NeurIPS, 2022 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2112.03492)]
* **RobustViT**: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", NeurIPS, 2022 (*Tel-Aviv*). [[Paper](https://arxiv.org/abs/2206.01161)][[PyTorch](https://github.com/hila-chefer/RobustViT)]
* **?**: "Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation", NeurIPS, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2110.07858)]
* **NVD**: "Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing", NeurIPS, 2022 (*Boston*). [[Paper](https://openreview.net/forum?id=Aisi2oEq1sc)]
* **?**: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (*UW-Madison*). [[Paper](https://arxiv.org/abs/2203.09125)]
* **MA**: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.12204)]
* **?**: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (*Fudan + Microsoft*). [[Paper](https://arxiv.org/abs/2204.12143)]
* **?**: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (*Tokyo Metropolitan University*). [[Paper](https://arxiv.org/abs/2205.12041)]
* **FedWAvg**: "Federated Adversarial Training with Transformers", arXiv, 2022 (*Institute of Electronics and Digital Technologies (IETR), France*). [[Paper](https://arxiv.org/abs/2206.02131)]
* **Backdoor-Transformer**: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (*Maryland + UC Davis*). [[Paper](https://arxiv.org/abs/2206.08477)][[Code (in construction)](https://github.com/UCDvision/backdoor_transformer)]
* **?**: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2206.12381)]
* **?**: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (*Tokyo Metropolitan University*). [[Paper](https://arxiv.org/abs/2207.05366)]
* **?**: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2208.09602)]
* **CLIPping Privacy**: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2209.07341)]
* **?**: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (*EPFL*). [[Paper](https://arxiv.org/abs/2209.07399)]
* **?**: "Attacking Compressed Vision Transformers", arXiv, 2022 (*NYU*). [[Paper](https://arxiv.org/abs/2209.13785)]
* **C-AVP**: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (*Michigan State*). [[Paper](https://arxiv.org/abs/2210.06284)]
* **?**: "Curved Representation Space of Vision Transformers", arXiv, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2210.05742)]
* **RKDE**: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2210.05794)]
* **MRAP**: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (*Arizona State University*). [[Paper](https://arxiv.org/abs/2210.07663)]
* **model-soup**: "Revisiting adapters with adversarial training", ICLR, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2210.04886)]
* **?**: "Budgeted Training for Vision Transformer", ICLR, 2023 (*Tsinghua*). [[Paper](https://openreview.net/forum?id=sVzBN-DlJRi)]
* **RobustCNN**: "Can CNNs Be More Robust Than Transformers?", ICLR, 2023 (*UC Santa Cruz + JHU*). [[Paper](https://arxiv.org/abs/2206.03452)][[PyTorch](https://github.com/UCSC-VLAA/RobustCNN)]
* **DMAE**: "Denoising Masked AutoEncoders are Certifiable Robust Vision Learners", ICLR, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2210.06983)][[PyTorch](https://github.com/quanlin-wu/dmae)]
* **TGR**: "Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.15754)][[PyTorch](https://github.com/jpzhang1810/TGR)]
* **TrojViT**: "TrojViT: Trojan Insertion in Vision Transformers", CVPR, 2023 (*Indiana University Bloomington*). [[Paper](https://arxiv.org/abs/2208.13049)]
* **RSPC**: "Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions", CVPR, 2023 (*MPI*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Guo_Improving_Robustness_of_Vision_Transformers_by_Reducing_Sensitivity_To_Patch_CVPR_2023_paper.html)]
* **TORA-ViT**: "Trade-off between Robustness and Accuracy of Vision Transformers", CVPR, 2023 (*The University of Sydney*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Li_Trade-Off_Between_Robustness_and_Accuracy_of_Vision_Transformers_CVPR_2023_paper.html)]
* **BadViT**: "You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?", CVPR, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Yuan_You_Are_Catching_My_Attention_Are_Vision_Transformers_Bad_Learners_CVPR_2023_paper.html)]
* **?**: "Understanding and Defending Patched-based Adversarial Attacks for Vision Transformer", ICML, 2023 (*University of Pittsburgh*). [[Paper](https://openreview.net/forum?id=GR4c6Onxfw)]
* **RobustMAE**: "Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2308.10315)][[PyTorch (in construction)](https://github.com/shikiw/RobustMAE)]
* **?**: "Efficiently Robustify Pre-trained Models", ICCV, 2023 (*IIT Roorkee, India*). [[Paper](https://arxiv.org/abs/2309.07499)]
* **?**: "Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients", ICCV, 2023 (*Tsinghua*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Ma_Transferable_Adversarial_Attack_for_Both_Vision_Transformers_and_Convolutional_Networks_ICCV_2023_paper.html)]
* **CleanCLIP**: "CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning", ICCV, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2303.03323)][[PyTorch](https://github.com/nishadsinghi/CleanCLIP)]
* **QBBA**: "Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks", BMVC, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2309.06438)]
* **RBFormer**: "RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias", BMVC, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2309.13245)]
* **PreLayerNorm**: "Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding", PR, 2023 (*POSTECH*). [[Paper](https://arxiv.org/abs/2111.08413)]
* **CertViT**: "CertViT: Certified Robustness of Pre-Trained Vision Transformers", arXiv, 2023 (*INRIA*). [[Paper](https://arxiv.org/abs/2302.10287)][[PyTorch](https://github.com/sagarverma/transformer-lipschitz)]
* **RoCLIP**: "Robust Contrastive Language-Image Pretraining against Adversarial Attacks", arXiv, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2303.06854)]
* **DeepMIM**: "DeepMIM: Deep Supervision for Masked Image Modeling", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2303.08817)][[Code (in construction)](https://github.com/OliverRensu/DeepMIM)]
* **TAP-ADL**: "Robustifying Token Attention for Vision Transformers", ICCV, 2023 (*MPI*). [[Paper](https://arxiv.org/abs/2303.11126)][[PyTorch](https://github.com/guoyongcs/TAPADL)]
* **EWA**: "Experts Weights Averaging: A New General Training Scheme for Vision Transformers", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.06093)]
* **SlowFormer**: "SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers", arXiv, 2023 (*UC Davis*). [[Paper](https://arxiv.org/abs/2310.02544)][[PyTorch](https://github.com/UCDvision/SlowFormer)]
* **DTM**: "Masked Image Modeling via Dynamic Token Morphing", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2401.00254)]
* **SWARM**: "Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers", CVPR, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2405.10612)][[Code (in construction)](https://github.com/20000yshust/SWARM)]
* **?**: "Safety of Multimodal Large Language Models on Images and Text", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2402.00357)]
#### Model Compression + Transformer
* **ViT-quant**: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2106.14156)]
* **VTP**: "Visual Transformer Pruning", arXiv, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2104.08500)]
* **MD-ViT**: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 (*Princeton*). [[Paper](https://arxiv.org/abs/2201.00043)]
* **FQ-ViT**: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2111.13824)][[PyTorch](https://github.com/linyang-zhh/FQ-ViT)]
* **UVC**: "Unified Visual Transformer Compression", ICLR, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2203.08243)][[PyTorch](https://github.com/VITA-Group/UVC)]
* **MiniViT**: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2204.07154)][[PyTorch](https://github.com/microsoft/Cream/tree/main/MiniViT)]
* **Auto-ViT-Acc**: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2208.05163)]
* **APQ-ViT**: "Towards Accurate Post-Training Quantization for Vision Transformer", ACMMM, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2303.14341)]
* **SPViT**: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2112.13890)][[PyTorch](https://github.com/PeiyanFlying/SPViT)]
* **PSAQ-ViT**: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2203.02250)][[PyTorch](https://github.com/zkkli/PSAQ-ViT)]
* **PTQ4ViT**: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2111.12293)]
* **EAPruning**: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2210.00181)]
* **Q-ViT**: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2210.06707)][[PyTorch](https://github.com/YanjingLi0202/Q-ViT)]
* **SAViT**: "SAViT: Structure-Aware Vision Transformer Pruning via Collaborative Optimization", NeurIPS, 2022 (*Hikvision*). [[Paper](https://openreview.net/forum?id=w5DacXWzQ-Q)]
* **VTC-LFC**: "VTC-LFC: Vision Transformer Compression with Low-Frequency Components", NeurIPS, 2022 (*Alibaba*). [[Paper](https://openreview.net/forum?id=HuiLIB6EaOk)][[PyTorch](https://github.com/Daner-Wang/VTC-LFC)]
* **Q-ViT**: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2201.07703)]
* **VAQF**: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2201.06618)]
* **VTP**: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (*UCLA*). [[Paper](https://arxiv.org/abs/2203.13444)]
* **SiDT**: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (*UC Irvine*). [[Paper](https://arxiv.org/abs/2204.07722)]
* **PSAQ-ViT-V2**: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2209.05687)][[PyTorch](https://github.com/zkkli/PSAQ-ViT)]
* **AS**: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2209.13802)]
* **SaiT**: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2210.05832)]
* **oViT**: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (*IST Austria*). [[Paper](https://arxiv.org/abs/2210.09223)]
* **CPT-V**: "CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2211.09643)]
* **TPS**: "Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers", CVPR, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2304.10716)][[PyTorch](https://github.com/megvii-research/TPS-CVPR2023)]
* **GPUSQ-ViT**: "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2305.10727)]
* **X-Pruner**: "X-Pruner: eXplainable Pruning for Vision Transformers", CVPR, 2023 (*James Cook University, Australia*). [[Paper](https://arxiv.org/abs/2303.04935)][[PyTorch (in construction)](https://github.com/vickyyu90/XPruner)]
* **NoisyQuant**: "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2211.16056)]
* **NViT**: "Global Vision Transformer Pruning with Hessian-Aware Saliency", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2110.04869)]
* **BinaryViT**: "BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models", CVPRW, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2306.16678)][[PyTorch](https://github.com/phuoc-hoan-le/binaryvit)]
* **OFQ**: "Oscillation-free Quantization for Low-bit Vision Transformers", ICML, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2302.02210)][[PyTorch](https://github.com/nbasyl/OFQ)]
* **UPop**: "UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers", ICML, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2301.13741)][[PyTorch](https://github.com/sdc17/UPop)]
* **COMCAT**: "COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models", ICML, 2023 (*Rutgers*). [[Paper](https://arxiv.org/abs/2305.17235)][[PyTorch](https://github.com/jinqixiao/ComCAT)]
* **Evol-Q**: "Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers", ICCV, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2308.10814)][[Code (in construction)](https://github.com/enyac-group/evol-q)]
* **BiViT**: "BiViT: Extremely Compressed Binary Vision Transformer", ICCV, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2211.07091)]
* **I-ViT**: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2207.01405)][[PyTorch](https://github.com/zkkli/I-ViT)]
* **RepQ-ViT**: "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2212.08254)][[PyTorch](https://github.com/zkkli/RepQ-ViT)]
* **LLM-FP4**: "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", EMNLP, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2310.16836)][[Code (in construction)](https://github.com/nbasyl/LLM-FP4)]
* **Q-HyViT**: "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023 (*Electronics and Telecommunications Research Institute (ETRI), Korea*). [[Paper](https://arxiv.org/abs/2303.12557)]
* **Bi-ViT**: "Bi-ViT: Pushing the Limit of Vision Transformer Quantization", arXiv, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2305.12354)]
* **BinaryViT**: "BinaryViT: Towards Efficient and Accurate Binary Vision Transformers", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2305.14730)]
* **Zero-TP**: "Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers", arXiv, 2023 (*Princeton*). [[Paper](https://arxiv.org/abs/2305.17328)]
* **?**: "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing", arXiv, 2023 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2306.12929)]
* **VVTQ**: "Variation-aware Vision Transformer Quantization", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2307.00331)][[PyTorch](https://github.com/HuangOwen/VVTQ)]
* **DIMAP**: "Data-independent Module-aware Pruning for Hierarchical Vision Transformers", ICLR, 2024 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2404.13648)][[Code (in construction)](https://github.com/he-y/Data-independent-Module-Aware-Pruning)]
* **MADTP**: "MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer", CVPR, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2403.02991)][[Code (in construction)](https://github.com/double125/MADTP)]
* **DC-ViT**: "Dense Vision Transformer Compression with Few Samples", CVPR, 2024 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2403.18708)]

[[Back to Overview](#overview)]

### Attention-Free
#### MLP-Series
* **RepMLP**: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2105.01883)][[PyTorch](https://github.com/DingXiaoH/RepMLP)]
* **EAMLP**: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2105.02358)]
* **Forward-Only**: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 (*Oxford*). [[Paper](https://arxiv.org/abs/2105.02723)][[PyTorch](https://github.com/lukemelas/do-you-even-need-attention)]
* **ResMLP**: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2105.03404)]
* **?**: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2105.15078)]
* **ViP**: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2106.12368)][[PyTorch](https://github.com/Andrew-Qibin/VisionPermutator)]
* **CCS**: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.14882)]
* **S2-MLPv2**: "S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2108.01072)]
* **RaftMLP**: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 (*Rikkyo University, Japan*). [[Paper](https://arxiv.org/abs/2108.04384)][[PyTorch](https://github.com/okojoalg/raft-mlp)]
* **Hire-MLP**: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (*Huawei*). [[Paper](https://arxiv.org/abs/2108.13341)]
* **Sparse-MLP**: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 (*NUS*). [[Paper](https://arxiv.org/abs/2109.02008)]
* **ConvMLP**: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 (*University of Oregon*). [[Paper](https://arxiv.org/abs/2109.04454)][[PyTorch](https://github.com/SHI-Labs/Convolutional-MLPs)]
* **sMLP**: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2109.05422)]
* **MLP-Mixer**: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2105.01601)][[Tensorflow](https://github.com/google-research/vision_transformer)][[PyTorch-1 (lucidrains)](https://github.com/lucidrains/mlp-mixer-pytorch)][[PyTorch-2 (rishikksh20)](https://github.com/rishikksh20/MLP-Mixer-pytorch)]
* **gMLP**: "Pay Attention to MLPs", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2105.08050)][[PyTorch (antonyvigouret)](https://github.com/antonyvigouret/Pay-Attention-to-MLPs)]
* **S2-MLP**: "S2-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.07477)]
* **CycleMLP**: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (*HKU*). [[Paper](https://arxiv.org/abs/2107.10224)][[PyTorch](https://github.com/ShoufaChen/CycleMLP)]
* **AS-MLP**: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (*ShanghaiTech University*). [[Paper](https://arxiv.org/abs/2107.08391)][[PyTorch](https://github.com/svip-lab/AS-MLP)]
* **Wave-MLP**: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2111.12294)][[PyTorch](https://github.com/huawei-noah/CV-Backbones/tree/master/wavemlp_pytorch)]
* **DynaMixer**: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2201.12083)][[PyTorch](https://github.com/ziyuwwang/DynaMixer)]
* **STD**: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (*Huawei*). [[Paper](https://proceedings.mlr.press/v162/li22c.html)]
* **AMixer**: " AMixer: Adaptive Weight Mixing for Self-Attention Free Vision Transformers", ECCV, 2022 (*Tsinghua University*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/4464_ECCV_2022_paper.php)]
* **MS-MLP**: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2202.06510)]
* **ActiveMLP**: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.06108)]
* **MDMLP**: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (*Jiangsu University*). [[Paper](https://arxiv.org/abs/2205.14477)][[PyTorch](https://github.com/Amoza-Theodore/MDMLP)]
* **PosMLP**: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (*University of Science and Technology
of China*). [[Paper](https://arxiv.org/abs/2207.07284)][[PyTorch](https://github.com/Zhicaiwww/PosMLP)]
* **SplitMixer**: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (*Quintic AI, California*). [[Paper](https://arxiv.org/abs/2207.10255)][[PyTorch](https://github.com/aliborji/splitmixer)]
* **gSwin**: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (*PKSHATechnology, Japan*). [[Paper](https://arxiv.org/abs/2208.11718)]
* **?**: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (*Berkeley*). [[Paper](https://arxiv.org/abs/2209.06383)]
* **AFFNet**: "Adaptive Frequency Filters As Efficient Global Token Mixers", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2307.14008)]
* **Strip-MLP**: "Strip-MLP: Efficient Token Interaction for Vision MLP", ICCV, 2023 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2307.11458)][[PyTorch](https://github.com/Med-Process/Strip_MLP)]
#### Other Attention-Free
* **DWNet**: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (*Nankai Univerisy*). [[Paper](https://arxiv.org/abs/2106.04263)][[PyTorch](https://github.com/Atten4Vis/DemystifyLocalViT)]
* **PoolFormer**: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2111.11418)][[PyTorch](https://github.com/sail-sg/poolformer)]
* **ConvNext**: "A ConvNet for the 2020s", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2201.03545)][[PyTorch](https://github.com/facebookresearch/ConvNeXt)]
* **RepLKNet**: "Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs", CVPR, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2203.06717)][[MegEngine](https://github.com/MegEngine/RepLKNet)][[PyTorch](https://github.com/DingXiaoH/RepLKNet-pytorch)]
* **FocalNet**: "Focal Modulation Networks", NeurIPS, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.11926)][[PyTorch](https://github.com/microsoft/FocalNet)]
* **HorNet**: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions", NeurIPS, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2207.14284)][[PyTorch](https://github.com/raoyongming/HorNet)][[Website](https://hornet.ivg-research.xyz/)]
* **S4ND**: "S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces", NeurIPS, 2022 (*Stanford*). [[Paper](https://arxiv.org/abs/2210.06583)]
* **Sequencer**: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (*Rikkyo University, Japan*). [[Paper](https://arxiv.org/abs/2205.01972)]
* **MogaNet**: "Efficient Multi-order Gated Aggregation Network", arXiv, 2022 (*Westlake University, China*). [[Paper](https://arxiv.org/abs/2211.03295)]
* **Conv2Former**: "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition", arXiv, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2211.11943)]
* **CoC**: "Image as Set of Points", ICLR, 2023 (*Northeastern*). [[Paper](https://arxiv.org/abs/2303.01494)][[PyTorch](https://github.com/ma-xu/Context-Cluster)]
* **SLaK**: "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity", ICLR, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2207.03620)][[PyTorch](https://github.com/VITA-Group/SLaK)]
* **ConvNeXt-V2**: "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.00808)][[PyTorch](https://github.com/facebookresearch/ConvNeXt-V2)]
* **SPANet**: "SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation", ICCV, 2023 (*Korea Institute of Science and Technology*). [[Paper](https://arxiv.org/abs/2308.11568)][[Code (in construction)](https://github.com/DoranLyong/SPANet-official)][[Website](https://doranlyong.github.io/projects/spanet/)]
* **DFFormer**: "FFT-based Dynamic Token Mixer for Vision", arXiv, 2023 (*Rikkyo University, Japan*). [[Paper](https://arxiv.org/abs/2303.03932)][[Code (in construction)](https://github.com/okojoalg/dfformer)]
* **?**: "ConvNets Match Vision Transformers at Scale", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2310.16764)]
* **VMamba**: "VMamba: Visual State Space Model", arXiv, 2024 (*CAS*). [[Paper](https://arxiv.org/abs/2401.10166)][[PyTorch](https://github.com/MzeroMiko/VMamba)]
* **Vim**: "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model", arXiv, 2024 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2401.09417)][[PyTorch](https://github.com/hustvl/Vim
* **VRWKV**: "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.02308)][[PyTorch](https://github.com/OpenGVLab/Vision-RWKV)]
* **LocalMamba**: "LocalMamba: Visual State Space Model with Windowed Selective Scan", arXiv, 2024 (*University of Sydney*). [[Paper](https://arxiv.org/abs/2403.09338)][[PyTorch](https://github.com/hunto/LocalMamba)]
* **SiMBA**: "SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series", arXiv, 2024 (*Microsoft*). [[Paper](https://arxiv.org/abs/2403.15360)][[PyTorch](https://github.com/badripatro/Simba)]
* **PlainMamba**: "PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition", arXiv, 2024 (*University of Edinburgh, Scotland*). [[Paper](https://arxiv.org/abs/2403.17695)][[PyTorch](https://github.com/ChenhongyiYang/PlainMamba)]
* **EfficientVMamba**: "EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba", arXiv, 2024 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2403.09977)][[PyTorch](https://github.com/TerryPei/EfficientVMamba)]
* **RDNet**: "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs", arXiv, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2403.19588)]
* **MambaOut**: "MambaOut: Do We Really Need Mamba for Vision?", arXiv, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2405.07992)][[PyTorch](https://github.com/yuweihao/MambaOut)]

[[Back to Overview](#overview)]

### Analysis for Transformer
* **Attention-CNN**: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (*EPFL*). [[Paper](https://openreview.net/forum?id=HJlnC1rKPB)][[PyTorch](https://github.com/epfml/attention-cnn)][[Website](https://epfml.github.io/attention-cnn/)]
* **Transformer-Explainability**: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2012.09838)][[PyTorch](https://github.com/hila-chefer/Transformer-Explainability)]
* **?**: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 (*Princeton*). [[Paper](https://arxiv.org/abs/2105.07197)]
* **?**: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 (*HKU*). [[Paper](https://arxiv.org/abs/2108.05305)]
* **?**: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2108.08810)]
* **?**: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2105.10497)][[PyTorch](https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers)]
* **FoveaTer**: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (*UCSB*). [[Paper](https://arxiv.org/abs/2105.14173)]
* **?**: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.04263)]
* **?**: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.07998)]
* **?**: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (*Horizon Robotic*). [[Paper](https://arxiv.org/abs/2107.02174)]
* **?**: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 (*Temple University*). [[Paper](https://openaccess.thecvf.com/content/WACV2022/html/Black_Visualizing_Paired_Image_Similarity_in_Transformer_Networks_WACV_2022_paper.html)][[PyTorch](https://github.com/vidarlab/xformer-paired-viz)]
* **FDSL**: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 (*AIST*). [[Paper](https://arxiv.org/abs/2103.13023)][[PyTorch](https://github.com/nakashima-kodai/FractalDB-Pretrained-ViT-PyTorch)][[Website](https://hirokatsukataoka16.github.io/Vision-Transformers-without-Natural-Images/)]
* **AlterNet**: "How Do Vision Transformers Work?", ICLR, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2202.06709)][[PyTorch](https://github.com/xxxnell/how-do-vits-work)]
* **?**: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2106.01548)][[Tensorflow](https://github.com/google-research/vision_transformer)]
* **?**: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (*Stanford*). [[Paper](https://arxiv.org/abs/2205.08078)]
* **?**: "Three things everyone should know about Vision Transformers", ECCV, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2203.09795)]
* **?**: "Vision Transformers provably learn spatial structure", NeurIPS, 2022 (*Princeton*). [[Paper](https://arxiv.org/abs/2210.09221)]
* **AWD-ViT**: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2203.05922)]
* **?**: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 (*Quintic AI, CA*). [[Paper](https://arxiv.org/abs/2203.11678)][[Code](https://github.com/aliborji/hybrid_images)]
* **MJP**: "Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers", CVPR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2205.12551)][[PyTorch](https://github.com/yhlleo/MJP)]
* **?**: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 (*University of Electronic Science and Technology of China*). [[Paper](https://arxiv.org/abs/2206.11073)]
* **?**: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 (*University of Groningen, The Netherlands*). [[Paper](https://arxiv.org/abs/2208.04693)]
* **?**: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 (*Technion Israel Institute Of Technology*). [[Paper](https://arxiv.org/abs/2208.08191)]
* **ProtoPFormer**: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2208.10431)][[PyTorch](https://github.com/zju-vipa/ProtoPFormer)]
* **ICLIP**: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2209.07046)][[Code (in construction)](https://github.com/xmed-lab/ICLIP)]
* **?**: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2210.06313)]
* **?**: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (*Monash University*). [[Paper](https://arxiv.org/abs/2210.07646)][[PyTorch](https://github.com/byM1902/ViT_visualization)]
* **ViT-CX**: "ViT-CX: Causal Explanation of Vision Transformers", arXiv, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2211.03064)]
* **?**: "Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application", arXiv, 2022 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2211.08543)]
* **IAV**: "Explanation on Pretraining Bias of Finetuned Vision Transformer", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2211.15428)]
* **ViT-Shapley**: "Learning to Estimate Shapley Values with Vision Transformers", ICLR, 2023 (*UW*). [[Paper](https://arxiv.org/abs/2206.05282)][[PyTorch](https://github.com/suinleelab/vit-shapley)]
* **ImageNet-X**: "ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations", ICLR, 2023 (*Meta*). [[Paper](https://openreview.net/forum?id=HXz7Vcm3VgM)]
* **?**: "A Theoretical Understanding of Vision Transformers: Learning, Generalization, and Sample Complexity", ICLR, 2023 (*Rensselaer Polytechnic Institute, NY*). [[Paper](https://openreview.net/forum?id=jClGv3Qjhb)]
* **?**: "What Do Self-Supervised Vision Transformers Learn?", ICLR, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2305.00729)][[PyTorch (in construction)](https://github.com/naver-ai/cl-vs-mim)]
* **?**: "When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?", ICLR, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2210.01936)]
* **CLIP-Dissect**: "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks", ICLR, 2023 (*UCSD*). [[Paper](https://arxiv.org/abs/2204.10965)]
* **?**: "Understanding Masked Autoencoders via Hierarchical Latent Variable Models", CVPR, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2306.04898)]
* **?**: "Teaching Matters: Investigating the Role of Supervision in Vision Transformers", CVPR, 2023 (*Maryland*). [[Paper](https://arxiv.org/abs/2212.03862)][[PyTorch](https://github.com/mwalmer-umd/vit_analysis)][[Website](https://www.cs.umd.edu/~sakshams/vit_analysis/)]
* **?**: "Masked Autoencoding Does Not Help Natural Language Supervision at Scale", CVPR, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2301.07836)]
* **?**: "On Data Scaling in Masked Image Modeling", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.04664)][[PyTorch](https://github.com/microsoft/SimMIM)]
* **?**: "Revealing the Dark Secrets of Masked Image Modeling", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2205.13543)]
* **Vision-DiffMask**: "VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking", CVPRW, 2023 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2304.06391)][[PyTorch](https://github.com/AngelosNal/Vision-DiffMask)]
* **?**: "A Multidimensional Analysis of Social Biases in Vision Transformers", ICCV, 2023 (*University of Mannheim, Germany*). [[Paper](https://arxiv.org/abs/2308.01948)][[PyTorch](https://github.com/jannik-brinkmann/social-biases-in-vision-transformers)]
* **?**: "Analyzing Vision Transformers for Image Classification in Class Embedding Space", NeurIPS, 2023 (*Goethe University Frankfurt, Germany*). [[Paper](https://arxiv.org/abs/2310.18969)]
* **BoB**: "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks", NeurIPS, 2023 (*NYU*). [[Paper](https://arxiv.org/abs/2310.19909)][[PyTorch](https://github.com/hsouri/Battle-of-the-Backbones)]
* **ViT-CoT**: "Are Vision Transformers More Data Hungry Than Newborn Visual Systems?", NeurIPS, 2023 (*Indiana University Bloomington, Indiana*). [[Paper](https://arxiv.org/abs/2312.02843)]
* **AtMan**: "AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation", NeurIPS, 2023 (*Aleph Alpha, Germany*). [[Paper](https://arxiv.org/abs/2301.08110)][[PyTorch](https://github.com/Aleph-Alpha/AtMan)]
* **AttentionViz**: "AttentionViz: A Global View of Transformer Attention", arXiv, 2023 (*Harvard*). [[Paper](https://arxiv.org/abs/2305.03210)][[Website](http://attentionviz.com/)]
* **?**: "Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields", arXiv, 2023 (*POSTECH*). [[Paper](https://arxiv.org/abs/2305.04722)]
* **?**: "Reviving Shift Equivariance in Vision Transformers", arXiv, 2023 (*Maryland*). [[Paper](https://arxiv.org/abs/2306.07470)]
* **ViT-ReciproCAM**: "ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer", arXiv, 2023 (*Intel*). [[Paper](https://arxiv.org/abs/2310.02588)]
* **Eureka-moment**: "Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems", arXiv, 2023 (*Bosch*). [[Paper](https://arxiv.org/abs/2310.12956)]
* **INTR**: "A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis", arXiv, 2023 (*OSU*). [[Paper](https://arxiv.org/abs/2311.04157)][[PyTorch](https://github.com/Imageomics/INTR)]
* **?**: "Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention", AAAI, 2024 (*Korea Institute of Science and Technology (KIST)*). [[Paper](https://arxiv.org/abs/2402.04563)][[PyTorch](https://github.com/LeemSaebom/Attention-Guided-CAM-Visual-Explanations-of-Vision-Transformer-Guided-by-Self-Attention)]
* **RelatiViT**: "Can Transformers Capture Spatial Relations between Objects?", ICLR, 2024 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2403.00729)][[Code (in construction)](https://github.com/AlvinWen428/spatial-relation)][[Website](https://sites.google.com/view/spatial-relation)]
* **TokenTM**: "Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer", CVPR, 2024 (*Illinois Institute of Technology*). [[Paper](https://arxiv.org/abs/2403.14552)]
* **SaCo**: "On the Faithfulness of Vision Transformer Explanations", CVPR, 2024 (*Illinois Institute of Technology*). [[Paper](https://arxiv.org/abs/2404.01415)]
* **?**: "A Decade's Battle on Dataset Bias: Are We There Yet?", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2403.08632)][[Code (in construction)](https://github.com/liuzhuang13/bias)]
* **LeGrad**: "LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity", arXiv, 2024 (*University of Bonn, Germany*). [[Paper](https://arxiv.org/abs/2404.03214)][[PyTorch](https://github.com/WalBouss/LeGrad)]

[[Back to Overview](#overview)]

## Detection
### Object Detection
* General:
* **detrex**: "detrex: Benchmarking Detection Transformers", arXiv, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2306.07265)][[PyTorch](https://github.com/IDEA-Research/detrex)]
* CNN-based backbone:
* **DETR**: "End-to-End Object Detection with Transformers", ECCV, 2020 (*Facebook*). [[Paper](https://arxiv.org/abs/2005.12872)][[PyTorch](https://github.com/facebookresearch/detr)]
* **Deformable DETR**: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 (*SenseTime*). [[Paper](https://arxiv.org/abs/2010.04159)][[PyTorch](https://github.com/fundamentalvision/Deformable-DETR)]
* **UP-DETR**: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2011.09094)][[PyTorch](https://github.com/dddzg/up-detr)]
* **SMCA**: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (*CUHK*). [[Paper](https://arxiv.org/abs/2108.02404)][[PyTorch](https://github.com/gaopengcuhk/SMCA-DETR)]
* **Conditional-DETR**: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2108.06152)]
* **PnP-DETR**: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (*Yitu*). [[Paper](https://arxiv.org/abs/2109.07036)][[Code (in construction)](https://github.com/twangnh/pnp-detr)]
* **TSP**: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 (*CMU*). [[Paper](https://arxiv.org/abs/2011.10881)]
* **Dynamic-DETR**: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 (*Microsoft*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Dai_Dynamic_DETR_End-to-End_Object_Detection_With_Dynamic_Attention_ICCV_2021_paper.html)]
* **ViT-YOLO**: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (*Xidian University*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021W/VisDrone/html/Zhang_ViT-YOLOTransformer-Based_YOLO_for_Object_Detection_ICCVW_2021_paper.html)]
* **ACT**: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 (*Peking + CUHK*). [[Paper](https://arxiv.org/abs/2011.09315)][[PyTorch](https://github.com/gaopengcuhk/SMCA-DETR/)]
* **DIL-ViT**: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (*Monash University Malaysia*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0675.pdf)]
* **Efficient-DETR**: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2104.01318)]
* **CA-FPN**: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2105.09464)]
* **DETReg**: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 (*Tel-Aviv + Berkeley*). [[Paper](https://arxiv.org/abs/2106.04550)][[Website](https://www.amirbar.net/detreg/)]
* **GQPos**: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2108.09691)]
* **Anchor-DETR**: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2109.07107)][[PyTorch](https://github.com/megvii-research/AnchorDETR)]
* **Sparse-DETR**: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (*Kakao*). [[Paper](https://arxiv.org/abs/2111.14330)][[PyTorch](https://github.com/kakaobrain/sparse-detr)]
* **DAB-DETR**: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2201.12329)][[PyTorch](https://github.com/SlongLiu/DAB-DETR)]
* **DN-DETR**: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (*International Digital Economy Academy (IDEA), China*). [[Paper](https://arxiv.org/abs/2203.01305)][[PyTorch](https://github.com/FengLi-ust/DN-DETR)]
* **SAM-DETR**: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2203.06883)][[PyTorch](https://github.com/ZhangGongjie/SAM-DETR)]
* **AdaMixer**: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2203.16507)][[Code (in construction)](https://github.com/MCG-NJU/AdaMixer)]
* **DESTR**: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (*Oregon State*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/He_DESTR_Object_Detection_With_Split_Transformer_CVPR_2022_paper.html)]
* **REGO**: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2112.04632)][[PyTorch](https://github.com/zhechen/Deformable-DETR-REGO)]
* **?**: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 (*Ant Group*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Hong_Training_Object_Detectors_From_Scratch_An_Empirical_Study_in_the_CVPR_2022_paper.html)]
* **DE-DETR**: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2203.09507)][[PyTorch](https://github.com/encounter1997/DE-DETRs)]
* **DFFT**: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2206.06829)]
* **Cornerformer**: "Cornerformer: Purifying Instances for Corner-Based Detectors", ECCV, 2022 (*Huawei*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/4286_ECCV_2022_paper.php)]
* **?**: "A Simple Approach and Benchmark for 21,000-Category Object Detection", ECCV, 2022 (*Microsoft*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/8094_ECCV_2022_paper.php)][[Code (in construction)](https://github.com/SwinTransformer/Simple-21K-Detection)]
* **Obj2Seq**: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", NeurIPS, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2209.13948)][[PyTorch](https://github.com/CASIA-IVA-Lab/Obj2Seq)]
* **KA**: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2203.03187)]
* **TCC**: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2207.06603)]
* **Conditional-DETR-V2**: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2207.08914)]
* **SAM-DETR++**: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2207.14172)][[PyTorch](https://github.com/ZhangGongjie/SAM-DETR)]
* **ComplETR**: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2209.05654)]
* **Pair-DETR**: "Pair DETR: Contrastive Learning Speeds Up DETR Training", arXiv, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2210.16476)]
* **Group-DETR-v2**: "Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2211.03594)]
* **KD-DETR**: "Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2211.08071)]
* **D3ETR**: "D3ETR: Decoder Distillation for Detection Transformer", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2211.09768)]
* **each-DETR**: "Teach-DETR: Better Training DETR with Teachers", arXiv, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2211.11953)][[Code (in construction)](https://github.com/LeonHLJ/Teach-DETR)]
* **DETA**: "NMS Strikes Back", arXiv, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2212.06137)][[PyTorch](https://github.com/jozhang97/DETA)]
* **ViT-Adapter**: "ViT-Adapter: Exploring Plain Vision Transformer for Accurate Dense Predictions", ICLR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2205.08534)][[PyTorch](https://github.com/czczup/ViT-Adapter)]
* **Lite-DETR**: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.07335)][[Code (in construction)](https://github.com/IDEA-Research/Lite-DETR)]
* **DDQ**: "Dense Distinct Query for End-to-End Object Detection", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.12776)][[PyTorch](https://github.com/jshilong/DDQ)]
* **SiameseDETR**: "Siamese DETR", CVPR, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2303.18144)][[PyTorch](https://github.com/Zx55/SiameseDETR)]
* **SAP-DETR**: "SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.02006)]
* **Q-DETR**: "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2304.00253)][[Code (in construction)](https://github.com/SteveTsui/Q-DETR)]
* **Lite-DETR**: "Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.07335)][[PyTorch](https://github.com/IDEA-Research/Lite-DETR)]
* **H-DETR**: "DETRs with Hybrid Matching", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.13080)][[PyTorch](https://github.com/HDETR)]
* **MaskDINO**: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", CVPR, 2023 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2206.02777)][[PyTorch](https://github.com/IDEACVR/MaskDINO)]
* **IMFA**: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2208.11356)][[Code (in construction)](https://github.com/ZhangGongjie/IMFA)]
* **SQR**: "Enhanced Training of Query-Based Object Detection via Selective Query Recollection", CVPR, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2212.07593)][[PyTorch](https://github.com/Fangyi-Chen/SQR)]
* **DQ-Det**: "Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation", ICML, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2307.12239)]
* **SpeedDETR**: "SpeedDETR: Speed-aware Transformers for End-to-end Object Detection", ICML, 2023 (*Northeastern University*). [[Paper](https://openreview.net/forum?id=5VdcSxrlTK)]
* **AlignDet**: "AlignDet: Aligning Pre-training and Fine-tuning in Object Detection", ICCV, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2307.11077)][[PyTorch](https://github.com/liming-ai/AlignDet)][[Website](https://liming-ai.github.io/AlignDet/)]
* **Focus-DETR**: "Less is More: Focus Attention for Efficient DETR", ICCV, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2307.12612)][[PyTorch](https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR)][[MindSpore](https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR)]
* **Plain-DETR**: "DETR Doesn't Need Multi-Scale or Locality Design", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.01904)][[Code (in construction)](https://github.com/impiga/Plain-DETR)]
* **ASAG**: "ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2308.09242)][[PyTorch](https://github.com/iSEE-Laboratory/ASAG)]
* **MIMDet**: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", ICCV, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2204.02964)][[PyTorch](https://github.com/hustvl/MIMDet)]
* **Stable-DINO**: "Detection Transformer with Stable Matching", ICCV, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2304.04742)][[Code (in construction)](https://github.com/IDEA-Research/Stable-DINO)]
* **imTED**: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2205.09613)][[PyTorch](https://github.com/LiewFeng/imTED)]
* **Group-DETR**: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", ICCV, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2207.13085)][[Code (in construction)](https://github.com/Atten4Vis/GroupDETR)]
* **Co-DETR**: "DETRs with Collaborative Hybrid Assignments Training", ICCV, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2211.12860)][[PyTorch](https://github.com/Sense-X/Co-DETR)]
* **DETRDistill**: "DETRDistill: A Universal Knowledge Distillation Framework for DETR-families", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2211.10156)]
* **Decoupled-DETR**: "Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection", ICCV, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2310.15955)]
* **StageInteractor**: "StageInteractor: Query-based Object Detector with Cross-stage Interaction", ICCV, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2304.04978)]
* **Rank-DETR**: "Rank-DETR for High Quality Object Detection", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2310.08854)][[PyTorch](https://github.com/LeapLabTHU/Rank-DETR)]
* **Cal-DETR**: "Cal-DETR: Calibrated Detection Transformer", NeurIPS, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2311.03570)][[PyTorch](https://github.com/akhtarvision/cal-detr)]
* **KS-DETR**: "KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer", arXiv, 2023 (*Toyota Technological Institute*). [[Paper](https://arxiv.org/abs/2302.11208)][[PyTorch](https://github.com/edocanonymous/KS-DETR)]
* **FeatAug-DETR**: "FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.01503)][[Codee (in construction)](https://github.com/rongyaofang/FeatAug-DETR)]
* **RT-DETR**: "DETRs Beat YOLOs on Real-time Object Detection", arXiv, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2304.08069)]
* **Align-DETR**: "Align-DETR: Improving DETR with Simple IoU-aware BCE loss", arXiv, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2304.07527)][[PyTorch](https://github.com/FelixCaae/AlignDETR)]
* **Box-DETR**: "Box-DETR: Understanding and Boxing Conditional Spatial Queries", arXiv, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2307.08353)][[PyTorch (in construction)](https://github.com/tiny-smart/box-detr)]
* **RefineBox**: "Enhancing Your Trained DETRs with Box Refinement", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2307.11828)][[Code (in construction)](https://github.com/YiqunChen1999/RefineBox)]
* **?**: "Revisiting DETR Pre-training for Object Detection", arXiv, 2023 (*Toronto*). [[Paper](https://arxiv.org/abs/2308.01300)]
* **Gen2Det**: "Gen2Det: Generate to Detect", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.04566)]
* **ViT-CoMer**: "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions", CVPR, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2403.07392)][[PyTorch](https://github.com/Traffic-X/ViT-CoMer)]
* **Salience-DETR**: "Salience-DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement", CVPR, 2024 (*Xi'an Jiaotong University*). [[Paper](https://arxiv.org/abs/2403.16131)][[PyTorch](https://github.com/xiuqhou/Salience-DETR)]
* **MS-DETR**: "MS-DETR: Efficient DETR Training with Mixed Supervision", arXiv, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2401.03989)][[Code (in construction)](https://github.com/Atten4Vis/MS-DETR)]
* Transformer-based backbone:
* **ViT-FRCNN**: "Toward Transformer-Based Object Detection", arXiv, 2020 (*Pinterest*). [[Paper](https://arxiv.org/abs/2012.09958)]
* **WB-DETR**: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (*CAS*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_WB-DETR_Transformer-Based_Detector_Without_Backbone_ICCV_2021_paper.html)]
* **YOLOS**: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (*Horizon Robotics*). [[Paper](https://arxiv.org/abs/2106.00666)][[PyTorch](https://github.com/hustvl/YOLOS)]
* **?**: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2111.11429)]
* **ViDT**: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (*NAVER*). [[Paper](https://arxiv.org/abs/2110.03921)][[PyTorch](https://github.com/naver-ai/vidt)]
* **FP-DETR**: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (*USTC*). [[Paper](https://openreview.net/forum?id=yjMQuLLcGWK)]
* **DETR++**: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2206.02977)]
* **ViTDet**: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2203.16527)]
* **UViT**: "A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2112.09747)]
* **CFDT**: "A Transformer-Based Object Detector with Coarse-Fine Crossing Representations", NeurIPS, 2022 (*Huawei*). [[Paper](https://openreview.net/forum?id=iuW96ssPQX)]
* **D2ETR**: "D2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2203.00860)]
* **DINO**: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", ICLR, 2023 (*IDEA, China*). [[Paper](https://arxiv.org/abs/2203.03605)][[PyTorch](https://github.com/IDEACVR/DINO)]
* **SimPLR**: "SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation", arXiv, 2023 (*UvA*). [[Paper](https://arxiv.org/abs/2310.05920)]

[[Back to Overview](#overview)]

### 3D Object Detection
* **AST-GRU**: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 (*Baidu*). [[Paper](https://arxiv.org/abs/2004.01389)][[Code (in construction)](https://github.com/yinjunbo/3DVID)]
* **Pointformer**: "3D Object Detection with Pointformer", arXiv, 2020 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2012.11409)]
* **CT3D**: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2108.10723)][[Code (in construction)](https://github.com/hlsheng1/CT3D)]
* **Group-Free-3D**: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (*Microsoft*). [[Paper](https://arxiv.org/abs/2104.00678)][[PyTorch](https://github.com/zeliu98/Group-Free-3D)]
* **VoTr**: "Voxel Transformer for 3D Object Detection", ICCV, 2021 (*CUHK + NUS*). [[Paper](https://arxiv.org/abs/2109.02497)]
* **3DETR**: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2109.08141)][[PyTorch](https://github.com/facebookresearch/3detr)][[Website](https://facebookresearch.github.io/3detr/)]
* **DETR3D**: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 (*MIT*). [[Paper](https://arxiv.org/abs/2110.06922)]
* **M3DETR**: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (*University of Maryland*). [[Paper](https://arxiv.org/abs/2104.11896)][[PyTorch](https://github.com/rayguan97/M3DETR)]
* **MonoDTR**: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 (*NTU*). [[Paper](https://arxiv.org/abs/2203.10981)][[Code (in construction)](https://github.com/kuanchihhuang/MonoDTR)]
* **VoxSeT**: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2203.10314)][[PyTorch](https://github.com/skyhehe123/VoxSeT)]
* **TransFusion**: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2203.11496)][[PyTorch](https://github.com/XuyangBai/TransFusion)]
* **CAT-Det**: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2204.00325)]
* **TokenFusion**: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2204.08721)]
* **SST**: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2112.06375)][[PyTorch](https://github.com/TuSimple/SST)]
* **LIFT**: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (*Shanghai Jiao Tong University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_LIFT_Learning_4D_LiDAR_Image_Fusion_Transformer_for_3D_Object_CVPR_2022_paper.html)]
* **BoxeR**: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2111.13087)][[PyTorch](https://github.com/kienduynguyen/BoxeR)]
* **BrT**: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2210.01391)]
* **VISTA**: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2203.09704)][[PyTorch](https://github.com/Gorilla-Lab-SCUT/VISTA)]
* **STRL**: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 (*Bosch*). [[Paper](https://drive.google.com/file/d/1_2RedCoqCH4cM6J-TOy18nevVd9RTr8c/view)]
* **MTrans**: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 (*HKU*). [[Paper](https://arxiv.org/abs/2207.09805)][[PyTorch](https://github.com/Cliu2/MTrans)]
* **CenterFormer**: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (*TuSimple*). [[Paper](https://arxiv.org/abs/2209.05588)][[Code (in construction)](https://github.com/TuSimple/centerformer)]
* **BUTD-DETR**: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 (*CMU*). [[Paper](https://arxiv.org/abs/2112.08879)][[PyTorch](https://github.com/nickgkan/butd_detr)][[Website](https://butd-detr.github.io/)]
* **SpatialDETR**: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (*Mercedes-Benz*). [[Paper](https://markus-enzweiler.de/downloads/publications/ECCV2022-spatial_detr.pdf)][[PyTorch](https://github.com/cgtuebingen/SpatialDETR)]
* **CramNet**: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (*Waymo*). [[Paper](https://arxiv.org/abs/2210.09267)]
* **SWFormer**: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", ECCV, 2022 (*Waymo*). [[Paper](https://arxiv.org/abs/2210.07372)]
* **EMMF-Det**: "Enhancing Multi-modal Features Using Local Self-Attention for 3D Object Detection", ECCV, 2022 (*Hikvision*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/6955_ECCV_2022_paper.php)]
* **UVTR**: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", NeurIPS, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2206.00630)][[PyTorch](https://github.com/dvlab-research/UVTR)]
* **MsSVT**: "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds", NeurIPS, 2022 (*Beijing Institute of Technology*). [[Paper](https://openreview.net/forum?id=hOVEBHpHrMu)][[PyTorch](https://github.com/dscdyc/MsSVT)]
* **DeepInteraction**: "DeepInteraction: 3D Object Detection via Modality Interaction", NeurIPS, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2208.11112)][[PyTorch](https://github.com/fudan-zvg/DeepInteraction)]
* **PETR**: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (*Megvii*). [[Paper](https://arxiv.org/abs/2203.05625)]
* **Graph-DETR3D**: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 (*University of Science and Technology of China*). [[Paper](https://arxiv.org/abs/2204.11582)]
* **PolarFormer**: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2206.15398#)][[Code (in construction)](https://github.com/fudan-zvg/PolarFormer)]
* **AST-GRU**: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2207.12659)]
* **SEFormer**: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2209.01745)]
* **CRAFT**: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2209.06535)]
* **CrossDTR**: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (*NTU*). [[Paper](https://arxiv.org/abs/2209.13507)][[Code (in construction)](https://github.com/sty61010/CrossDTR)]
* **Focal-PETR**: "Focal-PETR: Embracing Foreground for Efficient Multi-Camera 3D Object Detection", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2212.05505)]
* **Li3DeTr**: "Li3DeTr: A LiDAR based 3D Detection Transformer", WACV, 2023 (*University of Coimbra, Portugal*). [[Paper](https://arxiv.org/abs/2210.15365)]
* **PiMAE**: "PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2303.08129)][[PyTorch](https://github.com/BLVLab/PiMAE)]
* **OcTr**: "OcTr: Octree-based Transformer for 3D Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2303.12621)]
* **MonoATT**: "MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer", CVPR, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2303.13018)]
* **PVT-SSD**: "PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2305.06621)][[Code (in construction)](https://github.com/Nightmare-n/PVT-SSD)]
* **ConQueR**: "ConQueR: Query Contrast Voxel-DETR for 3D Object Detection", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2212.07289)][[PyTorch](https://github.com/poodarchu/EFG)][[Website](https://benjin.me/projects/2022_conquer/)]
* **FrustumFormer**: "FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection", CVPR, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2301.04467)][[PyTorch (in construction)](https://github.com/Robertwyq/Frustum)]
* **DSVT**: "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2301.06051)][[PyTorch](https://github.com/Haiyang-W/DSVT)]
* **AShapeFormer**: "AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers", CVPR, 2023 (*Hunan University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Li_AShapeFormer_Semantics-Guided_Object-Level_Active_Shape_Encoding_for_3D_Object_Detection_CVPR_2023_paper.html)][[Code (in construction)](https://github.com/ZechuanLi/AShapeFormer)]
* **MV-JAR**: "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.13510)][[Code (in construction)](https://github.com/SmartBot-PJLab/MV-JAR)]
* **FocalFormer3D**: "FocalFormer3D: Focusing on Hard Instance for 3D Object Detection", ICCV, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2308.04556)][[PyTorch](https://github.com/NVlabs/FocalFormer3D)]
* **3DPPE**: "3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers", ICCV, 2023 (*Houmo AI, China*). [[Paper](https://arxiv.org/abs/2211.14710)][[PyTorch](https://github.com/drilistbox/3DPPE)]
* **PARQ**: "Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection", ICCV, 2023 (*Northeastern*). [[Paper](https://arxiv.org/abs/2310.01401)][[PyTorch](https://github.com/ymingxie/parq)][[Website](https://ymingxie.github.io/parq/)]
* **CMT**: "Cross Modal Transformer: Towards Fast and Robust 3D Object Detection", ICCV, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2301.01283)][[PyTorch](https://github.com/junjie18/CMT)]
* **MonoDETR**: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", ICCV, 2023 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2203.13310)][[PyTorch](https://github.com/ZrrSkywalker/MonoDETR)]
* **DTH**: "Efficient Transformer-based 3D Object Detection with Dynamic Token Halting", ICCV, 2023 (*Cruise*). [[Paper](https://arxiv.org/abs/2303.05078)]
* **PETRv2**: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", ICCV, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2206.01256)][[PyTorch](https://github.com/megvii-research/PETR)]
* **MV2D**: "Object as Query: Lifting any 2D Object Detector to 3D Detection", ICCV, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2301.02364)]
* **?**: "An Empirical Analysis of Range for 3D Object Detection", ICCVW, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2308.04054)]
* **Uni3DETR**: "Uni3DETR: Unified 3D Detection Transformer", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2310.05699)][[PyTorch](https://github.com/zhenyuw16/Uni3DETR)]
* **Diffusion-SS3D**: "Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection", NeurIPS, 2023 (*NYCU*). [[Paper](https://arxiv.org/abs/2312.02966)][[PyTorch](https://github.com/luluho1208/Diffusion-SS3D)]
* **STEMD**: "Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2307.00347)][[Code (in construction)(https://github.com/Eaphan/STEMD)]]
* **V-DETR**: "V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.04409)][[Code (in construction)](https://github.com/yichaoshen-MS/V-DETR)]
* **3DiffTection**: "3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features", arXiv, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2311.04391)][[Code (in construction)](https://github.com/nv-tlabs/3DiffTection)][[Website](https://research.nvidia.com/labs/toronto-ai/3difftection/)]
* **PTT**: "PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection", arXiv, 2023 (*UC Merced*). [[Paper](https://arxiv.org/abs/2312.08371)][[Code (in construction)](https://github.com/kuanchihhuang/PTT)]
* **Point-DETR3D**: "Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection", AAAI, 2024 (*USTC*). [[Paper](https://arxiv.org/abs/2403.15317)]
* **MixSup**: "MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection", ICLR, 2024 (*CAS*). [[Paper](https://arxiv.org/abs/2401.16305)][[PyTorch](https://github.com/BraveGroup/PointSAM-for-MixSup)]
* **QAF2D**: "Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors", CVPR, 2024 (*Nullmax, China*). [[Paper](https://arxiv.org/abs/2403.06093)]
* **ScatterFormer**: "ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention", arXiv, 2024 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2401.00912)][[Code (in construction)](https://github.com/skyhehe123/ScatterFormer)]
* **MsSVT++**: "MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection", arXiv, 2024 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2401.11718)][[PyTorch](https://github.com/dscdyc/MsSVT)]

[[Back to Overview](#overview)]

### Multi-Modal Detection
* **OVR-CNN**: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 (*Snap*). [[Paper](https://arxiv.org/abs/2011.10678)][[PyTorch](https://github.com/alirezazareian/ovr-cnn)]
* **MDETR**: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (*NYU*). [[Paper](https://arxiv.org/abs/2104.12763)][[PyTorch](https://github.com/ashkamath/mdetr)][[Website](https://ashkamath.github.io/mdetr_page/)]
* **FETNet**: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (*Tsinghua*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/1400.pdf)]
* **MEDUSA**: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 (*Google*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0568.pdf)][[PyTorch](https://github.com/songhwanjun/MEDUSA)]
* **StrucTexT**: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2108.02923)]
* **MAVL**: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2111.11430)][[PyTorch](https://github.com/mmaaz60/mvits_for_class_agnostic_od)]
* **OWL-ViT**: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2205.06230)][[JAX](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)][[Hugging Face](https://huggingface.co/docs/transformers/model_doc/owlvit)]
* **X-DETR**: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2204.05626)]
* **simCrossTrans**: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (*The City University of New York*). [[Paper](https://arxiv.org/abs/2203.10456)][[PyTorch](https://github.com/liketheflower/simCrossTrans)]
* **?**: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (*USC*). [[Paper](https://arxiv.org/abs/2206.09592)]
* **YONOD**: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 (*CUNY*). [[Paper](https://arxiv.org/abs/2207.01071)][[PyTorch](https://github.com/liketheflower/YONOD)]
* **OmDet**: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (*Binjiang Institute of Zhejiang University*). [[Paper](https://arxiv.org/abs/2209.05946)]
* **ContFormer**: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2210.02953)]
* **DQ-DETR**: "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding", AAAI, 2023 (*International Digital Economy Academy (IDEA)*). [[Paper](https://arxiv.org/abs/2211.15516)][[Code (in construction)](https://github.com/IDEA-Research/DQ-DETR)]
* **F-VLM**: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", ICLR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2209.15639)][[Website](https://sites.google.com/view/f-vlm)]
* **OV-3DET**: "Open-Vocabulary Point-Cloud Object Detection without 3D Annotation", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2304.00788)][[PyTorch](https://github.com/lyhdet/OV-3DET)]
* **Detection-Hub**: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", CVPR, 2023 (*Fudan + Microsoft*). [[Paper](https://arxiv.org/abs/2206.03484)]
* **OmniLabel**: "OmniLabel: A Challenging Benchmark for Language-Based Object Detection", ICCV, 2023 (*NEC*). [[Paper](https://arxiv.org/abs/2304.11463)][[GitHub](https://github.com/samschulter/omnilabeltools)][[Website](https://www.omnilabel.org/)]
* **MM-OVOD**: "Multi-Modal Classifiers for Open-Vocabulary Object Detection", ICML, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2306.05493?s=31)][[Code (in construction)](https://github.com/prannaykaul/mm-ovod)][[Website](https://www.robots.ox.ac.uk/~vgg/research/mm-ovod/)]
* **CoDA**: "CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection", NeurIPS, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2310.02960)][[PyTorch](https://github.com/yangcaoai/CoDA_NeurIPS2023)][[Website](https://yangcaoai.github.io/publications/CoDA.html)]
* **ContextDET**: "Contextual Object Detection with Multimodal Large Language Models", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2305.18279)][[Code (in construction)](https://github.com/yuhangzang/ContextDET)][[Website](https://www.mmlab-ntu.com/project/contextdet/index.html)]
* **Object2Scene**: "Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection", arXiv, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2309.09456)]

[[Back to Overview](#overview)]

### HOI Detection
* **HOI-Transformer**: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 (*Megvii*). [[Paper](https://arxiv.org/abs/2103.04503)][[PyTorch](https://github.com/bbepoch/HoiTransformer)]
* **HOTR**: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (*Kakao + Korea University*). [[Paper](https://arxiv.org/abs/2104.13682)][[PyTorch](https://github.com/kakaobrain/HOTR)]
* **MSTR**: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (*Kakao*). [[Paper](https://arxiv.org/abs/2203.14709)]
* **SSRT**: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2204.00746)]
* **CPC**: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2204.04836)][[PyTorch (in construction)](https://github.com/mlvlab/CPChoi)]
* **DisTR**: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2204.09290)]
* **STIP**: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2206.06291)][[PyTorch](https://github.com/zyong812/STIP)]
* **DOQ**: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (*South China University of Technology*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Qu_Distillation_Using_Oracle_Queries_for_Transformer-Based_Human-Object_Interaction_Detection_CVPR_2022_paper.html)]
* **UPT**: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 (*Australian Centre for Robotic Vision*). [[Paper](https://arxiv.org/abs/2112.01838)][[PyTorch](https://github.com/fredzzhang/upt)][[Website](https://fredzzhang.com/unary-pairwise-transformers/)]
* **CATN**: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2204.04911)]
* **GEN-VLKT**: "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection", CVPR, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2203.13954)][[PyTorch](https://github.com/YueLiao/gen-vlkt)]
* **HQM**: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2207.05293)][[PyTorch](https://github.com/MuchHair/HQM)]
* **Iwin**: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2203.10537)]
* **RLIP**: "RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection", NeurIPS, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2209.01814)][[PyTorch](https://github.com/JacobYuan7/RLIP)]
* **TUTOR**: "Video-based Human-Object Interaction Detection from Tubelet Tokens", NeurIPS, 2022 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2206.01908)]
* **?**: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2210.05668)][[PyTorch](https://github.com/Yang-Li-2000/Understanding-Embodied-Reference-with-Touch-Line-Transformer)]
* **?**: "Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning", ICLR, 2023 (*KU Leuven*). [[Paper](https://openreview.net/forum?id=resApVNcqSB)]
* **HOICLIP**: "HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models", CVPR, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2303.15786)][[Code (in construction)](https://github.com/Artanic30/HOICLIP)]
* **ViPLO**: "ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection", CVPR, 2023 (*mAy-I, Korea*). [[Paper](https://arxiv.org/abs/2304.08114)][[PyTorch](https://github.com/Jeeseung-Park/ViPLO)]
* **OpenCat**: "Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework", CVPR, 2023 (*Renmin University of China*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Zheng_Open-Category_Human-Object_Interaction_Pre-Training_via_Language_Modeling_Framework_CVPR_2023_paper.html)]
* **CQL**: "Category Query Learning for Human-Object Interaction Classification", CVPR, 2023 (*Megvii*). [[Paper](https://arxiv.org/abs/2303.14005)][[Code (in construction)](https://github.com/charles-xie/CQL)]
* **RmLR**: "Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection", ICCV, 2023 (*Southeast University, China*). [[Paper](https://arxiv.org/abs/2307.13529)]
* **PViC**: "Exploring Predicate Visual Context in Detecting of Human-Object Interactions", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.06202)][[PyTorch](https://github.com/fredzzhang/pvic)]
* **AGER**: "Agglomerative Transformer for Human-Object Interaction Detection", ICCV, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2308.08370)][[Code (in construction)](https://github.com/six6607/AGER)]
* **RLIPv2**: "RLIPv2: Fast Scaling of Relational Language-Image Pre-training", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.09351)][[PyTorch](https://github.com/JacobYuan7/RLIPv2)]
* **EgoPCA**: "EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding", ICCV, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2309.02423)][[Website](https://mvig-rhos.com/ego_pca)]
* **UniHOI**: "Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models", NeurIPS, 2023 (*Southeast University*). [[Paper](https://arxiv.org/abs/2311.03799)][[Code (in construction)](https://github.com/Caoyichao/UniHOI)]
* **LogicHOI**: "Neural-Logic Human-Object Interaction Detection", NeurIPS, 2023 (*University of Technology Sydney*). [[Paper](https://arxiv.org/abs/2311.09817)][[Code (in construction)](https://github.com/weijianan1/LogicHOI)]
* **?**: "Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels", arXiv, 2023 (*KU Leuven*). [[Paper](https://arxiv.org/abs/2309.05069)]
* **DP-HOI**: "Disentangled Pre-training for Human-Object Interaction Detection", CVPR, 2024 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2404.01725)][[Code (in construction)](https://github.com/xingaoli/DP-HOI)]
* **HOI-Ref**: "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision", arXiv, 2024 (*University of Bristol, UK*). [[Paper](https://arxiv.org/abs/2404.09933)][[PyTorch](https://github.com/Sid2697/HOI-Ref)][[Website](https://sid2697.github.io/hoi-ref/)]

[[Back to Overview](#overview)]

### Salient Object Detection
* **VST**: "Visual Saliency Transformer", ICCV, 2021 (*Northwestern Polytechincal University*). [[Paper](https://arxiv.org/abs/2104.12099)]
* **?**: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2112.13528)]
* **SwinNet**: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (*Anhui University*). [[Paper](https://arxiv.org/abs/2204.05585)][[Code](https://github.com/liuzywen/SwinNet)]
* **SOD-Transformer**: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2104.10127)]
* **GLSTR**: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2108.02759)]
* **TriTransNet**: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (*Anhui University*). [[Paper](https://arxiv.org/abs/2108.03990)]
* **AbiU-Net**: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 (*Nankai University*). [[Paper](https://arxiv.org/abs/2108.07851)]
* **TranSalNet**: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (*Cardiff University, UK*). [[Paper](https://arxiv.org/abs/2110.03593)]
* **DFTR**: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2203.06429)]
* **GroupTransNet**: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (*Nankai university*). [[Paper](https://arxiv.org/abs/2203.10785)]
* **SelfReformer**: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2205.11283)]
* **DTMINet**: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2206.03105)]
* **MCNet**: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2207.03558)][[PyTorch](https://github.com/jxr326/SwinMCNet)]
* **SiaTrans**: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (*Shandong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2207.04224)]
* **PSFormer**: "PSFormer: Point Transformer for 3D Salient Object Detection", arXiv, 2022 (*Nanjing University of Aeronautics and Astronautics*). [[Paper](https://arxiv.org/abs/2210.15933)]
* **RMFormer**: "Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection", ACMMM, 2023 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2308.03826)]

[[Back to Overview](#overview)]

### Other Detection Tasks
* X-supervised:
* **LOST**: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (*Valeo.ai*). [[Paper](https://arxiv.org/abs/2109.14279)][[PyTorch](https://github.com/valeoai/LOST)]
* **Omni-DETR**: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2203.16089)][[PyTorch](https://github.com/amazon-research/omni-detr)]
* **TokenCut**: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 (*Univ. Grenoble Alpes, France*). [[Paper](https://arxiv.org/abs/2202.11539)][[PyTorch](https://github.com/YangtaoWANG95/TokenCut)][[Website](https://www.m-psi.fr/Papers/TokenCut2022/)]
* **WS-DETR**: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2207.05205)]
* **TRT**: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2208.01838)][[PyTorch](https://github.com/su-hui-zz/ReAttentionTransformer)]
* **TokenCut**: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 (*Univ. Grenoble Alpes, France*). [[Paper](https://arxiv.org/abs/2209.00383)][[PyTorch](https://github.com/YangtaoWANG95/TokenCut)][[Website](https://www.m-psi.fr/Papers/TokenCut2022/)]
* **Semi-DETR**: "Semi-DETR: Semi-Supervised Object Detection With Detection Transformers", CVPR, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2307.08095)][[Paddle (in construction)](https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/semi_det/semi_detr)][[PyTorch (JCZ404)](https://github.com/JCZ404/Semi-DETR)]
* **MoTok**: "Object Discovery from Motion-Guided Tokens", CVPR, 2023 (*Toyota*). [[Paper](https://arxiv.org/abs/2303.15555)][[PyTorch](https://github.com/zpbao/MoTok/)][[Website](https://zpbao.github.io/projects/MoTok/)]
* **CutLER**: "Cut and Learn for Unsupervised Object Detection and Instance Segmentation", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.11320)][[PyTorch](https://github.com/facebookresearch/CutLER)][[Website](http://people.eecs.berkeley.edu/~xdwang/projects/CutLER/)]
* **ISA-TS**: "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames", ICML, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2302.04973)]
* **MOST**: "MOST: Multiple Object localization with Self-supervised Transformers for object discovery", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.05387)][[PyTorch](https://github.com/rssaketh/MOST/)][[Website](https://rssaketh.github.io/most)]
* **GenPromp**: "Generative Prompt Model for Weakly Supervised Object Localization", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2307.09756)][[PyTorch](https://github.com/callsys/GenPromp)]
* **SAT**: "Spatial-Aware Token for Weakly Supervised Object Localization", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2303.10438)][[PyTorch](https://github.com/wpy1999/SAT)]
* **ALWOD**: "ALWOD: Active Learning for Weakly-Supervised Object Detection", ICCV, 2023 (*Rutgers*). [[Paper](https://arxiv.org/abs/2309.07914)][[Code (in construction)](https://github.com/seqam-lab/ALWOD)]
* **HASSOD**: "HASSOD: Hierarchical Adaptive Self-Supervised Object Detection", NeurIPS, 2023 (*UIUC*). [[Paper](https://github.com/Shengcao-Cao/HASSOD)][[PyTorch](https://github.com/Shengcao-Cao/HASSOD)][[Website](https://hassod-neurips23.github.io/)]
* **SeqCo-DETR**: "SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers", arXiv, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2303.08481)]
* **R-MAE**: "R-MAE: Regions Meet Masked Autoencoders", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2306.05411)]
* **SimDETR**: "SimDETR: Simplifying self-supervised pretraining for DETR", arXiv, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2307.15697)]
* **U2Seg**: "Unsupervised Universal Image Segmentation", arXiv, 2023 (*Berkely*). [[Paper](https://arxiv.org/abs/2312.17243)][[PyTorch](https://github.com/u2seg/U2Seg)]
* **CuVLER**: "CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers", CVPR, 2024 (*Technion - Israel Institute of Technology*). [[Paper](https://arxiv.org/abs/2403.07700)][[PyTorch](https://github.com/shahaf-arica/CuVLER)]
* **Sparse-Semi-DETR**: "Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection", CVPR, 2024 (*DFKI, Germany*). [[Paper](https://arxiv.org/abs/2404.01819)]
* X-Shot Object Detection:
* **AIT**: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 (*Academia Sinica*). [[Paper](https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Adaptive_Image_Transformer_for_One-Shot_Object_Detection_CVPR_2021_paper.html)]
* **Meta-DETR**: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (*NTU Singapore*). [[Paper](https://arxiv.org/abs/2103.11731)][[PyTorch](https://github.com/ZhangGongjie/Meta-DETR)]
* **CAT**: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2104.14984)]
* **FCT**: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (*Columbia*). [[Paper](https://arxiv.org/abs/2203.15021)]
* **SaFT**: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2203.09093)]
* **TENET**: "Time-rEversed diffusioN tEnsor Transformer: A New TENET of Few-Shot Object Detection", ECCV, 2022 (*ANU*). [[Paper](https://arxiv.org/abs/2210.16897)][[PyTorch](https://github.com/ZS123-lang/TENET)]
* **Meta-DETR**: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2208.00219)]
* **Incremental-DETR**: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (*NUS*). [[Paper](https://arxiv.org/abs/2205.04042)]
* **FS-DETR**: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", ICCV, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2210.04845)]
* **Meta-ZSDETR**: "Meta-ZSDETR: Zero-shot DETR with Meta-learning", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.09540)]
* **?**: "Revisiting Few-Shot Object Detection with Vision-Language Models", arXiv, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2312.14494)]
* Open-World/Vocabulary:
* **OW-DETR**: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 (*IIAI*). [[Paper](https://arxiv.org/abs/2112.01513)][[PyTorch](https://github.com/akshitac8/OW-DETR)]
* **DetPro**: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2203.14940)][[PyTorch](https://github.com/dyabel/detpro)]
* **RegionCLIP**: "RegionCLIP: Region-based Language-Image Pretraining", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2112.09106)][[PyTorch](https://github.com/microsoft/RegionCLIP)]
* **PromptDet**: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2203.16513)][[PyTorch](https://github.com/fcjian/PromptDet)][[Website](https://fcjian.github.io/promptdet/)]
* **OV-DETR**: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2203.11876)]
* **VL-PLM**: "Exploiting Unlabeled Data with Vision and Language Models for Object Detection", ECCV, 2022 (*Rutgers University*). [[Paper](https://arxiv.org/abs/2207.08954)][[PyTorch](https://github.com/xiaofeng94/VL-PLM)][[Website](https://www.nec-labs.com/~mas/VL-PLM/)]
* **DetCLIP**: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2209.09407)]
* **WWbL**: "What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs", NeurIPS, 2022 (*Tel-Aviv*). [[Paper](https://arxiv.org/abs/2206.09358)][[PyTorch](https://github.com/talshaharabany/what-is-where-by-looking)][[Demo](https://replicate.com/talshaharabany/what-is-where-by-looking)]
* **P3OVD**: "P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection", arXiv, 2022 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2211.00849)]
* **Open-World-DETR**: "Open World DETR: Transformer based Open World Object Detection", arXiv, 2022 (*NUS*). [[Paper](https://arxiv.org/abs/2212.02969)]
* **BARON**: "Aligning Bag of Regions for Open-Vocabulary Object Detection", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2302.13996)][[PyTorch](https://github.com/wusize/ovdet)]
* **CapDet**: "CapDet: Unifying Dense Captioning and Open-World Detection Pretraining", CVPR, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2303.02489)]
* **CORA**: "CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.13076)][[PyTorch](https://github.com/tgxs002/CORA)]
* **UniDetector**: "Detecting Everything in the Open World: Towards Universal Object Detection", CVPR, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2303.11749)][[PyTorch](https://github.com/zhenyuw16/UniDetector)]
* **DetCLIPv2**: "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment", CVPR, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2304.04514)]
* **RO-ViT**: "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2305.07011)]
* **CAT**: "CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection", CVPR, 2023 (*Northeast University, China*). [[Paper](https://arxiv.org/abs/2301.01970)][[PyTorch](https://github.com/xiaomabufei/CAT)]
* **CondHead**: "Learning to Detect and Segment for Open Vocabulary Object Detection", CVPR, 2023 (*Sichuan University*). [[Paper](https://arxiv.org/abs/2212.12130)]
* **OADP**: "Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2303.05892)][[PyTorch](https://github.com/LutingWang/OADP)]
* **OVAD**: "Open-vocabulary Attribute Detection", CVPR, 2023 (*University of Freiburg, Germany*). [[Paper](https://arxiv.org/abs/2211.12914)][[Website](https://ovad-benchmark.github.io/)]
* **OvarNet**: "OvarNet: Towards Open-vocabulary Object Attribute Recognition", CVPR, 2023 (*Xiaohongshu*). [[Paper](https://arxiv.org/abs/2301.09506)][[Website](https://kyanchen.github.io/OvarNet/)][[PyTorch](https://github.com/KyanChen/OvarNet)]
* **ALLOW**: "Annealing-Based Label-Transfer Learning for Open World Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Ma_Annealing-Based_Label-Transfer_Learning_for_Open_World_Object_Detection_CVPR_2023_paper.html)][[PyTorch](https://github.com/DIG-Beihang/ALLOW)]
* **PROB**: "PROB: Probabilistic Objectness for Open World Object Detection", CVPR, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2212.01424)][[PyTorch](https://github.com/orrzohar/PROB)][[Website](https://orrzohar.github.io/projects/prob/)]
* **RandBox**: "Random Boxes Are Open-world Object Detectors", ICCV, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2307.08249)][[PyTorch](https://github.com/scuwyh2000/RandBox)]
* **Cascade-DETR**: "Cascade-DETR: Delving into High-Quality Universal Object Detection", ICCV, 2023 (*ETHZ + HKUST*). [[Paper](https://arxiv.org/abs/2307.11035)][[Pytorch](https://github.com/SysCV/cascade-detr)]
* **EdaDet**: "EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment", ICCV, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2309.01151)][[Website](https://chengshiest.github.io/edadet/)]
* **V3Det**: "V3Det: Vast Vocabulary Visual Detection Dataset", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2304.03752)][[GitHub](https://github.com/V3Det/V3Det)][[Website](https://v3det.openxlab.org.cn/)]
* **CoDet**: "CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection", NeurIPS, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2310.16667)][[PyTorch](https://github.com/CVMI-Lab/CoDet)]
* **DAMEX**: "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", NeurIPS, 2023 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2311.04894)][[Code (in construction)](https://github.com/jinga-lala/DAMEX)]
* **OWL-ST**: "Scaling Open-Vocabulary Object Detection", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2306.09683)]
* **MQ-Det**: "Multi-modal Queried Object Detection in the Wild", NeurIPS, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2305.18980)][[PyTorch](https://github.com/YifanXu74/MQ-Det)]
* **Grounding-DINO**: "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection", arXiv, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.05499)]
* **GridCLIP**: "GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning", arXiv, 2023 (*Queen Mary University of London*). [[Paper](https://arxiv.org/abs/2303.09252)]
* **?**: "Three ways to improve feature alignment for open vocabulary detection", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2303.13518)]
* **PCL**: "Open-Vocabulary Object Detection using Pseudo Caption Labels", arXiv, 2023 (*Kakao*). [[Paper](https://arxiv.org/abs/2303.13040)]
* **Prompt-OVD**: "Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2303.14386)]
* **LOWA**: "LOWA: Localize Objects in the Wild with Attributes", arXiv, 2023 (*Mineral, California*). [[Paper](https://arxiv.org/abs/2305.20047)]
* **SGDN**: "Open-Vocabulary Object Detection via Scene Graph Discovery", arXiv, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2307.03339)]
* **SAS-Det**: "Improving Pseudo Labels for Open-Vocabulary Object Detection", arXiv, 2023 (*NEC*). [[Paper](https://arxiv.org/abs/2308.06412)]
* **DE-ViT**: "Detect Every Thing with Few Examples", arXiv, 2023 (*Rutgers*). [[Paper](https://arxiv.org/abs/2309.12969)][[PyTorch](https://github.com/mlzxy/devit)]
* **CLIPSelf**: "CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction", arXiv, 2023 (*NTU, Singapore*). [[Papewr](https://arxiv.org/abs/2310.01403)][[PyTorch](https://github.com/wusize/CLIPSelf)]
* **DST-Det**: "DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2310.01393)][[Code (in consgtruction)](https://github.com/xushilin1/dst-det)]
* **DITO**: "Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2310.00161)]
* **RegionSpot**: "Recognize Any Regions", arXiv, 2023 (*University Of Surrey, England*). [[Paper](https://arxiv.org/abs/2311.01373)][[Code (in construction)](https://github.com/Surrey-UPLab/Recognize-Any-Regions)]
* **DECOLA**: "Language-conditioned Detection Transformer", arXiv, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2311.17902)][[PyTorch](https://github.com/janghyuncho/DECOLA)]
* **PLAC**: "Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection", arXiv, 2023 (*Kakao*). [[Paper](https://arxiv.org/abs/2312.02103)]
* **FOMO**: "Open World Object Detection in the Era of Foundation Models", arXiv, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2312.05745)][[Website](https://orrzohar.github.io/projects/fomo/)]
* **LP-OVOD**: "LP-OVOD: Open-Vocabulary Object Detection by Linear Probing", WACV, 2024 (*VinAI, Vietnam*). [[Paper](https://arxiv.org/abs/2310.17109)]
* **ProxyDet**: "ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection", WACV, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2312.07266)]
* **WSOVOD**: "Weakly Supervised Open-Vocabulary Object Detection", AAAI, 2024 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2312.12437)][[Code (i construction)](https://github.com/HunterJ-Lin/WSOVOD)]
* **CLIM**: "CLIM: Contrastive Language-Image Mosaic for Region Representation", AAAI, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2312.11376)][[PyTorch](https://github.com/wusize/CLIM)]
* **SS-OWFormer**: "Semi-supervised Open-World Object Detection", AAAI, 2024 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2402.16013)][[PyTorch](https://github.com/sahalshajim/SS-OWFormer)]
* **DVDet**: "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors", ICLR, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2402.04630)]
* **GenerateU**: "Generative Region-Language Pretraining for Open-Ended Object Detection", CVPR, 2024 (*Monash University*). [[Paper](https://arxiv.org/abs/2403.10191)][[PyTorch](https://github.com/FoundationVision/GenerateU)]
* **DetCLIPv3**: "DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection", CVPR, 2024 (*Huawei*). [[Paper](https://arxiv.org/abs/2404.09216)]
* **RALF**: "Retrieval-Augmented Open-Vocabulary Object Detection", CVPR, 2024 (*Korea University*). [[Paper](https://arxiv.org/abs/2404.05687)][[Code (in construction)](https://github.com/mlvlab/RALF)]
* **SHiNe**: "SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection", CVPR, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2405.10053)]
* **MM-Grounding-DINO**: "An Open and Comprehensive Pipeline for Unified Object Grounding and Detection", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2401.02361)][[PyTorch](https://github.com/open-mmlab/mmdetection/tree/main/configs/grounding_dino)]
* **YOLO-World**: "YOLO-World: Real-Time Open-Vocabulary Object Detection", arXiv, 2024 (*Tencent*). [[Paper](https://arxiv.org/abs/2401.17270)][[Code (in construction)](https://github.com/AILab-CVC/YOLO-World)]
* **T-Rex2**: "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy", arXiv, 2024 (*IDEA*). [[Paper](https://arxiv.org/abs/2403.14610)][[PyTorch](https://github.com/IDEA-Research/T-Rex)][[Website](https://deepdataspace.com/home)]
* **Grounding-DINO-1.5**: "Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection", arXiv, 2024 (*IDEA*). [[Paper](https://arxiv.org/abs/2405.10300)][[Code](https://github.com/IDEA-Research/Grounding-DINO-1.5-API)]
* Pedestrian Detection:
* **PED**: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2012.06785)][[PyTorch](https://github.com/Hatmm/PED-DETR-for-Pedestrian-Detection)]
* **?**: "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection", NeurIPS, 2022 (*ICL*). [[Paper](https://openreview.net/forum?id=eow_ZGaw24j)]
* **Pedestron**: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (*IIAI*). [[Paper](https://arxiv.org/abs/2201.03176)][[PyTorch](https://github.com/hasanirtiza/Pedestron)]
* **VLPD**: "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision", CVPR, 2023 (*University of Science and Technology Beijing*). [[Paper](https://arxiv.org/abs/2304.03135)][[PyTorch](https://github.com/lmy98129/VLPD)]
* Lane Detection:
* **LSTR**: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 (*Xi'an Jiaotong*). [[Paper](https://arxiv.org/abs/2011.04233)][[PyTorch](https://github.com/liuruijin17/LSTR)]
* **LETR**: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 (*UCSD*). [[Paper](https://arxiv.org/abs/2101.01909)][[PyTorch](https://github.com/mlpc-ucsd/LETR)]
* **Laneformer**: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (*Huawei*). [[Paper](https://arxiv.org/abs/2203.09830)]
* **TLC**: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 (*Peking University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Tong_Transformer_Based_Line_Segment_Classifier_With_Image_Context_for_Real-Time_CVPR_2022_paper.html)]
* **PersFormer**: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (*Shanghai AI Laboratory*). [[Paper](https://arxiv.org/abs/2203.11089)][[PyTorch](https://github.com/OpenPerceptionX/OpenLane)]
* **MHVA**: "Lane Detection Transformer Based on Multi-Frame Horizontal and Vertical Attention and Visual Transformer Module", ECCV, 2022 (*Beihang University*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/3918_ECCV_2022_paper.php)]
* **PriorLane**: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (*Zhejiang Lab*). [[Paper](https://arxiv.org/abs/2209.06994)][[PyTorch](https://github.com/vincentqqb/priorlane)]
* **CurveFormer**: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 (*NullMax, China*). [[Paper](https://arxiv.org/abs/2209.07989)]
* **LATR**: "LATR: 3D Lane Detection from Monocular Images with Transformer", ICCV, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.04583)][[PyTorch](https://github.com/JMoonr/LATR)]
* **O2SFormer**: "End to End Lane detection with One-to-Several Transformer", arXiv, 2023 (*Southeast University, China*). [[Paper](https://arxiv.org/abs/2305.00675)][[PyTorch](https://github.com/zkyseu/O2SFormer)]
* **Lane2Seq**: "Lane2Seq: Towards Unified Lane Detection via Sequence Generation", CVPR, 2024 (*Southeast University, China*). [[Paper](https://arxiv.org/abs/2402.17172)]
* Object Localization:
* **TS-CAM**: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2103.14862)]
* **LCTR**: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2112.05291)]
* **ViTOL**: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (*Mercedes-Benz*). [[Paper](https://arxiv.org/abs/2204.06772)][[PyTorch](https://github.com/Saurav-31/ViTOL)]
* **SCM**: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.10447)][[PyTorch](https://github.com/164140757/SCM)]
* **CaFT**: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2201.00475)]
* **CoW**: "CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation", CVPR, 2023 (*Columbia*). [[Paper](https://arxiv.org/abs/2203.10421)][[PyTorch](https://github.com/columbia-ai-robotics/cow)][[Website](https://cow.cs.columbia.edu/)]
* **ESC**: "ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation", ICML, 2023 (*UCSC*). [[Paper](https://arxiv.org/abs/2301.13166)]
* Relation Detection:
* **PST**: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2105.02170)]
* **PST**: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2105.02170)]
* **TROI**: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 (*NUS, Singapore*). [[Paper](https://arxiv.org/abs/2106.03162)]
* **RelTransformer**: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (*KAUST*). [[Paper](https://arxiv.org/abs/2104.11934)][[PyTorch](https://github.com/Vision-CAIR/RelTransformer)]
* **VReBERT**: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (*ANU*). [[Paper](https://arxiv.org/abs/2206.09111)]
* **UniVRD**: "Unified Visual Relationship Detection with Vision and Language Models", ICCV, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2303.08998)][[Code (in construction)](https://github.com/google-research/scenic/tree/main/scenic/projects/univrd)]
* **RECODE**: "Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models", NeurIPS, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2305.12476)]
* **SG-ViT**: "Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection", arXiv, 2024 (*DeepMind*). [[Paper](https://arxiv.org/abs/2403.14270)]
* Anomaly Detection:
* **VT-ADL**: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 (*University of Udine, Italy*). [[Paper](https://arxiv.org/abs/2104.10036)]
* **InTra**: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 (*Fujitsu*). [[Paper](https://arxiv.org/abs/2104.13897)]
* **AnoViT**: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2203.10808)]
* **WinCLIP**: "WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation", CVPR, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2303.14814)]
* **M3DM**: "Multimodal Industrial Anomaly Detection via Hybrid Fusion", CVPR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2303.00601)][[PyTorch](https://github.com/nomewang/M3DM)]
* Cross-Domain:
* **SSTN**: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 (*Gwangju Institute of Science and Technology*). [[Paper](https://arxiv.org/abs/2103.03150)]
* **MTTrans**: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (*Beihang University*). [[Paper](https://arxiv.org/abs/2205.01643)]
* **OAA-OTA**: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.14195)]
* **SSTA**: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 (*University of Electronic Science and Technology of China*). [[Paper](https://arxiv.org/abs/2206.00222)]
* **DETR-GA**: "DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection", CVPR, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2304.07082)]
* **DA-DETR**: "DA-DETR: Domain Adaptive Detection Transformer with Information Fusion", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2103.17084)]
* **?**: "CLIP the Gap: A Single Domain Generalization Approach for Object Detection", CVPR, 2023 (*EPFL*). [[Paper](https://arxiv.org/abs/2301.05499)][[PyTorch](https://github.com/vidit09/domaingen)]
* **PM-DETR**: "PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2307.00313)]
* Co-Salient Object Detection:
* **CoSformer**: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2104.14729)]
* Oriented Object Detection:
* **O2DETR**: "Oriented Object Detection with Transformer", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.03146)]
* **AO2-DETR**: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2205.12785)]
* **ARS-DETR**: "ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer", arXiv, 2023 (*Harbin Institude of Technology*). [[Paper](https://arxiv.org/abs/2303.04989)][[PyTorch](https://github.com/httle/ARS-DETR)]
* **RHINO**: "RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection", arXiv, 2023 (*SI Analytics*). [[Paper](https://arxiv.org/abs/2305.07598)]
* Multiview Detection:
* **MVDeTr**: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 (*ANU*). [[Paper](https://arxiv.org/abs/2108.05888)]
* Polygon Detection:
* **?**: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 (*Delft University of Technology, Netherlands*). [[Paper](https://arxiv.org/abs/2108.07533)]
* Drone-view:
* **TPH**: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 (*Beihang University*). [[Paper](https://arxiv.org/abs/2108.11539)]
* **TransVisDrone**: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (*UCF*). [[Paper](https://arxiv.org/abs/2210.08423)][[Code (in construction)](https://github.com/tusharsangam/TransVisDrone)]
* Infrared:
* **?**: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (*Chongqing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2109.14379)]
* **MiPa**: "MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection", arXiv, 2024 (*ETS Montreal*). [[Paper](https://arxiv.org/abs/2404.18849)][[Code (in construction)](https://github.com/heitorrapela/MiPa)]
* Text Detection:
* **SwinTextSpotter**: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2203.10209)][[PyTorch](https://github.com/mxin262/SwinTextSpotter)]
* **TESTR**: "Text Spotting Transformers", CVPR, 2022 (*UCSD*). [[Paper](https://arxiv.org/abs/2204.01918)][[PyTorch](https://github.com/mlpc-ucsd/TESTR)]
* **TTS**: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2202.05508)]
* **oCLIP**: "Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting", ECCV, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2203.03911)]
* **TransDETR**: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2203.10539)][[PyTorch](https://github.com/weijiawu/TransDETR)]
* **?**: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (*University of Waterloo, Canada*). [[Paper](https://arxiv.org/abs/2202.11221)]
* **?**: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 (*University of Science and Technology Beijing*). [[Paper](https://arxiv.org/abs/2205.05320)][[Code (in construction)](https://github.com/GXYM/TextBPN-Plus-Plus)]
* **DPTNet**: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2208.09878)]
* **ATTR**: "Aggregated Text Transformer for Scene Text Detection", arXiv, 2022 (*Fudan*). [[Paper](https://arxiv.org/abs/2211.13984)]
* **DPText-DETR**: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", AAAI, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2207.04491)][[PyTorch](https://github.com/ymy-k/DPText-DETR)]
* **TCM**: "Turning a CLIP Model into a Scene Text Detector", CVPR, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2302.14338)][[PyTorch](https://github.com/wenwenyu/TCM)]
* **DeepSolo**: "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting", CVPR, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2211.10772)][[PyTorch](https://github.com/ViTAE-Transformer/DeepSolo)]
* **ESTextSpotter**: "ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer", ICCV, 2023 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2308.10147)][[PyTorch](https://github.com/mxin262/ESTextSpotter)]
* **PBFormer**: "PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer", ACMMM, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2308.15004)]
* **DeepSolo++**: "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting", arXiv, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2305.19957)][[PyTorch](https://github.com/ViTAE-Transformer/DeepSolo)]
* **FastTCM**: "Turning a CLIP Model into a Scene Text Spotter", arXiv, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2308.10408)][[PyTorch](https://github.com/wenwenyu/TCM)]
* **SRFormer**: "SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2308.10531)]
* **TGA**: "Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis", CVPR, 2024 (*Microsoft*). [[Paper](https://arxiv.org/abs/2405.07481)]
* **SwinTextSpotter-v2**: "SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting", arXiv, 2024 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2401.07641)]
* Change Detection:
* **ChangeFormer**: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 (*JHU*). [[Paper](https://arxiv.org/abs/2201.01293)][[PyTorch](https://github.com/wgcban/ChangeFormer)]
* **IDET**: "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 (*Civil Aviation University of China*). [[Paper](https://arxiv.org/abs/2207.09240)]
* Edge Detection:
* **EDTER**: "EDTER: Edge Detection with Transformer", CVPR, 2022 (*Beijing Jiaotong University*). [[Paper](https://arxiv.org/abs/2203.08566)][[Code (in construction)](https://github.com/MengyangPu/EDTER)]
* **HEAT**: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 (*Simon Fraser*). [[Paper](https://arxiv.org/abs/2111.15143)][[PyTorch](https://github.com/woodfrog/heat)][[Website](https://heat-structured-reconstruction.github.io/)]
* Person Search:
* **COAT**: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 (*Kitware*). [[Paper](https://arxiv.org/abs/2203.09642)][[PyTorch](https://github.com/Kitware/COAT)]
* **PSTR**: "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2204.03340)][[PyTorch](https://github.com/JialeCao001/PSTR)]
* Manipulation Detection:
* **ObjectFormer**: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2203.14681)]
* Mirror Detection:
* **SATNet**: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2207.06332)][[PyTorch](https://github.com/tyhuang0428/SATNet)]
* Shadow Detection:
* **SCOTCH-SODA**: "SCOTCH and SODA: A Transformer Video Shadow Detection Framework", CVPR, 2023 (*University of Cambridge*). [[Paper](https://arxiv.org/abs/2211.06885)]
* Keypoint Detection:
* **SalViT**: "From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection", arXiv, 2023 (*ANU*). [[Paper](https://arxiv.org/abs/2304.03140)]
* Continual Learning:
* **CL-DETR**: "Continual Detection Transformer for Incremental Object Detection", CVPR, 2023 (*MPI*). [[Paper](https://arxiv.org/abs/2304.03110)]
* Visual Query Detection/Localization:
* **CocoFormer**: "Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2211.10528)][[PyTorch](https://github.com/facebookresearch/vq2d_cvpr)]
* **VQLoC**: "Single-Stage Visual Query Localization in Egocentric Videos", NeurIPS, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2306.09324)][[PyTorch](https://github.com/hwjiang1510/VQLoC)][[Website](https://hwjiang1510.github.io/VQLoC/)]
* Task-Driven Object Detection:
* **CoTDet**: "CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection", ICCV, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2309.01093)]
* Diffusion:
* **DiffusionEngine**: "DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2309.03893)][[PyTorch](https://github.com/bytedance/DiffusionEngine)][[Website](https://mettyz.github.io/DiffusionEngine/)]
* **TADP**: "Text-image Alignment for Diffusion-based Perception", arXiv, 2023 (*CalTech*). [[Paper](https://arxiv.org/abs/2310.00031)][[Website](https://www.vision.caltech.edu/tadp/)]
* **InstaGen**: "InstaGen: Enhancing Object Detection by Training on Synthetic Dataset", arXiv, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2402.05937)][[Code (in construction)](https://github.com/fcjian/InstaGen)][[Website](https://fcjian.github.io/InstaGen/)]

[[Back to Overview](#overview)]

## Segmentation
### Semantic Segmentation
* **SETR**: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 (*Tencent*). [[Paper](https://arxiv.org/abs/2012.15840)][[PyTorch](https://github.com/fudan-zvg/SETR)][[Website](https://fudan-zvg.github.io/SETR/)]
* **TrSeg**: "TrSeg: Transformer for semantic segmentation", PRL, 2021 (*Korea University*). [[Paper](https://www.sciencedirect.com/science/article/abs/pii/S016786552100163X)][[PyTorch](https://github.com/youngsjjn/TrSeg)]
* **CWT**: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2108.03032)][[PyTorch](https://github.com/zhiheLu/CWT-for-FSS)]
* **Segmenter**: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (*INRIA*). [[Paper](https://arxiv.org/abs/2105.05633)][[PyTorch](https://github.com/rstrudel/segmenter)]
* **UN-EPT**: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2107.14209)][[PyTorch](https://github.com/amazon-research/unified-ept)]
* **FTN**: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.04108)]
* **SegFormer**: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2105.15203)][[PyTorch](https://github.com/NVlabs/SegFormer)]
* **MaskFormer**: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", NeurIPS, 2021 (*UIUC + Facebook*). [[Paper](https://arxiv.org/abs/2107.06278)][[Website](https://bowenc0221.github.io/maskformer/)]
* **OffRoadTranSeg**: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 (*IISER. India*). [[Paper](https://arxiv.org/abs/2106.13963)]
* **TRFS**: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (*ETHZ*). [[Paper](https://arxiv.org/abs/2108.02266)]
* **Flying-Guide-Dog**: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (*KIT, Germany*). [[Paper](https://arxiv.org/abs/2108.07007)][[Code (in construction)](https://github.com/EckoTan0804/flying-guide-dog)]
* **VSPW**: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 (*Xiaomi*). [[Paper](https://arxiv.org/abs/2109.01316)]
* **SDTP**: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (*?*). [[Paper](https://arxiv.org/abs/2109.08963)]
* **TopFormer**: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2204.05525)][[PyTorch](https://github.com/hustvl/TopFormer)]
* **HRViT**: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2111.01236)][[PyTorch](https://github.com/facebookresearch/HRViT)]
* **GReaT**: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2209.09545)]
* **SegDeformer**: "A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining", ECCV, 2022 (*Shanghai Jiao Tong + Huawei*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/383_ECCV_2022_paper.php)][[PyTorch](https://github.com/lygsbw/segdeformer)]
* **PAUMER**: "PAUMER: Patch Pausing Transformer for Semantic Segmentation", BMVC, 2022 (*Idiap, Switzerland*). [[Paper](https://bmvc2022.mpi-inf.mpg.de/737/)]
* **SegViT**: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2210.05844)][[PyTorch](https://github.com/zbwxp/SegVit)]
* **RTFormer**: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2210.07124)][[Paddle](https://github.com/PaddlePaddle/PaddleSeg)]
* **SegNeXt**: "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation", NeurIPS, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2209.08575)]
* **Lawin**: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2201.01615)][[PyTorch](https://github.com/yan-hao-tian/lawin)]
* **PFT**: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (*CUHK + SenseTime*). [[Paper](https://arxiv.org/abs/2201.04019)]
* **DFlatFormer**: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 (*OPPO*). [[Paper](https://arxiv.org/abs/2201.09139)]
* **FeSeFormer**: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2203.14124)]
* **StructToken**: "StructToken: Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2203.12612)]
* **HILA**: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 (*University of Toronto*). [[Paper](https://arxiv.org/abs/2207.02126)][[Website](https://www.cs.toronto.edu/~garyleung/hila/)][[PyTorch](https://github.com/fidler-lab/hila)]
* **HLG**: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2207.09339)][[PyTorch](https://github.com/fudan-zvg/SETR)]
* **SSformer**: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (*Nanjing University of Aeronautics and Astronautics*). [[Paper](https://arxiv.org/abs/2208.02034)][[PyTorch](https://github.com/shiwt03/SSformer)]
* **NamedMask**: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2209.11228)][[PyTorch](https://github.com/NoelShin/namedmask)][[Website](https://www.robots.ox.ac.uk/~vgg/research/namedmask/)]
* **IncepFormer**: "IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation", arXiv, 2022 (*Nanjing University of Aeronautics and Astronautics*). [[Paper](https://arxiv.org/abs/2212.03035)][[PyTorch](https://github.com/shendu0321/IncepFormer)]
* **SeaFormer**: "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation", ICLR, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.13156)]
* **PPL**: "Probabilistic Prompt Learning for Dense Prediction", CVPR, 2023 (*Yonsei*). [[Paper](https://arxiv.org/abs/2304.00779)]
* **AFF**: "AutoFocusFormer: Image Segmentation off the Grid", CVPR, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2304.12406)]
* **CTS**: "Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers", CVPR, 2023 (*Eindhoven University of Technology, Netherlands*). [[Paper](https://arxiv.org/abs/2306.02095)][[PyTorch](https://github.com/tue-mps/cts-segmenter)][[Website](https://tue-mps.github.io/CTS/)]
* **TSG**: "Transformer Scale Gate for Semantic Segmentation", CVPR, 2023 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2205.07056)]
* **FASeg**: "Dynamic Focus-aware Positional Queries for Semantic Segmentation", CVPR, 2023 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2204.01244)][[PyTorch](https://github.com/ziplab/FASeg)]
* **HFD-BSD**: "A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation", ICCV, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2307.12574)]
* **DToP**: "Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation", ICCV, 2023 (*South China University of Technology + The University of Adelaide*). [[Paper](https://arxiv.org/abs/2308.01045)]
* **FreeMask**: "FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models", NeurIPS, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2310.15160)][[PyTorch](https://github.com/LiheYoung/FreeMask)]
* **AiluRus**: "AiluRus: A Scalable ViT Framework for Dense Prediction", NeurIPS, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2311.01197)][[Code (in construction)](https://github.com/caddyless/AiluRus)]
* **SegViTv2**: "SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2306.06289)][[PyTorch](https://github.com/zbwxp/SegVit)]
* **DoViT**: "Dynamic Token-Pass Transformers for Semantic Segmentation", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.01944)]
* **CFT**: "Category Feature Transformer for Semantic Segmentation", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2308.05581)]
* **ICPC**: "ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2308.07078)]
* **Superpixel-Association**: "Superpixel Transformers for Efficient Semantic Segmentation", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2309.16889)]
* **PlainSeg**: "Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers", arXiv, 2023 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2310.12755)][[PyTorch](https://github.com/ydhongHIT/PlainSeg)]
* **SCTNet**: "SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation", AAAI, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2312.17071)][[Code (in construction)](https://github.com/xzz777/SCTNet)]
* **?**: "Region-Based Representations Revisited", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2402.02352)]

[[Back to Overview](#overview)]

### Depth Estimation
* **DPT**: "Vision Transformers for Dense Prediction", ICCV, 2021 (*Intel*). [[Paper](https://arxiv.org/abs/2103.13413)][[PyTorch](https://github.com/intel-isl/DPT)]
* **TransDepth**: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 (*Haerbin Institute of Technology + University of Trento*). [[Paper](https://arxiv.org/abs/2103.12091)][[PyTorch](https://github.com/ygjwd12345/TransDepth)]
* **ASTransformer**: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (*USTC*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0244.pdf)][[PyTorch](https://github.com/WJ-Chang-42/ASTransformer)]
* **MT-SfMLearner**: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (*NavInfo Europe, Netherlands*). [[Paper](https://arxiv.org/abs/2202.03131)]
* **DepthFormer**: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 (*Toyota*). [[Paper](https://arxiv.org/abs/2204.07616)]
* **GuideFormer**: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (*Agency for Defense Development, Korea*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Rho_GuideFormer_Transformers_for_Image_Guided_Depth_Completion_CVPR_2022_paper.html)]
* **SparseFormer**: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2206.04557)]
* **DEST**: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2204.13791)]
* **MonoViT**: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (*University of Bologna, Italy*). [[Paper](https://arxiv.org/abs/2208.03543)][[PyTorch](https://github.com/zxcqlf/MonoViT)]
* **Spike-Transformer**: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 (*Peking University*). [[Paper]()][[PyTorch](https://github.com/Leozhangjiyuan/MDE-SpikingCamera)]
* **?**: "Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation", ECCVW, 2022 (*IIT Madras*). [[Paper](https://arxiv.org/abs/2211.11066)]
* **GLPanoDepth**: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2202.02796)]
* **DepthFormer**: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2203.14211)][[PyTorch](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)]
* **BinsFormer**: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.00987)][[PyTorch](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox)]
* **SideRT**: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2204.13892)]
* **MonoFormer**: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (*DGIST, Korea*). [[Paper](https://arxiv.org/abs/2205.11083)]
* **Depthformer**: "Depthformer: Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 (*Indian Institute of Technology Delhi*). [[Paper](https://arxiv.org/abs/2207.04535)]
* **TODE-Trans**: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (*USTC*). [[Paper](https://arxiv.org/abs/2209.08455)][[Code (in construction)](https://github.com/yuchendoudou/TODE)]
* **ObjCAViT**: "ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention", arXiv, 2022 (*ICL*). [[Paper](https://arxiv.org/abs/2211.17232)]
* **ROIFormer**: "ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation", AAAI, 2023 (*OPPO*). [[Paper](https://arxiv.org/abs/2212.05729)]
* **TST**: "Lightweight Monocular Depth Estimation via Token-Sharing Transformer", ICRA, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2306.05682)]
* **CompletionFormer**: "CompletionFormer: Depth Completion with Convolutions and Vision Transformers", CVPR, 2023 (*University of Bologna, Italy*). [[Paper](https://arxiv.org/abs/2304.13030)][[PyTorch](https://github.com/youmi-zym/CompletionFormer)][[Website](https://youmi-zym.github.io/projects/CompletionFormer/)]
* **Lite-Mono**: "Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation", CVPR, 2023 (*University of Twente, Netherlands*). [[Paper](https://arxiv.org/abs/2211.13202)][[PyTorch](https://github.com/noahzn/Lite-Mono)]
* **EGformer**: "EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation", ICCV, 2023 (*SNU*). [[Paper](https://arxiv.org/abs/2304.07803)]
* **ZeroDepth**: "Towards Zero-Shot Scale-Aware Monocular Depth Estimation", ICCV, 2023 (*Toyota*). [[Paper](https://arxiv.org/abs/2306.17253)][[PyTorch](https://github.com/tri-ml/vidar)][[Website](https://sites.google.com/view/tri-zerodepth)]
* **Win-Win**: "Win-Win: Training High-Resolution Vision Transformers from Two Windows", arXiv, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2310.00632)]
* **?**: "Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation", WACV, 2024 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2311.01034)]
* **DeCoTR**: "DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions", CVPR, 2024 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2403.12202)]
* **Depth-Anything**: "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data", arXiv, 2024 (*TikTok*). [[Paper](https://arxiv.org/abs/2401.10891)][[PyTorch](https://github.com/LiheYoung/Depth-Anything)][[Website](https://depth-anything.github.io/)]

[[Back to Overview](#overview)]

### Object Segmentation
* **SOTR**: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (*China Agricultural University*). [[Paper](https://arxiv.org/abs/2108.06747)][[PyTorch](https://github.com/easton-cau/SOTR)]
* **Trans4Trans**: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2107.03172)][[Code (in construction)](https://github.com/jamycheung/Trans4Trans)]
* **Trans2Seg**: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (*HKU + SenseTime*). [[Paper](https://arxiv.org/abs/2101.08461)][[PyTorch](https://github.com/xieenze/Trans2Seg)]
* **SOIT**: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (*Hikvision*). [[Paper](https://arxiv.org/abs/2112.11037)][[PyTorch](https://github.com/hikvision-research/opera)]
* **CAST**: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 (*Berkeley*). [[Paper](https://arxiv.org/abs/2210.00314)]
* **?**: "Learning Explicit Object-Centric Representations with Vision Transformers", arXiv, 2022 (*Aalto University, Finland*). [[Paper](https://arxiv.org/abs/2210.14139)]
* **MSMFormer**: "Mean Shift Mask Transformer for Unseen Object Instance Segmentation", arXiv, 2022 (*UT Dallas*). [[Paper](https://arxiv.org/abs/2211.11679)][[PyTorch](https://github.com/YoungSean/UnseenObjectsWithMeanShift)]

[[Back to Overview](#overview)]

### Other Segmentation Tasks
* Any-X/Every-X:
* **SAM**: "Segment Anything", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.02643)][[PyTorch](https://github.com/facebookresearch/segment-anything)][[Website](https://segment-anything.com/)]
* **SEEM**: "Segment Everything Everywhere All at Once", NeurIPS, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2304.06718)][[PyTorch](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)]
* **HQ-SAM**: "Segment Anything in High Quality", NeurIPS, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2306.01567)][[PyTorch](https://github.com/SysCV/SAM-HQ)]
* **?**: "An Empirical Study on the Robustness of the Segment Anything Model (SAM)", arXiv, 2023 (*UCSB*). [[Paper](https://arxiv.org/abs/2305.06422)]
* **?**: "A Comprehensive Survey on Segment Anything Model for Vision and Beyond", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2305.08196)]
* **SAD**: "SAD: Segment Any RGBD", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2305.14207)][[PyTorch](https://github.com/Jun-CEN/SegmentAnyRGBD)]
* **?**: "A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering", arXiv, 2023 (*Kyung Hee University, Korea*). [[Paper](https://arxiv.org/abs/2306.06211)]
* **?**: "Robustness of SAM: Segment Anything Under Corruptions and Beyond", arXiv, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2306.07713)]
* **FastSAM**: "Fast Segment Anything", arXiv, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2306.12156)][[PyTorch](https://github.com/CASIA-IVA-Lab/FastSAM)]
* **MobileSAM**: "Faster Segment Anything: Towards Lightweight SAM for Mobile Applications", arXiv, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2306.14289)][[PyTorch](https://github.com/ChaoningZhang/MobileSAM)]
* **Semantic-SAM**: "Semantic-SAM: Segment and Recognize Anything at Any Granularity", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2307.04767)][[Code (in construction)](https://github.com/UX-Decoder/Semantic-SAM)]
* **Follow-Anything**: "Follow Anything: Open-set detection, tracking, and following in real-time", arXiv, 2023 (*MIT*). [[Paper](https://arxiv.org/abs/2308.05737)]
* **DINOv**: "Visual In-Context Prompting", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2311.13601)][[Code (in construction)](https://github.com/UX-Decoder/DINOv)]
* **Stable-SAM**: "Stable Segment Anything Model", arXiv, 2023 (*Kuaishou*). [[Paper](https://arxiv.org/abs/2311.15776)][[Code (in construction)](https://github.com/fanq15/Stable-SAM)]
* **EfficientSAM**: "EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.00863)]
* **EdgeSAM**: "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2312.06660)][[PyTorch](https://github.com/chongzhou96/EdgeSAM)][[Website](https://mmlab-ntu.github.io/project/edgesam/)]
* **RepViT-SAM**: "RepViT-SAM: Towards Real-Time Segmenting Anything", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.05760)][[PyTorch](https://github.com/THU-MIG/RepViT)]
* **SlimSAM**: "0.1% Data Makes Segment Anything Slim", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2312.05284)][[PyTorch](https://github.com/czg1225/SlimSAM)]
* **FIND**: "Interfacing Foundation Models' Embeddings", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2312.07532)][[PyTorch (in construction)](https://github.com/UX-Decoder/FIND)][[Website](https://x-decoder-vl.github.io/)]
* **SqueezeSAM**: "SqueezeSAM: User-friendly mobile interactive segmentation", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2312.06736)]
* **TAP**: "Tokenize Anything via Prompting", arXiv, 2023 (*BAAI*). [[Paper](https://arxiv.org/abs/2312.09128)][[PyTorch](https://github.com/baaivision/tokenize-anything)]
* **MobileSAMv2**: "MobileSAMv2: Faster Segment Anything to Everything", arXiv, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2312.09579)][[PyTorch](https://github.com/ChaoningZhang/MobileSAM)]
* **TinySAM**: "TinySAM: Pushing the Envelope for Efficient Segment Anything Model", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2312.13789)][[PyTorch](https://github.com/xinghaochen/TinySAM)]
* **Conv-LoRA**: "Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model", ICLR, 2024 (*Amazon*). [[Paper](https://arxiv.org/abs/2401.17868)][[PyTorch](https://github.com/autogluon/autogluon)]
* **PerSAM**: "Personalize Segment Anything Model with One Shot", ICLR, 2024 (*CUHK*). [[Paper](https://arxiv.org/abs/2305.03048)][[PyTorch](https://github.com/ZrrSkywalker/Personalize-SAM)]
* **VRP-SAM**: "VRP-SAM: SAM with Visual Reference Prompt", CVPR, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2402.17726)]
* **UAD**: "Unsegment Anything by Simulating Deformation", CVPR, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2404.02585)][[PyTorch](https://github.com/jiahaolu97/anything-unsegmentable)]
* **ASAM**: "ASAM: Boosting Segment Anything Model with Adversarial Tuning", CVPR, 2024 (*vivo*). [[Paper](https://arxiv.org/abs/2405.00256)][[PyTorch](https://github.com/luckybird1994/ASAM)][[Website](https://asam2024.github.io/)]
* **PTQ4SAM**: "PTQ4SAM: Post-Training Quantization for Segment Anything", CVPR, 2024 (*Beihang*). [[Paper](https://arxiv.org/abs/2405.03144)][[PyTorch](https://github.com/chengtao-lv/PTQ4SAM)]
* **BA-SAM**: "BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model", arXiv, 2024 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2401.02317)]
* **OV-SAM**: "Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2401.02955)][[PyTorch](https://github.com/HarborYuan/ovsam)][[Website](https://www.mmlab-ntu.com/project/ovsam/)]
* **SSPrompt**: "Learning to Prompt Segment Anything Models", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2401.04651)]
* **RAP-SAM**: "RAP-SAM: Towards Real-Time All-Purpose Segment Anything", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2401.10228)][[PyTorch](https://github.com/xushilin1/RAP-SAM/)][[Website](https://xushilin1.github.io/rap_sam/)]
* **PA-SAM**: "PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation", arXiv, 2024 (*OPPO*). [[Paper](https://arxiv.org/abs/2401.13051)][[PyTorch](https://github.com/xzz2/pa-sam)]
* **Grounded-SAM**: "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks", arXiv, 2024 (*IDEA*). [[Paper](https://arxiv.org/abs/2401.14159)][[PyTorch](https://github.com/IDEA-Research/Grounded-Segment-Anything)]
* **EfficientViT-SAM**: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2402.05008)][[PyTorch](https://github.com/mit-han-lab/efficientvit)]
* **DeiSAM**: "DeiSAM: Segment Anything with Deictic Prompting", arXiv, 2024 (*TU Darmstadt, Germany*). [[Paper](https://arxiv.org/abs/2402.14123)]
* **CAT-SAM**: "CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2402.03631)][[PyTorch (in construction)](https://github.com/weihao1115/cat-sam)][[Website](https://xiaoaoran.github.io/projects/CAT-SAM)]
* **BLO-SAM**: "BLO-SAM: Bi-level Optimization Based Overfitting-Preventing Finetuning of SAM", arXiv, 2024 (*UCSD*). [[Paper](https://arxiv.org/abs/2402.16338)][[PyTorch](https://github.com/importZL/BLO-SAM)]
* **P2SAM**: "Part-aware Personalized Segment Anything Model for Patient-Specific Segmentation", arXiv, 2024 (*UMich*). [[Paper](https://arxiv.org/abs/2403.05433)]
* **RA**: "Practical Region-level Attack against Segment Anything Models", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2404.08255)]
* Vision-Language:
* **LSeg**: "Language-driven Semantic Segmentation", ICLR, 2022 (*Cornell*). [[Paper](https://arxiv.org/abs/2201.03546)][[PyTorch](https://github.com/isl-org/lang-seg)]
* **ZegFormer**: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2112.07910)][[PyTorch](https://github.com/dingjiansw101/ZegFormer)]
* **CLIPSeg**: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 (*University of Göttingen, Germany*). [[Paper](https://arxiv.org/abs/2112.10003)][[PyTorch](https://github.com/timojl/clipseg)]
* **DenseCLIP**: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2112.01518)][[PyTorch](https://github.com/raoyongming/DenseCLIP)][[Website](https://denseclip.ivg-research.xyz/)]
* **GroupViT**: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2202.11094)][[Website](https://jerryxu.net/GroupViT/)][[PyTorch](https://github.com/NVlabs/GroupViT)]
* **MaskCLIP**: "Extract Free Dense Labels from CLIP", ECCV, 2022 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2112.01071)][[PyTorch](https://github.com/chongzhou96/MaskCLIP)][[Website](https://www.mmlab-ntu.com/project/maskclip/)]
* **ViewCo**: "ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency", ICLR, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2302.10307)][[Code (in construction)](https://github.com/pzhren/ViewCo)]
* **LMSeg**: "LMSeg: Language-guided Multi-dataset Segmentation", ICLR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2302.13495)]
* **VL-Fields**: "VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations", ICRA, 2023 (*University of Edinburgh, UK*). [[Paper](https://arxiv.org/abs/2305.12427)][[Website](https://tsagkas.github.io/vl-fields/)]
* **X-Decoder**: "Generalized Decoding for Pixel, Image, and Language", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2212.11270)][[PyTorch](https://github.com/microsoft/X-Decoder)][[Website](https://x-decoder-vl.github.io/)]
* **IFSeg**: "IFSeg: Image-free Semantic Segmentation via Vision-Language Model", CVPR, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2303.14396)][[PyTorch](https://github.com/alinlab/ifseg)]
* **SAZS**: "Delving into Shape-aware Zero-shot Semantic Segmentation", CVPR, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2304.08491)][[PyTorch](https://github.com/Liuxinyv/SAZS)]
* **CLIP-S4**: "CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation", CVPR, 2023 (*Bosch*). [[Paper](https://arxiv.org/abs/2305.01040)]
* **D2Zero**: "Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2305.13173)][[Code (in construction)](https://github.com/heshuting555/D2Zero)][[Website](https://henghuiding.github.io/D2Zero/)]
* **PADing**: "Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2306.11087)][[PyTorch](https://github.com/heshuting555/PADing)][[Website](https://henghuiding.github.io/PADing/)]
* **LD-ZNet**: "LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2303.12343)][[PyTorch](https://github.com/koutilya-pnvr/LD-ZNet)][[Website](https://koutilya-pnvr.github.io/LD-ZNet/)]
* **MAFT**: "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation", NeurIPS, 2023 (*Picsart*). [[Paper](https://arxiv.org/abs/2310.00240)][[PyTorch](https://github.com/jiaosiyu1999/MAFT)]
* **PGSeg**: "Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation", NeurIPS, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2310.19001)][[PyTorch](https://github.com/Ferenas/PGSeg)]
* **MESS**: "What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (*IBM*). [[Paper](https://arxiv.org/abs/2306.15521)][[PyTorch](https://github.com/blumenstiel/MESS)][[Website](https://blumenstiel.github.io/mess-benchmark/)]
* **ZegOT**: "ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts", arXiv, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2301.12171)]
* **SimCon**: "SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation", arXiv, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2302.03432)]
* **DiffusionSeg**: "DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2303.09813)]
* **ASCG**: "Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2304.01114)]
* **ClsCLIP**: "[CLS] Token is All You Need for Zero-Shot Semantic Segmentation", arXiv, 2023 (*Eastern Institute for Advanced Study, China*). [[Paper](https://arxiv.org/abs/2304.06212)]
* **CLIPTeacher**: "CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation", arXiv, 2023 (*Nagoya University*). [[Paper](https://arxiv.org/abs/2310.02296)]
* **SAM-CLIP**: "SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding", arXiv, 2023 (*Apple*). [[Paper](https://arxiv.org/abs/2310.15308)]
* **GEM**: "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", arXiv, 2023 (*University of Bonn, Germany*). [[Paper](https://arxiv.org/abs/2312.00878)][[PyTorch](https://github.com/WalBouss/GEM)]
* **CaR**: "CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.07661)][[Code (in construction)](https://github.com/kevin-ssy/CLIP_as_RNN)][[Website](https://torrvision.com/clip_as_rnn/)]
* **SPT**: "Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation", AAAI, 2024 (*Beijing University of Posts and Telecommunications*). [[Paper](https://arxiv.org/abs/2312.12754)][[PyTorch (in construction)](https://github.com/clearxu/SPT)]
* **FMbSeg**: "Annotation Free Semantic Segmentation with Vision Foundation Models", arXiv, 2024 (*Toyota*). [[Paper](https://arxiv.org/abs/2403.09307)]
* Open-World/Vocabulary:
* **ViL-Seg**: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.08455)]
* **OVSS**: "A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2112.14757)][[PyTorch](https://github.com/MendelXu/zsseg.baseline)]
* **OpenSeg**: "Scaling Open-Vocabulary Image Segmentation with Image-Level Labels", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2112.12143)]
* **Fusioner**: "Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models", BMVC, 2022 (*Shanghai Jiao Tong University*). [[Paper](https://arxiv.org/abs/2210.15138)][[Website](https://yyh-rain-song.github.io/Fusioner_webpage/)]
* **OVSeg**: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2210.04150)][[PyTorch](https://github.com/facebookresearch/ov-seg)][[Website](https://jeff-liangf.github.io/projects/ovseg/)]
* **ZegCLIP**: "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation", CVPR, 2023 (*The University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2212.03588)][[PyTorch](https://github.com/ZiqinZhou66/ZegCLIP)]
* **TCL**: "Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs", CVPR, 2023 (*Kakao*). [[Paper](https://arxiv.org/abs/2212.00785)][[PyTorch](https://github.com/kakaobrain/tcl)]
* **ODISE**: "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2303.04803)][[PyTorch](https://github.com/NVlabs/ODISE)][[Website](https://jerryxu.net/ODISE/)]
* **Mask-free-OVIS**: "Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations", CVPR, 2023 (*Salesforce*). [[Paper](https://arxiv.org/abs/2303.16891)][[PyTorch (in construction)](https://github.com/Vibashan/Maskfree-OVIS)]
* **FreeSeg**: "FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation", CVPR, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.17225)]
* **SAN**: "Side Adapter Network for Open-Vocabulary Semantic Segmentation", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2302.12242)][[PyTorch](https://github.com/MendelXu/SAN)]
* **OVSegmentor**: "Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision", CVPR, 2023 (*Fudan University*). [[Paper](https://arxiv.org/abs/2301.09121)][[PyTorch](https://github.com/Jazzcharles/OVSegmentor/)][[Website](https://jazzcharles.github.io/OVSegmentor/)]
* **PACL**: "Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2212.04994)]
* **MaskCLIP**: "Open-Vocabulary Universal Image Segmentation with MaskCLIP", ICML, 2023 (*UCSD*). [[Paper](https://arxiv.org/abs/2208.08984)][[Website](https://maskclip.github.io/)]
* **SegCLIP**: "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation", ICML, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2211.14813)][[PyTorch](https://github.com/ArrowLuo/SegCLIP)]
* **SWORD**: "Exploring Transformers for Open-world Instance Segmentation", ICCV, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2308.04206)]
* **Grounded-Diffusion**: "Open-vocabulary Object Segmentation with Diffusion Models", ICCV, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2301.05221)][[PyTorch](https://github.com/Lipurple/Grounded-Diffusion)][[Website](https://lipurple.github.io/Grounded_Diffusion/)]
* **SegPrompt**: "SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning", ICCV, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.06531)][[PyTorch](https://github.com/aim-uofa/SegPrompt)]
* **CGG**: "Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation", ICCV, 2023 (*SenseTime*). [[Paper](https://arxiv.org/abs/2301.00805)][[PyTorch](https://github.com/jzwu48033552/betrayed-by-captions)][[Website](https://www.mmlab-ntu.com/project/betrayed_caption/index.html)]
* **OpenSeeD**: "A Simple Framework for Open-Vocabulary Segmentation and Detection", ICCV, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.08131)][[PyTorch](https://github.com/IDEA-Research/OpenSeeD)]
* **OPSNet**: "Open-vocabulary Panoptic Segmentation with Embedding Modulation", ICCV, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2303.11324)]
* **GKC**: "Global Knowledge Calibration for Fast Open-Vocabulary Segmentation", ICCV, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.09181)]
* **ZeroSeg**: "Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only", ICCV, 2023 (*Meta*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Chen_Exploring_Open-Vocabulary_Semantic_Segmentation_from_CLIP_Vision_Encoder_Distillation_Only_ICCV_2023_paper.html)]
* **MasQCLIP**: "MasQCLIP for Open-Vocabulary Universal Image Segmentation", ICCV, 2023 (*UCSD*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Xu_MasQCLIP_for_Open-Vocabulary_Universal_Image_Segmentation_ICCV_2023_paper.html)][[PyTorch](https://github.com/mlpc-ucsd/MasQCLIP)][[Website](https://masqclip.github.io/)]
* **VLPart**: "Going Denser with Open-Vocabulary Part Segmentation", ICCV, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2305.11173)][[PyTorch](https://github.com/facebookresearch/VLPart)]
* **DeOP**: "Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network", ICCV, 2023 (*Meituan*). [[Paper](https://arxiv.org/abs/2304.01198)]][[PyTorch](https://github.com/CongHan0808/DeOP)]
* **MixReorg**: "MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2308.04829)]
* **OV-PARTS**: "OV-PARTS: Towards Open-Vocabulary Part Segmentation", NeurIPS (Datasets and Benchmarks), 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2310.05107)][[PyTorch](https://github.com/OpenRobotLab/OV_PARTS)]
* **HIPIE**: "Hierarchical Open-vocabulary Universal Image Segmentation", NeurIPS, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2307.00764)][[PyTorch](https://github.com/berkeley-hipie/HIPIE)][[Website](http://people.eecs.berkeley.edu/~xdwang/projects/HIPIE/)]
* **?**: "Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation", NeurIPS, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2309.00096)]
* **FC-CLIP**: "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP", NeurIPS, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2308.02487)][[PyTorch](https://github.com/bytedance/fc-clip)]
* **WLSegNet**: "A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation", arXiv, 2023 (*IIT, New Delhi*). [[Paper](https://arxiv.org/abs/2302.14163)]
* **CAT-Seg**: "CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*Korea University*). [[Paper](https://arxiv.org/abs/2303.11797)][[PyTorch](https://github.com/KU-CVLAB/CAT-Seg)][[Website](https://ku-cvlab.github.io/CAT-Seg/)]
* **MVP-SEG**: "MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*Xiaohongshu, China*). [[Paper](https://arxiv.org/abs/2304.06957)]
* **TagCLIP**: "TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2304.07547)]
* **OVDiff**: "Diffusion Models for Zero-Shot Open-Vocabulary Segmentation", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2306.09316)][[Website](https://www.robots.ox.ac.uk/~vgg/research/ovdiff/)]
* **UOVN**: "Unified Open-Vocabulary Dense Visual Prediction", arXiv, 2023 (*Monash University*). [[Paper](https://arxiv.org/abs/2307.08238)]
* **CLIP-DIY**: "CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free", arXiv, 2023 (*Warsaw University of Technology, Poland*). [[Paper](https://arxiv.org/abs/2309.14289)]
* **Entity**: "Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion", arXiv, 2023 (*Harbin Engineering University*). [[Paper](https://arxiv.org/abs/2311.03352)][[PyTorch](https://github.com/qqlu/Entity)]
* **OSM**: "Towards Open-Ended Visual Recognition with Large Language Model", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2311.08400)][[PyTorch](https://github.com/bytedance/OmniScient-Model)]
* **SED**: "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*Tianjin*). [[Paper](https://arxiv.org/abs/2311.15537)][[PyTorch (in construction)](https://github.com/xb534/SED)]
* **PnP-OVSS**: "Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2311.17095)]
* **SCLIP**: "SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference", arXiv, 2023 (*JHU*). [[Paper](https://arxiv.org/abs/2312.01597)]
* **GranSAM**: "Towards Granularity-adjusted Pixel-level Semantic Annotation", arXiv, 2023 (*UC Riverside*). [[Paper](https://arxiv.org/abs/2312.02420)]
* **Sambor**: "Boosting Segment Anything Model Towards Open-Vocabulary Learning", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2312.03628)][[Code (in construction)](https://github.com/ucas-vg/Sambor)]
* **SCAN**: "Open-Vocabulary Segmentation with Semantic-Assisted Calibration", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.04089)][[Code (in construction)](https://github.com/workforai/SCAN)]
* **Self-Seg**: "Self-Guided Open-Vocabulary Semantic Segmentation", arXiv, 2023 (*UvA*). [[Paper](https://arxiv.org/abs/2312.04539)]
* **OpenSD**: "OpenSD: Unified Open-Vocabulary Segmentation and Detection", arXiv, 2023 (*OPPO*). [[Paper](https://arxiv.org/abs/2312.06703)]
* **CLIP-DINOiser**: "CLIP-DINOiser: Teaching CLIP a few DINO tricks", arXiv, 2023 (*Warsaw University of Technology, Poland*). [[Paper](https://arxiv.org/abs/2312.12359)][[PyTorch](https://github.com/wysoczanska/clip_dinoiser)]
* **TagAlign**: "TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification", arXiv, 2023 (*Ant Group*). [[Paper](https://arxiv.org/abs/2312.14149)][[PyTorch](https://github.com/Qinying-Liu/TagAlign)][[Website](https://qinying-liu.github.io/Tag-Align/)]
* **OVFoodSeg**: "OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation", CVPR, 2024 (*Singapore Management University (SMU)*). [[Paper](https://arxiv.org/abs/2404.01409)]
* **FreeDA**: "Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation", CVPR, 2024 (*University of Modena and Reggio Emilia (UniMoRe), Italy*). [[Paper](https://arxiv.org/abs/2404.06542)][[Website](https://aimagelab.github.io/freeda/)]
* **S-Seg**: "Exploring Simple Open-Vocabulary Semantic Segmentation", arXiv, 2024 (*Oxford*). [[Paper](https://arxiv.org/abs/2401.12217)][[Code (in construction)](https://github.com/zlai0/S-Seg)]
* **PosSAM**: "PosSAM: Panoptic Open-vocabulary Segment Anything", arXiv, 2024 (*Qualcomm*). [[Paper](https://arxiv.org/abs/2403.09620)]][[Code (in construction)](https://github.com/Vibashan/PosSAM)][[Website](https://vibashan.github.io/possam-web/)]
* LLM-based:
* **LISA**: "LISA: Reasoning Segmentation via Large Language Model", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.00692)][[PyTorch](https://github.com/dvlab-research/LISA)]
* **PixelLM**: "PixelLM: Pixel Reasoning with Large Multimodal Model", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2312.02228)][[Code (in construction)](https://github.com/MaverickRen/PixelLM)][[Website](https://pixellm.github.io/)]
* **PixelLLM**: "Pixel Aligned Language Models", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.09237)][[Website](https://jerryxu.net/PixelLLM/)]
* **GSVA**: "GSVA: Generalized Segmentation via Multimodal Large Language Models", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2312.10103)]
* **LISA++**: "An Improved Baseline for Reasoning Segmentation with Large Language Model", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2312.17240)]
* **GROUNDHOG**: "GROUNDHOG: Grounding Large Language Models to Holistic Segmentation", CVPR, 2024 (*Amazon*). [[Paper](https://arxiv.org/abs/2402.16846)][[Website](https://groundhog-mllm.github.io/)]
* **PSALM**: "PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model", arXiv, 2024 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2403.14598)][[PyTorch](https://github.com/zamling/PSALM)]
* **LLaVASeg**: "Empowering Segmentation Ability to Multi-modal Large Language Models", arXiv, 2024 (*vivo*). [[Paper](https://arxiv.org/abs/2403.14141)]
* **LaSagnA**: "LaSagnA: Language-based Segmentation Assistant for Complex Queries", arXiv, 2024 (*Meituan*). [[Paper](https://arxiv.org/abs/2404.08506)][[PyTorch](https://github.com/congvvc/LaSagnA)]
* Universal Segmentation:
* **K-Net**: "K-Net: Towards Unified Image Segmentation", NeurIPS, 2021 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2106.14855)][[PyTorch](https://github.com/ZwwWayne/K-Net/)]
* **Mask2Former**: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2112.01527)][[PyTorch](https://github.com/facebookresearch/Mask2Former)][[Website](https://bowenc0221.github.io/mask2former/)]
* **MP-Former**: "MP-Former: Mask-Piloted Transformer for Image Segmentation", CVPR, 2023 (*IDEA*). [[Paper](https://arxiv.org/abs/2303.07336)][[Code (in construction)](https://github.com/IDEA-Research/MP-Former)]
* **OneFormer**: "OneFormer: One Transformer to Rule Universal Image Segmentation", CVPR, 2023 (*Oregon*). [[Paper](https://arxiv.org/abs/2211.06220)][[PyTorch](https://github.com/SHI-Labs/OneFormer)][[Website](https://praeclarumjj3.github.io/oneformer/)]
* **UNINEXT**: "Universal Instance Perception as Object Discovery and Retrieval", CVPR, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2303.06674)][[PyTorch](https://github.com/MasterBin-IIAU/UNINEXT)]
* **ClustSeg**: "CLUSTSEG: Clustering for Universal Segmentation", ICML, 2023 (*Rochester Institute of Technology*). [[Paper](https://arxiv.org/abs/2305.02187)]
* **DaTaSeg**: "DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.01736)]
* **DFormer**: "DFormer: Diffusion-guided Transformer for Universal Image Segmentation", arXiv, 2023 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2306.03437)][[Code (in construction)](https://github.com/cp3wan/DFormer)]
* **?**: "A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task", arXiv, 2023 (*OMRON SINIC X, Japan*). [[Paper](https://arxiv.org/abs/2307.02862)]
* **Mask2Anomaly**: "Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation", arXiv, 2023 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2309.04573)]
* **SegGen**: "SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis", arXiv, 2023 (*Adobe*). [[Paper](https://arxiv.org/abs/2311.03355)][[Code (in construction)](https://github.com/prismformore/seggen)][[Website](https://seggenerator.github.io/)]
* **PolyMaX**: "PolyMaX: General Dense Prediction with Mask Transformer", WACV, 2024 (*Google*). [[Paper](https://arxiv.org/abs/2311.05770)][[Tensorflow](https://github.com/google-research/deeplab2)]
* **PEM**: "PEM: Prototype-based Efficient MaskFormer for Image Segmentation", CVPR, 2024 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2402.19422)][[Code (in construction)](https://github.com/NiccoloCavagnero/PEM)]
* **OMG-Seg**: "OMG-Seg: Is One Model Good Enough For All Segmentation?", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2401.10229)][[PyTorch](https://github.com/lxtGH/OMG-Seg)][[Website](https://lxtgh.github.io/project/omg_seg/)]
* **Uni-OVSeg**: "Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision", arXiv, 2024 (*University of Sydney*). [[Paper](https://arxiv.org/abs/2402.08960)][[PyTorch (in construction)](https://github.com/DerrickWang005/Uni-OVSeg.pytorch)]
* **PRO-SCALE**: "Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation", arXiv, 2024 (*NEC*). [[Paper](https://arxiv.org/abs/2404.14657)]
* Multi-Modal:
* **UCTNet**: "UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation", ECCV, 2022 (*Lehigh University, Pennsylvania*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/7082_ECCV_2022_paper.php)]
* **CMX**: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2203.04838)][[PyTorch](https://github.com/huaaaliu/RGBX_Semantic_Segmentation)]
* **DeLiVER**: "Delivering Arbitrary-Modal Semantic Segmentation", CVPR, 2023 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2303.01480)][[PyTorch](https://github.com/jamycheung/DELIVER)][[Website](https://jamycheung.github.io/DELIVER.html)]
* **DFormer**: "DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation", arXiv, 2023 (*Nankai University*). [[Paper](https://arxiv.org/abs/2309.09668)][[PyTorch](https://github.com/VCIP-RGBD/DFormer)]
* **Sigma**: "Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation", arXiv, 2024 (*CMU*). [[Paper](https://arxiv.org/abs/2404.04256)][[PyTorch](https://github.com/zifuwan/Sigma)]
* Panoptic Segmentation:
* **MaX-DeepLab**: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2012.00759)][[PyTorch (conradry)](https://github.com/conradry/max-deeplab)]
* **SIAin**: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 (*SI Analytics, South Korea*). [[Paper](https://arxiv.org/abs/2110.04009)]
* **VPS-Transformer**: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 (*Technical University of Cluj-Napoca, Romania*). [[Paper](https://openaccess.thecvf.com/content/WACV2022/html/Petrovai_Time-Space_Transformers_for_Video_Panoptic_Segmentation_WACV_2022_paper.html)]
* **CMT-DeepLab**: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2206.08948)]
* **Panoptic-SegFormer**: "Panoptic SegFormer", CVPR, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2109.03814)][[PyTorch](https://github.com/zhiqi-li/Panoptic-SegFormer)]
* **kMaX-DeepLab**: "k-means Mask Transformer", ECCV, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2207.04044)][[Tensorflow](https://github.com/google-research/deeplab2)]
* **Panoptic-PartFormer**: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (*Peking*). [[Paper](https://arxiv.org/abs/2204.04655)][[PyTorch](https://github.com/lxtGH/Panoptic-PartFormer)]
* **CoMFormer**: "CoMFormer: Continual Learning in Semantic and Panoptic Segmentation", CVPR, 2023 (*Sorbonne Université, France*). [[Paper](https://arxiv.org/abs/2211.13999)]
* **YOSO**: "You Only Segment Once: Towards Real-Time Panoptic Segmentation", CVPR, 2023 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2303.14651)][[PyTorch](https://github.com/hujiecpp/YOSO)]
* **Pix2Seq-D**: "A Generalist Framework for Panoptic Segmentation of Images and Videos", ICCV, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2210.06366)][[Tensorflow2](https://github.com/google-research/pix2seq)]
* **DeepDPS**: "Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning", ICCV, 2023 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2307.14786)][[Code (in construction)](https://github.com/jwh97nn/DeepDPS)]
* **ReMaX**: "ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation", NeurIPS, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.17319)][[Tensorflow2](https://github.com/google-research/deeplab2)]
* **PanopticPartFormer++**: "PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2204.04655)][[PyTorch](https://github.com/lxtGH/Panoptic-PartFormer)]
* **MaXTron**: "MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation", arXiv, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2311.18537)]
* **ECLIPSE**: "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning", CVPR, 2024 (*NAVER*). [[Paper](https://arxiv.org/abs/2403.20126)][[Code (in construction)](https://github.com/clovaai/ECLIPSE)]
* Instance Segmentation:
* **ISTR**: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2105.00637)][[PyTorch](https://github.com/hujiecpp/ISTR)]
* **Mask-Transfiner**: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2111.13673)][[PyTorch](https://github.com/SysCV/transfiner)][[Website](https://www.vis.xyz/pub/transfiner/)]
* **BoundaryFormer**: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (*UCSD*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Lazarow_Instance_Segmentation_With_Mask-Supervised_Polygonal_Boundary_Transformers_CVPR_2022_paper.html)]
* **PPT**: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2206.10845)]
* **TOIST**: "TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation", NeurIPS, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2210.10775)][[PyTorch](https://github.com/AIR-DISCOVER/TOIST)]
* **MAL**: "Vision Transformers Are Good Mask Auto-Labelers", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2301.03992)][[PyTorch](https://github.com/NVlabs/mask-auto-labeler)]
* **FastInst**: "FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation", CVPR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2303.08594)][[PyTorch](https://github.com/junjiehe96/FastInst)]
* **SP**: "Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt", CVPR, 2023 (*Northwestern Polytechnical University, China*). [[Paper](https://arxiv.org/abs/2302.01171)]
* **X-Paste**: "X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion", ICML, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2212.03863)][[PyTorch](https://github.com/yoctta/XPaste)]
* **DynaMITe**: "DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer", ICCV, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2304.06668)][[PyTorch](https://github.com/sabarim/dynamite/)][[Website](https://sabarim.github.io/dynamite/)]
* **Mask-Frozen-DETR**: "Mask Frozen-DETR: High Quality Instance Segmentation with One GPU", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2308.03747)]
* Optical Flow:
* **CRAFT**: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 (*A\*STAR, Singapore*). [[Paper](https://arxiv.org/abs/2203.16896)][[PyTorch](https://github.com/askerlee/craft)]
* **KPA-Flow**: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 (*Megvii*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Luo_Learning_Optical_Flow_With_Kernel_Patch_Attention_CVPR_2022_paper.html)][[PyTorch (in construction)](https://github.com/megvii-research/KPAFlow)]
* **GMFlowNet**: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 (*Rutgers*). [[Paper](https://arxiv.org/abs/2203.11335)][[PyTorch](https://github.com/xiaofeng94/GMFlowNet)]
* **FlowFormer**: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2203.16194)][[Website](https://drinkingcoder.github.io/publication/flowformer/)]
* **TransFlow**: "TransFlow: Transformer as Flow Learner", CVPR, 2023 (*Rochester Institute of Technology*). [[Paper](https://arxiv.org/abs/2304.11523)]
* **FlowFormer++**: "FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation", CVPR, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2303.01237)]
* Panoramic Semantic Segmentation:
* **Trans4PASS**: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2203.01452)][[PyTorch](https://github.com/jamycheung/Trans4PASS)]
* **SGAT4PASS**: "SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation", IJCAI, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2306.03403)][[Code (in construction)](https://github.com/TencentARC/SGAT4PASS)]
* **FlowFormer**: "FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2306.05442)]
* X-Shot:
* **CyCTR**: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (*University of Technology Sydney*). [[Paper](https://arxiv.org/abs/2106.02320)]
* **CATrans**: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2204.12817)]
* **VAT**: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2207.10866)][[PyTorch](https://github.com/Seokju-Cho/Volumetric-Aggregation-Transformer)][[Website](https://seokju-cho.github.io/VAT/)]
* **DCAMA**: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2207.08549)]
* **AAFormer**: "Adaptive Agent Transformer for Few-Shot Segmentation", ECCV, 2022 (*USTC*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/1397_ECCV_2022_paper.php)]
* **IPMT**: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2210.06780)][[PyTorch](https://github.com/LIUYUANWEI98/IPMT)]
* **TAFT**: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2202.06498)]
* **MSANet**: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (*AiV Research Group, Korea*). [[Paper](https://arxiv.org/abs/2206.09667)][[PyTorch](https://github.com/AIVResearch/MSANet)]
* **MuHS**: "Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation", ICLR, 2023 (*Zhejiang University*). [[Paper](https://openreview.net/forum?id=CGuvK3U09LH)]
* **VTM**: "Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching", ICLR, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2303.14969)][[PyTorch](https://github.com/GitGyun/visual_token_matching)]
* **SegGPT**: "SegGPT: Segmenting Everything In Context", ICCV, 2023 (*BAAI*). [[Paper](https://arxiv.org/abs/2304.03284)][[PyTorch](https://github.com/baaivision/Painter)]
* **AMFormer**: "Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation", NeurIPS, 2023 (*ISTC*). [[Paper](https://arxiv.org/abs/2311.17626)][[Code (in construction)](https://github.com/Wyxdm/AMNet)]
* **RefT**: "Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2301.01156)][[Code (in construction)](https://github.com/hanyue1648/RefT)]
* **?**: "Multi-Modal Prototypes for Open-Set Semantic Segmentation", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2307.02003)]
* **SPINO**: "Few-Shot Panoptic Segmentation With Foundation Models", arXiv, 2023 (*University of Freiburg, Germany*). [[Paper](https://arxiv.org/abs/2309.10726)][[Website](http://spino.cs.uni-freiburg.de/)]
* **?**: "Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach", CVPR, 2024 (*UBC*). [[Paper](https://arxiv.org/abs/2404.11732)]
* **RefLDM-Seg**: "Explore In-Context Segmentation via Latent Diffusion Models", arXiv, 2024 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2403.09616)][[Code (in construction)](https://github.com/wang-chaoyang/RefLDMSeg)][[Website](https://wang-chaoyang.github.io/project/refldmseg/)]
* **Chameleon**: "Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild", arXiv, 2024 (*KAIST*). [[Paper](https://arxiv.org/abs/2404.18459)]
* X-Supervised:
* **MCTformer**: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 (*The University of Western Australia*). [[Paper](https://arxiv.org/abs/2203.02891)][[Code (in construction)](https://github.com/xulianuwa/MCTformer)]
* **AFA**: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2203.02664)][[PyTorch](https://github.com/rulixiang/afa)]
* **HSG**: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 (*Berkeley*). [[Paper](https://arxiv.org/abs/2204.11432)][[PyTorch](https://github.com/twke18/HSG)]
* **CLIMS**: "Cross Language Image Matching for Weakly Supervised Semantic Segmentation", CVPR, 2022 (*Shenzhen University*). [[Paper](https://arxiv.org/abs/2203.02668)][[PyTorch](https://github.com/CVI-SZU/CLIMS)]
* **?**: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 (*Université Paris-Saclay, France*). [[Paper](https://arxiv.org/abs/2205.15173)]
* **SegSwap**: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (*École des Ponts ParisTech*). [[Paper](https://arxiv.org/abs/2110.15904)][[PyTorch](https://github.com/XiSHEN0220/SegSwap)][[Website](http://imagine.enpc.fr/~shenx/SegSwap/)]
* **ViT-PCM**: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 (*Sapienza University, Italy*). [[Paper](https://arxiv.org/abs/2210.17400)][[Tensorflow](https://github.com/deepplants/ViT-PCM)]
* **TransFGU**: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2112.01515)][[PyTorch](https://github.com/damo-cv/TransFGU)]
* **TransCAM**: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*University of Toronto*). [[Paper](https://arxiv.org/abs/2203.07239)][[PyTorch](https://github.com/liruiwen/TransCAM)]
* **WegFormer**: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*Tongji University, China*). [[Paper](https://arxiv.org/abs/2203.08421)]
* **MaskDistill**: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (*KU Leuven*). [[Paper](https://arxiv.org/abs/2206.06363)][[PyTorch](https://github.com/wvangansbeke/MaskDistill)]
* **eX-ViT**: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*La Trobe University, Australia*). [[Paper](https://arxiv.org/abs/2207.05358)]
* **TCC**: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2209.02178)]
* **SemFormer**: "SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (*Shenzhen University*). [[Paper](https://arxiv.org/abs/2210.14618)][[PyTorch](https://github.com/JLChen-C/SemFormer)]
* **CLIP-ES**: "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2212.09506)][[PyTorch](https://github.com/linyq2117/CLIP-ES)]
* **ToCo**: "Token Contrast for Weakly-Supervised Semantic Segmentation", CVPR, 2023 (*JD*). [[Paper](https://arxiv.org/abs/2303.01267)][[PyTorch](https://arxiv.org/abs/2303.01267)]
* **DPF**: "DPF: Learning Dense Prediction Fields with Weak Supervision", CVPR, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2303.16890)][[PyTorch](https://github.com/cxx226/DPF)]
* **SemiCVT**: "SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation", CVPR, 2023 (*Zhejiang University*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Huang_SemiCVT_Semi-Supervised_Convolutional_Vision_Transformer_for_Semantic_Segmentation_CVPR_2023_paper.html)]
* **AttentionShift**: "AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (*CAS*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Liao_AttentionShift_Iteratively_Estimated_Part-Based_Attention_Map_for_Pointly_Supervised_Instance_CVPR_2023_paper.html)]
* **MMCST**: "Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization", CVPR, 2023 (*The University of Western Australia*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Xu_Learning_Multi-Modal_Class-Specific_Tokens_for_Weakly_Supervised_Dense_Object_Localization_CVPR_2023_paper.html)]
* **SimSeg**: "A Simple Framework for Text-Supervised Semantic Segmentation", CVPR, 2023 (*ByteDance*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Yi_A_Simple_Framework_for_Text-Supervised_Semantic_Segmentation_CVPR_2023_paper.html)][[Code (in construction)](https://github.com/muyangyi/SimSeg)]
* **SIM**: "SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation", CVPR, 2023 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2303.08578)][[PyTorch (in construction)](https://github.com/lslrh/SIM)]
* **AttentionShift**: "AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation", CVPR, 2023 (*CAS*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Liao_AttentionShift_Iteratively_Estimated_Part-Based_Attention_Map_for_Pointly_Supervised_Instance_CVPR_2023_paper.html)]
* **Point2Mask**: "Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport", ICCV, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.01779)][[PyTorch](https://github.com/LiWentomng/Point2Mask)]
* **BoxSnake**: "BoxSnake: Polygonal Instance Segmentation with Box Supervision", ICCV, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2303.11630)]
* **QA-CLIMS**: "Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation", ACMMM, 2023 (*Shenzhen University*). [[Paper](https://arxiv.org/abs/2401.09883)][[Code (in construction)](https://github.com/CVI-SZU/QA-CLIMS)]
* **CoCu**: "Bridging Semantic Gaps for Language-Supervised Semantic Segmentation", NeurIPS, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2309.13505)][[PyTorch](https://github.com/xing0047/CoCu)]
* **APro**: "Label-efficient Segmentation via Affinity Propagation", NeurIPS, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2310.10533)][[PyTorch](https://github.com/CircleRadon/APro)][[Website](https://liwentomng.github.io/apro/)]
* **PaintSeg**: "PaintSeg: Training-free Segmentation via Painting", NeurIPS, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2305.19406)]
* **SmooSeg**: "SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation", NeurIPS, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2310.17874)][[PyTorch](https://github.com/mc-lan/SmooSeg)]
* **VLOSS**: "Towards Universal Vision-language Omni-supervised Segmentation", arXiv, 2023 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2303.06547)]
* **MECPformer**: "MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Tongji University*). [[Paper](https://arxiv.org/abs/2303.10689)][[Code (in construction)](https://github.com/ChunmengLiu1/MECPformer)]
* **WeakTr**: "WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation", arXiv, 2023 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2304.01184)][[PyTorch](https://github.com/hustvl/WeakTr)]
* **SAM-WSSS**: "An Alternative to WSSS? An Empirical Study of the Segment Anything Model (SAM) on Weakly-Supervised Semantic Segmentation Problems", arXiv, 2023 (*ANU*). [[Paper](https://arxiv.org/abs/2305.01586)]
* **?**: "Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang University + Nankai University*). [[Paper](https://arxiv.org/abs/2305.01275)]
* **AReAM**: "Mitigating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2305.03112)]
* **SEPL**: "Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*OSU*). [[Paper](https://arxiv.org/abs/2305.05803)][[Code (in construction)](https://github.com/cskyl/SAM_WSSS)]
* **MIMIC**: "MIMIC: Masked Image Modeling with Image Correspondences", arXiv, 2023 (*UW*). [[Paper](https://arxiv.org/abs/2306.15128)][[PyTorch](https://github.com/RAIVNLab/MIMIC)]
* **POLE**: "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation", arXiv, 2023 (*ETS Montreal, Canada*). [[Paper](https://arxiv.org/abs/2307.00097)][[PyTorch](https://github.com/rB080/WSS_POLE)]
* **GD**: "Guided Distillation for Semi-Supervised Instance Segmentation", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2308.02668)]
* **MCTformer+**: "MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*The University of Western Australia*). [[Paper](https://arxiv.org/abs/2308.03005)][[PyTorch](https://github.com/xulianuwa/MCTformer)]
* **MMC**: "Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding", arXiv, 2023 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2308.11448)]
* **CRATE**: "Emergence of Segmentation with Minimalistic White-Box Transformers", arXiv, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2308.16271)][[PyTorch](https://github.com/Ma-Lab-Berkeley/CRATE)]
* **?**: "Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models", arXiv, 2023 (*Singapore Management University*). [[Paper](https://arxiv.org/abs/2310.13026)]
* **MCC**: "Masked Collaborative Contrast for Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang Lab, China*). [[Paper](https://arxiv.org/abs/2305.08491)][[PyTorch](https://github.com/fwu11/MCC)]
* **CRATE**: "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?", arXiv, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2311.13110)][[PyTorch](https://github.com/Ma-Lab-Berkeley/CRATE)][[Website](https://ma-lab-berkeley.github.io/CRATE/)]
* **SAMS**: "Foundation Model Assisted Weakly Supervised Semantic Segmentation", arXiv, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2312.03585)]
* **SemiVL**: "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2311.16241)][[PyTorch](https://github.com/google-research/semivl)]
* **Self-reinforcement**: "Progressive Uncertain Feature Self-reinforcement for Weakly Supervised Semantic Segmentation", AAAI, 2024 (*Zhejiang Lab*). [[Paper](https://arxiv.org/abs/2312.08916)][[PyTorch](https://github.com/Jessie459/feature-self-reinforcement)]
* **FeatUp**: "FeatUp: A Model-Agnostic Framework for Features at Any Resolution", ICLR, 2024 (*MIT*). [[Paper](https://arxiv.org/abs/2403.10516)]
* **Zip-Your-CLIP**: "The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models", ICLR, 2024 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2404.11957)][[PyTorch](https://github.com/ChengShiest/Zip-Your-CLIP)]
* **SeCo**: "Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation", CVPR, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2402.18467)][[Code (in construction)](https://github.com/zwyang6/SeCo)]
* **AllSpark**: "AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation", CVPR, 2024 (*HKUST*). [[Paper](https://arxiv.org/abs/2403.01818)][[PyTorch](https://github.com/xmed-lab/AllSpark)]
* **CPAL**: "Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation", CVPR, 2024 (*Monash University*). [[Paper](https://arxiv.org/abs/2403.07630)][[Code (in construction)](https://github.com/Barrett-python/CPAL)]
* **DuPL**: "DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation", CVPR, 2024 (*Shanghai University*). [[Paper](https://arxiv.org/abs/2403.11184)][[PyTorch](https://github.com/Wu0409/DuPL)]
* **CoDe**: "Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation", CVPR, 2024 (*NTU*). [[Paper](https://arxiv.org/abs/2404.04231)][[Code (in construction)](https://github.com/072jiajia/image-text-co-decomposition)]
* **SemPLeS**: "SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2401.11791)]
* **WeakSAM**: "WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition", arXiv, 2024 (*Huazhong University of Science & Technology (HUST)*). [[Paper](https://arxiv.org/abs/2402.14812)][[PyTorch](https://github.com/hustvl/WeakSAM)]
* **CoSA**: "Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation", arXiv, 2024 (*Lancaster University, UK*). [[Paper](https://arxiv.org/abs/2402.17891)][[Code (in construction)](https://github.com/youshyee/CoSA)]
* **CoBra**: "CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation", arXiv, 2024 (*Yonsei*). [[Paper](https://arxiv.org/abs/2403.08801)]
* Cross-Domain:
* **DAFormer**: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2111.14887)][[PyTorch](https://github.com/lhoyer/DAFormer)]
* **HGFormer**: "HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation", CVPR, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2305.13031)][[Code (in construction)](https://github.com/dingjiansw101/HGFormer)]
* **UniDAformer**: "UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration", CVPR, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2206.15083)]
* **MIC**: "MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation", CVPR, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2212.01322)][[PyTorch](https://github.com/lhoyer/MIC)]
* **CDAC**: "CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation", ICCV, 2023 (*Boston*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Wang_CDAC_Cross-domain_Attention_Consistency_in_Transformer_for_Domain_Adaptive_Semantic_ICCV_2023_paper.html)][[PyTorch](https://github.com/wangkaihong/CDAC)]
* **EDAPS**: "EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation", ICCV, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2304.14291)][[PyTorch](https://github.com/susaha/edaps)]
* **PTDiffSeg**: "Prompting Diffusion Representations for Cross-Domain Semantic Segmentation", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2307.02138)][[Code (in construction)](https://github.com/ETHRuiGong/PTDiffSeg)]
* **Rein**: "Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation", arXiv, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2312.04265)]
* Continual Learning:
* **TISS**: "Delving into Transformer for Incremental Semantic Segmentation", arXiv, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2211.10253)]
* **Incrementer**: "Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class", CVPR, 2023 (*University of Electronic Science and Technology of China*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Shang_Incrementer_Transformer_for_Class-Incremental_Semantic_Segmentation_With_Knowledge_Distillation_Focusing_CVPR_2023_paper.html)]
* Crack Detection:
* **CrackFormer**: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (*Nanjing University of Science and Technology*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_CrackFormer_Transformer_Network_for_Fine-Grained_Crack_Detection_ICCV_2021_paper.html)]
* Camouflaged/Concealed Object:
* **UGTR**: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (*Group42, Abu Dhabi*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Yang_Uncertainty-Guided_Transformer_Reasoning_for_Camouflaged_Object_Detection_ICCV_2021_paper.html)][[PyTorch](https://github.com/fanyang587/UGTR)]
* **COD**: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 (*Anhui University, China*). [[Paper](https://arxiv.org/abs/2205.10579)][[Code (in construction)](https://github.com/liuzywen/COD)]
* **OSFormer**: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2207.02255)][[PyTorch](https://github.com/PJLallen/OSFormer)]
* **FSPNet**: "Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers", CVPR, 2023 (*Sichuan Changhong Electric, China*). [[Paper](https://arxiv.org/abs/2303.14816)][[PyTorch](https://github.com/ZhouHuang23/FSPNet)][[Website](https://tzxiang.github.io/project/COD-FSPNet/index.html)]
* **MFG**: "Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping", NeurIPS, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2305.11003)][[Code (in construction)](https://github.com/ChunmingHe/WS-SAM)]
* Background Separation:
* **TransBlast**: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 (*University of British Columbia*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021W/RSLCV/html/Osman_TransBlast_Self-Supervised_Learning_Using_Augmented_Subspace_With_Transformer_for_BackgroundForeground_ICCVW_2021_paper.html)]
* Scene Understanding:
* **BANet**: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2106.12413)]
* **Cerberus-Transformer**: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2111.12608)][[PyTorch](https://github.com/OPEN-AIR-SUN/Cerberus)]
* **IRISformer**: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (*UCSD*). [[Paper](https://arxiv.org/abs/2206.08423)][[Code (in construction)](https://github.com/ViLab-UCSD/IRISformer)]
* 3D Segmentation:
* **Stratified-Transformer**: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2203.14508)][[PyTorch](https://github.com/dvlab-research/Stratified-Transformer)]
* **CodedVTR**: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2203.09887)]
* **M2F3D**: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (*RWTH Aachen University, Germany*). [[Paper](https://jonasschult.github.io/Mask3D/assets/workshop_paper.pdf)][[Website](https://jonasschult.github.io/Mask3D/)]
* **3DSeg**: "3D Segmenter: 3D Transformer based Semantic Segmentation via 2D Panoramic Distillation", ICLR, 2023 (*The University of Tokyo*). [[Paper](https://openreview.net/forum?id=4dZeBJ83oxk)]
* **Analogical-Network**: "Analogical Networks for Memory-Modulated 3D Parsing", ICLR, 2023 (*CMU*). [[Paper](https://openreview.net/forum?id=SRIQZTh0IK)]
* **VoxFormer**: "VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion", CVPR, 2023 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2302.12251)][[PyTorch](https://github.com/NVlabs/VoxFormer)]
* **GrowSP**: "GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds", CVPR, 2023 (*The Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2305.16404)][[PyTorch](https://github.com/vLAR-group/GrowSP)]
* **RangeViT**: "RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving", CVPR, 2023 (*Valeo.ai, France*). [[Paper](https://arxiv.org/abs/2301.10222)][[Code (in construction)](https://github.com/valeoai/rangevit)]
* **MeshFormer**: "Heat Diffusion based Multi-scale and Geometric Structure-aware Transformer for Mesh Segmentation", CVPR, 2023 (*University of Macau*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Wong_Heat_Diffusion_Based_Multi-Scale_and_Geometric_Structure-Aware_Transformer_for_Mesh_CVPR_2023_paper.html)]
* **MSeg3D**: "MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving", CVPR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2303.08600)][[PyTorch](https://github.com/jialeli1/lidarseg3d)]
* **SGVF-SVFE**: "See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data", ICCV, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2307.10782)]
* **SVQNet**: "SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2308.13323)]
* **MAF-Transformer**: "Mask-Attention-Free Transformer for 3D Instance Segmentation", ICCV, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2309.01692)][[PyTorch](https://github.com/dvlab-research/Mask-Attention-Free-Transformer)]
* **UniSeg**: "UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2309.05573)][[PyTorch](https://github.com/PJLab-ADG/PCSeg)]
* **MIT**: "2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision", ICCV, 2023 (*NTU*). [[Paper](https://openaccess.thecvf.com/content/ICCV2023/html/Yang_2D-3D_Interlaced_Transformer_for_Point_Cloud_Segmentation_with_Scene-Level_Supervision_ICCV_2023_paper.html)]
* **CVSformer**: "CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion", ICCV, 2023 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2307.07938)]
* **SPT**: "Efficient 3D Semantic Segmentation with Superpoint Transformer", ICCV, 2023 (*Univ Gustave Eiffel, France*). [[Paper](https://arxiv.org/abs/2306.08045)][[PyTorch](https://github.com/drprojects/superpoint_transformer)]
* **SATR**: "SATR: Zero-Shot Semantic Segmentation of 3D Shapes", ICCV, 2023 (*KAUST*). [[Paper](https://arxiv.org/abs/2304.04909)][[PyTorch](https://github.com/Samir55/SATR)][[Website](https://samir55.github.io/SATR/)]
* **3D-OWIS**: "3D Indoor Instance Segmentation in an Open-World", NeurIPS, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2309.14338)]
* **SA3D**: "Segment Anything in 3D with NeRFs", NeurIPS, 2023 (*SJTU*). [[Paper](https://arxiv.org/abs/2304.12308)][[PyTorch](https://github.com/Jumpat/SegmentAnythingin3D)][[Website](https://jumpat.github.io/SA3D/)]
* **Contrastive-Lift**: "Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion", NeurIPS, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2306.04633)][[PyTorch](https://github.com/yashbhalgat/Contrastive-Lift)][[Website](https://www.robots.ox.ac.uk/~vgg/research/contrastive-lift/)]
* **P3Former**: "Position-Guided Point Cloud Panoptic Segmentation Transformer", arXiv, 2023 (*1Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.13509)][[Code (in construction)](https://github.com/SmartBot-PJLab/P3Former)]
* **UnScene3D**: "UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes", arXiv, 2023 (*TUM*). [[Paper](https://arxiv.org/abs/2303.14541)][[Website](https://rozdavid.github.io/unscene3d)]
* **CNS**: "Towards Label-free Scene Understanding by Vision Foundation Models", NeurIPS, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2306.03899)][[Code (in construction)](https://github.com/runnanchen/Label-Free-Scene-Understanding)]
* **DCTNet**: "Dynamic Clustering Transformer Network for Point Cloud Segmentation", arXiv, 2023 (*University of Waterloo, Waterloo, Canada*). [[Paper](https://arxiv.org/abs/2306.08073)]
* **Symphonies**: "Symphonize 3D Semantic Scene Completion with Contextual Instance Queries", arXiv, 2023 (*Horizon Robotics*). [[Paper](https://arxiv.org/abs/2306.15670)][[PyTorch](https://github.com/hustvl/Symphonies)]
* **TFS3D**: "Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.12961)][[PyTorch](https://github.com/yangyangyang127/TFS3D)]
* **CIP-WPIS**: "When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision", arXiv, 2023 (*Australian National University*). [[Paper](https://arxiv.org/abs/2309.00828)]
* **?**: "SAM-guided Unsupervised Domain Adaptation for 3D Segmentation", arXiv, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2310.08820)]
* **CSF**: "Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation", arXiv, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2311.01989)]
* **?**: "Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2311.14665)]
* **OneFormer3D**: "OneFormer3D: One Transformer for Unified Point Cloud Segmentation", arXiv, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2311.14405)]
* **SAGA**: "Segment Any 3D Gaussians", arXiv, 2023 (*SJTU*). [[Paper](https://arxiv.org/abs/2312.00860)][[Code (in construction)](https://github.com/Jumpat/SegAnyGAussians)][[Website](https://jumpat.github.io/SAGA/)]
* **SANeRF-HQ**: "SANeRF-HQ: Segment Anything for NeRF in High Quality", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2312.01531)][[Code (in construction)](https://github.com/lyclyc52/SANeRF-HQ)][[Website](https://lyclyc52.github.io/SANeRF-HQ/)]
* **SAM-Graph**: "SAM-guided Graph Cut for 3D Instance Segmentation", arXiv, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2312.08372)][[Code (in construction)](https://github.com/zju3dv/SAM-Graph)][[Website](https://zju3dv.github.io/sam_graph/)]
* **SAI3D**: "SAI3D: Segment Any Instance in 3D Scenes", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2312.11557)]
* **COSeg**: "Rethinking Few-shot 3D Point Cloud Semantic Segmentation", CVPR, 2024 (*ETHZ*). [[Paper](https://arxiv.org/abs/2403.00592)][[Code (in construction)](https://github.com/ZhaochongAn/COSeg)]
* **CSC**: "Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception", CVPR, 2024 (*East China Normal University*). [[Paper](https://arxiv.org/abs/2405.07201)][[Code (in construction)](https://github.com/chenhaomingbob/CSC)]
* Multi-Task:
* **InvPT**: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 (*HKUST*). [[Paper](https://arxiv.org/abs/2203.07997)][[PyTorch](https://github.com/prismformore/Multi-Task-Transformer)]
* **MTFormer**: "MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning", ECCV, 2022 (*CUHK*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/1353_ECCV_2022_paper.php)]
* **MQTransformer**: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2205.14354)]
* **DeMT**: "DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction", AAAI, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2301.03461)][[PyTorch](https://github.com/yangyangxu0/DeMT)]
* **TaskPrompter**: "TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding", ICLR, 2023 (*HKUST*). [[Paper](https://openreview.net/forum?id=-CwPopPJda)][[PyTorch (in construction)](https://github.com/prismformore/Multi-Task-Transformer)]
* **AiT**: "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token", ICCV, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2301.02229)][[PyTorch](https://github.com/SwinTransformer/AiT)]
* **InvPT++**: "InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2306.04842)]
* **DeMTG**: "Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction", arXiv, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2308.05721)][[PyTorch](https://github.com/yangyangxu0/DeMTG)]
* **SRT**: "Sub-token ViT Embedding via Stochastic Resonance Transformers", arXiv, 2023 (*UCLA*). [[Paper](https://arxiv.org/abs/2310.03967)]
* **MLoRE**: "Multi-Task Dense Prediction via Mixture of Low-Rank Experts", CVPR, 2024 (*vivo*). [[Paper](https://arxiv.org/abs/2403.17749)]
* **ODIN**: "ODIN: A Single Model for 2D and 3D Perception", arXiv, 2024 (*CMU*). [[Paper](https://arxiv.org/abs/2401.02416)][[Code (in construction)](https://github.com/ayushjain1144/odin)][[Website](https://odin-seg.github.io/)]
* **LiFT**: "LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors", arXiv, 2024 (*Maryland*). [[Paper](https://arxiv.org/abs/2403.14625)]
* Forecasting:
* **DiffAttn**: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 (*UIUC*). [[Paper](https://arxiv.org/abs/2204.07157)][[Code (in construction)](https://github.com/cgraber/psf-diffattn)]
* LiDAR:
* **HelixNet**: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 (*CNRS, France*). [[Paper](https://arxiv.org/abs/2206.08194)][[Website](https://romainloiseau.fr/helixnet/)][[PyTorch](https://github.com/romainloiseau/Helix4D)]
* **Gaussian-Radar-Transformer**: "Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data", RA-L, 2022 (*University of Bonn, Germany*). [[Paper](https://arxiv.org/abs/2212.03690)]
* **MOST**: "Lidar Panoptic Segmentation and Tracking without Bells and Whistles", IROS, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2310.12464)][[PyTorch](https://github.com/abhinavagarwalla/most-lps)]
* **4D-Former**: "4D-Former: Multimodal 4D Panoptic Segmentation", CoRL, 2023 (*Waabi, Canada*). [[Paper](https://arxiv.org/abs/2311.01520)][[Website](https://waabi.ai/4d-former/)]
* **MASK4D**: "MASK4D: Mask Transformer for 4D Panoptic Segmentation", arXiv, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2309.16133)]
* Co-Segmentation:
* **ReCo**: "ReCo: Retrieve and Co-segment for Zero-shot Transfer", NeurIPS, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2206.07045)][[PyTorch](https://github.com/NoelShin/reco)][[Website](https://www.robots.ox.ac.uk/~vgg/research/reco/)]
* **DINO-ViT-feature**: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 (*Weizmann Institute of Science, Israel*). [[Paper](https://arxiv.org/abs/2112.05814)][[PyTorch](https://github.com/ShirAmir/dino-vit-features)][[Website](https://dino-vit-features.github.io/)]
* **LCCo**: "LCCo: Lending CLIP to Co-Segmentation", arXiv, 2023 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2308.11506)]
* Top-Down Semantic Segmentation:
* **Trans4Map**: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2207.06205)]
* Surface Normal:
* **Normal-Transformer**: "Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics", arXiv, 2022 (*University of Technology Sydney*). [[Paper](https://arxiv.org/abs/2211.10580)]
* Applications:
* **FloodTransformer**: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (*BITS Pilani, India*). [[Paper](https://arxiv.org/abs/2210.04218)]
* Diffusion:
* **VPD**: "Unleashing Text-to-Image Diffusion Models for Visual Perception", ICCV, 2023 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2303.02153)][[PyTorch](https://github.com/wl-zhao/VPD)][[Website](https://vpd.ivg-research.xyz/)]
* **Dataset-Diffusion**: "Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation", NeurIPS, 2023 (*VinAI, Vietnam*). [[Paper](https://arxiv.org/abs/2309.14303)][[PyTorch](https://github.com/VinAIResearch/Dataset-Diffusion)][[Website](https://dataset-diffusion.github.io/)]
* **SegRefiner**: "SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process", NeurIPS, 2023 (*ByteDance*). [[Paper](https://arxiv.org/abs/2312.12425)][[PyTorch](https://github.com/MengyuWang826/SegRefiner)]
* **DatasetDM**: "DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models", NeurIPS, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.06160)][[PyTorch](https://github.com/showlab/DatasetDM)][[Website](https://weijiawu.github.io/DatasetDM_page/)]
* **DiffSeg**: "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion", arXiv, 2023 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2308.12469)]
* **DiffSegmenter**: "Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter", arXiv, 2023 (*Beihang University*). [[Paper](https://arxiv.org/abs/2309.02773)]
* **?**: "From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2309.04109)]
* **LDMSeg**: "A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting", arXiv, 2024 (*Segments.ai, Belgium*). [[Paper](https://arxiv.org/abs/2401.10227)][[PyTorch](https://github.com/segments-ai/latent-diffusion-segmentation)]
* Low-Level Structure Segmentation:
* **EVP**: "Explicit Visual Prompting for Low-Level Structure Segmentations", CVPR, 2023. (*Tencent*). [[Paper](https://arxiv.org/abs/2303.10883)][[PyTorch](https://github.com/NiFangBaAGe/Explict-Visual-Prompt)]
* **EVP**: "Explicit Visual Prompting for Universal Foreground Segmentations", arXiv, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2305.18476)][[PyTorch](https://github.com/NiFangBaAGe/Explict-Visual-Prompt)]
* **EmerDiff**: "EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2401.11739)][[Website](https://kmcode1.github.io/Projects/EmerDiff/)]
* Zero-Guidance Segmentation:
* **zero-guide-seg**: "Zero-guidance Segmentation Using Zero Segment Labels", arXiv, 2023 (*VISTEC, Thailand*). [[Paper](https://arxiv.org/abs/2303.13396)][[Website](https://zero-guide-seg.github.io/)]
* Part Segmentation:
* **OPS**: "Towards Open-World Segmentation of Parts", CVPR, 2023 (*Adobe*). [[Paper](https://arxiv.org/abs/2305.16804)][[PyTorch](https://github.com/tydpan/OpenPartSeg)]
* **PartDistillation**: "PartDistillation: Learning Parts from Instance Segmentation", CVPR, 2023 (*Meta*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Cho_PartDistillation_Learning_Parts_From_Instance_Segmentation_CVPR_2023_paper.html)]
* Entity Segmentation:
* **AIMS**: "AIMS: All-Inclusive Multi-Level Segmentation", NeurIPS, 2023 (*UC Merced*). [[Paper](https://arxiv.org/abs/2305.17768)][[PyTorch](https://github.com/dvlab-research/Entity)]
* **SOHES**: "SOHES: Self-supervised Open-world Hierarchical Entity Segmentation", ICLR, 2024 (*Adobe*). [[Paper](https://arxiv.org/abs/2404.12386)][[Website](https://sohes.github.io/)]
* Evaluation:
* **?**: "Robustness Analysis on Foundational Segmentation Models", arXiv, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2306.09278)][[PyTorch](https://github.com/DeepLearningRobustnessStudies/SegmetationRobustness)]
* Interactive Segmentation:
* **InterFormer**: "InterFormer: Real-time Interactive Image Segmentation", ICCV, 2023 (*Xiamen University*). [[Paper](https://arxiv.org/abs/2304.02942)][[PyTorch](https://github.com/YouHuang67/InterFormer)]
* **SimpleClick**: "SimpleClick: Interactive Image Segmentation with Simple Vision Transformers", ICCV, 2023 (*UNC*). [[Paper](https://arxiv.org/abs/2210.11006)][[PyTorch](https://github.com/uncbiag/SimpleClick)]
* **iCMFormer**: "Interactive Image Segmentation with Cross-Modality Vision Transformers", arXiv, 2023 (*University of Twente, Netherlands*). [[Paper](https://arxiv.org/abs/2307.02280)][[Code (in construction)](https://github.com/lik1996/iCMFormer)]
* **MFP**: "MFP: Making Full Use of Probability Maps for Interactive Image Segmentation", CVPR, 2024 (*Korea University*). [[Paper](https://arxiv.org/abs/2404.18448)][[Code (in construction)](https://github.com/cwlee00/MFP)]
* **GraCo**: "GraCo: Granularity-Controllable Interactive Segmentation", CVPR, 2024 (*Peking*). [[Paper](https://arxiv.org/abs/2405.00587)][[Website](https://zhao-yian.github.io/GraCo/)]
* Amodal Segmentation:
* **AISFormer**: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (*University of Arkansas, Arkansas*). [[Paper](https://arxiv.org/abs/2210.06323)][[PyTorch](https://github.com/UARK-AICV/AISFormer)]
* **C2F-Seg**: "Coarse-to-Fine Amodal Segmentation with Shape Prior", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.16825)][[Code (in construction)](https://github.com/JianxGao/C2F-Seg)][[Website](https://jianxgao.github.io/C2F-Seg/)]
* **EoRaS**: "Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2309.13248)][[Code (in construction)](https://github.com/kfan21/EoRaS)]
* **MP3D-Amodal**: "Amodal Ground Truth and Completion in the Wild", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2312.17247)][[Website (in construction)](https://www.robots.ox.ac.uk/~vgg/research/amodal/)]
* Amonaly Segmentation:
* **Mask2Anomaly**: "Unmasking Anomalies in Road-Scene Segmentation", ICCV, 2023 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2307.13316)][[PyTorch](https://github.com/shyam671/Mask2Anomaly-Unmasking-Anomalies-in-Road-Scene-Segmentation)]
* In-Context Segmentation:
* **SEGIC**: "SegIC: Unleashing the Emergent Correspondence for In-Context Segmentation", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2311.14671)][[Code (in construction)](https://github.com/MengLcool/SEGIC)]

[[Back to Overview](#overview)]

## Video (High-level)
### Action Recognition
* RGB mainly
* **Action Transformer**: "Video Action Transformer Network", CVPR, 2019 (*DeepMind*). [[Paper](https://arxiv.org/abs/1812.02707)][[Code (ppriyank)](https://github.com/ppriyank/Video-Action-Transformer-Network-Pytorch-)]
* **ViViT-Ensemble**: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2106.05058)]
* **TimeSformer**: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2102.05095)][[PyTorch (lucidrains)](https://github.com/lucidrains/TimeSformer-pytorch)]
* **MViT**: "Multiscale Vision Transformers", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2104.11227)][[PyTorch](https://github.com/facebookresearch/SlowFast)]
* **VidTr**: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2104.11746)][[PyTorch](https://github.com/amazon-research/gluonmm)]
* **ViViT**: "ViViT: A Video Vision Transformer", ICCV, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2103.15691)][[PyTorch (rishikksh20)](https://github.com/rishikksh20/ViViT-pytorch)]
* **VTN**: "Video Transformer Network", ICCVW, 2021 (*Theator*). [[Paper](https://arxiv.org/abs/2102.00719)][[PyTorch](https://github.com/bomri/SlowFast/tree/master/projects/vtn)]
* **TokShift**: "Token Shift Transformer for Video Classification", ACMMM, 2021 (*CUHK*). [[Paper](https://arxiv.org/abs/2108.02432)][[PyTorch](https://github.com/VideoNetworks/TokShift-Transformer)]
* **Motionformer**: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.05392)][[PyTorch](https://github.com/facebookresearch/Motionformer)][[Website](https://facebookresearch.github.io/Motionformer/)]
* **X-ViT**: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (*Samsung*). [[Paper](https://arxiv.org/abs/2106.05968)][[PyTorch](https://github.com/1adrianb/video-transformers)]
* **SCT**: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (*Kuaishou*). [[Paper](https://arxiv.org/abs/2108.11575)]
* **RSANet**: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (*POSTECH*). [[Paper](https://arxiv.org/abs/2111.01673)][[PyTorch](https://github.com/KimManjin/RSA)][[Website](http://cvlab.postech.ac.kr/research/RSA/)]
* **STAM**: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2103.13915)][[Code](https://github.com/Alibaba-MIIL/STAM)]
* **GAT**: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (*Samsung*). [[Paper](https://arxiv.org/abs/2103.10043)]
* **TokenLearner**: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2106.11297)]
* **VLF**: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (*The University of Sheffield*). [[Paper](https://arxiv.org/abs/2107.00451)]
* **UniFormer**: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (*CAS + SenstTime*). [[Paper](https://arxiv.org/abs/2201.04676)][[PyTorch](https://github.com/Sense-X/UniFormer)]
* **Video-Swin**: "Video Swin Transformer", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2106.13230)][[PyTorch](https://github.com/SwinTransformer/Video-Swin-Transformer)]
* **DirecFormer**: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (*University of Arkansas*). [[Paper](https://arxiv.org/abs/2203.10233)][[Code (in construction)](https://github.com/uark-cviu/DirecFormer)]
* **DVT**: "Deformable Video Transformer", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2203.16795)]
* **MeMViT**: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2201.08383)]
* **MLP-3D**: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2206.06292)][[PyTorch (in construction)](https://github.com/ZhaofanQiu/MLP-3D)]
* **RViT**: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (*TCL Corporate Research, HK*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Yang_Recurring_the_Transformer_for_Video_Action_Recognition_CVPR_2022_paper.html)]
* **SIFA**: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2206.06931)][[PyTorch](https://github.com/FuchenUSTC/SIFA)]
* **MViTv2**: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2112.01526)][[PyTorch](https://github.com/facebookresearch/mvit)]
* **MTV**: "Multiview Transformers for Video Recognition", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2201.04288)][[Tensorflow](https://github.com/google-research/scenic/tree/main/scenic/projects/mtv)]
* **ORViT**: "Object-Region Video Transformers", CVPR, 2022 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2110.06915)][[Website](https://roeiherz.github.io/ORViT/)]
* **TIME**: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (*KAIST*). [[Paper](https://arxiv.org/abs/2207.09067)][[PyTorch](https://github.com/alinlab/temporal-selfsupervision)]
* **TPS**: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2207.13259)][[PyTorch](https://github.com/MartinXM/TPS)]
* **DualFormer**: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (*Sea AI Lab*). [[Paper](https://arxiv.org/abs/2112.04674)][[PyTorch](https://github.com/sail-sg/dualformer)]
* **STTS**: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (*Fudan University*). [[Paper](https://arxiv.org/abs/2111.11591)][[PyTorch](https://github.com/wangjk666/STTS)]
* **Turbo**: "Turbo Training with Token Dropout", BMVC, 2022 (*Oxford*). [[Paper](https://arxiv.org/abs/2210.04889)]
* **MultiTrain**: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2209.12362)][[Code (in construction)](https://github.com/JunweiLiang/MultiTrain)]
* **SViT**: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", NeurIPS, 2022 (*Tel Aviv*). [[Paper](https://arxiv.org/abs/2206.06346)][[Website](https://eladb3.github.io/SViT/)]
* **ST-Adapter**: "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning", NeurIPS, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2206.13559)][[Code (in construction)](https://github.com/linziyi96/st-adapter)]
* **ATA**: "Alignment-guided Temporal Attention for Video Action Recognition", NeurIPS, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2210.00132)]
* **AIA**: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (*University of Science and Technology of China*). [[Paper](https://arxiv.org/abs/2204.09303)][[PyTorch](https://github.com/haoyanbin918/Attention-in-Attention)]
* **MSCA**: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (*Nagoya Institute of Technology*). [[Paper](https://arxiv.org/abs/2204.00452)]
* **VAST**: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2208.11108)]
* **Video-MobileFormer**: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.12257)]
* **MAM2**: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2210.05234)]
* **?**: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (*SenseTime*). [[Paper](https://arxiv.org/abs/2210.08164)]
* **STAN**: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (*The University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2208.01753)]
* **PatchBlender**: "PatchBlender: A Motion Prior for Video Transformers", arXiv, 2022 (*Mila*). [[Paper](https://arxiv.org/abs/2211.14449)]
* **DualPath**: "Dual-path Adaptation from Image to Video Transformers", CVPR, 2023 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2303.09857)][[PyTorch (in construction)](https://github.com/park-jungin/DualPath)]
* **S-ViT**: "Streaming Video Model", CVPR, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2303.17228)][[Code (in construction)](https://github.com/yuzhms/Streaming-Video-Model)]
* **TubeViT**: "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.03229)]
* **AdaMAE**: "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders", CVPR, 2023 (*JHU*). [[Paper](https://arxiv.org/abs/2211.09120)][[PyTorch](https://github.com/wgcban/adamae)]
* **ObjectViViT**: "How can objects help action recognition?", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.11726)]
* **SMViT**: "Simple MViT: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (*Meta*). [[Paper](https://omidpoursaeed.github.io/publication/smvit/)]
* **Hiera**: "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles", ICML, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2306.00989)][[PyTorch](https://github.com/facebookresearch/hiera)]
* **Video-FocalNet**: "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition", ICCV, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2307.06947)][[PyTorch](https://github.com/TalalWasim/Video-FocalNets)][[Website](https://talalwasim.github.io/Video-FocalNets/)]
* **ATM**: "What Can Simple Arithmetic Operations Do for Temporal Modeling?", ICCV, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2307.08908)][[Code (in construction)](https://github.com/whwu95/ATM)]
* **STA**: "Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation", ICCV, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2308.04549)]
* **Helping-Hands**: "Helping Hands: An Object-Aware Ego-Centric Video Recognition Model", ICCV, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2308.07918)][[PyTorch](https://github.com/Chuhanxx/helping_hand_for_egocentric_videos)]
* **SUM-L**: "Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition", ICCV, 2023 (*University of Delaware, Delaware*). [[Paper](https://arxiv.org/abs/2308.11489)][[Code (in construction)](https://github.com/wqtwjt1996/SUM-L)]
* **BEAR**: "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition", ICCV, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2303.13505)][[GitHub](https://github.com/AndongDeng/BEAR)]
* **UniFormerV2**: "UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer", ICCV, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2211.09552)][[PyTorch](https://github.com/OpenGVLab/UniFormerV2)]
* **CAST**: "CAST: Cross-Attention in Space and Time for Video Action Recognition", NeurIPS, 2023 (*Kyung Hee University*). [[Paper](https://arxiv.org/abs/2311.18825)][[PyTorch](https://github.com/KHU-VLL/CAST)][[Website](https://jong980812.github.io/CAST.github.io/)]
* **PPMA**: "Learning Human Action Recognition Representations Without Real Humans", NeurIPS (Datasets and Benchmarks), 2023 (*IBM*). [[Paper](https://arxiv.org/abs/2311.06231)][[PyTorch](https://github.com/howardzh01/PPMA)]
* **SVT**: "SVT: Supertoken Video Transformer for Efficient Video Understanding", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.00325)]
* **PLAR**: "Prompt Learning for Action Recognition", arXiv, 2023 (*Maryland*). [[Paper](https://arxiv.org/abs/2305.12437)]
* **SFA-ViViT**: "Optimizing ViViT Training: Time and Memory Reduction for Action Recognition", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2306.04822)]
* **TAdaConv**: "Temporally-Adaptive Models for Efficient Video Understanding", arXiv, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2308.05787)][[PyTorch](https://github.com/alibaba-mmai-research/TAdaConv)]
* **ZeroI2V**: "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2310.01324)]
* **MV-Former**: "Multi-entity Video Transformers for Fine-Grained Video Representation Learning", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2311.10873)][[PyTorch](https://github.com/facebookresearch/video_rep_learning)]
* **GeoDeformer**: "GeoDeformer: Geometric Deformable Transformer for Action Recognition", arXiv, 2023 (*HKUST*). [[Paper](https://arxiv.org/abs/2311.17975)]
* **Early-ViT**: "Early Action Recognition with Action Prototypes", arXiv, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2312.06598)]
* **MCA**: "Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition", ICLR, 2024 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2403.09506)][[PyTorch](https://github.com/BeSpontaneous/MCA-pytorch)]
* **StructViT**: "Learning Correlation Structures for Vision Transformers", CVPR, 2024 (*POSTECH*). [[Paper](https://arxiv.org/abs/2404.03924)]
* **VideoMamba**: "VideoMamba: State Space Model for Efficient Video Understanding", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.06977)][[PyTorch](https://github.com/OpenGVLab/VideoMamba)]
* **Video-Mamba-Suite**: "Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.09626)][[PyTorch](https://github.com/OpenGVLab/video-mamba-suite)]
* Depth:
* **Trear**: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 (*Tianjing University*). [[Paper](https://ieeexplore.ieee.org/document/9312201)]
* Pose/Skeleton:
* **ST-TR**: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (*Polytechnic University of Milan*). [[Paper](https://arxiv.org/abs/2012.06399)]
* **AcT**: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2107.00606)][[Code (in construction)](https://github.com/FedericoAngelini/MPOSE2021_Dataset)]
* **STAR**: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (*UCLA*). [[Paper](https://arxiv.org/abs/2107.07089)]
* **GCsT**: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2109.02860)]
* **GL-Transformer**: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (*Seoul National University*). [[Paper](https://arxiv.org/abs/2207.06101)][[PyTorch](https://github.com/Boeun-Kim/GL-Transformer)]
* **?**: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (*University of Delaware*). [[Paper](https://arxiv.org/abs/2208.01161)]
* **FG-STFormer**: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (*Zhengzhou University*). [[Paper](https://arxiv.org/abs/2210.02693)]
* **STTFormer**: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (*Xidian University*). [[Paper](https://arxiv.org/abs/2201.02849)][[Code (in construction)](https://github.com/heleiqiu/STTFormer)]
* **ProFormer**: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2202.11423)][[PyTorch](https://github.com/KPeng9510/ProFormer)]
* **?**: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2206.15002)]
* **HyperSA**: "Hypergraph Transformer for Skeleton-based Action Recognition", arXiv, 2022 (*University of Mannheim, Germany*). [[Paper](https://arxiv.org/abs/2211.09590)]
* **STAR-Transformer**: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (*Keimyung University, Korea*). [[Paper](https://arxiv.org/abs/2210.07503)]
* **STMT**: "STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition", CVPR, 2023 (*CMU*). [[Paper](https://arxiv.org/abs/2303.18177)][[Code (in construction)](https://github.com/zgzxy001/STMT)]
* **SkeletonMAE**: "SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2307.08476)][[Code (in construction)](https://github.com/HongYan1123/SkeletonMAE)]
* **MAMP**: "Masked Motion Predictors are Strong 3D Action Representation Learners", ICCV, 2023 (*USTC*). [[Paper](https://arxiv.org/abs/2308.07092)][[PyTorch](https://github.com/maoyunyao/MAMP)]
* **LAC**: "LAC - Latent Action Composition for Skeleton-based Action Segmentation", ICCV, 2023 (*INRIA*). [[Paper](https://arxiv.org/abs/2308.14500)][[Website](https://walker1126.github.io/LAC/)]
* **SkeleTR**: "SkeleTR: Towards Skeleton-based Action Recognition in the Wild", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2309.11445)]
* **PCM3**: "Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D Action Representation Learning", ACMMM, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2308.03975)][[Website](https://jhang2020.github.io/Projects/PCM3/PCM3.html)]
* **PoseAwareVT**: "Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers", arXiv, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2306.09331)][[PyTorch](https://github.com/dominickrei/PoseAwareVT)]
* **HandFormer**: "On the Utility of 3D Hand Poses for Action Recognition", arXiv, 2024 (*NUS*). [[Paper](https://arxiv.org/abs/2403.09805)][[Code (in construction)](https://github.com/s-shamil/HandFormer)][[Website](https://s-shamil.github.io/HandFormer/)]
* **SkateFormer**: "SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition", arXiv, 2024 (*KAIST*). [[Paper](https://arxiv.org/abs/2403.09508)][[Code (in construction)](https://github.com/KAIST-VICLab/SkateFormer)][[Website](https://jeonghyeokdo.github.io/SkateFormer_site/)]
* Multi-modal:
* **MBT**: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (*Google*). [[Paper](https://arxiv.org/abs/2107.00135)]
* **MM-ViT**: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (*OPPO*). [[Paper](https://arxiv.org/abs/2108.09322)]
* **MMT-NCRC**: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (*UCF*). [[Paper](https://arxiv.org/abs/2204.04564)][[Code (in construction)](https://github.com/Momilijaz96/MMT_for_NCRC)]
* **M&M**: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2206.09852)]
* **VT-CE**: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2208.01897)]
* **Hi-TRS**: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (*Rutgers*). [[Paper](https://arxiv.org/abs/2207.09644)][[PyTorch](https://github.com/yuxiaochen1103/Hi-TRS)]
* **MVFT**: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (*Alibaba*). [[Paper](https://arxiv.org/abs/2202.12949)]
* **MOV**: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2207.07646)]
* **3Mformer**: "3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition", CVPR, 2023 (*ANU*). [[Paper](https://arxiv.org/abs/2303.14474)]
* **UMT**: "On Uni-Modal Feature Learning in Supervised Multi-Modal Learning", ICML, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2305.01233)]
* **?**: "Multimodal Distillation for Egocentric Action Recognition", ICCV, 2023 (*KU Leuven*). [[Paper](https://arxiv.org/abs/2307.07483)]
* **MotionBERT**: "MotionBERT: Unified Pretraining for Human Motion Analysis", ICCV, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2210.06551)][[PyTorch](https://github.com/Walter0807/MotionBERT)][[Website](https://motionbert.github.io/)]
* **TIM**: "TIM: A Time Interval Machine for Audio-Visual Action Recognition", CVPR, 2024 (*University of Bristol + Oxford*). [[Paper](https://arxiv.org/abs/2404.05559)][[PyTorch](https://github.com/JacobChalk/TIM)][[Website](https://jacobchalk.github.io/TIM-Project/)]
* Group Activity:
* **GroupFormer**: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (*Sensetime*). [[Paper](https://arxiv.org/abs/2108.12630)]
* **?**: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (*Hitachi*). [[Paper](https://arxiv.org/abs/2207.05254)]
* **GAFL**: "Learning Group Activity Features Through Person Attribute Prediction", CVPR, 2024 (*Toyota Technological Institute, Japan*). [[Paper](https://arxiv.org/abs/2403.02753)]

[[Back to Overview](#overview)]

### Action Detection/Localization
* **OadTR**: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (*Huazhong University of Science and Technology*). [[Paper](https://arxiv.org/abs/2106.11149)][[PyTorch](https://github.com/wangxiang1230/OadTR)]
* **RTD-Net**: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2102.01894)][[PyTorch](https://github.com/MCG-NJU/RTD-Action)]
* **FS-TAL**: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2110.10552)][[PyTorch](https://github.com/sauradip/fewshotQAT)]
* **LSTR**: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (*Amazon*). [[Paper](https://arxiv.org/abs/2107.03377)][[PyTorch](https://github.com/amazon-research/long-short-term-transformer)][[Website](https://xumingze0308.github.io/projects/lstr/)]
* **ATAG**: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2103.16024)]
* **TAPG-Transformer**: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2105.12043)]
* **TadTR**: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2106.10271)][[Code (in construction)](https://github.com/xlliu7/TadTR)]
* **Vidpress-Soccer**: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (*Baidu*). [[Paper](https://arxiv.org/abs/2106.14447)][[GitHub](https://github.com/baidu-research/vidpress-sports)]
* **MS-TCT**: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (*INRIA*). [[Paper](https://arxiv.org/abs/2112.03902)][[PyTorch](https://github.com/dairui01/MS-TCT)]
* **UGPT**: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (*Rensselaer Polytechnic Institute, NY*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Guo_Uncertainty-Guided_Probabilistic_Transformer_for_Complex_Action_Recognition_CVPR_2022_paper.html)]
* **TubeR**: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2104.00969)]
* **DDM-Net**: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2112.04771)][[PyTorch](https://github.com/MCG-NJU/DDM)]
* **?**: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2207.03038)][[PyTorch](https://github.com/GX77/Dual-Stream-Transformer-for-Generic-Event-Boundary-Captioning)]
* **?**: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (*Renmin University of China*). [[Paper](https://arxiv.org/abs/2208.05375)]
* **EAMAT**: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (*Beijing Institute of Technology*). [[Paper](https://arxiv.org/abs/2205.05854)][[Code (in construction)](https://github.com/shuoyang129/EAMAT)]
* **STPT**: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2207.10448)]
* **TeSTra**: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (*UT Austin*). [[Paper](https://arxiv.org/abs/2209.09236)][[PyTorch](https://github.com/zhaoyue-zephyrus/TeSTra)]
* **TALLFormer**: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (*UNC*). [[Paper](https://arxiv.org/abs/2204.01680)][[PyTorch](https://github.com/klauscc/TALLFormer)]
* **?**: "Uncertainty-Based Spatial-Temporal Attention for Online Action Detection", ECCV, 2022 (*Rensselaer Polytechnic Institute, NY*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2138_ECCV_2022_paper.php)]
* **ActionFormer**: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (*UW-Madison*). [[Paper](https://arxiv.org/abs/2202.07925)][[PyTorch](https://github.com/happyharrycn/actionformer_release)]
* **ActionFormer**: "Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge", ECCVW, 2022 (*UW-Madison*). [[Paper](https://arxiv.org/abs/2211.09074)][[Pytorch](https://github.com/happyharrycn/actionformer_release)]
* **CoOadTR**: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (*Aarhus University, Denmark*). [[Paper](https://arxiv.org/abs/2201.06268)][[PyTorch](https://github.com/LukasHedegaard/continual-transformers)]
* **Temporal-Perceiver**: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2203.00307)]
* **LocATe**: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (*Stanford*). [[Paper](https://arxiv.org/abs/2203.10719)]
* **HTNet**: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2207.09662)]
* **AdaPerFormer**: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (*Tianjin University*). [[Paper](https://arxiv.org/abs/2208.11908)]
* **CWC-Trans**: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (*Meituan*). [[Paper](https://arxiv.org/abs/2208.14209)]
* **HIT**: "Holistic Interaction Transformer Network for Action Detection", WACV, 2023 (*NTHU*). [[Paper](https://arxiv.org/abs/2210.12686)][[PyTorch](https://github.com/joslefaure/HIT)]
* **LART**: "On the Benefits of 3D Pose and Tracking for Human Action Recognition", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.01199)][[Website](http://people.eecs.berkeley.edu/~jathushan/LART/)]
* **TranS4mer**: "Efficient Movie Scene Detection using State-Space Transformers", CVPR, 2023 (*Comcast*). [[Paper](https://arxiv.org/abs/2212.14427)]
* **TTM**: "Token Turing Machines", CVPR, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2211.09119)][[JAX](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing)]
* **?**: "Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection", CVPR, 2023 (*NAVER*). [[Paper](https://arxiv.org/abs/2303.17285)]
* **Self-DETR**: "Self-Feedback DETR for Temporal Action Detection", ICCV, 2023 (*Sungkyunkwan University, Korea*). [[Paper](https://arxiv.org/abs/2308.10570)]
* **UnLoc**: "UnLoc: A Unified Framework for Video Localization Tasks", ICCV, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2308.11062)][[JAX](https://github.com/google-research/scenic)]
* **EVAD**: "Efficient Video Action Detection with Token Dropout and Context Refinement", ICCV, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2304.08451)][[PyTorch](https://github.com/MCG-NJU/EVAD)]
* **MS-DETR**: "MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction", ACL, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2305.18969)][[PyTorch](https://github.com/K-Nick/MS-DETR)]
* **STAR**: "End-to-End Spatio-Temporal Action Localisation with Video Transformers", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.12160)]
* **DiffTAD**: "DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion", arXiv, 2023 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2303.14863)][[PyTorch (in construction)](https://github.com/sauradip/DiffusionTAD)]
* **MNA-ZBD**: "No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection", arXiv, 2023 (*Renmin University of China*). [[Paper](https://arxiv.org/abs/2307.10567)]
* **PAT**: "PAT: Position-Aware Transformer for Dense Multi-Label Action Detection", arXiv, 2023 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2308.05051)]
* **ViT-TAD**: "Adapting Short-Term Transformers for Action Detection in Untrimmed Videos", arXiv, 2023 (*Nanjing University (NJU)*). [[Paper](https://arxiv.org/abs/2312.01897)]
* **Cafe**: "Towards More Practical Group Activity Detection: A New Benchmark and Model", arXiv, 2023 (*POSTECH*). [[Paper](https://arxiv.org/abs/2312.02878)][[PyTorch](https://github.com/dk-kim/CAFE_codebase)][[Website](https://dk-kim.github.io/CAFE/)]
* **?**: "Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization", arXiv, 2023 (*Queen Mary, UK*). [[Paper](https://arxiv.org/abs/2312.17686)]
* **SMAST**: "A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection", TPAMI, 2024 (*University of Virginia*). [[Paper](https://arxiv.org/abs/2405.08204)]
* **OV-STAD**: "Open-Vocabulary Spatio-Temporal Action Detection", arXiv, 2024 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2405.10832)]

[[Back to Overview](#overview)]

### Action Prediction/Anticipation
* **AVT**: "Anticipative Video Transformer", ICCV, 2021 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.02036)][[PyTorch](https://github.com/facebookresearch/AVT)][[Website](https://facebookresearch.github.io/AVT/)]
* **TTPP**: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", Neurocomputing, 2021 (*CAS*). [[Paper](https://arxiv.org/abs/2003.03530)]
* **HORST**: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2104.08665)][[PyTorch](https://github.com/CorcovadoMing/HORST)]
* **?**: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2107.08579)]
* **FUTR**: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (*POSTECH*). [[Paper](https://arxiv.org/abs/2205.14022)]
* **VPTR**: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (*Polytechnique Montreal, Canada*). [[Paper](https://arxiv.org/abs/2203.15836)][[PyTorch](https://github.com/XiYe20/VPTR)]
* **Earthformer**: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", NeurIPS, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2207.05833)]
* **InAViT**: "Interaction Visual Transformer for Egocentric Action Anticipation", arXiv, 2022 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2211.14154)]
* **VPTR**: "Video Prediction by Efficient Transformers", IVC, 2022 (*Polytechnique Montreal, Canada*). [[Paper](https://arxiv.org/abs/2212.06026)][[Pytorch](https://github.com/XiYe20/VPTR)]
* **AFFT**: "Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation", WACV, 2023 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2210.12649)][[Code (in construction)](https://github.com/zeyun-zhong/AFFT)]
* **GliTr**: "GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction", WACV, 2023 (*McGill University, Canada*). [[Paper](https://arxiv.org/abs/2210.13605)]
* **RAFTformer**: "Latency Matters: Real-Time Action Forecasting Transformer", CVPR, 2023 (*Honda*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Girase_Latency_Matters_Real-Time_Action_Forecasting_Transformer_CVPR_2023_paper.html)]
* **AdamsFormer**: "AdamsFormer for Spatial Action Localization in the Future", CVPR, 2023 (*Honda*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Chi_AdamsFormer_for_Spatial_Action_Localization_in_the_Future_CVPR_2023_paper.html)]
* **TemPr**: "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction", CVPR, 2023 (*University of Bristol*). [[Paper](https://arxiv.org/abs/2204.13340)][[PyTorch](https://github.com/alexandrosstergiou/progressive-action-prediction)][[Website](https://alexandrosstergiou.github.io/project_pages/TemPr/index.html)]
* **MAT**: "Memory-and-Anticipation Transformer for Online Action Understanding", ICCV, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2308.07893)][[PyTorch](https://github.com/Echo0125/Memory-and-Anticipation-Transformer)]
* **SwinLSTM**: "SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM", ICCV, 2023 (*Hainan University*). [[Paper](https://arxiv.org/abs/2308.09891)][[PyTorch](https://github.com/SongTang-x/SwinLSTM)]
* **MVP**: "Multiscale Video Pretraining for Long-Term Activity Forecasting", arXiv, 2023 (*Boston*). [[Paper](https://arxiv.org/abs/2307.12854)]
* **DiffAnt**: "DiffAnt: Diffusion Models for Action Anticipation", arXiv, 2023 (*Karlsruhe Institute of Technology (KIT), Germany*). [[Paper](https://arxiv.org/abs/2311.15991)]
* **LALM**: "LALM: Long-Term Action Anticipation with Language Models", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2311.17944)]
* **?**: "Learning from One Continuous Video Stream", arXiv, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2312.00598)]
* **ObjectPrompt**: "Object-centric Video Representation for Long-term Action Anticipation", WACV, 2024 (*Honda*). [[Paper](https://arxiv.org/abs/2311.00180)][[Code (in construction)](https://github.com/brown-palm/ObjectPrompt)]

[[Back to Overview](#overview)]

### Video Object Segmentation
* **GC**: "Fast Video Object Segmentation using the Global Context Module", ECCV, 2020 (*Tencent*). [[Paper](https://arxiv.org/abs/2001.11243)]
* **SSTVOS**: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (*Modiface*). [[Paper](https://arxiv.org/abs/2101.08833)][[Code (in construction)](https://github.com/dukebw/SSTVOS)]
* **JOINT**: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (*University of Science and Technology of China*). [[Paper](https://arxiv.org/abs/2108.03679)][[PyTorch](https://github.com/maoyunyao/JOINT)]
* **AOT**: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2106.02638)][[PyTorch (yoxu515)](https://github.com/yoxu515/aot-benchmark)][[Code (in construction)](https://github.com/z-x-yang/AOT)]
* **TransVOS**: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2106.00588)]
* **SITVOS**: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (*JD*). [[Paper](https://arxiv.org/abs/2112.13983)]
* **HODOR**: "Differentiable Soft-Masked Attention", CVPRW, 2022 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2206.00182)]
* **BATMAN**: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.01159)]
* **DeAOT**: "Decoupling Features in Hierarchical Propagation for Video Object Segmentation", NeurIPS, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2210.09782)][[PyTorch](https://github.com/z-x-yang/AOT)]
* **AOT**: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2203.11442)][[PyTorch](https://github.com/yoxu515/aot-benchmark)]
* **MED-VT**: "MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation", CVPR, 2023 (*York University*). [[Paper](https://arxiv.org/abs/2304.05930)][[Website](https://rkyuca.github.io/medvt/)]
* **?**: "Boosting Video Object Segmentation via Space-time Correspondence Learning", CVPR, 2023 (*Shanghai Jiao Tong University (SJTU)*). [[Paper](https://arxiv.org/abs/2304.06211)]
* **Isomer**: "Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation", ICCV, 2023 (*Dalian University of Technology*). [[Paper](https://arxiv.org/abs/2308.06693)][[PyTorch](https://github.com/DLUT-yyc/Isomer)]
* **SimVOS**: "Scalable Video Object Segmentation with Simplified Framework", ICCV, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2308.09903)]
* **MITS**: "Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation", ICCV, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2308.13266)][[PyTorch](https://github.com/yoxu515/MITS)]
* **VIPMT**: "Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation", ICCV, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2309.11160)][[Code (in construction)](https://github.com/nankepan/VIPMT)]
* **MOSE**: "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes", ICCV, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2302.01872)][[GitHub](https://github.com/henghuiding/MOSE-api)][[Website](https://henghuiding.github.io/MOSE/)]
* **LVOS**: "LVOS: A Benchmark for Long-term Video Object Segmentation", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2211.10181)][[GitHub](https://github.com/LingyiHongfd/LVOS)][[Website](https://lingyihongfd.github.io/lvos.github.io/)]
* **JointFormer**: "Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2308.13505)]
* **PanoVOS**: "PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2309.12303)][[Code (in construction)](https://github.com/shilinyan99/PanoVOS)][[Website](https://shilinyan99.github.io/PanoVOS/index_pano.html)]
* **Cutie**: "Putting the Object Back into Video Object Segmentation", arXiv, 2023 (*UIUC*). [[Paper](https://arxiv.org/abs/2310.12982)][[PyTorch](https://github.com/hkchengrex/Cutie)][[Website](https://hkchengrex.com/Cutie/)]
* **M3T**: "M3T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking", arXiv, 2023 (*UBC*). [[Paper](https://arxiv.org/abs/2312.08514)]
* **?**: "Appearance-based Refinement for Object-Centric Motion Segmentation", arXiv, 2023 (*Oxford*). [[Paper](https://arxiv.org/abs/2312.11463)]
* **DATTT**: "Depth-aware Test-Time Training for Zero-shot Video Object Segmentation", CVPR, 2024 (*University of Macau*). [[Paper](https://arxiv.org/abs/2403.04258)][[PyTorch](https://github.com/NiFangBaAGe/DATTT)][[Website](https://nifangbaage.github.io/DATTT/)]
* **LLE-VOS**: "Event-assisted Low-Light Video Object Segmentation", CVPR, 2024 (*USTC*). [[Paper](https://arxiv.org/abs/2404.01945)]
* **Point-VOS**: "Point-VOS: Pointing Up Video Object Segmentation", arXiv, 2024 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2402.05917)][[Website](https://pointvos.github.io/)]
* **MAVOS**: "Efficient Video Object Segmentation via Modulated Cross-Attention Memory", arXiv, 2024 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2403.17937)][[Code (in construction)](https://github.com/Amshaker/MAVOS)]
* **STMA**: "Spatial-Temporal Multi-level Association for Video Object Segmentation", arXiv, 2024 (*Harbin Institute of Technology*). [[Paper](https://arxiv.org/abs/2404.06265)]
* **Flow-SAM**: "Moving Object Segmentation: All You Need Is SAM (and Flow)", arXiv, 2024 (*Oxford*). [[Paper](https://arxiv.org/abs/2404.12389)][[Website](https://www.robots.ox.ac.uk/~vgg/research/flowsam/)]
* **LVOSv2**: "LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation", arXiv, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2404.19326)][[GitHub](https://github.com/LingyiHongfd/LVOS)][[Website](https://lingyihongfd.github.io/lvos.github.io/)]

[[Back to Overview](#overview)]

### Video Instance Segmentation
* **VisTR**: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (*Meituan*). [[Paper](https://arxiv.org/abs/2011.14503)][[PyTorch](https://github.com/Epiphqny/VisTR)]
* **IFC**: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2106.03299)][[PyTorch](https://github.com/sukjunhwang/IFC)]
* **Deformable-VisTR**: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (*University at Buffalo*). [[Paper](https://arxiv.org/abs/2203.06318)][[Code (in construction)](https://github.com/skrya/DefVIS)]
* **TeViT**: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2204.08412)][[PyTorch](https://github.com/hustvl/TeViT)]
* **GMP-VIS**: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (*Shandong University*). [[Paper](https://paperswithcode.com/paper/a-graph-matching-perspective-with)]
* **VMT**: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (*ETHZ*). [[Paper](https://arxiv.org/abs/2207.14012)][[GitHub](https://github.com/SysCV/vmt)][[Website](https://www.vis.xyz/pub/vmt/)]
* **SeqFormer**: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (*ByteDance*). [[Paper](https://arxiv.org/abs/2112.08275)][[PyTorch](https://github.com/wjf5203/SeqFormer)]
* **MS-STS**: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2203.13253)][[PyTorch](https://github.com/OmkarThawakar/MSSTS-VIS)]
* **MinVIS**: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", NeurIPS, 2022 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2208.02245)][[PyTorch](https://github.com/NVlabs/MinVIS)]
* **VITA**: "VITA: Video Instance Segmentation via Object Token Association", NeurIPS, 2022 (*Yonsei University*). [[Paper](https://arxiv.org/abs/2206.04403)][[PyTorch](https://github.com/sukjunhwang/VITA)]
* **IFR**: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2206.07011)]
* **DeVIS**: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2207.11103)][[PyTorch](https://github.com/acaelles97/DeVIS)]
* **InstanceFormer**: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (*Ludwig Maximilian University of Munich*). [[Paper](https://arxiv.org/abs/2208.10547)][[Code (in construction)](https://github.com/rajatkoner08/InstanceFormer)]
* **MaskFreeVIS**: "Mask-Free Video Instance Segmentation", CVPR, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2303.15904)][[PyTorch](https://github.com/SysCV/MaskFreeVis)]
* **MDQE**: "MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos", CVPR, 2023 (*Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2303.14395)][[PyTorch](https://github.com/MinghanLi/MDQE_CVPR2023)]
* **GenVIS**: "A Generalized Framework for Video Instance Segmentation", CVPR, 2023 (*Yonsei*). [[Paper](https://arxiv.org/abs/2211.08834)][[PyTorch](https://github.com/miranheo/GenVIS)]
* **CTVIS**: "CTVIS: Consistent Training for Online Video Instance Segmentation", ICCV, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2307.12616)][[PyTorch](https://github.com/KainingYing/CTVIS)]
* **TCOVIS**: "TCOVIS: Temporally Consistent Online Video Instance Segmentation", ICCV, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2309.11857)][[Code (in construction)](https://github.com/jun-long-li/TCOVIS)]
* **DVIS**: "DVIS: Decoupled Video Instance Segmentation Framework", ICCV, 2023 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2306.03413)][[PyTorch](https://github.com/zhang-tao-whu/DVIS)]
* **TMT-VIS**: "TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation", NeurIPS, 2023 (*HKU*). [[Paper](https://arxiv.org/abs/2312.06630)][[Code (in construction)](https://github.com/rkzheng99/TMT-VIS)]
* **BoxVIS**: "BoxVIS: Video Instance Segmentation with Box Annotations", arXiv, 2023 (*Hong Kong Polytechnic University*). [[Paper](https://arxiv.org/abs/2303.14618)][[Code (in construction)](https://github.com/MinghanLi/BoxVIS)]
* **OW-VISFormer**: "Video Instance Segmentation in an Open-World", arXiv, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2304.01200)][[Code (in construction)](https://github.com/OmkarThawakar/OWVISFormer)]
* **GRAtt-VIS**: "GRAtt-VIS: Gated Residual Attention for Auto Rectifying Video Instance Segmentation", arXiv, 2023 (*LMU Munich*). [[Paper](https://arxiv.org/abs/2305.17096)][[Code (in construction)](https://github.com/Tanveer81/GRAttVIS)]
* **RefineVIS**: "RefineVIS: Video Instance Segmentation with Temporal Attention Refinement", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2306.04774)]
* **VideoCutLER**: "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2308.14710)][[PyTorch](https://github.com/facebookresearch/CutLER/tree/main/videocutler)]
* **NOVIS**: "NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation", arXiv, 2023 (*TUM*). [[Paper](https://arxiv.org/abs/2308.15266)]
* **VISAGE**: "VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement", arXiv, 2023 (*Yonsei*). [[Paper](https://arxiv.org/abs/2312.04885)][[Code (in construction)](https://github.com/KimHanjung/VISAGE)]
* **OW-VISCap**: "OW-VISCap: Open-World Video Instance Segmentation and Captioning", arXiv, 2024 (*UIUC*). [[Paper](https://arxiv.org/abs/2404.03657)][[Website](https://anwesachoudhuri.github.io/OpenWorldVISCap/)]
* **PointVIS**: "What is Point Supervision Worth in Video Instance Segmentation?", arXiv, 2024 (*NVIDIA*). [[Paper](https://arxiv.org/abs/2404.01990)]

[[Back to Overview](#overview)]

### Other Video Tasks
* Action Segmentation
* **ASFormer**: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (*Peking University*). [[Paper](https://arxiv.org/abs/2110.08568)][[PyTorch](https://github.com/ChinaYi/ASFormer)]
* **Bridge-Prompt**: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (*Tsinghua University*). [[Paper](https://arxiv.org/abs/2203.14104)][[PyTorch](https://github.com/ttlmh/Bridge-Prompt)]
* **SC-Transformer++**: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2206.12634)][[Code (in construction)](https://github.com/lufficc/SC-Transformer)]
* **UVAST**: "Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation", ECCV, 2022 (*Bosch*). [[Paper](https://arxiv.org/abs/2209.00638)][[PyTorch](https://github.com/boschresearch/UVAST)]
* **?**: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (*TUM*). [[Paper](https://arxiv.org/abs/2201.05675)]
* **CETNet**: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (*Shijiazhuang Tiedao University*). [[Paper](https://arxiv.org/abs/2205.09445)]
* **EUT**: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2205.13425)]
* **SC-Transformer**: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (*CAS*). [[Paper](https://arxiv.org/abs/2206.02985)]
* **DXFormer**: "Enhancing Transformer Backbone for Egocentric Video Action Segmentation", CVPRW, 2023 (*Northeastern University*). [[Paper](https://arxiv.org/abs/2305.11365)][[Website (in construction)](https://www.sail-nu.com/dxformer)]
* **LTContext**: "How Much Temporal Long-Term Context is Needed for Action Segmentation?", ICCV, 2023 (*University of Bonn*). [[Paper](https://arxiv.org/abs/2308.11358)][[PyTorch](https://github.com/LTContext/LTContext)]
* **DiffAct**: "Diffusion Action Segmentation", ICCV, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2303.17959)][[PyTorch](https://github.com/Finspire13/DiffAct)]
* **TST**: "Temporal Segment Transformer for Action Segmentation", arXiv, 2023 (*Shanghai Tech*). [[Paper](https://arxiv.org/abs/2302.13074)]
* Video X Segmentation:
* **STT**: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (*Shanghai Jiao Tong*). [[Paper](https://dl.acm.org/doi/abs/10.1145/3474085.3475409)]
* **CFFM**: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (*ETH Zurich*). [[Paper](https://arxiv.org/abs/2204.03330)][[PyTorch](https://github.com/GuoleiSun/VSS-CFFM)]
* **TF-DL**: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (*Google*). [[Paper](https://arxiv.org/abs/2205.15361)]
* **Video-K-Net**: "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation", CVPR, 2022 (*Peking University*). [[Paper](https://arxiv.org/abs/2204.04656)][[PyTorch](https://github.com/lxtGH/Video-K-Net)]
* **MRCFA**: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (*ETH Zurich*). [[Paper](https://arxiv.org/pdf/2207.10436)][[PyTorch](https://github.com/GuoleiSun/VSS-MRCFA)]
* **PolyphonicFormer**: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2112.02582)][[Code (in construction)](https://github.com/HarborYuan/PolyphonicFormer)]
* **?**: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (*Technical University of Cluj-Napoca, Romania*). [[Paper](https://arxiv.org/abs/2210.03546)]
* **CAROQ**: "Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation", CVPR, 2023 (*UIUC*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Choudhuri_Context-Aware_Relative_Object_Queries_To_Unify_Video_Instance_and_Panoptic_CVPR_2023_paper.html)][[PyTorch](https://github.com/AnwesaChoudhuri/CAROQ)][[Website](https://anwesachoudhuri.github.io/ContextAwareRelativeObjectQueries/)]
* **TarViS**: "TarViS: A Unified Approach for Target-based Video Segmentation", CVPR, 2023 (*RWTH Aachen University, Germany*). [[Paper](https://arxiv.org/abs/2301.02657)][[PyTorch](https://github.com/Ali2500/TarViS)]
* **MEGA**: "MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2308.11185)]
* **DEVA**: "Tracking Anything with Decoupled Video Segmentation", ICCV, 2023 (*UIUC*). [[Paper](https://arxiv.org/abs/2309.03903)][[PyTorch](https://github.com/hkchengrex/Tracking-Anything-with-DEVA)][[Website](https://hkchengrex.com/Tracking-Anything-with-DEVA/)]
* **Tube-Link**: "Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation", ICCV, 2023 (*NTU, Singapore*). [[Paper](https://arxiv.org/abs/2303.12782)][[PyTorch](https://github.com/lxtGH/Tube-Link)]
* **THE-Mask**: "Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation", BMVC, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2309.08020)][[Code (in construction)](https://github.com/ZhaochongAn/THE-Mask)]
* **MPVSS**: "Mask Propagation for Efficient Video Semantic Segmentation", NeurIPS, 2023 (*Monash University, Australia*). [[Paper](https://arxiv.org/abs/2310.18954)][[Code (in construction)](https://github.com/ziplab/MPVSS)]
* **Video-kMaX**: "Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.04694)]
* **SAM-PT**: "Segment Anything Meets Point Tracking", arXiv, 2023 (*ETHZ*). [[Paper](https://arxiv.org/abs/2307.01197)][[Code (in construction)](https://github.com/SysCV/sam-pt)]
* **TTT-MAE**: "Test-Time Training on Video Streams", arXiv, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2307.05014)][[Website](https://video-ttt.github.io/)]
* **UniVS**: "UniVS: Unified and Universal Video Segmentation with Prompts as Queries", CVPR, 2024 (*OPPO*). [[Paper](https://arxiv.org/abs/2402.18115)][[PyTorch](https://github.com/MinghanLi/UniVS)][[Website](https://sites.google.com/view/unified-video-seg-univs)]
* **DVIS++**: "DVIS++: Improved Decoupled Framework for Universal Video Segmentation", arXiv, 2024 (*Wuhan University*). [[Paper](https://arxiv.org/abs/2312.13305)][[PyTorch](https://github.com/zhang-tao-whu/DVIS_Plus)]
* **SAM-PD**: "SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising", arXiv, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2403.04194)][[PyTorch (in construction)](https://github.com/infZhou/SAM-PD)]
* **OneVOS**: "OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework", arXiv, 2024 (*Fudan*). [[Paper](https://arxiv.org/abs/2403.08682)]
* Video Object Detection:
* **TransVOD**: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (*Shanghai Jiao Tong + SenseTime*). [[Paper](https://arxiv.org/abs/2105.10920)][[Code (in construction)](https://github.com/SJTU-LuHe/TransVOD)]
* **MODETR**: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (*Valeo, Egypt*). [[Paper](https://arxiv.org/abs/2106.11422)]
* **ST-MTL**: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (*Valeo, Egypt*). [[Paper](https://arxiv.org/abs/2106.11401)]
* **ST-DETR**: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (*Valeo, Egypt*). [[Paper](https://arxiv.org/abs/2107.05887)]
* **PTSEFormer**: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (*Shanghai Jiao Tong University*). [[Paper](https://arxiv.org/abs/2209.02242)][[PyTorch](https://github.com/Hon-Wong/PTSEFormer)]
* **TransVOD**: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (*Shanghai Jiao Tong + SenseTime*). [[Paper](https://arxiv.org/abs/2201.05047)]
* **?**: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (*Zenseact, Sweden*). [[Paper](https://arxiv.org/abs/2204.10321)]
* **ClipVID**: "Identity-Consistent Aggregation for Video Object Detection", ICCV, 2023 (*University of Adelaide, Australia*). [[Paper](https://arxiv.org/abs/2308.07737)][[Code (in construction)](https://github.com/bladewaltz1/clipvid)]
* **OCL**: "Unsupervised Open-Vocabulary Object Localization in Videos", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2309.09858)]
* **CETR**: "Context Enhanced Transformer for Single Image Object Detection", AAAI, 2024 (*Korea University*). [[Paper](https://arxiv.org/abs/2312.14492)][[Code (in construction)](https://github.com/KU-CVLAB/CETR)][[Website](https://ku-cvlab.github.io/CETR/)]
* Dense Video Tasks (Detection + Segmentation):
* **TDViT**: "TDViT: Temporal Dilated Video Transformer for Dense Video Tasks", ECCV, 2022 (*Queen's University Belfast, UK*). [[Paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/5559_ECCV_2022_paper.php)][[Code (in construction)](https://github.com/guanxiongsun/TDViT)]
* **FAQ**: "Feature Aggregated Queries for Transformer-Based Video Object Detectors", CVPR, 2023 (*UCF*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Cui_Feature_Aggregated_Queries_for_Transformer-Based_Video_Object_Detectors_CVPR_2023_paper.html)][[PyTorch](https://github.com/YimingCuiCuiCui/FAQ)]
* **Video-OWL-ViT**: "Video OWL-ViT: Temporally-consistent open-world localization in video", ICCV, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2308.11093)]
* Video Retrieval:
* **SVRTN**: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (*Alibaba*). [[Paper](https://arxiv.org/abs/2104.07993)]
* Video Hashing:
* **BTH**: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 (*Tsinghua*). [[Paper](https://openaccess.thecvf.com/content/CVPR2021/html/Li_Self-Supervised_Video_Hashing_via_Bidirectional_Transformers_CVPR_2021_paper.html)][[PyTorch](https://github.com/Lily1994/BTH)]
* Video-Language:
* **ActionCLIP**: "ActionCLIP: A New Paradigm for Video Action Recognition", arXiv, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2109.08472)][[PyTorch](https://github.com/sallymmx/ActionCLIP)]
* **?**: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (*Shanghai Jiao Tong + Oxford*). [[Paper](https://arxiv.org/abs/2112.04478)][[PyTorch](https://github.com/ju-chen/Efficient-Prompt)][[Website](https://ju-chen.github.io/efficient-prompt/)]
* **X-CLIP**: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2208.02816)][[PyTorch](https://github.com/microsoft/VideoX/tree/master/X-CLIP)]
* **EVL**: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2208.03550)][[PyTorch (in construction)](https://github.com/OpenGVLab/efficient-video-recognition)]
* **STALE**: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (*University of Surrey, UK*). [[Paper](https://arxiv.org/abs/2207.08184)][[Code (in construction)](https://github.com/sauradip/STALE)]
* **?**: "Knowledge Prompting for Few-shot Action Recognition", arXiv, 2022 (*Beijing Laboratory of Intelligent Information Technology*). [[Paper](https://arxiv.org/abs/2211.12030)]
* **VLG**: "VLG: General Video Recognition with Web Textual Knowledge", arXiv, 2022 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2212.01638)]
* **InternVideo**: "InternVideo: General Video Foundation Models via Generative and Discriminative Learning", arXiv, 2022 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2212.03191)][[Code (in construction)](https://github.com/OpenGVLab/InternVideo)][[Website](https://opengvlab.shlab.org.cn/home)]
* **PromptonomyViT**: "PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data", arXiv, 2022 (*Tel Aviv + IBM*). [[Paper](https://arxiv.org/abs/2212.04821)]
* **MUPPET**: "Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation", arXiv, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2211.14905)][[Code (in construction)](https://github.com/sauradip/MUPPET)]
* **MovieCLIP**: "MovieCLIP: Visual Scene Recognition in Movies", WACV, 2023 (*USC*). [[Paper](https://arxiv.org/abs/2210.11065)][[Website](https://sail.usc.edu/~mica/MovieCLIP/)]
* **TranZAD**: "Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection", WACV, 2023 (*UC Riverside*). [[Paper](https://openaccess.thecvf.com/content/WACV2023/html/Nag_Semantics_Guided_Contrastive_Learning_of_Transformers_for_Zero-Shot_Temporal_Activity_WACV_2023_paper.html)]
* **Text4Vis**: "Revisiting Classifier: Transferring Vision-Language Models for Video Recognition", AAAI, 2023 (*Baidu*). [[Paper](https://arxiv.org/abs/2207.01297)][[PyTorch](https://github.com/whwu95/Text4Vis)]
* **AIM**: "AIM: Adapting Image Models for Efficient Video Action Recognition", ICLR, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2302.03024)][[PyTorch](https://github.com/taoyang1122/adapt-image-models)][[Website](https://adapt-image-models.github.io/)]
* **ViFi-CLIP**: "Fine-tuned CLIP Models are Efficient Video Learners", CVPR, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2212.03640)][[PyTorch](https://github.com/muzairkhattak/ViFi-CLIP)]
* **LaViLa**: "Learning Video Representations from Large Language Models", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2212.04501)][[PyTorch](https://github.com/facebookresearch/LaViLa)][[Website](https://facebookresearch.github.io/LaViLa/)]
* **TVP**: "Text-Visual Prompting for Efficient 2D Temporal Video Grounding", CVPR, 2023 (*Intel*). [[Paper](https://arxiv.org/abs/2303.04995)]
* **Vita-CLIP**: "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting", CVPR, 2023 (*MBZUAI*). [[Paper](https://arxiv.org/abs/2304.03307)][[PyTorch](https://github.com/TalalWasim/Vita-CLIP)]
* **STAN**: "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring", CVPR, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2301.11116)][[PyTorch](https://github.com/farewellthree/STAN)]
* **CBP-VLP**: "Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization", CVPR, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2212.09335)]
* **BIKE**: "Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models", CVPR, 2023 (*The University of Sydney*). [[Paper](https://arxiv.org/abs/2301.00182)][[PyTorch](https://github.com/whwu95/BIKE)]
* **HierVL**: "HierVL: Learning Hierarchical Video-Language Embeddings", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2301.02311)][[PyTorch](https://github.com/facebookresearch/HierVL)]
* **?**: "Test of Time: Instilling Video-Language Models with a Sense of Time", CVPR, 2023 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2301.02074)][[PyTorch](https://github.com/bpiyush/TestOfTime)][[Website](https://bpiyush.github.io/testoftime-website/index.html)]
* **Open-VCLIP**: "Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization", ICML, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2302.00624)][[PyTorch](https://github.com/wengzejia1/Open-VCLIP)]
* **ILA**: "Implicit Temporal Modeling with Learnable Alignment for Video Recognition", ICCV, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2304.10465)][[PyTorch](https://github.com/Francis-Rings/ILA)]
* **OV2Seg**: "Towards Open-Vocabulary Video Instance Segmentation", ICCV, 2023 (*University of Amsterdam*). [[Paper](https://arxiv.org/abs/2304.01715)][[PyTorch](https://github.com/haochenheheda/LVVIS)]
* **DiST**: "Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2309.07911)][[PyTorch](https://github.com/alibaba-mmai-research/DiST)]
* **GAP**: "Generative Action Description Prompts for Skeleton-based Action Recognition", ICCV, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2208.05318)][[PyTorch](https://github.com/MartinXM/GAP)]
* **MAXI**: "MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge", ICCV, 2023 (*Graz University of Technology, Austria*). [[Paper](https://arxiv.org/abs/2303.08914)][[PyTorch](https://github.com/wlin-at/MAXI)]
* **?**: "Language as the Medium: Multimodal Video Classification through text only", ICCVW, 2023 (*Unitary, UK*). [[Paper](https://arxiv.org/abs/2309.10783)]
* **MAP**: "Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning", ACMMM, 2023 (*Tencent*). [[Paper](https://arxiv.org/abs/2308.04828)]
* **OTI**: "Orthogonal Temporal Interpolation for Zero-Shot Video Recognition", ACMMM, 2023 (*CAS*). [[Paper](https://arxiv.org/abs/2308.06897)][[Code (in construction)](https://github.com/sweetorangezhuyan/mm2023_oti)]
* **Symbol-LLM**: "Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning", NeurIPS, 2023 (*Shanghai Jiao Tong University (SJTU)*). [[Paper](https://arxiv.org/abs/2311.17365)][[Code (in construction)](https://github.com/enlighten0707/Symbol-LLM)][[Website](https://mvig-rhos.com/symbol_llm)]
* **OAP-AOP**: "Opening the Vocabulary of Egocentric Actions", NeurIPS, 2023 (*NUS*). [[Paper](https://arxiv.org/abs/2308.11488)][[PyTorch (in construction)](https://github.com/dibschat/openvocab-egoAR)][[Website](https://dibschat.github.io/openvocab-egoAR/)]
* **CLIP-FSAR**: "CLIP-guided Prototype Modulating for Few-shot Action Recognition", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2303.02982)][[PyTorch](https://github.com/alibaba-mmai-research/CLIP-FSAR)]
* **?**: "Multi-modal Prompting for Low-Shot Temporal Action Localization", arXiv, 2023 (*Shanghai Jiao Tong*). [[Paper](https://arxiv.org/abs/2303.11732)]
* **VicTR**: "VicTR: Video-conditioned Text Representations for Activity Recognition", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2304.02560)]
* **OpenVIS**: "OpenVIS: Open-vocabulary Video Instance Segmentation", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2305.16835)]
* **ALGO**: "Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning", arXiv, 2023 (*Oklahoma State University*). [[Paper](https://arxiv.org/abs/2305.16602)]
* **?**: "Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2212.10596)]
* **MSQNet**: "MSQNet: Actor-agnostic Action Recognition with Multi-modal Query", arXiv, 2023 (*University of Surrey, England*). [[Paper](https://arxiv.org/abs/2307.10763)][[Code (in construction)](https://github.com/mondalanindya/MSQNet)]
* **AVION**: "Training a Large Video Model on a Single Machine in a Day", arXiv, 2023 (*UT Austin*). [[Paper](https://arxiv.org/abs/2309.16669)][[PyTorch](https://github.com/zhaoyue-zephyrus/AVION)]
* **Open-VCLIP**: "Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2310.05010)][[PyTorch](https://github.com/wengzejia1/Open-VCLIP)]
* **Videoprompter**: "Videoprompter: an ensemble of foundational models for zero-shot video understanding", arXiv, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2310.15324)]
* **MM-VID**: "MM-VID: Advancing Video Understanding with GPT-4V(vision)", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2310.19773)][[Website](https://multimodal-vid.github.io/)]
* **Chat-UniVi**: "Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding", arXiv, 2023 (*Peking*). [[Paper](https://arxiv.org/abs/2311.08046)]
* **Side4Video**: "Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning", arXiv, 2023 (*Tsinghua*). [[Paper](https://arxiv.org/abs/2311.15769)][[Code (in construction)](https://github.com/HJYao00/Side4Video)]
* **ALT**: "Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition", arXiv, 2023 (*Huawei*). [[Paper](https://arxiv.org/abs/2311.15619)]
* **MM-Narrator**: "MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning", arXiv, 2023 (*Microsoft*). [[Paper](https://arxiv.org/abs/2311.17435)][[Website](https://mm-narrator.github.io/)]
* **Spacewalk-18**: "Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains", arXiv, 2023 (*Brown*). [[Paper](https://arxiv.org/abs/2311.18773)][[Website](https://brown-palm.github.io/Spacewalk-18/)]
* **OST**: "OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition", arXiv, 2023 (*Hunan University (HNU)*). [[Paper](https://arxiv.org/abs/2312.00096)][[Code (in construction)](https://github.com/tomchen-ctj/OST)][[Website](https://tomchen-ctj.github.io/OST/)]
* **AP-CLIP**: "Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition", arXiv, 2023 (*Xi'an Jiaotong*). [[Paper](https://arxiv.org/abs/2312.02226)]
* **EZ-CLIP**: "EZ-CLIP: Efficient Zeroshot Video Action Recognition", arXiv, 2023 (*Østfold University College, Norway*). [[Paper](https://arxiv.org/abs/2312.08010)][[PyTorch (in construction)](https://github.com/Shahzadnit/EZ-CLIP)]
* **M2-CLIP**: "M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition", AAAI, 2024 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2401.11649)]
* **FROSTER**: "FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition", ICLR, 2024 (*Baidu*). [[Paper](https://arxiv.org/abs/2402.03241)][[PyTorch](https://github.com/Visual-AI/FROSTER)][[Website](https://visual-ai.github.io/froster/)]
* **LaIAR**: "Language Model Guided Interpretable Video Action Reasoning", CVPR, 2024 (*Xidian University*). [[Paper](https://arxiv.org/abs/2404.01591)][[Code (in construction)](https://github.com/NingWang2049/LaIAR)]
* **BriVIS**: "Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation", arXiv, 2024 (*Peking*). [[Paper](https://arxiv.org/abs/2401.09732)][[PyTorch (in construction)](https://github.com/sennnnn/OpenVIS)]
* **ActionHub**: "ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition", arXiv, 2024 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2401.11654)]
* **ZERO**: "Zero Shot Open-ended Video Inference", arXiv, 2024 (*A\*STAR*). [[Paper](https://arxiv.org/abs/2401.12471)]
* **SATA**: "Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition", arXiv, 2024 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2403.01560)][[Code (in construction)](https://github.com/KunyuLin/XOV-Action/)]
* **CLIP-VIS**: "CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation", arXiv, 2024 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2403.12455)][[PyTorch](https://github.com/zwq456/CLIP-VIS)]
* X-supervised Learning:
* **LSTCL**: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (*Facebook*). [[Paper](https://arxiv.org/abs/2106.09212)]
* **SVT**: "Self-supervised Video Transformer", CVPR, 2022 (*Stony Brook*). [[Paper](https://arxiv.org/abs/2112.01514)][[PyTorch](https://github.com/kahnchana/svt)][[Website](https://kahnchana.github.io/svt/)]
* **BEVT**: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2112.01529)][[PyTorch](https://github.com/xyzforever/BEVT)]
* **SCVRL**: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (*Amazon*). [[Paper](https://arxiv.org/abs/2205.11710)]
* **VIMPAC**: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (*UNC*). [[Paper](https://arxiv.org/abs/2106.11250)][[PyTorch](https://github.com/airsplay/vimpac)]
* **?**: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2207.12795)]
* **VideoMAE**: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", NeurIPS, 2022 (*Tencent*). [[Paper](https://arxiv.org/abs/2203.12602)][[Pytorch](https://github.com/MCG-NJU/VideoMAE)]
* **MAE-ST**: "Masked Autoencoders As Spatiotemporal Learners", NeurIPS, 2022 (*Meta*). [[Paper](https://arxiv.org/abs/2205.09113)][[PyTorch](https://github.com/facebookresearch/mae_st)]
* **?**: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (*Georgia Tech*). [[Paper](https://arxiv.org/abs/2209.07474)]
* **MaskViT**: "MaskViT: Masked Visual Pre-Training for Video Prediction", ICLR, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2206.11894)][[Code (in construction)](https://github.com/agrimgupta92/maskvit)][[Website](https://maskedvit.github.io/)]
* **WeakSVR**: "Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos", CVPR, 2023 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2303.12370)][[PyTorch](https://github.com/svip-lab/WeakSVR)]
* **VideoMAE-V2**: "VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking", CVPR, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2303.16727)][[PyTorch](https://github.com/OpenGVLab/VideoMAEv2)]
* **SVFormer**: "SVFormer: Semi-supervised Video Transformer for Action Recognition", CVPR, 2023 (*Fudan University*). [[Paper](https://arxiv.org/abs/2211.13222)][[PyTorch](https://github.com/ChenHsing/SVFormer)]
* **OmniMAE**: "OmniMAE: Single Model Masked Pretraining on Images and Videos", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2206.08356)][[PyTorch](https://github.com/facebookresearch/omnivore)]
* **MVD**: "Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning", CVPR, 2023 (*Fudan Univeristy*). [[Paper](https://arxiv.org/abs/2212.04500)][[PyTorch](https://github.com/ruiwang2021/mvd)]
* **MME**: "Masked Motion Encoding for Self-Supervised Video Representation Learning", CVPR, 2023 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2210.06096)][[PyTorch](https://github.com/XinyuSun/MME)]
* **MGMAE**: "MGMAE: Motion Guided Masking for Video Masked Autoencoding", ICCV, 2023 (*Shanghai AI Lab*). [[Paper](https://arxiv.org/abs/2308.10794)]
* **MGM**: "Motion-Guided Masking for Spatiotemporal Representation Learning", ICCV, 2023 (*Amazon*). [[Paper](https://arxiv.org/abs/2308.12962)]
* **TimeT**: "Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations", ICCV, 2023 (*UvA*). [[Paper](https://arxiv.org/abs/2308.11796)][[PyTorch](https://github.com/SMSD75/Timetuning)]
* **LSS**: "Language-based Action Concept Spaces Improve Video Self-Supervised Learning", NeurIPS, 2023 (*Stony Brook*). [[Paper](https://arxiv.org/abs/2307.10922)]
* **VITO**: "Self-supervised video pretraining yields human-aligned visual representations", NeurIPS, 2023 (*DeepMind*). [[Paper](https://arxiv.org/abs/2210.06433)]
* **SiamMAE**: "Siamese Masked Autoencoders", NeurIPS, 2023 (*Stanford*). [[Paper](https://arxiv.org/abs/2305.14344)][[Website](https://siam-mae-video.github.io/)]
* **ViC-MAE**: "Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders", arXiv, 2023 (*Rice University*). [[Paper](https://arxiv.org/abs/2303.12001)]
* **LSTA**: "Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation", arXiv, 2023 (*Hangzhou Dianzi University*). [[Paper](https://arxiv.org/abs/2309.11707)]
* **DoRA**: "Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video", arXiv, 2023 (*INRIA*). [[Paper](https://arxiv.org/abs/2310.08584)]
* **AMD**: "Asymmetric Masked Distillation for Pre-Training Small Foundation Models", arXiv, 2023 (*Nanjing University*). [[Paper](https://arxiv.org/abs/2311.03149)]
* **SSL-UVOS**: "Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation", arXiv, 2023 (*CUHK*). [[Paper](https://arxiv.org/abs/2311.17893)]
* **NMS**: "No More Shortcuts: Realizing the Potential of Temporal Self-Supervision", AAAI, 2024 (*Adobe*). [[Paper](https://arxiv.org/abs/2312.13008)][[Website](https://daveishan.github.io/nms-webpage/)]
* **VideoMAC**: "VideoMAC: Video Masked Autoencoders Meet ConvNets", CVPR, 2024 (*Nanjing University of Science and Technology*). [[Paper](https://arxiv.org/abs/2402.19082)]
* **GPM**: "Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention", arXiv, 2024 (*HKUST*). [[Paper](https://arxiv.org/abs/2401.13937)]
* **MV2MAE**: "MV2MAE: Multi-View Video Masked Autoencoders", arXiv, 2024 (*Amazon*). [[Paper](https://arxiv.org/abs/2401.15900)][[PyTorch](https://github.com/NUST-Machine-Intelligence-Laboratory/VideoMAC)]
* **V-JEPA**: "Revisiting Feature Prediction for Learning Visual Representations from Video", arXiv, 2024 (*Meta*). [[Paper](https://arxiv.org/abs/2404.08471)][[PyTorch](https://github.com/facebookresearch/jepa)][[Website](https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/)]
* Transfer Learning/Adaptation:
* **APT**: "Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling", FG, 2024 (*JHU*). [[Paper](https://arxiv.org/abs/2403.06978)][[PyTorch](https://github.com/wgcban/apt)]
* X-shot:
* **ResT**: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (*Microsoft*). [[Paper](https://arxiv.org/abs/2205.01657)]
* **ViSET**: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (*University of South Florida*). [[Paper](https://arxiv.org/abs/2203.05156)]
* **REST**: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (*Samsung*). [[Paper](https://arxiv.org/abs/2209.15000)]
* **MoLo**: "MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition", CVPR, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2304.00946)][[Code (in construction)](https://github.com/alibaba-mmai-research/MoLo)]
* **MA-CLIP**: "Multimodal Adaptation of CLIP for Few-Shot Action Recognition", arXiv, 2023 (*Zhejiang*). [[Paper](https://arxiv.org/abs/2308.01532)]
* **SA-CT**: "On the Importance of Spatial Relations for Few-shot Action Recognition", arXiv, 2023 (*Fudan*). [[Paper](https://arxiv.org/abs/2308.07119)]
* **CapFSAR**: "Few-shot Action Recognition with Captioning Foundation Models", arXiv, 2023 (*Alibaba*). [[Paper](https://arxiv.org/abs/2310.10125)]
* Multi-Task:
* **EgoPack**: "A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives", CVPR, 2024 (*Politecnico di Torino, Italy*). [[Paper](https://arxiv.org/abs/2403.03037)][[PyTorch (in construction)](https://github.com/sapeirone/EgoPack)][[Website](https://sapeirone.github.io/EgoPack/)]
* Anomaly Detection:
* **CT-D2GAN**: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (*NEC*). [[Paper](https://arxiv.org/abs/2107.13720)]
* **ADTR**: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (*Shanghai Jiao Tong University*). [[Paper](https://arxiv.org/abs/2209.01816)]
* **SSMCTB**: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (*UCF*). [[Paper](https://arxiv.org/abs/2209.12148)][[Code (in construction)](https://github.com/ristea/ssmctb)]
* **?**: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 (*Korea University*). [[Paper](https://arxiv.org/abs/2206.08568)]
* **CLIP-TSA**: "CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection", ICIP, 2023 (*University of Arkansas*). [[Paper](https://arxiv.org/abs/2212.05136)]
* **?**: "Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features", CVPR, 2023 (*Konica Minolta, Japan*). [[Paper](https://arxiv.org/abs/2303.15167)]
* **TPWNG**: "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection", CVPR, 2024 (*Xidian University*). [[Paper](https://arxiv.org/abs/2404.08531)]
* Relation Detection:
* **VidVRD**: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2108.08669)][[PyTorch](https://github.com/Dawn-LX/VidVRD-tracklets)]
* **VRDFormer**: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (*Renmin University of China*). [[Paper](https://openaccess.thecvf.com/content/CVPR2022/html/Zheng_VRDFormer_End-to-End_Video_Visual_Relation_Detection_With_Transformers_CVPR_2022_paper.html)][[Code (in construction)](https://github.com/zhengsipeng/VRDFormer_VRD)]
* **VidSGG-BIG**: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2112.04222)][[PyTorch](https://github.com/Dawn-LX/VidSGG-BIG)]
* **RePro**: "Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection", ICLR, 2023 (*Zhejiang University*). [[Paper](https://arxiv.org/abs/2302.00268)][[PyTorch (in construction)](https://github.com/Dawn-LX/OpenVoc-VidVRD)]
* Saliency Prediction:
* **STSANet**: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (*Shanghai University*). [[Paper](https://arxiv.org/abs/2108.10696)]
* **UFO**: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (*South China University of Technology*). [[Paper](https://arxiv.org/abs/2203.04708)][[PyTorch](https://github.com/suyukun666/UFO)]
* **DMT**: "Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection", CVPR, 2023 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2305.00514)][[PyTorch](https://github.com/dragonlee258079/DMT)]
* **CASP-Net**: "CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective", CVPR, 2023 (*Northwestern Polytechnical University*). [[Paper](https://arxiv.org/abs/2303.06357)]
* Video Inpainting Detection:
* **FAST**: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (*Tsinghua University*). [[Paper](https://openaccess.thecvf.com/content/ICCV2021/html/Yu_Frequency-Aware_Spatiotemporal_Transformers_for_Video_Inpainting_Detection_ICCV_2021_paper.html)]
* Driver Activity:
* **TransDARC**: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (*Karlsruhe Institute of Technology, Germany*). [[Paper](https://arxiv.org/abs/2203.00927)]
* **?**: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (*Jericho High School, NY*). [[Paper](https://arxiv.org/abs/2207.12148)]
* **ViT-DD**: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (*Purdue*). [[Paper](https://arxiv.org/abs/2209.09178)][[PyTorch (in construction)](https://github.com/PurdueDigitalTwin/ViT-DD)]
* Video Alignment:
* **DGWT**: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (*University of New South Wales, Australia*). [[Paper](https://www.bmvc2021-virtualconference.com/assets/papers/0993.pdf)]
* Sport-related:
* **Skating-Mixer**: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (*Southern University of Science and Technology*). [[Paper](https://arxiv.org/abs/2203.03990)]
* Action Counting:
* **TransRAC**: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 (*ShanghaiTech*). [[Paper](https://arxiv.org/abs/2204.01018)][[PyTorch](https://github.com/SvipRepetitionCounting/TransRAC)][[Website](https://svip-lab.github.io/dataset/RepCount_dataset.html)]
* **PoseRAC**: "PoseRAC: Pose Saliency Transformer for Repetitive Action Counting", arXiv, 2023 (*Peking University*). [[Paper](https://arxiv.org/abs/2303.08450)][[PyTorch](https://github.com/MiracleDance/PoseRAC)]
* Action Quality Assessment:
* **?**: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 (*Baidu*). [[Paper](https://arxiv.org/abs/2207.09270)]
* **?**: "Action Quality Assessment using Transformers", arXiv, 2022 (*USC*). [[Paper](https://arxiv.org/abs/2207.12318)]
* Human Interaction:
* **IGFormer**: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (*The University of Melbourne*). [[Paper](https://arxiv.org/abs/2207.12100)]
* Cross-Domain:
* **UDAVT**: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (*University of Trento*). [[Paper](https://arxiv.org/abs/2207.12842)][[Code (in construction)](https://github.com/vturrisi/UDAVT)]
* **AutoLabel**: "AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation", CVPR, 2023 (*University of Trento*). [[Paper](https://arxiv.org/abs/2304.01110)][[PyTorch](https://github.com/gzaraunitn/autolabel)]
* **DALL-V**: "The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation", ICCV, 2023 (*University of Trento*). [[Paper](https://arxiv.org/abs/2308.09139)][[PyTorch](https://github.com/giaczara/dallv)]
* Multi-Camera Editing:
* **TC-Transformer**: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (*CUHK*). [[Paper](https://arxiv.org/abs/2210.08737)]
* Instructional/Procedural Video:
* **ProcedureVRL**: "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations", CVPR, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2303.17839)]
* **Paprika**: "Procedure-Aware Pretraining for Instructional Video Understanding", CVPR, 2023 (*Salesforce*). [[Paper](https://arxiv.org/abs/2303.18230)][[PyTorch](https://github.com/salesforce/paprika)]
* **StepFormer**: "StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos", CVPR, 2023 (*Samsung*). [[Paper](https://arxiv.org/abs/2304.13265)]
* **E3P**: "Event-Guided Procedure Planning from Instructional Videos with Text Supervision", ICCV, 2023 (*Sun Yat-sen University*). [[Paper](https://arxiv.org/abs/2308.08885)]
* **VLaMP**: "Pretrained Language Models as Visual Planners for Human Assistance", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2304.09179)]
* **VINA**: "Learning to Ground Instructional Articles in Videos through Narrations", ICCV, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2306.03802)][[Website](https://eval.ai/web/challenges/challenge-page/2082/overview)]
* **PREGO**: "PREGO: online mistake detection in PRocedural EGOcentric videos", CVPR, 2024 (*Sapienza University of Rome, Italy*). [[Paper](https://arxiv.org/abs/2404.01933)][[Code (in construction)](https://github.com/aleflabo/PREGO)]
* Continual Learning:
* **PIVOT**: "PIVOT: Prompting for Video Continual Learning", CVPR, 2023 (*KAUST*). [[Paper](https://openaccess.thecvf.com/content/CVPR2023/html/Villa_PIVOT_Prompting_for_Video_Continual_Learning_CVPR_2023_paper.html)]
* 3D:
* **MaST-Pre**: "Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos", ICCV, 2023 (*CloudWalk, China*). [[Paper](https://arxiv.org/abs/2308.09245)][[PyTorch](https://github.com/JohnsonSign/MaST-Pre)]
* **EPIC-Fields**: "EPIC Fields: Marrying 3D Geometry and Video Understanding", NeurIPS, 2023 (*Oxford + Bristol*). [[Paper](https://arxiv.org/abs/2306.08731)][[Website](https://epic-kitchens.github.io/epic-fields/)]
* Audio-Video:
* **AVGN**: "Audio-Visual Glance Network for Efficient Video Recognition", ICCV, 2023 (*KAIST*). [[Paper](https://arxiv.org/abs/2308.09322)]
* Event Camera:
* **EventTransAct**: "EventTransAct: A video transformer-based framework for Event-camera based action recognition", IROS, 2023 (*UCF*). [[Paper](https://arxiv.org/abs/2308.13711)][[PyTorch](https://github.com/tristandb8/EventTransAct)][[Website](https://tristandb8.github.io/EventTransAct_webpage/)]
* Long Video:
* **EgoSchema**: "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding", NeurIPS, 2023 (*Berkeley*). [[Paper](https://arxiv.org/abs/2308.09126)][[PyTorch](https://github.com/egoschema/EgoSchema)][[Website](https://egoschema.github.io/)]
* **KTS**: "Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding", arXiv, 2023 (*Meta*). [[Paper](https://arxiv.org/abs/2309.11569)]
* **TCR**: "Text-Conditioned Resampler For Long Form Video Understanding", arXiv, 2023 (*Google*). [[Paper](https://arxiv.org/abs/2312.11897)]
* **MC-ViT**: "Memory Consolidation Enables Long-Context Video Understanding", arXiv, 2024 (*DeepMind*). [[Paper](https://arxiv.org/abs/2402.05861)]
* **VideoAgent**: "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", arXiv, 2024 (*Stanford*). [[Paper](https://arxiv.org/abs/2403.10517)]
* Video Story:
* **YouTube-News-Timeline**: "Video Timeline Modeling For News Story Understanding", NeurIPS (Datasets and Benchmarks), 2023 (*Google*). [[Paper](https://arxiv.org/abs/2309.13446)][[GotHub](https://github.com/google-research/google-research/tree/master/video_timeline_modeling)]
* Analysis:
* **VTCD**: "Understanding Video Transformers via Universal Concept Discovery", arXiv, 2024 (*Toyota*). [[Paper](https://arxiv.org/abs/2401.10831)][[Website](https://yorkucvil.github.io/VTCD/)]

[[Back to Overview](#overview)]

---

## References
* Online Resources:
* [Papers with Code](https://paperswithcode.com/methods/category/vision-transformer)
* [Transformer tutorial (Lucas Beyer)](http://lucasb.eyer.be/transformer)
* [CS25: Transformers United (Course @ Stanford)](https://web.stanford.edu/class/cs25/)
* [The Annotated Transformer (Blog)](http://nlp.seas.harvard.edu/annotated-transformer/)
* [3D Vision with Transformers (GitHub)](https://github.com/lahoud/3d-vision-transformers)
* [Networks Beyond Attention (GitHub)](https://github.com/FocalNet/Networks-Beyond-Attention)
* [Practical Introduction to Transformers (GitHub)](https://github.com/IbrahimSobh/Transformers)
* [Awesome Transformer Architecture Search (GitHub)](https://github.com/automl/awesome-transformer-search)
* [Transformer-in-Vision (GitHub)](https://github.com/DirtyHarryLYL/Transformer-in-Vision)
* [Awesome Visual-Transformer (GitHub)](https://github.com/dk-liang/Awesome-Visual-Transformer)
* [Awesome Transformer for Vision Resources List (GitHub)](https://github.com/lijiaman/awesome-transformer-for-vision)
* [Transformer-in-Computer-Vision (GitHub)](https://github.com/Yangzhangcst/Transformer-in-Computer-Vision)
* [Transformer Tutorial in ICASSP 2022)](https://transformer-tutorial.github.io/icassp2022/)