https://github.com/lxtGH/Awesome-Segmentation-With-Transformer

[Arxiv-04-2023] Transformer-Based Visual Segmentation: A Survey
https://github.com/lxtGH/Awesome-Segmentation-With-Transformer
List: Awesome-Segmentation-With-Transformer
Last synced: 3 months ago
JSON representation
[Arxiv-04-2023] Transformer-Based Visual Segmentation: A Survey
Host: GitHub
URL: https://github.com/lxtGH/Awesome-Segmentation-With-Transformer
Owner: lxtGH
Created: 2023-03-13T08:03:03.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-04-06T08:10:44.000Z (about 1 year ago)
Last Synced: 2024-04-23T00:17:17.049Z (about 1 year ago)
Homepage:
Size: 191 KB
Stars: 572
Watchers: 10
Forks: 45
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-ai-data-github-repos - Awesome-Segmentation-With-Transformer
ultimate-awesome - Awesome-Segmentation-With-Transformer - [T-PAMI-2024] Transformer-Based Visual Segmentation: A Survey. (Other Lists / Julia Lists)
README

         

[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls)






  
Transformer-Based Visual Segmentation: A Survey

  

    T-PAMI, 2024

    


    Xiangtai Li (Project Lead)

    ·

    Henghui Ding

    ·

    Haobo Yuan

    ·

    Wenwei Zhang

    ·

    Guangliang Cheng

    


    Jiangmiao Pang

    .

    Kai Chen

    .

    Ziwei Liu

    .

    Chen Change Loy

  


  


    

      

    

    

      

    

    

      

    

  




This repo is used for recording, tracking, and benchmarking several recent transformer-based visual segmentation methods,

as a supplement to our [survey](https://arxiv.org/abs/2304.09854).  

If you find any work missing or have any suggestions (papers, implementations and other resources), feel free

to [pull requests](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls).

We will add the missing papers to this repo ASAP.

### 🔥News

[-] Accepted by T-PAMI-2024.

[-] Add several CVPR-24 works on this directions. 2024-03. You are welcome to add your CVPR works in our repos!

[-] The third version is on arxiv. [survey](https://arxiv.org/abs/2304.09854) More benchmark and methods are included!!. 2023-12.

[-] The second draft is on arxiv. 2023-06.

### 🔥Highlight!!

[1], Previous transformer surveys divide the methods by the different tasks and settings.

Different from them, we re-visit and group the existing transformer-based methods from the **technical perspective.**

[2], We survey the methods in two parts: one for the mainstream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks.

[3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets. 

[4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query. 

## Introduction

In this survey, we present the first detailed survey on Transformer-Based Segmentation.

![Alt Text](./figs/survey_pipeline.jpg)

## Summary of Contents

- [Methods: A Survey](#methods-a-survey)

    - [Meta-Architecture](#meta-architecture)

    - [Strong Representation](#Strong-Representation)

    - [Interaction Design in Decoder](#Interaction-Design-in-Decoder)

    - [Optimizing Object Query](#Optimizing-Object-Query)

    - [Using Query For Association](#Using-Query-For-Association)

    - [Conditional Query Generation](#Conditional-Query-Generation)

- [Related Domains and Beyond](#Related-Domains-and-Beyond)

    - [Point Cloud Segmentation](#Point-Cloud-Segmentation)

    - [Tuning Foundation Models](#Tuning-Foundation-Models)

    - [Domain-aware Segmentation](#Domain-aware-Segmentation)

    - [Label and Model Efficient Segmentation](#Label-and-Model-Efficient-Segmentation)

    - [Class Agnostic Segmentation and Tracking](#Class-Agnostic-Segmentation-and-Tracking)

    - [Medical Image Segmentation](#Medical-Image-Segmentation)

## Methods: A Survey

### Meta-Architecture

| Year |  Venue  |     Acronym     | Paper Title                                                                                                           | Code/Project                                                 |

|:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|

| 2020 |  ECCV   |      DETR       | [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872)                                     | [Code](https://github.com/facebookresearch/detr)             |

| 2021 |  ICLR   | Deformable DETR | [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159)          | [Code](https://github.com/fundamentalvision/Deformable-DETR) |

| 2021 |  CVPR   |   Max-Deeplab   | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759)              | [Code](https://github.com/google-research/deeplab2)          |

| 2021 | NeurIPS |   MaskFormer    | [MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation](http://arxiv.org/abs/2107.06278) | [Code](https://github.com/facebookresearch/MaskFormer)       |

| 2021 | NeurIPS |      K-Net      | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855)                                         | [Code](https://github.com/ZwwWayne/K-Net)                    |

| 2023 |  CVPR   |    Lite-DETR    | [Lite detr: An interleaved multi-scale encoder for efficient detr](https://arxiv.org/pdf/2303.07335)                  | [Code](https://github.com/IDEA-Research/Lite-DETR)           |

### Strong Representation

#### Better ViTs Design

| Year |  Venue  |   Acronym   | Paper Title                                                                                                                    | Code/Project                                     |

|:----:|:-------:|:-----------:|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|

| 2021 |  CVPR   |    SETR     | [Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers](https://arxiv.org/abs/2012.15840) | [Code](https://github.com/fudan-zvg/SETR)        |

| 2021 |  ICCV   |   MviT-V1   | [Multiscale vision transformers](https://arxiv.org/abs/2104.11227)                                                             | [Code](https://github.com/facebookresearch/mvit) |

| 2022 |  CVPR   |   MviT-V2   | [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526)           | [Code](https://github.com/facebookresearch/mvit) |

| 2021 | NeurIPS |    XCIT     | [Xcit: Crosscovariance image transformers](https://arxiv.org/abs/2106.09681)                                                   | [Code](https://github.com/facebookresearch/xci)  |

| 2021 |  ICCV   | Pyramid VIT | [Pyramid vision transformer: A versatile backbone for dense prediction without convolutions](https://arxiv.org/abs/2102.12122) | [Code](https://github.com/whai362/PVT)           |

| 2021 |  ICCV   |  CorssViT   | [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899)          | [Code](https://github.com/IBM/CrossViT)          |

| 2021 |  ICCV   |    CoaT     | [Co-Scale Conv-Attentional Image Transformers](https://arxiv.org/abs/2104.06399)                                               | [Code](https://github.com/mlpc-ucsd/CoaT)        |

| 2022 |  CVPR   |    MPViT    | [MPViT: Multi-Path Vision Transformer for Dense Prediction](https://arxiv.org/abs/2112.11010)                                  | [Code](https://github.com/youngwanLEE/MPViT)     |

| 2022 | NeurIPS |   SegViT    | [SegViT: Semantic Segmentation with Plain Vision Transformers](https://arxiv.org/abs/2210.05844)                               | [Code](https://github.com/zbwxp/SegVit)          |

| 2022 |  arxiv  |    RSSeg    | [Representation Separation for SemanticSegmentation with Vision Transformers](https://arxiv.org/abs/2212.13764)                | N/A                                              |

#### Hybrid CNNs/Transformers/MLPs

| Year |  Venue  |  Acronym   | Paper Title                                                                                                                        | Code/Project                                                   |

|:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|

| 2021 |  ICCV   |    Swin    | [Swin transformer: Hierarchical vision transformer using shifted windows](https://arxiv.org/abs/2103.14030)                        | [Code](https://github.com/microsoft/Swin-Transformer)          |

| 2022 |  CVPR   |  Swin-v2   | [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)                                        | [Code](https://github.com/microsoft/Swin-Transformer)          |

| 2021 | NeurIPS | Segformer  | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)             | [Code](http://github.com/NVlabs/SegFormer)                     |

| 2022 |  CVPR   |    CMT     | [CMT: Convolutional Neural Networks Meet Vision Transformers](https://arxiv.org/abs/2107.06263)                                    | [Code](https://github.com/FlyEgle/CMT-pytorch)                 |

| 2021 | NeurIPS |   Twins    | [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](https://arxiv.org/abs/2104.13840)                       | [Code](https://github.com/Meituan-AutoML/Twins)                |

| 2021 |  ICCV   |    CvT     | [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808)                                           | [Code](https://github.com/microsoft/CvT)                       |

| 2021 | NeurIPS |   Vitae    | [Vitae: Vision transformer advanced by exploring intrinsic inductive bias](https://arxiv.org/abs/2106.03348)                       | [Code](https://github.com/ViTAE-Transformer/ViTAE-Transformer) |

| 2022 |  CVPR   |  ConvNext  | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545)                                                                        | [Code](https://github.com/facebookresearch/ConvNeXt)           |

| 2022 | NeurIPS |  SegNext   | [SegNeXt:Rethinking Convolutional Attention Design for Semantic Segmentation](https://github.com/visual-attention-network/segnext) | [Code](https://github.com/visual-attention-network/segnext)    |

| 2022 |  CVPR   | PoolFormer | [PoolFormer: MetaFormer Is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)                                    | [Code](https://github.com/sail-sg/poolformer)                  |

| 2023 |  ICLR   |    STM     | [Demystify Transformers & Convolutions in Modern Image Deep Networks](https://arxiv.org/abs/2211.05781)                            | [Code](https://github.com/OpenGVLab/STM-Evaluation)            |

#### Self-Supervised Learning

| Year |  Venue  |   Acronym   | Paper Title                                                                                                                | Code/Project                                                |

|:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|

| 2021 |  ICCV   |   MOCOV3    | [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057)                     | [Code](https://github.com/facebookresearch/moco-v3)         |

| 2022 |  ICLR   |    Beit     | [Beit: Bert pre-training of image transformers](https://arxiv.org/abs/2106.08254)                                          | [Code](https://github.com/microsoft/unilm/tree/master/beit) |

| 2022 |  CVPR   |  MaskFeat   | [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133)                      | [Code](https://github.com/facebookresearch/SlowFast)        |

| 2022 |  CVPR   |     MAE     | [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)                                       | [Code](https://github.com/facebookresearch/mae)             |

| 2022 | NeurIPS |   ConvMAE   | [MCMAE: Masked Convolution Meets Masked Autoencoders](https://arxiv.org/abs/2303.05475)                                    | [Code](https://github.com/Alpha-VL/ConvMAE)                 |

| 2023 |  ICLR   |    Spark    | [SparK: the first successful BERT/MAE-style pretraining on any convolutional networks](https://github.com/keyu-tian/SparK) | [Code](https://github.com/keyu-tian/SparK)                  |

| 2022 |  CVPR   |    FLIP     | [Scaling Language-Image Pre-training via Masking](https://arxiv.org/abs/2212.00794)                                        | [Code](https://github.com/facebookresearch/flip)            |

| 2023 |  arxiv  | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)                 | [Code](https://github.com/facebookresearch/ConvNeXt-V2)     |

### Interaction Design in Decoder

#### Improved Cross Attention Design

| Year |  Venue  |      Acronym 
|:----:|:-------:| 
| 2021 |  CVPR   | 
| 2022 |  CVPR   | 
| 2021 |  CVPR   | 
| 2021 | NeurIPS | 
| 2022 |  CVPR   | 
| 2022 |  ECCV   | 
| 2021 |  ICCV   | 
| 2021 |  arxiv  | 
| 2021 | NeurIPS | 
| 2022 |  CVPR 
| 2022 |  CVPR   | 
| 2022 |  CVPR   | 
| 2022 |  CVPR   | 
| 2021 |  ICCV   | 
| 2021 |  BMVC   | 
| 2021 |  ICCV   | 
| 2022 |  ICLR   | 
| 2023 |  CVPR   |

| Paper Title                                                                                                                           | Code/Project                                            | :------------------:|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------| Sparse R-CNN    | [Sparse R-CNN: End-to-End Object Detection with Learnable Proposals](https://arxiv.org/abs/2011.12450)                                | [Code](https://github.com/PeizeSun/SparseR-CNN)         | AdaMixer      | [AdaMixer: A Fast-Converging Query-Based Object Detector](https://arxiv.org/abs/2203.16507)                                           | [Code](https://github.com/MCG-NJU/AdaMixer)             | MaX-DeepLab     | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759)                              | [Code](https://github.com/google-research/deeplab2)     | K-Net        | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855)                                                         | [Code](https://github.com/ZwwWayne/K-Net/)              | Mask2Former     | [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527)                                | [Code](https://github.com/facebookresearch/Mask2Former) | kMaX-DeepLab    | [k-means Mask Transformer](https://arxiv.org/abs/2207.04044)                                                                          | [Code](https://github.com/google-research/deeplab2)     | QueryInst      | [Instances as queries](https://arxiv.org/abs/2105.01928)                                                                              | [Code](https://github.com/hustvl/QueryInst)             | ISTR        | [ISTR: End-to-End Instance Segmentation via Transformers](https://arxiv.org/abs/2105.00637)                                           | [Code](https://github.com/hujiecpp/ISTR)                | SOLQ        | [Solq: Segmenting objects by learning queries](https://arxiv.org/abs/2106.02351)                                                      | [Code](https://github.com/megvii-research/SOLQ)         | | Panoptic Segformer | [Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers](https://arxiv.org/abs/2109.03814)                   | [Code](https://github.com/zhiqi-li/Panoptic-SegFormer)  | CMT-Deeplab     | [CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation](https://arxiv.org/abs/2206.08948)                               | N/A                                                     | SparseInst     | [Sparse Instance Activation for Real-Time Instance Segmentation](https://arxiv.org/abs/2203.12827)                                    | [Code](https://github.com/hustvl/SparseInst)            | SAM-DETR      | [Accelerating DETR Convergence via Semantic-Aligned Matching](https://arxiv.org/abs/2203.06883)                                       | [Code](https://github.com/ZhangGongjie/SAM-DETR)        | SMCA-DETR      | [Fast Convergence of DETR with Spatially Modulated Co-Attention](https://arxiv.org/abs/2101.07448)                                    | [Code](https://github.com/gaopengcuhk/SMCA-DETR)        | ACT-DETR      | [End-to-End Object Detection with Adaptive Clustering Transformer](https://www.bmvc2021-virtualconference.com/assets/papers/0709.pdf) | [Code](https://github.com/gaopengcuhk/SMCA-DETR)        | Dynamic DETR    | [Dynamic DETR: End-to-End Object Detection with Dynamic Attention](https://ieeexplore.ieee.org/document/9709981)                      | N/A                                                     | Sparse DETR     | [Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity](https://arxiv.org/abs/2111.14330)                        | [Code](https://github.com/kakaobrain/sparse-detr)       | FastInst      | [FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation](https://arxiv.org/abs/2303.08594)                          | [Code](https://github.com/junjiehe96/FastInst)          |

#### Spatial-Temporal Cross Attention Design

| Year |  Venue  |      Acronym       | Paper Title                                                                                                          | Code/Project                                            |

|:----:|:-------:|:------------------:|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|

| 2021 |  CVPR   |       VisTR        | [VisTR: End-to-End Video Instance Segmentation with Transformers](https://arxiv.org/abs/2011.14503)                  | [Code](https://github.com/Epiphqny/VisTR)               |

| 2021 | NeurIPS |        IFC         | [Video instance segmentation using inter-frame communication transformers](https://arxiv.org/abs/2106.03299)         | [Code](https://github.com/sukjunhwang/IFC)              |

| 2022 |  CVPR   |      SlotVPS       | [Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation](https://arxiv.org/abs/2112.08949) | N/A                                                     |

| 2022 |  CVPR   | TubeFormer-DeepLab | [TubeFormer-DeepLab: Video Mask Transformer](https://arxiv.org/abs/2205.15361)                                       | N/A                                                     |

| 2022 |  CVPR   |    Video K-Net     | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656)       | [Code](https://github.com/lxtGH/Video-K-Net)            |

| 2022 |  CVPR   |       TeViT        | [Temporally efficient vision transformer for video instance segmentation](https://arxiv.org/abs/2204.08412)          | [Code](https://github.com/hustvl/TeViT)                 |

| 2022 |  ECCV   |     Seqformer      | [SeqFormer: Sequential Transformer for Video Instance Segmentation](https://arxiv.org/abs/2112.08275)                | [Code](https://github.com/wjf5203/SeqFormer)            |

| 2022 |  arxiv  |  Mask2Former-VIS   | [Mask2Former for Video Instance Segmentation](https://arxiv.org/abs/2112.10764)                                      | [Code](https://github.com/facebookresearch/Mask2Former) |

| 2022 |  PAMI   |      TransVOD      | [TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers](https://arxiv.org/abs/2201.05047)   | [Code](https://github.com/SJTU-LuHe/TransVOD)           |

| 2022 | NeurIPS |        VITA        | [VITA: Video Instance Segmentation via Object Token Association](https://arxiv.org/abs/2206.04403)                   | [Code](https://github.com/sukjunhwang/VITA)             |

### Optimizing Object Query

#### Adding Position Information into Query

| Year | Venue |       Acronym       | Paper Title                                                                                                | Code/Project                                         |

|:----:|:-----:|:-------------------:|------------------------------------------------------------------------------------------------------------|------------------------------------------------------|

| 2021 | ICCV  |  Conditional-DETR   | [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152)                         | [Code](https://github.com/Atten4Vis/ConditionalDETR) |

| 2022 | arxiv | Conditional-DETR-v2 | [Conditional detr v2:Efficient detection transformer with box queries](https://arxiv.org/abs/2207.08914)   | [Code](https://github.com/Atten4Vis/ConditionalDETR) |

| 2022 | AAAI  |     Anchor DETR     | [Anchor detr: Query design for transformer-based detector](https://arxiv.org/abs/2109.07107)               | [Code](https://github.com/megvii-model/AnchorDETR)   |

| 2022 | ICLR  |      DAB-DETR       | [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://arxiv.org/abs/2201.12329)             | [Code](https://github.com/SlongLiu/DAB-DETR)         |

| 2021 | arxiv |   Efficient DETR    | [Efficient detr: improving end-to-end object  detector with dense prior](https://arxiv.org/abs/2104.01318) | N/A                                                  |

#### Adding Extra Supervision into Query

| Year |  Venue  |  Acronym   | Paper Title                                                                                                                        | Code/Project                                                      |

|:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|

| 2022 |  ECCV   |  DE-DETR   | [Towards Data-Efficient Detection Transformers](https://arxiv.org/abs/2203.09507)                                                  | [Code](https://github.com/encounter1997/DE-DETRs)                 |

| 2022 |  CVPR   |  DN-DETR   | [Dndetr:Accelerate detr training by introducing query denoising](https://arxiv.org/abs/2203.01305)                                 | [Code](https://github.com/IDEA-opensource/DN-DETR)                |

| 2023 |  ICLR   |    DINO    | [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605)                | [Code](https://github.com/IDEA-Research/DINO)                     |

| 2023 |  CVPR   | Mp-Former  | [Mp-former: Mask-piloted transformer for image segmentation](https://arxiv.org/abs/2303.07336)                                     | [Code](https://github.com/IDEA-Research/MP-Former)                |

| 2023 |  CVPR   | Mask-DINO  | [Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation](https://arxiv.org/abs/2206.02777) | [Code](https://github.com/IDEACVR/MaskDINO)                       |

| 2022 | NeurIPS |    N/A     | [Learning equivariant segmentation with instance-unique querying](https://arxiv.org/abs/2210.00911)                                | [Code](https://github.com/JamesLiang819/Instance_Unique_Querying) |

| 2023 |  CVPR   |   H-DETR   | [DETRs with Hybrid Matching](https://arxiv.org/abs/2207.13080)                                                                     | [Code](https://github.com/HDETR)                                  |

| 2023 |  ICCV  | Group-DETR | [Group detr: Fast detr training with group-wise one-to-many assignment](https://arxiv.org/abs/2207.13085)                          | N/A                                                               |

| 2023 |  ICCV  |  Co-DETR   | [Detrs with collaborative hybrid assignments training](https://arxiv.org/abs/2211.12860)                                           | [Code](https://github.com/Sense-X/Co-DETR)                        |

### Using Query For Association

#### Query as Instance Association

| Year |  Venue  |   Acronym   | Paper Title                                                                                                                      | Code/Project                                        |

|:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|

| 2022 |  CVPR   | TrackFormer | [TrackFormer: Multi-Object Tracking with Transformer](https://arxiv.org/abs/2101.02702)                                          | [Code](https://github.com/timmeinhardt/trackformer) |

| 2021 |  arxiv  | TransTrack  | [TransTrack: Multiple Object Tracking with Transformer](https://arxiv.org/abs/2012.15460)                                        | [Code](https://github.com/PeizeSun/TransTrack)      |

| 2022 |  ECCV   |    MOTR     | [MOTR: End-to-End Multiple-Object Tracking with TRansformer](https://arxiv.org/abs/2105.03247)                                   | [Code](https://github.com/megvii-research/MOTR)     |

| 2022 | NeurIPS |   MinVIS    | [MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training](https://arxiv.org/abs/2208.02245)         | [Code](https://github.com/NVlabs/MinVIS)            |

| 2022 |  ECCV   |    IDOL     | [In defense of online models for video instance segmentation](https://arxiv.org/abs/2207.10661)                                  | [Code](https://github.com/wjf5203/VNext)            |

| 2022 |  CVPR   | Video K-Net | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656)                   | [Code](https://github.com/lxtGH/Video-K-Net)        |

| 2023 |  CVPR   |   GenVIS    | [A Generalized Framework for Video Instance Segmentation](https://arxiv.org/abs/2211.08834)                                      | [Code](https://github.com/miranheo/GenVIS)          |

| 2023 |  ICCV  |  Tube-Link  | [Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation](https://arxiv.org/abs/2303.12782)                   | [Code](https://github.com/lxtGH/Tube-Link)          |

| 2023 |  ICCV  |  CTVIS  | [CTVIS: Consistent Training for Online Video Instance Segmentation](https://arxiv.org/abs/2303.12782)                   | [Code](https://github.com/KainingYing/CTVIS)          |

| 2023 |  CVPR-W  | Video-kMaX  | [Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation](https://arxiv.org/abs/2304.04694) | N/A                                                 |

#### Query as Linking Multi-Tasks

| Year | Venue |       Acronym       | Paper Title                                                                                                                            | Code/Project                                           |

|:----:|:-----:|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|

| 2022 | ECCV  | Panoptic-PartFormer | [Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation](https://arxiv.org/abs/2204.04655)                       | [Code](https://github.com/lxtGH/Panoptic-PartFormer)   |

| 2022 | ECCV  |  PolyphonicFormer   | [PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation](https://arxiv.org/abs/2112.02582)               | [Code](https://github.com/HarborYuan/PolyphonicFormer) |

| 2022 | CVPR  |    PanopticDepth    | [Panopticdepth: A unified framework for depth-aware panoptic segmentation](https://arxiv.org/abs/2206.00468)                           | [Code](https://github.com/NaiyuGao/PanopticDepth)      |

| 2022 | ECCV  |    Fashionformer    | [Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition](https://arxiv.org/abs/2204.04654) | [Code](https://github.com/xushilin1/FashionFormer)     |

| 2022 | ECCV  |        InvPT        | [InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding](https://arxiv.org/abs/2203.07997)                       | [Code](https://github.com/prismformore/InvPT)          |

| 2023 | CVPR  |       UNINEXT       | [Universal Instance Perception as Object Discovery and Retrieval](https://arxiv.org/abs/2303.06674)                       | [Code](https://github.com/MasterBin-IIAU/UNINEXT)          |

| 2024 | CVPR  |        GLEE         | [GLEE: General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158)                          | [Code](https://glee-vision.github.io/)          |

| 2024 | CVPR  |        UniVS        | [UniVS: Unified and Universal Video Segmentation with Prompts as Queries](https://arxiv.org/abs/2402.18115)                        | [Code](https://github.com/MinghanLi/UniVS)          |

| 2024 | CVPR  |       OMG-Seg       | [OMG-Seg: Is One Model Good Enough For All Segmentation?](https://arxiv.org/abs/2401.10229)                        | [Code](https://github.com/lxtGH/OMG-Seg)          |

### Conditional Query Generation

#### Conditional Query Fusion on Language Features

| Year | Venue |    Acronym     | Paper Title                                                                                                              | Code/Project                                                       |

|:----:|:-----:|:--------------:|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|

| 2021 | ICCV  |      VLT       | [Vision-Language Transformer and Query Generation for Referring Segmentation](https://arxiv.org/abs/2108.05565)          | [Code](https://github.com/henghuiding/Vision-Language-Transformer) |

| 2022 | CVPR  |      LAVT      | [Lavt: Language-aware vision transformer for referring image segmentation](https://arxiv.org/abs/2112.02244)             | [Code](https://github.com/yz93/LAVT-RIS)                           |

| 2022 | CVPR  |     Restr      | [Restr:Convolution-free referring image segmentation using transformers](https://arxiv.org/abs/2203.16768)               | N/A                                                                |

| 2022 | CVPR  |      Cris      | [Cris: Clip-driven referring image segmentation](https://arxiv.org/abs/2111.15174)                                       | [Code](https://github.com/DerrickWang005/CRIS.pytorch)             |

| 2022 | CVPR  |      MTTR      | [End-to-End Referring Video Object Segmentation with Multimodal Transformers](https://arxiv.org/abs/2111.14821)          | [Code](https://github.com/mttr2021/MTTR)                           |

| 2022 | CVPR  |      LBDT      | [Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation](https://arxiv.org/abs/2206.03789) | [Code](https://github.com/dzh19990407/LBDT)                        |

| 2022 | CVPR  |  ReferFormer   | [Language as queries for referring video object segmentation](https://arxiv.org/abs/2201.00487)                          | [Code](https://github.com/wjn922/ReferFormer)                      |

| 2024 | CVPR  | MaskGrounding  | [Mask Grounding for Referring Image Segmentation](https://arxiv.org/abs/2312.12198)                           | [Code](https://yxchng.github.io/projects/mask-grounding/)                      |

#### Conditional Query Fusion on Cross Image Features

| Year |  Venue  |     Acronym     | Paper Title                                                                                                                                                     | Code/Project                                     |

|:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|

| 2021 | NeurIPS |      CyCTR      | [Few-Shot Segmentation via Cycle-Consistent Transformer](https://arxiv.org/abs/2106.02320)                                                                      | [Code](https://github.com/GengDavid/CyCTR)       |

| 2022 |  CVPR   |   MatteFormer   | [MatteFormer: Transformer-Based Image Matting via Prior-Tokens](https://arxiv.org/abs/2203.15662)                                                               | [Code](https://github.com/webtoon/matteformer)   |

| 2022 |  ECCV   |   Segdeformer   | [A Transformer-based Decoder for Semantic Segmentation with Multi-level Context Mining](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880617.pdf) | [Code](https://github.com/lygsbw/segdeformer)    |

| 2022 |  arxiv  |   StructToken   | [StructToken : Rethinking Semantic Segmentation with Structural Prior](https://arxiv.org/abs/2203.12612)                                                        | N/A                                              |

| 2022 | NeurIPS |    MM-Former    | [Mask Matching Transformer for Few-Shot Segmentation](https://arxiv.org/abs/2301.01208)                                                                         | [Code](https://github.com/jiaosiyu1999/mmformer) |

| 2022 |  ECCV   |    AAFormer     | [Adaptive Agent Transformer for Few-shot Segmentation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890035.pdf)                                  | N/A                                              |

| 2023 |  arxiv  | ReferenceTwice | [Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation](https://arxiv.org/abs/2301.01156)                                           | [Code](https://github.com/hanyue1648/RefT)       |

### Tuning Foundation Models

#### Vision Adapter

| Year | Venue |   Acronym   | Paper Title                                                                                                  | Code/Project                                                     |

|:----:|:-----:|:-----------:|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|

| 2022 | CVPR  |   CoCoOp    | [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557)                   | [Code](https://github.com/KaiyangZhou/CoOp)                      |

| 2022 | ECCV  | Tip-Adapter | [Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification](https://arxiv.org/abs/2111.03930)  | [Code](https://github.com/gaopengcuhk/Tip-Adapter)               |

| 2022 | ECCV  |     EVL     | [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/abs/2208.03550)                          | [Code](https://github.com/OpenGVLab/efficient-video-recognition) |

| 2023 | ICLR  | ViT-Adapter | [Vision Transformer Adapter for Dense Predictions](https://arxiv.org/abs/2205.08534)                         | [Code](https://github.com/czczup/ViT-Adapter)                    |

| 2022 | CVPR  |  DenseCLIP  | [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/abs/2112.01518) | [Code](https://github.com/raoyongming/DenseCLIP)                 |

| 2022 | CVPR  |   CLIPSeg   | [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)                          | [Code](https://eckerlab.org/code/clipseg)                        |

| 2023 | CVPR  |  OneFormer  | [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220)          | [Code](https://github.com/SHI-Labs/OneFormer)                    |

#### Open Vocabulary Learning

| Year | Venue |  Acronym  | Paper Title                                                                                                                                | Code/Project                                                                                     |

|:----:|:-----:|:---------:|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|

| 2021 | CVPR  |  OVR-CNN  | [Open-Vocabulary Object Detection Using Captions](https://arxiv.org/abs/2011.10678)                                                        | [Code](https://github.com/alirezazareian/ovr-cnn)                                                |

| 2022 | ICLR  |   ViLD    | [Open-vocabulary Object Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921)                        | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)    |

| 2022 | ECCV  |   Detic   | [Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605)                                        | [Code](https://github.com/facebookresearch/Detic)                                                |

| 2022 | ECCV  |  OV-DETR  | [Open-Vocabulary DETR with Conditional Matching](https://arxiv.org/abs/2203.11876)                                                         | [Code](https://github.com/yuhangzang/OV-DETR)                                                    |

| 2023 | ICLR  |   F-VLM   | [F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models](https://arxiv.org/abs/2209.15639)                         | [Code](https://sites.google.com/view/f-vlm/home)                                                 |

| 2022 | ECCV  |   MViT    | [Class-agnostic Object Detection with Multi-modal Transformer](https://arxiv.org/abs/2111.11430)                                           | [Code](https://github.com/mmaaz60/mvits_for_class_agnostic_od)                                   |

| 2022 | ECCV  |  OpenSeg  | [Scaling Open-Vocabulary Image Segmentation with Image-Level Labels](https://arxiv.org/abs/2112.12143)                                     | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/openseg) |

| 2022 | ICLR  |   LSeg    | [Language-driven Semantic Segmentation](https://arxiv.org/abs/2201.03546)                                                                  | [Code](https://github.com/isl-org/lang-seg)                                                      |

| 2022 | ECCV  |  SimSeg   | [A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model](https://arxiv.org/abs/2112.14757)     | [Code](https://github.com/MendelXu/zsseg.baseline)                                               |

| 2022 | ECCV  | DenseCLIP | [Extract Free Dense Labels from CLIP](https://arxiv.org/abs/2112.01071)                                                                    | [Code](https://github.com/chongzhou96/MaskCLIP)                                                  |

| 2021 | ICCV  |    UVO    | [Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation](https://arxiv.org/abs/2104.04691)                             | [Project](https://sites.google.com/view/unidentified-video-object)                               |

| 2023 | arXiv |    CGG    | [Betrayed-by-Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation](https://arxiv.org/abs/2301.00805) | [Code](https://github.com/jzwu48033552/betrayed-by-captions)                                     |

| 2022 | TPAMI |    ES     | [Open-World Entity Segmentation](https://arxiv.org/abs/2107.14228)                                                                         | [Code](https://github.com/dvlab-research/Entity/)                                                |

| 2022 | CVPR  |  OW-DETR  | [OW-DETR: Open-world Detection Transformer](https://arxiv.org/abs/2112.01513)                                                              | [Code](https://github.com/akshitac8/OW-DETR)                                                     |

| 2023 | CVPR  |   PROB    | [PROB: Probabilistic Objectness for Open World Object Detection](https://arxiv.org/abs/2212.01424)                                         | [Code](https://github.com/orrzohar/PROB)                                                         |

### Related Domains and Beyond

#### Point Cloud Segmentation

| Year |  Venue  |        Acronym         | Paper Title                                                                                                               | Code/Project                                                     |

|:----:|:-------:|:----------------------:|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|

| 2021 |  ICCV   |   Point Transformer    | [Point Transformer](https://arxiv.org/abs/2012.09164)                                                                     | N/A                                                              |

| 2021 |   CVM   |          PCT           | [PCT: Point cloud transformer](https://arxiv.org/abs/2012.09688)                                                          | [Code](https://github.com/MenghaoGuo/PCT)                        |

| 2022 |  CVPR   | Stratified Transformer | [Stratified Transformer for 3D Point Cloud Segmentation](https://arxiv.org/abs/2203.14508)                                | [Code](https://github.com/dvlab-research/Stratified-Transformer) |

| 2022 |  CVPR   |       Point-BERT       | [Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling](https://arxiv.org/abs/2111.14819)       | [Code](https://github.com/lulutang0608/Point-BERT)               |

| 2022 |  ECCV   |       Point-MAE        | [Masked Autoencoders for Point Cloud Self-supervised Learning](https://arxiv.org/abs/2203.06604)                          | [Code](https://github.com/Pang-Yatian/Point-MAE)                 |

| 2022 | NeurIPS |       Point-M2AE       | [Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training](https://arxiv.org/abs/2205.14401) | [Code](https://github.com/ZrrSkywalker/Point-M2AE)               |

| 2022 |  ICRA   |         Mask3D         | [Mask3D for 3D Semantic Instance Segmentation](https://arxiv.org/abs/2210.03105)                                          | [Code](https://github.com/JonasSchult/Mask3D)                    |

| 2023 |  AAAI   |        SPFormer        | [Superpoint Transformer for 3D Scene Instance Segmentation](https://arxiv.org/abs/2211.15766)                             | [Code](https://github.com/sunjiahao1999/SPFormer)                |

| 2023 |  AAAI   |          PUPS          | [PUPS: Point Cloud Unified Panoptic Segmentation](https://arxiv.org/abs/2302.06185)                                       | N/A                                                              |

#### Domain-aware Segmentation

| Year | Venue  |    Acronym    | Paper Title                                                                                                                                                                                     | Code/Project                                            |

|:----:|:------:|:-------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|

| 2022 |  CVPR  |   DAFormer    | [DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2111.14887)                                                 | [Code](https://github.com/lhoyer/DAFormer)              |

| 2022 |  ECCV  |     HRDA      | [HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2204.13132)                                                                                   | [Code](https://github.com/lhoyer/HRDA)                  |

| 2023 |  CVPR  |      MIC      | [MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation](https://arxiv.org/abs/2212.01322)                                                                                        | [Code](https://github.com/lhoyer/MIC)                   |

| 2021 | ACM MM |      SFA      | [Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers](https://arxiv.org/abs/2107.12636)                                                                             | [Code](https://github.com/encounter1997/SFA)            |

| 2023 |  CVPR  |    DA-DETR    | [DA-DETR: Domain Adaptive Detection Transformer with Information Fusion](https://arxiv.org/abs/2103.17084)                                                                                      | N/A                                                     |

| 2022 |  ECCV  |    MTTrans    | [MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer](https://arxiv.org/abs/2205.01643)                                                                                        | [Code](https://github.com/Lafite-Yu/MTTrans-OpenSource) |

| 2022 | arXiv  | Sentence-Seg  | [The devil is in the labels: Semantic segmentation from sentences](https://arxiv.org/abs/2202.02002)                                                                                            | N/A                                                     |

| 2023 |  ICLR  |     LMSeg     | [LMSeg: Language-guided Multi-dataset Segmentation](https://arxiv.org/abs/2302.13495)                                                                                                           | N/A                                                     |

| 2022 |  CVPR  |    UniDet     | [Simple multi-dataset detection](https://arxiv.org/abs/2102.13086)                                                                                                                              | [Code](https://github.com/xingyizhou/UniDet)            |

| 2023 |  CVPR  | Detection Hub | [Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding](https://arxiv.org/abs/2206.03484)                                                                | N/A                                                     |

| 2022 |  CVPR  |      WD2      | [Unifying Panoptic Segmentation for Autonomous Driving](https://openaccess.thecvf.com/content/CVPR2022/papers/Zendel_Unifying_Panoptic_Segmentation_for_Autonomous_Driving_CVPR_2022_paper.pdf) | [Data](https://github.com/ozendelait/wilddash_scripts)  |

| 2023 | arXiv  |    TarVIS     | [TarViS: A Unified Approach for Target-based Video Segmentation](https://arxiv.org/abs/2301.02657)                                                                                              | [Code](https://github.com/Ali2500/TarViS)                                                     |

#### Label and Model Efficient Segmentation

| Year |  Venue  |   Acronym   | Paper Title                                                                                                                                    | Code/Project                                       |

|:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|

| 2022 |  CVPR   |  MCTformer  | [Multi-class Token Transformer for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2203.02891)                                  | [Code](https://github.com/xulianuwa/MCTformer)     |

| 2020 |  CVPR   |     PCM     | [Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2004.04581)                | [Code](https://github.com/YudeWang/SEAM)           |

| 2022 |  ECCV   |   ViT-PCM   | [Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation](https://arxiv.org/abs/2210.17400) | [Code](https://github.com/deepplants/ViT-PCM)      |

| 2021 |  ICCV   |    DINO     | [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.14294)                                                 | [Code](https://github.com/facebookresearch/dino)   |

| 2021 |  BMVC   |    LOST     | [Localizing Objects with Self-Supervised Transformers and no Labels](https://arxiv.org/abs/2109.14279)                                         | [Code](https://github.com/valeoai/LOST)            |

| 2022 |  ICLR   |    STEGO    | [Unsupervised Semantic Segmentation by Distilling Feature Correspondences](https://arxiv.org/abs/2203.08414)                                   | [Code](https://github.com/mhamilton723/STEGO)      |

| 2022 | NeurIPS |    ReCo     | [ReCo: Retrieve and Co-segment for Zero-shot Transfer](https://arxiv.org/abs/2206.07045)                                                       | [Code](https://github.com/NoelShin/reco)           |

| 2022 |  arXiv  | MaskDistill | [Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation](https://arxiv.org/abs/2206.06363)                          | N/A                                                |

| 2022 |  CVPR   |  FreeSOLO   | [FreeSOLO: Learning to Segment Objects without Annotations](https://arxiv.org/abs/2202.12181)                                                  | [Code](http://github.com/NVlabs/FreeSOLO)          |

| 2023 |  CVPR   |   CutLER    | [Cut and Learn for Unsupervised Object Detection and Instance Segmentation](https://arxiv.org/abs/2301.11320)                                  | [Code](https://github.com/facebookresearch/CutLER) |

| 2022 |  CVPR   |  TokenCut   | [Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut](https://arxiv.org/abs/2202.11539)                        | [Code](https://github.com/YangtaoWANG95/TokenCut)  |

| 2022 |  ICLR   |  MobileViT  | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178)                           | [Code](https://github.com/apple/ml-cvnets)         |

| 2023 |  arXiv  |     EMO     | [Rethinking Mobile Block for Efficient Neural Models](https://arxiv.org/abs/2301.01146)                                                        | [Code](https://github.com/zhangzjn/EMO)            |

| 2022 |  CVPR   |  TopFormer  | [TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2204.05525)                                      | [Code](https://github.com/hustvl/TopFormer)        |

| 2023 |  ICLR   |  SeaFormer  | [SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2301.13156)                             | [Code](https://github.com/fudan-zvg/SeaFormer)     |

#### Class Agnostic Segmentation and Tracking

| Year |  Venue  |   Acronym   | Paper Title                                                                                                                              | Code/Project                                   |

|:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|

| 2022 |  CVPR   | Transfiner  | [Mask Transfiner for High-Quality Instance Segmentation](https://arxiv.org/abs/2111.13673)                                               | [Code](https://github.com/SysCV/transfiner)    |

| 2022 |  ECCV   |     VMT     | [Video Mask Transfiner for High-Quality Video Instance Segmentation](https://arxiv.org/abs/2207.14012)                                   | [Code](https://github.com/SysCV/vmt)           |

| 2022 |  arXiv  | SimpleClick | [SimpleClick: Interactive Image Segmentation with Simple Vision Transformers](https://arxiv.org/abs/2210.11006)                          | [Code](https://github.com/uncbiag/simpleclick) |

| 2023 |  ICLR   |  PatchDCT   | [PatchDCT: Patch Refinement for High Quality Instance Segmentation](https://arxiv.org/abs/2302.02693)                                    | [Code](https://github.com/olivia-w12/PatchDCT) |

| 2019 |  ICCV   |     STM     | [Video Object Segmentation using Space-Time Memory Networks](https://arxiv.org/abs/1904.00607)                                           | [Code](https://github.com/seoungwugoh/STM)     |

| 2021 | NeurIPS |     AOT     | [Associating Objects with Transformers for Video Object Segmentation](https://arxiv.org/abs/2106.02638)                                  | [Code](https://github.com/z-x-yang/AOT)        |

| 2021 | NeurIPS |    STCN     | [Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation](https://arxiv.org/abs/2106.05210) | [Code](https://github.com/hkchengrex/STCN)     |

| 2022 |  ECCV   |    XMem     | [XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model](https://arxiv.org/abs/2207.07115)                     | [Code](https://hkchengrex.github.io/XMem)      |

| 2022 |  CVPR   |    PCVOS    | [Per-Clip Video Object Segmentation](https://arxiv.org/abs/2208.01924)                                                                   | [Code](https://github.com/pkyong95/PCVOS)      |

| 2023 |  CVPR   |     N/A     | [Look Before You Match: Instance Understanding Matters in Video Object Segmentation](https://arxiv.org/abs/2212.06826)                   | N/A                                            |

#### Medical Image Segmentation

| Year |     Venue     |  Acronym  | Paper Title                                                                                                     | Code/Project                                                                    |

|:----:|:-------------:|:---------:|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|

| 2020 |     BIBM     | CellDETR | [Attention-Based Transformers for Instance Segmentation of Cells in Microstructures](https://arxiv.org/abs/2011.09763) | [Code](https://github.com/ChristophReich1996/Cell-DETR)                                  |

| 2021 |     arXiv     | TransUNet | [TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation](https://arxiv.org/abs/2102.04306) | [Code](https://github.com/Beckschen/TransUNet)                                  |

| 2022 | ECCV Workshop | Swin-Unet | [Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation](https://arxiv.org/abs/2105.05537)        | [Code](https://github.com/HuCaoFighting/Swin-Unet)                              |

| 2021 |    MICCAI     | TransFuse | [TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation](https://arxiv.org/abs/2102.08005)      | [Code](https://github.com/Rayicer/TransFuse)                                    |

| 2022 |     WACV      |   UNETR   | [UNETR: Transformers for 3D Medical Image Segmentation](https://arxiv.org/abs/2103.10504)                       | [Code](https://github.com/Project-MONAI/research-contributions/tree/main/UNETR) |

## Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

```bibtex

@article{li2023transformer,

    author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change},

    title={Transformer-Based Visual Segmentation: A Survey},

    journal={T-PAMI},

    year={2024}

  }

```

## Contact

```

[email protected] (main)

```

```

[email protected]

```

## Related Repo For Segmentation and Detection

Attention Model [Repo](https://github.com/cmhungsteve/Awesome-Transformer-Attention) by Min-Hung (Steve) Chen.

Detection Transformer [Repo](https://github.com/IDEA-Research/awesome-detection-transformer) by IDEA.

Open Vocabulary Learning [Repo](https://github.com/jianzongwu/Awesome-Open-Vocabulary) by PKU and NTU.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lxtGH/Awesome-Segmentation-With-Transformer

Awesome Lists containing this project

README

Transformer-Based Visual Segmentation: A Survey