An open API service indexing awesome lists of open source software.

https://github.com/gaomingqi/awesome-video-object-segmentation

🔥 Latest advances in Video Object Segmentation (VOS) – papers, datasets, and projects.
https://github.com/gaomingqi/awesome-video-object-segmentation

List: awesome-video-object-segmentation

audio-visual-segmentation awesome-papers awesome-papers-for-video-object-segmentation referring-video-object-segmentation semi-supervised-video-object-segmentation video-matting video-object-segmentation video-reasoning-segmentation

Last synced: 4 months ago
JSON representation

🔥 Latest advances in Video Object Segmentation (VOS) – papers, datasets, and projects.

Awesome Lists containing this project

README

          










Latest Advances in Video Object Segmentation (VOS). VOS works before 2022 can be found in our survey paper:

>Deep Learning for Video Object Segmentation: A Review / [paper](https://link.springer.com/content/pdf/10.1007/s10462-022-10176-7.pdf) / [project page](https://github.com/gaomingqi/VOS-Review) BibTex @article{gao2023deep,
title={Deep learning for video object segmentation: a review},
author={Gao, Mingqi and Zheng, Feng and Yu, James JQ and Shan, Caifeng and Ding, Guiguang and Han, Jungong},
journal={Artificial Intelligence Review},
volume={56},
number={1},
pages={457--531},
year={2023},
publisher={Springer}
}

---

:teddy_bear: We mark different VOS tasks with coloured squares:


:blue_square:SVOS
SVOS
:orange_square:RVOS
RVOS


:green_square:UVOS
UVOS
:red_square:AVOS
AVOS


:diamond_shape_with_a_dot_inside:VMAT
VMAT
:white_large_square:XVOS
Other types of VOS

:teddy_bear: Please feel free to send us pull requests to add VOS works.

---

Links for a quick jump: [ArXiv (within 6 months)](#arxiv), 🔥[ICLR 2026](#iclr26)🔥, [AAAI 2026](#aaai26), [NeurIPS 2025](#nips25), [ACM MM 2025](#mm25), [SIGGRAPH 2025](#sig25), [ICCV 2025](#iccv25), [CVPR 2025](#cvpr25), [ICLR 2025](#iclr25), [AAAI 2025](#aaai25), [Journals 2025](#j25), [Earlier ArXiv 2025](#a25), [NeurIPS 2024](#nips24), [ACMMM 2024](#acmmm24), [ECCV 2024](#eccv24), [CVPR 2024](#cvpr24), [AAAI 2024](#aaai24), [Journals 2024](#j24), [Earlier ArXiv 2024](#earxiv24), [EMNLP 2023](#emnlp23), [NeurIPS 2023](#nips23), [ACMMM 2023](#mm23), [ICCV 2023](#iccv23), [CVPR 2023](#cvpr23), [IJCAI 2023](#ijcai23), [AAAI 2023](#aaai23), [Journals 2023](#j23), [Earlier ArXiv 2023](#earxiv23), [NeurIPS 2022](#neurips22), [ECCV 2022](#eccv22), [CVPR 2022](#cvpr22), [AAAI 2022](#aaai22), [Journals 2022](#j22)

---
### 🏁 VOS Workshops and Challenges

No Active - Click to see history

:blue_square: `SVOS` :orange_square: `RVOS`    -    [LSVOS @ICCV 2025](https://lsvos.github.io/)

---
### :floppy_disk: VOS Dataset

Click to expand

:blue_square: `SVOS`: [MOSEv2](https://www.codabench.org/competitions/10062/) (2025), [SA-V](https://ai.meta.com/datasets/segment-anything-video/) (2024), [LVOS](https://lingyihongfd.github.io/lvos.github.io/dataset.html) (2023), [MOSEv1](https://henghuiding.github.io/MOSE/) (2023), [VOST](https://www.vostdataset.org/) (2023), [VISOR](https://epic-kitchens.github.io/VISOR/) (2022), [YouTube-VOS](https://youtube-vos.org/) (2018/2019), [DAVIS](https://davischallenge.org/index.html) (2016/2017)

:orange_square: `RVOS`: [MeViSv2](https://henghuiding.com/MeViS/#dataset) (2025), [ReVOS](https://github.com/cilinyan/ReVOS-api) (2024), [MeViS](https://henghuiding.github.io/MeViS/) (2023), [Ref-YouTube-VOS](https://youtube-vos.org/dataset/rvos/) (2020), [Ref-DAVIS](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/video-segmentation/video-object-segmentation-with-language-referring-expressions) (2018), [J-HMDB-Sentences](https://kgavrilyuk.github.io/publication/actor_action/) (2018), [A2D-Sentences](https://kgavrilyuk.github.io/publication/actor_action/) (2018)

:green_square: `UVOS`: [DAVIS](https://davischallenge.org/index.html) (2016)

:red_square: `AVOS`: [AVSBench](https://opennlplab.github.io/AVSBench/) (2022)

:diamond_shape_with_a_dot_inside: `VMAT`: [VideoMatte240K](https://grail.cs.washington.edu/projects/background-matting-v2/#/datasets) (2021), [CRGNN](https://github.com/TiantianWang/VideoMatting-CRGNN) (2021)

---

### ArXiv (Last 6 months)

:orange_square: `RVOS` `Feb`    -    [paper](https://arxiv.org/pdf/2602.12173) / [code](https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext)    -    SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder
for Efficient Vision-Language Segmentation

:orange_square: `RVOS` `Feb`    -    [paper](https://arxiv.org/pdf/2602.04454) / [code](https://github.com/iSEE-Laboratory/Seg-ReSearch)    -    Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search (`Reasoning VOS via Outside Knowledge!`)

:red_square: `AVOS` `Feb`    -    [paper](https://arxiv.org/pdf/2602.03892) / [code](https://github.com/jasongief/MQA-RefAVS)    -    Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

:orange_square: `RVOS` `Feb`    -    [paper](https://arxiv.org/pdf/2602.03595) / code    -    Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

:diamond_shape_with_a_dot_inside: `VMAT` `Jan`    -    [paper](https://arxiv.org/pdf/2601.14255) / [code](https://github.com/cvlab-kaist/VideoMaMa)    -    VideoMaMa: Mask-Guided Video Matting via Generative Prior

:blue_square: `SVOS` :orange_square: `RVOS` `Jan`    -    [paper](https://arxiv.org/pdf/2601.09699) / [code](https://github.com/FudanCVL/SAM3-DMS)    -    SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3

:blue_square: `SVOS` `Jan`    -    [paper](https://arxiv.org/pdf/2601.08831) / [project page](https://jayisaking.github.io/3AM-Page/)    -    3AM: 3egment Anything with Geometric Consistency in Videos

:diamond_shape_with_a_dot_inside: `VMAT` `Dec`    -    [paper](https://arxiv.org/pdf/2512.11782) / [project page](https://pq-yang.github.io/projects/MatAnyone2/)    -    MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

:red_square: `AVOS` `Dec`    -    [paper](https://arxiv.org/abs/2512.20117) / [project page](https://trilarflagz.github.io/DDAVS-page/)    -    DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

:orange_square: `RVOS` :red_square: `AVOS` `Dec` `TPAMI`    -    [paper](https://arxiv.org/abs/2512.10945) / [project page and dataset](https://henghuiding.com/MeViS/index.html)    -    MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

:blue_square: `SVOS` `Dec`    -    [paper](https://arxiv.org/abs/2512.08406) / [code](https://github.com/gaomingqi/sam-body4d)    -    SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos (`VOS-driven Human Mesh Recovery`)

:orange_square: `RVOS` `Dec`    -    [paper](https://arxiv.org/abs/2512.02835) / [code](https://github.com/Clementine24/ReVSeg)    -    ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

:blue_square: `SVOS` `Nov`    -    [paper](https://arxiv.org/abs/2511.16618) / [code](https://github.com/jinlab-imvr/SAM2S)    -    SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

:blue_square: `SVOS` `Nov`    -    [paper](https://arxiv.org/abs/2511.20886) / code    -    V2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

:orange_square: `RVOS` `Nov`    -    [paper](https://arxiv.org/abs/2511.21139) / code    -    ReVSeg: Referring Video Object Segmentation with Cross-Modality Proxy Queries

:red_square: `AVOS` `Oct`    -    [paper](https://arxiv.org/abs/2510.10051) / [code](https://github.com/SitongGong/CCFormer)    -    Complementary and Contrastive Learning for Audio-Visual Segmentation

:orange_square: `RVOS` `Oct`    -    [paper](https://arxiv.org/abs/2510.09274) / [code](https://github.com/Dmmm1997/MomentSeg)    -    MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

:orange_square: `RVOS` `Oct`    -    [paper](https://arxiv.org/abs/2510.08305) / code    -    LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

:orange_square: `RVOS` `Oct`    -    [paper](https://arxiv.org/abs/2510.07319) / code    -    Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

:orange_square: `RVOS` `Oct`    -    [paper](https://arxiv.org/abs/2510.06139) / code    -    Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

:red_square: `AVOS` `Sep`    -    [paper](https://arxiv.org/abs/2509.22740) / [code](https://github.com/jinbae-s/ACVIS)    -    Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

:red_square: `AVOS` `Sep`    -    [paper](https://arxiv.org/abs/2509.18912) / code    -    Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation

:orange_square: `RVOS` `Sep`    -    [paper](https://arxiv.org/abs/2509.13722) / code    -    Mitigating Query Selection Bias in Referring Video Object Segmentation

:orange_square: `RVOS` `Sep`    -    [paper](https://arxiv.org/abs/2509.05751) / code    -    Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation

:orange_square: `RVOS` :blue_square: `SVOS` `Aug`    -    [paper](https://arxiv.org/abs/2508.21809) / [code](https://github.com/google-deepmind/vocap)    -    VoCap: Video Object Captioning and Segmentation from Any Prompt

:orange_square: `RVOS` `Aug`    -    [paper](https://arxiv.org/abs/2508.11955) / [code](https://github.com/Seung-Hun-Lee/SAMDWICH)    -    SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation

:orange_square: `RVOS` `Aug`    -    [paper](https://arxiv.org/abs/2508.13584) / [code](https://github.com/qianqiaoai/HCD)    -    Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

:orange_square: `RVOS` `Aug`    -    [paper](https://arxiv.org/abs/2508.11538) / [code](https://github.com/SitongGong/Veason-R1)    -    Reinforcing Video Reasoning Segmentation to Think Before It Segments

:red_square: `AVOS` `Aug`    -    [paper](https://arxiv.org/abs/2508.02149) / code    -    AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

:green_square: `UVOS` `Jul`    -    [paper](https://arxiv.org/abs/2507.19790) / [code](https://github.com/suhwan-cho/DepthFlow)    -    DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation

:blue_square: `SVOS` `Jul`    -    [paper](https://arxiv.org/pdf/2507.18921) / code    -    HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback

:blue_square: `SVOS` `Jul`    -    [paper](https://arxiv.org/abs/2507.07603) / [code](https://github.com/LouisFinner/HiM2SAM)    -    HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking

:white_large_square: `XVOS` `Jul`    -    [paper](https://arxiv.org/abs/2507.07519) / [dataset](https://volumetric-repository.labs.b-com.com/#/muvod)    -    MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation

:diamond_shape_with_a_dot_inside: `VMAT` `Jul`    -    [paper](https://arxiv.org/abs/2507.04456) / code    -    BiVM: Accurate Binarized Neural Network for Efficient Video Matting

:diamond_shape_with_a_dot_inside: `VMAT` `Jun`    -    [paper](https://arxiv.org/abs/2506.10840) / code    -    Post-Training Quantization for Video Matting

:red_square: `AVOS` `Jun`    -    [paper](https://arxiv.org/abs/2506.11436) / code    -    TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

:orange_square: `RVOS` `Jun`    -    [paper](https://arxiv.org/pdf/2506.02356) / [project](https://cvlab-kaist.github.io/InterRVOS/)    -    InterRVOS: Interaction-aware Referring Video Object Segmentation

:red_square: `AVOS` `Jun`    -    [paper](https://arxiv.org/abs/2506.01015) / [code](https://github.com/yyliu01/AuralSAM2)    -    AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

---
### ICLR 2026
:blue_square: `SVOS` :orange_square: `RVOS`    -    [paper](https://scontent-lhr8-1.xx.fbcdn.net/v/t39.2365-6/586186898_724834017304937_2869787384130329011_n.pdf?_nc_cat=107&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=V0ySxB4YEecQ7kNvwE9bO7k&_nc_oc=Adn7bAjFdDo3pyjopE-tHSmOV5lgvaoxLNYmJtRbE9Op6gSQHJoEsg2ANithdVk2hm5mJvm_jjjSyhT0TiB9Go0h&_nc_zt=14&_nc_ht=scontent-lhr8-1.xx&_nc_gid=Oa338lN6JufBYl5-MfIdMg&oh=00_Afj4QvVDmMqGVUembmWPdxu9nWcksZ6Rjruxy28TYOm3PA&oe=6923F072) / [code](https://github.com/facebookresearch/sam3)    -    SAM 3: Segment Anything with Concepts

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2507.15852) / [code](https://github.com/OpenIXCLab/SeC) / [dataset](https://huggingface.co/datasets/OpenIXCLab/SeCVOS)    -    SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2602.08224) / [code](https://github.com/jingjing0419/Efficient-SAM2)    -    Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2510.19592) / [code](https://github.com/HYUNJS/DecAF)    -    Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/pdf/2510.06139) / [code](https://github.com/xmz111/FlowRVS)    -    Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

:diamond_shape_with_a_dot_inside: `VMAT`    -    [paper](https://openreview.net/pdf?id=6K08FPo2cf) / code    -    Matting Anything 2: Towards Video Matting for Anything

---

### AAAI 2026

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2508.04418) / [code](https://github.com/jasongief/TGS-Agent)    -    Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2511.13715) / [code](https://github.com/FudanCVL/SAAS)    -    Segment Anything Across Shots: A Method and Benchmark

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2511.19475) / code    -    Tracking and Segmenting Anything in Any Modality

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2511.16077) / [code](https://github.com/euyis1019/VideoSeg-R1)    -    VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning

---

### NeurIPS 2025

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2509.18094) / [code](https://github.com/PolyU-ChenLab/UniPixel)    -    UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

:orange_square: `RVOS`    -    [paper](https://openreview.net/pdf?id=z9xyREqxzq) / code    -    Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

---

### ACM MM 2025

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2507.22465) / [code](https://github.com/ZhengxyFlow/HMHI-Net)    -    Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation

---

### SIGGRAPH 2025

:diamond_shape_with_a_dot_inside: `VMAT`    -    [paper](https://arxiv.org/abs/2508.07905) / [code](https://github.com/aim-uofa/GVM)    -    Generative Video Matting

---
### ICCV 2025

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2507.18944) / [code](https://github.com/jinlab-imvr/OASIS)    -    Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2410.16268) / [code](https://github.com/Mark12Ding/SAM2Long)    -    SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/ICCV2025/papers/Baek_EVOLVE_Event-Guided_Deformable_Feature_Transfer_and_Dual-Memory_Refinement_for_Low-Light_ICCV_2025_paper.pdf) / [code](https://github.com/whdgusdl48/EVOLVE)    -    EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/ICCV2025/papers/Rong_MPG-SAM_2_Adapting_SAM_2_with_Mask_Priors_and_Global_ICCV_2025_paper.pdf) / [code](https://github.com/rongfu-dsb/MPG-SAM2)    -    MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2507.19599) / [code](https://github.com/qirui-chen/RGA3-release)    -    Object-centric Video Question Answering with Visual Grounding and Referring (`Video LLM with applications on RVOS`)

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2501.14607) / [code](https://github.com/iSEE-Laboratory/ReferDINO)    -    ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2412.14006) / [code](https://github.com/congvvc/InstructSeg)    -    InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2507.22061) / [code](https://github.com/FudanCVL/MOVE)    -    MOVE: Motion-Guided Few-Shot Video Object Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2507.22886) / [code](https://github.com/FudanCVL/OmniAVS)    -    Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2507.20740) / code    -    Implicit Counterfactual Learning for Audio-Visual Segmentation

---
### CVPR 2025

:diamond_shape_with_a_dot_inside: `VMAT`    -    [paper](https://arxiv.org/abs/2501.14677) / [code](https://github.com/pq-yang/MatAnyone)    -    Stable Video Matting with Consistent Memory Propagation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2411.17646) / [code](https://github.com/ClaudiaCuttano/SAMWISE)    -    SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2501.08549) / [code](https://github.com/SitongGong/VRS-HQ)    -    The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2412.09754) / [code](https://github.com/Ali2500/ViCaS)    -    ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2411.09921) / [code](https://github.com/dengandong/GroundMoRe)    -    Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2504.07962) / [code](https://github.com/GLUS-video/GLUS)    -    GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Pan_Semantic_and_Sequential_Alignment_for_Referring_Video_Object_Segmentation_CVPR_2025_paper.pdf) / [code](https://github.com/tavarich/SSA)    -    Semantic and Sequential Alignment for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Fang_Decoupled_Motion_Expression_Video_Segmentation_CVPR_2025_paper.pdf) / code    -    Decoupled Motion Expression Video Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2411.17576) / [code](https://github.com/jovanavidenovic/DAM4SAM)    -    A Distractor-Aware Memory for Visual Object Tracking with SAM2

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2411.02818) / [code](https://github.com/uncbiag/LiVOS)    -    LiVOS: Light Video Object Segmentation with Gated Linear Matching

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2502.04144) / [project page](https://hd-epic.github.io/)    -    HD-EPIC: A Highly-Detailed Egocentric Video Dataset (`with long-term SVOS dataset`)

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2412.13803) / [project page](https://zixuan-chen.github.io/M-cube-VOS.github.io/)    -    M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (`svos with phase transition for embodied ai`)

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2506.01558) / [code](https://github.com/VoyageWang/SAM2LOVE)    -    SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2503.12840) / [code](https://github.com/YenanLiu/DDESeg)    -    Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2506.23623) / [code](https://github.com/spyflying/VCT_AVS)    -    Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2503.12847) / code    -    Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

:red_square: `AVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Radman_TSAM_Temporal_SAM_Augmented_with_Multimodal_Prompts_for_Referring_Audio-Visual_CVPR_2025_paper.pdf) / [project](https://abdurad.github.io/TSAM/)    -    TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2412.04623) / [code](https://github.com/Kaihua-Chen/diffusion-vas)    -    Using Diffusion Priors for Video Amodal Segmentation (`segment both visible and invisible (e.g., occluded) video objects`)

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2506.01304) / [code](https://github.com/showlab/SAM-I2V)    -    SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2503.22268) / [code](https://github.com/nnanhuang/SegAnyMo)    -    Segment Any Motion in Videos

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2504.05468) / [code](https://github.com/thanosDelatolas/diff-zvos)    -    Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

---
### ICLR 2025

:blue_square: `SVOS`    -    [paper](https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/) / [code](https://github.com/facebookresearch/segment-anything-2)    -    SAM 2: Segment Anything in Images and Videos

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2410.18538) / [code](https://github.com/alimohammadiamirhossein/smite/)    -    SMITE: Segment Me In TimE

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2407.07760) / [code](https://github.com/yahooo-m/S3)    -    Learning Spatial-Semantic Features for Robust Video Object Segmentation

---
### AAAI 2025

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2412.01471) / [project page](https://cvlab-kaist.github.io/MUG-VOS/)    -    Multi-Granularity Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/32706) / code    -    Holistic Correction with Object Prototype for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/32626) / code    -    Beyond Pixel and Object: Part Feature as Reference for Few-Shot Video Object Segmentation

:red_square: `AVOS` :orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2408.15876) / [code](https://github.com/appletea233/AL-Ref-SAM2)    -    Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

---
### Journals 2025

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/document/11311151) / [code](https://github.com/zaplm/DC-SAM)    -    `TPAMI` DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

:orange_square: `RVOS` :red_square: `AVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/11184493) / [code](https://github.com/yongliu20/MRVS_SOC)    -    `TPAMI` Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation

:green_square: `SVOS`    -    [paper](https://arxiv.org/abs/2502.12975) / [code](https://github.com/danqu130/EvInsMOS)    -    `IJCV` Instance-Level Moving Object Segmentation from a Single Image with Events

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2501.07806) / [code](https://github.com/hy0523/MTNet)    -   
`TNNLS` Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/10933555) / [code](https://github.com/yk-pku/Low-shot-VOS)    -    `TPAMI` Low-shot Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/10949703) / code    -    `TPAMI` JointFormer: A Unified Framework with Joint Modeling for Video Object Segmentation

---
### Earlier Arxiv 2025

:orange_square: `RVOS` `May`    -    [paper](https://arxiv.org/abs/2505.08581) / [code](https://github.com/jinlab-imvr/ReSurgSAM2)    -    ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking

:orange_square: `RVOS` `May`    -    [paper](https://arxiv.org/abs/2505.18561) / [code](https://github.com/DanielSHKao/ThinkVideo)    -    ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts

:orange_square: `RVOS` `May`    -    [paper](https://arxiv.org/abs/2505.12702) / [code](https://isee-laboratory.github.io/Long-RVOS)    -    Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

:red_square: `AVOS` `May`    -    [paper](https://arxiv.org/abs/2505.01448) / code    -    OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

:blue_square: `SVOS` `May`    -    [paper](https://arxiv.org/abs/2505.00739) / code    -    MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

:blue_square: `SVOS` `Apr`    -    [paper](https://arxiv.org/abs/2504.16471) / code    -    RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

:green_square: `UVOS` `Apr`    -    [paper](https://arxiv.org/abs/2504.05904) / code    -    Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation

:orange_square: `RVOS` `Mar`    -    [paper](https://arxiv.org/abs/2503.21056) / code    -    Online Reasoning Video Segmentation with Just-in-Time Digital Twins

:diamond_shape_with_a_dot_inside: `VMAT` `Mar`    -    [paper](https://arxiv.org/abs/2503.10678) / [project page](https://bio.lehanyang.info/VRMDiff.github.io/)    -    VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

:diamond_shape_with_a_dot_inside: `VMAT` `Mar`    -    [paper](https://arxiv.org/abs/2503.01262) / code    -    Object-Aware Video Matting with Cross-Frame Guidance

:orange_square: `RVOS` `Mar`    -    [paper](https://arxiv.org/abs/2503.03492) / [code](https://github.com/suhwan-cho/FindTrack)    -    Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

:orange_square: `RVOS` `Jan`    -    [paper](https://arxiv.org/abs/2501.04001) / [code](https://github.com/magic-research/Sa2VA)    -    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2502.09660) / code    -    Towards Fine-grained Interactive Segmentation in Images and Videos

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2502.00358) / code    -    Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2501.13667) / code    -    MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2501.07256) / code    -    EdgeTAM: On-Device Track Anything Model

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2501.07806) / [code](https://github.com/SitongGong/AVS-Mamba)    -   
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2501.04939) / [code](https://github.com/Choi58/MTCM)    -    Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation

---
### NeurIPS 2024

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2409.19603) / [code](https://github.com/showlab/VideoLISA)    -    One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

:blue_square: `RVOS`    -    [paper](https://arxiv.org/abs/2412.19806) / [code](https://github.com/SkyworkAI/Vitron)    -    VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing (`with applications in SVOS`)

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2501.12392) / code    -    Learning segmentation from point trajectories

---
### ACMMM 2024

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2409.19342) / [code](https://github.com/PinxueGuo/X-Prompt)    -    X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

---
### ECCV 2024

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2404.06265) / [code](https://github.com/yahooo-m/VOS-Solution)    -    Spatial-Temporal Multi-level Association for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2403.08682) / [code](https://github.com/L599wy/OneVOS)    -    OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2309.12303) / [code & dataset](https://github.com/shilinyan99/PanoVOS)    -    PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2407.11325) / [code](https://github.com/cilinyan/VISA)    -    VISA: Reasoning Video Object Segmentation via Large Language Model

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2403.12042) / [code](https://github.com/buxiangzhiren/VD-IT)    -    Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2407.07402) / [code](https://github.com/ut-vision/ActionVOS)    -    ActionVOS: Actions as Prompts for Video Object Segmentation

:orange_square: `RVOS` :red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2403.04924) / [code & dataset](https://github.com/lxa9867/r2bench)    -    R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

:orange_square: `RVOS` :red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2407.10957) / [code](https://github.com/GeWu-Lab/Ref-AVS)    -    Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2407.11820) / [code](https://github.com/GeWu-Lab/Stepping-Stones)    -    Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2311.17893) / [code](https://github.com/shvdiwnkozbw/SSL-UVOS)    -    Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

---
### CVPR 2024

:diamond_shape_with_a_dot_inside: `VMAT`    -    [paper](https://arxiv.org/abs/2404.16035) / [code](https://github.com/hmchuong/MaGGIe)    -    MaGGIe: Masked Guided Gradual Human Instance Matting

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2402.05917) / [code](https://pointvos.github.io/)    -    Point-VOS: Pointing Up Video Object Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2310.00132) / [code](https://github.com/lxa9867/QSD)    -    Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2312.06462) / [code](https://github.com/yannqi/COMBO-AVS)    -    Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2304.02970) / code    -    A Closer Look at Audio-Visual Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2403.04258) / [code](https://github.com/NiFangBaAGe/DATTT)    -    Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2211.12036) / [code](https://github.com/Hydragon516/DPA)    -    Dual Prototype Attention for Unsupervised Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2303.08314) / code    -    Guided Slot Attention for Unsupervised Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2404.03645) / [code](https://github.com/heshuting555/DsHmp)    -    Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2306.08736) / [code](https://github.com/LinfengYuan1997/Losh)    -    LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2312.01623) / [code](https://github.com/workforai/UniLSeg)    -    Universal Segmentation at Arbitrary Granularity with Language Instruction

:blue_square: `SVOS` :orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2402.18115) / [code](https://github.com/MinghanLi/UniVS)    -    UniVS: Unified and Universal Video Segmentation with Prompts as Queries

:blue_square: `SVOS` :orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2312.09158) / [code](https://github.com/FoundationVision/GLEE)    -    General Object Foundation Model for Images and Videos at Scale

:blue_square: `SVOS` :green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2406.04221) / [code](https://github.com/siyuanliii/masa)    -    Matching Anything By Segmenting Anything

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2406.08476) / [code](https://github.com/Restricted-Memory/RMem)    -    RMem: Restricted Memory Banks Improve Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2404.01945) / [code](https://github.com/HebeiFast/EventLowLightVOS)    -    Event-assisted Low-Light Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2310.12982) / [code](https://github.com/hkchengrex/Cutie)    -    Putting the Object Back into Video Object Segmentation

---
### AAAI 2024
:orange_square: `RVOS` :red_square: `AVOS`    -    [paper](https://arxiv.org/pdf/2305.16318.pdf) / [code](https://github.com/OpenGVLab/MUTR)    -    Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/28295) / code    -    Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation

---
### Journals 2024
:orange_square: `RVOS`    -    [paper](https://ieeexplore.ieee.org/document/10694805) / [code](https://github.com/Yxxxb/LAVT-RS)    -    `TPAMI` Language-Aware Vision Transformer for Referring Segmentation

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/10713285) / [code](https://github.com/BIT-Vision/ECOS)    -    `TPAMI` Continuous-time Object Segmentation using High Temporal Resolution Event Camera

---
### Earlier Arxiv 2024

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2412.19761) / [project page](https://genprop.github.io/)    -    Generative Video Propagation (`with applications in SVOS`)

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2412.08161) / code    -    Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2412.04930) / [project page](https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html)    -    Video Decomposition Prior: A Methodology to Decompose Videos into Layers (`with applications in UVOS`)

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2412.01136) / [project page](https://cvlab-kaist.github.io/SOLA/)    -    Referring Video Object Segmentation via Language-aligned Track Selection

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2411.18977) / [code](https://github.com/motern88/Det-SAM2)    -    Det-SAM2: Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2411.19141) / code    -    On Moving Object Segmentation from Monocular Video with Transformers

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2411.19210) / code    -    Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2411.18933) / [code](https://github.com/yformer/EfficientTAM)    -    Efficient Track Anything

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2411.11922) / [code](https://github.com/yangchris11/samurai)    -    SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2410.23287) / [project page](https://miccooper9.github.io/projects/ReferEverything/)    -    ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2409.18653) / [code](https://github.com/zhoustan/SAM2-VCOS)    -    When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2409.14343) / code    -    Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2408.01708) / [code](https://github.com/MarkXCloud/AVESFormer)    -    AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2408.00169) / [code](https://github.com/Vujas-Eteph/LazyXMem)    -    Strike the Balance: On-the-Fly Uncertainty based User Interactions for Long-Term Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2407.14500) / [code](https://github.com/rkzheng99/ViLLa)    -    ViLLa: Video Reasoning Segmentation with Large Language Model

:green_square: `UVOS`    -    [paper](https://arxiv.org/pdf/2407.11714) / code    -    Improving Unsupervised Video Object Segmentation via Fake Flow Generation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2406.02345) / code    -    Progressive Confident Masking Attention Network for Audio-Visual Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2406.06163) / code    -    Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2406.12834) / code    -    GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation

---
### EMNLP 2023

:orange_square: `RVOS`    -    [paper](https://aclanthology.org/2023.emnlp-main.140.pdf) / code    -    Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text (``Spoken language as referring guidance``)

---
### NeurIPS 2023
:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2305.17011) / [code](https://github.com/RobertLuo1/NeurIPS2023_SOC)    -    SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://openreview.net/pdf?id=9QsdPQlWiE) / [code](https://github.com/ttt-matching-based-vos/ttt_matching_vos)    -    Test-time Training for Matching-based Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://openreview.net/pdf?id=jfsjKBDB1z) / [code](https://github.com/BGU-CS-VIL/Training-Free-VOS)    -    From ViT Features to Training-free Video Object Segmentation via Streaming-data Mixture Models

---
### ACM MM 2023
:green_square: `UVOS`    -    [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611804) / code    -    SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3612017) / code    -    Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611827) / code    -    Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks

:red_square: `AVOS`    -    [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611724) / [code](https://github.com/aspirinone/CATR.github.io)    -    CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

:red_square: `AVOS`    -    [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3612373) / code    -    Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

---

### ICCV 2023

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2309.11160) / [code](https://github.com/nankepan/VIPMT)    -    Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2308.11796) / [code](https://github.com/SMSD75/Timetuning)    -    Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations (`self-supervised learning for UVOS`)

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2308.06693) / [code](https://github.com/DLUT-yyc/Isomer)    -    Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Su_Unsupervised_Video_Object_Segmentation_with_Online_Adversarial_Self-Tuning_ICCV_2023_paper.pdf) / code    -    Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning

:green_square: `UVOS` :orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2309.03903) / [code](https://github.com/hkchengrex/Tracking-Anything-with-DEVA)    -    DEVA: Tracking Anything with Decoupled Video Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2309.03473) / [code](https://github.com/Toneyaya/TempCD)    -    Temporal Collection and Distribution for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2207.01203) / [code](https://github.com/lxa9867/R2VOS)    -    Robust Referring Video Object Segmentation with Cyclic Structural Consensus

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2307.13537) / [code](https://github.com/bo-miao/SgMg)    -    Spectrum-guided Multi-granularity Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2307.09356) / [code](https://github.com/wudongming97/OnlineRefer)    -    OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2309.02041) / [code](https://github.com/hengliusky/Few_shot_RVOS)    -    Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Han_HTML_Hybrid_Temporal-scale_Multimodal_Learning_Framework_for_Referring_Video_Object_ICCV_2023_paper.pdf) / code    -    HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2308.08544) / [code & dataset](https://henghuiding.github.io/MeViS/)    -    MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2308.13266) / [code](https://github.com/yoxu515/MITS)    -    Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2307.15958) / [code](https://github.com/max810/XMem2)    -    XMem++: Production-level Video Segmentation From Few Annotated Frames

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2308.09903) / code    -    Scalable Video Object Segmentation with Simplified Framework

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Sun_Alignment_Before_Aggregation_Trajectory_Memory_Retrieval_Network_for_Video_Object_ICCV_2023_paper.pdf) / code    -    Alignment Before Aggregation: Trajectory Memory Retrieval Network for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2304.03284.pdf) / [code](https://github.com/baaivision/Painter)    -    SegGPT: Segmenting Everything In Context

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2211.10181.pdf) / [code & dataset](https://lingyihongfd.github.io/lvos.github.io/)    -    LVOS: A Benchmark for Long-term Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2302.01872) / [code & dataset](https://github.com/henghuiding/MOSE-api)    -    MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

---

### CVPR 2023

:diamond_shape_with_a_dot_inside: `VMAT`    -    [paper](https://arxiv.org/abs/2304.06018) / [code](https://github.com/microsoft/AdaM)    -    Adaptive Human Matting for Dynamic Videos

:green_square: `UVOS`    -    [paper](https://arxiv.org/pdf/2304.05930.pdf) / [code](https://rkyuca.github.io/medvt/)    -    MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2304.06211.pdf) / [code](https://github.com/wenguanwang/VOS_Correspondence)    -    Boosting Video Object Segmentation via Space-time Correspondence Learning

:blue_square: `SVOS` :orange_square: `RVOS`    -    [paper](https://arxiv.org/pdf/2303.06674.pdf) / [code](https://github.com/MasterBin-IIAU/UNINEXT)    -    Universal Instance Perception as Object Discovery and Retrieval

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Athar_TarViS_A_Unified_Approach_for_Target-Based_Video_Segmentation_CVPR_2023_paper.pdf) / [code](https://github.com/Ali2500/TarViS)    -    TarViS: A Unified Approach for Target-Based Video Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2303.12078.pdf) / [code](https://github.com/yk-pku/Two-shot-Video-Object-Segmentation)    -    Two-shot Video Object Segmetnation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2303.07815.pdf) / code    -    MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2212.06826.pdf) / code    -    Look Before You Match: Instance Understanding Matters in Video Object Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/pdf/2212.06200.pdf) / [code & dataset](https://www.vostdataset.org/)    -    Breaking the “Object” in Video Object Segmentation

---

### IJCAI 2023

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2309.09501) / code    -    Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2305.04470) / [code & dataset](https://github.com/yoxu515/VIPOSeg-Benchmark)    -    Video Object Segmentation in Panoptic Wild Scenes

---

### AAAI 2023

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2212.02112.pdf) / code    -    Learning to Learn Better for Video Object Segmentation

---

### Journals 2023

:green_square: `UVOS`    -    [paper](https://ieeexplore.ieee.org/document/10298026) / [code](https://github.com/ZSVOS/HGPU)    -    `TIP` Hierarchical Graph Pattern Understanding for Zero-Shot Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/10159996) / [code](https://github.com/xilin1991/CluterNet)    -    `TCSVT` Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/10105896) / [code](https://github.com/NUST-Machine-Intelligence-Laboratory/HCPN)    -    `TIP` Hierarchical Co-Attention Propagation Network for Zero-Shot Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/9932025) / code    -    `TPAMI` VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

:orange_square: `RVOS`    -    [paper](https://ieeexplore.ieee.org/abstract/document/10083244) / [code](https://github.com/leonnnop/Locater)    -    `TPAMI` Local-Global Context Aware Transformer for Language-Guided Video Segmentation

### Earlier Arxiv 2023

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2404.19326) / [code and dataset](https://lingyihongfd.github.io/lvos.github.io/)    -    LVOS (v2, with more data): A Benchmark for Large-scale Long-term Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2405.10610) / code    -    Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2405.14010) / code    -    One-shot Training for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2405.07031) / code    -    Global Motion Understanding in Large-Scale Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2405.08715) / code    -    DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2404.13505) / [code](https://github.com/NUST-Machine-Intelligence-Laboratory/HVC)    -    Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2404.12389) / [code](https://github.com/Jyxarthur/flowsam)    -    Moving Object Segmentation: All You Need Is SAM (and Flow)

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2403.19407) / code    -    Towards Temporally Consistent Referring Video Object Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2403.14203) / code    -    Unsupervised Audio-Visual Segmentation with Modality Alignment

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2403.17937) / [code](https://github.com/Amshaker/MAVOS)    -    Efficient Video Object Segmentation via Modulated Cross-Attention Memory

⬜ `XVOS`   -    [paper](https://arxiv.org/abs/2403.06130) / [code](https://github.com/PinxueGuo/ClickVOS)    -    ClickVOS: Click Video Object Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2402.02327) / code    -    Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2401.14168) / [code](https://github.com/scott-yjyang/Vivim)    -    Vivim: a Video Vision Mamba for Medical Video Object Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2401.13937) / code    -    Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2401.12480) / code    -    Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2312.17448) / [code](https://github.com/jiawen-zhu/TrackGPT)    -    Tracking with Human-Intent Reasoning

:blue_square: `SVOS` :orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2312.15715) / [code](https://github.com/FoundationVision/UniRef)    -    UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

:green_square: `UVOS` `Dec`    -    [paper](https://arxiv.org/abs/2312.11463) / code    -    Appearance-based Refinement for Object-Centric Motion Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2312.08514) / code    -    M3T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2311.18837) / [code](https://github.com/ChenHsing/VIDiff)    -    VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2311.07261) / [code](https://github.com/YRlin-12/Sketch-VOS-datasets)    -    Sketch-based Video Object Segmentation: Benchmark and Analysis

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2311.04414) / [code](https://eva-vos.compute.dtu.dk/)    -    Learning the What and How of Annotation in Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2310.03967) / code    -    Sub-token ViT Embedding via Stochastic Resonance Transformers (support svos)

:green_square: `UVOS`    -    [paper](https://arxiv.org/abs/2309.14786) / [code](https://github.com/suhwan-cho/TMO)    -    Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

:red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2310.00132) / code    -    Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2308.13505) / code    -    Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

:orange_square: `RVOS` :red_square: `AVOS`    -    [paper](https://arxiv.org/abs/2308.04162) / [code](https://github.com/lab206/EPCFormer)    -    EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2308.02162) / [code](https://github.com/wangbo-zhao/WRVOS/)    -    Learning Referring Video Object Segmentation from Weak Annotation

:green_square: `UVOS`    -    [paper](https://arxiv.org/pdf/2305.12659.pdf) / code    -    UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2305.06558) / [code](https://github.com/z-x-yang/Segment-and-Track-Anything)    -    Segment and Track Anything

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2304.11968) / [code](https://github.com/gaomingqi/Track-Anything)    -    Track Anything: Segment Anything Meets Videos

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/pdf/2303.14384.pdf) / [code](https://github.com/mkg1204/RHMNet-for-SSVOS)    -    Reliability-Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://arxiv.org/abs/2307.13974) / [code](https://github.com/jiawen-zhu/HQTrack)    -    Tracking Anything in High Quality

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2307.00536) / code    -    Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation

:orange_square: `RVOS`    -    [paper](https://arxiv.org/abs/2307.00997) / [code](https://github.com/LancasterLi/RefSAM)    -    RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/abs/2307.01197) / [code](https://github.com/SysCV/sam-pt)    -    Segment Anything Meets Point Tracking

---

### NeurIPS 2022

:blue_square: `SVOS`    -    [paper](https://arxiv.org/pdf/2210.09782.pdf) / [code](https://github.com/z-x-yang/AOT)    -    Decoupling Features in Hierarchical Propagation for Video Object Segmentation

:white_large_square: `XVOS`    -    [paper](https://arxiv.org/pdf/2210.12733.pdf) / code    -    Self-supervised Amodal Video Object Segmentation

---

### ECCV 2022

:blue_square: `SVOS`    -    [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880633.pdf) / [code](https://github.com/hkchengrex/XMem)    -    XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

:blue_square: `SVOS`    -    [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890603.pdf) / code    -    BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890462.pdf) / [code](https://github.com/workforai/QDMN)    -    Learning Quality-aware Dynamic Memory for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136820434.pdf) / [code](https://github.com/suhwan-cho/TBD)    -    Tackling Background Distraction in Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890639.pdf) / [code](https://github.com/workforai/GSFM)    -    Global Spectral Filter Memory Network for Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136940584.pdf) / code    -    Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation

---

### CVPR 2022

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Botach_End-to-End_Referring_Video_Object_Segmentation_With_Multimodal_Transformers_CVPR_2022_paper.pdf) / [code](https://github.com/mttr2021/MTTR)    -    End-to-End Referring Video Object Segmentation With Multimodal Transformers

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu_Language_As_Queries_for_Referring_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/wjn922/ReferFormer)    -    Language As Queries for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Ding_Language-Bridged_Spatial-Temporal_Interaction_for_Referring_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/dzh19990407/LBDT)    -    Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu_Multi-Level_Representation_Learning_With_Semantic_Alignment_for_Referring_Video_Object_CVPR_2022_paper.pdf) / code    -    Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Recurrent_Dynamic_Embedding_for_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/Limingxing00/RDE-VOS-CVPR2022)    -    Recurrent Dynamic Embedding for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Xu_Accelerating_Video_Object_Segmentation_With_Compressed_Video_CVPR_2022_paper.pdf) / [code](https://github.com/kai422/CoVOS)    -    Accelerating Video Object Segmentation With Compressed Video

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Lin_SWEM_Towards_Real-Time_Video_Object_Segmentation_With_Sequential_Weighted_Expectation-Maximization_CVPR_2022_paper.pdf) / code    -    SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization

:blue_square: `SVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Park_Per-Clip_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/pkyong95/PCVOS)    -    Per-Clip Video Object Segmentation

:white_large_square: `XVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Pan_Wnet_Audio-Guided_Video_Object_Segmentation_via_Wavelet-Based_Cross-Modal_Denoising_Networks_CVPR_2022_paper.pdf) / [code](https://github.com/asudahkzj/Wnet)    -    Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks

:white_large_square: `XVOS`    -    [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wei_YouMVOS_An_Actor-Centric_Multi-Shot_Video_Object_Segmentation_Dataset_CVPR_2022_paper.pdf) / [code & dataset](https://donglaiw.github.io/proj/youMVOS/)    -    YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset

---

### AAAI 2022

:blue_square: `SVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20009) / [code](https://github.com/LANMNG/SITVOS)    -    Siamese Network with Interactive Transformer for Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20200) / code    -    Reliable Propagation-Correction Modulation for Video Object Segmentation

:orange_square: `RVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20017) / code    -    You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

:green_square: `UVOS`    -    [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20011) / code    -    Iteratively Selecting an Easy Reference Frame Makes Unsupervised Video Object Segmentation Easier

---

### Journals 2022

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/document/9745367) / code    -    `TPAMI` Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/document/9875116) / code    -    `TIP` From Pixels to Semantics: Self-Supervised Video Object Segmentation With Multiperspective Feature Mining

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/document/9904497) / code    -    `TIP` Delving Deeper Into Mask Utilization in Video Object Segmentation

:blue_square: `SVOS`    -    [paper](https://ieeexplore.ieee.org/document/9942927) / code    -    `TIP` Adaptive Online Mutual Learning Bi-Decoders for Video Object Segmentation

---

End of the list. :seedling:

VOS papers and datasets before 2022 could be found below:

>Deep Learning for Video Object Segmentation: A Review / [paper](https://link.springer.com/content/pdf/10.1007/s10462-022-10176-7.pdf) / [project page](https://github.com/gaomingqi/VOS-Review)