https://github.com/gaomingqi/awesome-video-object-segmentation
🔥 Latest advances in Video Object Segmentation (VOS) – papers, datasets, and projects.
https://github.com/gaomingqi/awesome-video-object-segmentation
List: awesome-video-object-segmentation
audio-visual-segmentation awesome-papers awesome-papers-for-video-object-segmentation referring-video-object-segmentation semi-supervised-video-object-segmentation video-matting video-object-segmentation video-reasoning-segmentation
Last synced: 4 months ago
JSON representation
🔥 Latest advances in Video Object Segmentation (VOS) – papers, datasets, and projects.
- Host: GitHub
- URL: https://github.com/gaomingqi/awesome-video-object-segmentation
- Owner: gaomingqi
- Created: 2023-04-01T17:04:55.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2026-02-05T12:32:49.000Z (4 months ago)
- Last Synced: 2026-02-05T23:55:11.271Z (4 months ago)
- Topics: audio-visual-segmentation, awesome-papers, awesome-papers-for-video-object-segmentation, referring-video-object-segmentation, semi-supervised-video-object-segmentation, video-matting, video-object-segmentation, video-reasoning-segmentation
- Homepage:
- Size: 3.1 MB
- Stars: 460
- Watchers: 25
- Forks: 17
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Latest Advances in Video Object Segmentation (VOS). VOS works before 2022 can be found in our survey paper:
>Deep Learning for Video Object Segmentation: A Review / [paper](https://link.springer.com/content/pdf/10.1007/s10462-022-10176-7.pdf) / [project page](https://github.com/gaomingqi/VOS-Review) BibTex @article{gao2023deep,
title={Deep learning for video object segmentation: a review},
author={Gao, Mingqi and Zheng, Feng and Yu, James JQ and Shan, Caifeng and Ding, Guiguang and Han, Jungong},
journal={Artificial Intelligence Review},
volume={56},
number={1},
pages={457--531},
year={2023},
publisher={Springer}
}
---
:teddy_bear: We mark different VOS tasks with coloured squares:
:blue_square:SVOS
:orange_square:RVOS
:green_square:UVOS
:red_square:AVOS
:diamond_shape_with_a_dot_inside:VMAT
:white_large_square:XVOS
Other types of VOS
:teddy_bear: Please feel free to send us pull requests to add VOS works.
---
Links for a quick jump: [ArXiv (within 6 months)](#arxiv), 🔥[ICLR 2026](#iclr26)🔥, [AAAI 2026](#aaai26), [NeurIPS 2025](#nips25), [ACM MM 2025](#mm25), [SIGGRAPH 2025](#sig25), [ICCV 2025](#iccv25), [CVPR 2025](#cvpr25), [ICLR 2025](#iclr25), [AAAI 2025](#aaai25), [Journals 2025](#j25), [Earlier ArXiv 2025](#a25), [NeurIPS 2024](#nips24), [ACMMM 2024](#acmmm24), [ECCV 2024](#eccv24), [CVPR 2024](#cvpr24), [AAAI 2024](#aaai24), [Journals 2024](#j24), [Earlier ArXiv 2024](#earxiv24), [EMNLP 2023](#emnlp23), [NeurIPS 2023](#nips23), [ACMMM 2023](#mm23), [ICCV 2023](#iccv23), [CVPR 2023](#cvpr23), [IJCAI 2023](#ijcai23), [AAAI 2023](#aaai23), [Journals 2023](#j23), [Earlier ArXiv 2023](#earxiv23), [NeurIPS 2022](#neurips22), [ECCV 2022](#eccv22), [CVPR 2022](#cvpr22), [AAAI 2022](#aaai22), [Journals 2022](#j22)
---
### 🏁 VOS Workshops and Challenges
No Active - Click to see history
:blue_square: `SVOS` :orange_square: `RVOS` - [LSVOS @ICCV 2025](https://lsvos.github.io/)
---
### :floppy_disk: VOS Dataset
Click to expand
:blue_square: `SVOS`: [MOSEv2](https://www.codabench.org/competitions/10062/) (2025), [SA-V](https://ai.meta.com/datasets/segment-anything-video/) (2024), [LVOS](https://lingyihongfd.github.io/lvos.github.io/dataset.html) (2023), [MOSEv1](https://henghuiding.github.io/MOSE/) (2023), [VOST](https://www.vostdataset.org/) (2023), [VISOR](https://epic-kitchens.github.io/VISOR/) (2022), [YouTube-VOS](https://youtube-vos.org/) (2018/2019), [DAVIS](https://davischallenge.org/index.html) (2016/2017)
:orange_square: `RVOS`: [MeViSv2](https://henghuiding.com/MeViS/#dataset) (2025), [ReVOS](https://github.com/cilinyan/ReVOS-api) (2024), [MeViS](https://henghuiding.github.io/MeViS/) (2023), [Ref-YouTube-VOS](https://youtube-vos.org/dataset/rvos/) (2020), [Ref-DAVIS](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/video-segmentation/video-object-segmentation-with-language-referring-expressions) (2018), [J-HMDB-Sentences](https://kgavrilyuk.github.io/publication/actor_action/) (2018), [A2D-Sentences](https://kgavrilyuk.github.io/publication/actor_action/) (2018)
:green_square: `UVOS`: [DAVIS](https://davischallenge.org/index.html) (2016)
:red_square: `AVOS`: [AVSBench](https://opennlplab.github.io/AVSBench/) (2022)
:diamond_shape_with_a_dot_inside: `VMAT`: [VideoMatte240K](https://grail.cs.washington.edu/projects/background-matting-v2/#/datasets) (2021), [CRGNN](https://github.com/TiantianWang/VideoMatting-CRGNN) (2021)
---
### ArXiv (Last 6 months)
:orange_square: `RVOS` `Feb` - [paper](https://arxiv.org/pdf/2602.12173) / [code](https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext) - SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder
for Efficient Vision-Language Segmentation
:orange_square: `RVOS` `Feb` - [paper](https://arxiv.org/pdf/2602.04454) / [code](https://github.com/iSEE-Laboratory/Seg-ReSearch) - Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search (`Reasoning VOS via Outside Knowledge!`)
:red_square: `AVOS` `Feb` - [paper](https://arxiv.org/pdf/2602.03892) / [code](https://github.com/jasongief/MQA-RefAVS) - Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
:orange_square: `RVOS` `Feb` - [paper](https://arxiv.org/pdf/2602.03595) / code - Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
:diamond_shape_with_a_dot_inside: `VMAT` `Jan` - [paper](https://arxiv.org/pdf/2601.14255) / [code](https://github.com/cvlab-kaist/VideoMaMa) - VideoMaMa: Mask-Guided Video Matting via Generative Prior
:blue_square: `SVOS` :orange_square: `RVOS` `Jan` - [paper](https://arxiv.org/pdf/2601.09699) / [code](https://github.com/FudanCVL/SAM3-DMS) - SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
:blue_square: `SVOS` `Jan` - [paper](https://arxiv.org/pdf/2601.08831) / [project page](https://jayisaking.github.io/3AM-Page/) - 3AM: 3egment Anything with Geometric Consistency in Videos
:diamond_shape_with_a_dot_inside: `VMAT` `Dec` - [paper](https://arxiv.org/pdf/2512.11782) / [project page](https://pq-yang.github.io/projects/MatAnyone2/) - MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
:red_square: `AVOS` `Dec` - [paper](https://arxiv.org/abs/2512.20117) / [project page](https://trilarflagz.github.io/DDAVS-page/) - DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
:orange_square: `RVOS` :red_square: `AVOS` `Dec` `TPAMI` - [paper](https://arxiv.org/abs/2512.10945) / [project page and dataset](https://henghuiding.com/MeViS/index.html) - MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
:blue_square: `SVOS` `Dec` - [paper](https://arxiv.org/abs/2512.08406) / [code](https://github.com/gaomingqi/sam-body4d) - SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos (`VOS-driven Human Mesh Recovery`)
:orange_square: `RVOS` `Dec` - [paper](https://arxiv.org/abs/2512.02835) / [code](https://github.com/Clementine24/ReVSeg) - ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
:blue_square: `SVOS` `Nov` - [paper](https://arxiv.org/abs/2511.16618) / [code](https://github.com/jinlab-imvr/SAM2S) - SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
:blue_square: `SVOS` `Nov` - [paper](https://arxiv.org/abs/2511.20886) / code - V2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
:orange_square: `RVOS` `Nov` - [paper](https://arxiv.org/abs/2511.21139) / code - ReVSeg: Referring Video Object Segmentation with Cross-Modality Proxy Queries
:red_square: `AVOS` `Oct` - [paper](https://arxiv.org/abs/2510.10051) / [code](https://github.com/SitongGong/CCFormer) - Complementary and Contrastive Learning for Audio-Visual Segmentation
:orange_square: `RVOS` `Oct` - [paper](https://arxiv.org/abs/2510.09274) / [code](https://github.com/Dmmm1997/MomentSeg) - MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
:orange_square: `RVOS` `Oct` - [paper](https://arxiv.org/abs/2510.08305) / code - LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation
:orange_square: `RVOS` `Oct` - [paper](https://arxiv.org/abs/2510.07319) / code - Temporal Prompting Matters: Rethinking Referring Video Object Segmentation
:orange_square: `RVOS` `Oct` - [paper](https://arxiv.org/abs/2510.06139) / code - Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
:red_square: `AVOS` `Sep` - [paper](https://arxiv.org/abs/2509.22740) / [code](https://github.com/jinbae-s/ACVIS) - Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
:red_square: `AVOS` `Sep` - [paper](https://arxiv.org/abs/2509.18912) / code - Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation
:orange_square: `RVOS` `Sep` - [paper](https://arxiv.org/abs/2509.13722) / code - Mitigating Query Selection Bias in Referring Video Object Segmentation
:orange_square: `RVOS` `Sep` - [paper](https://arxiv.org/abs/2509.05751) / code - Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation
:orange_square: `RVOS` :blue_square: `SVOS` `Aug` - [paper](https://arxiv.org/abs/2508.21809) / [code](https://github.com/google-deepmind/vocap) - VoCap: Video Object Captioning and Segmentation from Any Prompt
:orange_square: `RVOS` `Aug` - [paper](https://arxiv.org/abs/2508.11955) / [code](https://github.com/Seung-Hun-Lee/SAMDWICH) - SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation
:orange_square: `RVOS` `Aug` - [paper](https://arxiv.org/abs/2508.13584) / [code](https://github.com/qianqiaoai/HCD) - Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
:orange_square: `RVOS` `Aug` - [paper](https://arxiv.org/abs/2508.11538) / [code](https://github.com/SitongGong/Veason-R1) - Reinforcing Video Reasoning Segmentation to Think Before It Segments
:red_square: `AVOS` `Aug` - [paper](https://arxiv.org/abs/2508.02149) / code - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation
:green_square: `UVOS` `Jul` - [paper](https://arxiv.org/abs/2507.19790) / [code](https://github.com/suhwan-cho/DepthFlow) - DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation
:blue_square: `SVOS` `Jul` - [paper](https://arxiv.org/pdf/2507.18921) / code - HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback
:blue_square: `SVOS` `Jul` - [paper](https://arxiv.org/abs/2507.07603) / [code](https://github.com/LouisFinner/HiM2SAM) - HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking
:white_large_square: `XVOS` `Jul` - [paper](https://arxiv.org/abs/2507.07519) / [dataset](https://volumetric-repository.labs.b-com.com/#/muvod) - MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation
:diamond_shape_with_a_dot_inside: `VMAT` `Jul` - [paper](https://arxiv.org/abs/2507.04456) / code - BiVM: Accurate Binarized Neural Network for Efficient Video Matting
:diamond_shape_with_a_dot_inside: `VMAT` `Jun` - [paper](https://arxiv.org/abs/2506.10840) / code - Post-Training Quantization for Video Matting
:red_square: `AVOS` `Jun` - [paper](https://arxiv.org/abs/2506.11436) / code - TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models
:orange_square: `RVOS` `Jun` - [paper](https://arxiv.org/pdf/2506.02356) / [project](https://cvlab-kaist.github.io/InterRVOS/) - InterRVOS: Interaction-aware Referring Video Object Segmentation
:red_square: `AVOS` `Jun` - [paper](https://arxiv.org/abs/2506.01015) / [code](https://github.com/yyliu01/AuralSAM2) - AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
---
### ICLR 2026
:blue_square: `SVOS` :orange_square: `RVOS` - [paper](https://scontent-lhr8-1.xx.fbcdn.net/v/t39.2365-6/586186898_724834017304937_2869787384130329011_n.pdf?_nc_cat=107&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=V0ySxB4YEecQ7kNvwE9bO7k&_nc_oc=Adn7bAjFdDo3pyjopE-tHSmOV5lgvaoxLNYmJtRbE9Op6gSQHJoEsg2ANithdVk2hm5mJvm_jjjSyhT0TiB9Go0h&_nc_zt=14&_nc_ht=scontent-lhr8-1.xx&_nc_gid=Oa338lN6JufBYl5-MfIdMg&oh=00_Afj4QvVDmMqGVUembmWPdxu9nWcksZ6Rjruxy28TYOm3PA&oe=6923F072) / [code](https://github.com/facebookresearch/sam3) - SAM 3: Segment Anything with Concepts
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2507.15852) / [code](https://github.com/OpenIXCLab/SeC) / [dataset](https://huggingface.co/datasets/OpenIXCLab/SeCVOS) - SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2602.08224) / [code](https://github.com/jingjing0419/Efficient-SAM2) - Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2510.19592) / [code](https://github.com/HYUNJS/DecAF) - Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/pdf/2510.06139) / [code](https://github.com/xmz111/FlowRVS) - Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
:diamond_shape_with_a_dot_inside: `VMAT` - [paper](https://openreview.net/pdf?id=6K08FPo2cf) / code - Matting Anything 2: Towards Video Matting for Anything
---
### AAAI 2026
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2508.04418) / [code](https://github.com/jasongief/TGS-Agent) - Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2511.13715) / [code](https://github.com/FudanCVL/SAAS) - Segment Anything Across Shots: A Method and Benchmark
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2511.19475) / code - Tracking and Segmenting Anything in Any Modality
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2511.16077) / [code](https://github.com/euyis1019/VideoSeg-R1) - VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning
---
### NeurIPS 2025
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2509.18094) / [code](https://github.com/PolyU-ChenLab/UniPixel) - UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
:orange_square: `RVOS` - [paper](https://openreview.net/pdf?id=z9xyREqxzq) / code - Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention
---
### ACM MM 2025
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2507.22465) / [code](https://github.com/ZhengxyFlow/HMHI-Net) - Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
---
### SIGGRAPH 2025
:diamond_shape_with_a_dot_inside: `VMAT` - [paper](https://arxiv.org/abs/2508.07905) / [code](https://github.com/aim-uofa/GVM) - Generative Video Matting
---
### ICCV 2025
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2507.18944) / [code](https://github.com/jinlab-imvr/OASIS) - Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2410.16268) / [code](https://github.com/Mark12Ding/SAM2Long) - SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/ICCV2025/papers/Baek_EVOLVE_Event-Guided_Deformable_Feature_Transfer_and_Dual-Memory_Refinement_for_Low-Light_ICCV_2025_paper.pdf) / [code](https://github.com/whdgusdl48/EVOLVE) - EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/ICCV2025/papers/Rong_MPG-SAM_2_Adapting_SAM_2_with_Mask_Priors_and_Global_ICCV_2025_paper.pdf) / [code](https://github.com/rongfu-dsb/MPG-SAM2) - MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2507.19599) / [code](https://github.com/qirui-chen/RGA3-release) - Object-centric Video Question Answering with Visual Grounding and Referring (`Video LLM with applications on RVOS`)
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2501.14607) / [code](https://github.com/iSEE-Laboratory/ReferDINO) - ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2412.14006) / [code](https://github.com/congvvc/InstructSeg) - InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2507.22061) / [code](https://github.com/FudanCVL/MOVE) - MOVE: Motion-Guided Few-Shot Video Object Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2507.22886) / [code](https://github.com/FudanCVL/OmniAVS) - Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2507.20740) / code - Implicit Counterfactual Learning for Audio-Visual Segmentation
---
### CVPR 2025
:diamond_shape_with_a_dot_inside: `VMAT` - [paper](https://arxiv.org/abs/2501.14677) / [code](https://github.com/pq-yang/MatAnyone) - Stable Video Matting with Consistent Memory Propagation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2411.17646) / [code](https://github.com/ClaudiaCuttano/SAMWISE) - SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2501.08549) / [code](https://github.com/SitongGong/VRS-HQ) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2412.09754) / [code](https://github.com/Ali2500/ViCaS) - ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2411.09921) / [code](https://github.com/dengandong/GroundMoRe) - Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2504.07962) / [code](https://github.com/GLUS-video/GLUS) - GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Pan_Semantic_and_Sequential_Alignment_for_Referring_Video_Object_Segmentation_CVPR_2025_paper.pdf) / [code](https://github.com/tavarich/SSA) - Semantic and Sequential Alignment for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Fang_Decoupled_Motion_Expression_Video_Segmentation_CVPR_2025_paper.pdf) / code - Decoupled Motion Expression Video Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2411.17576) / [code](https://github.com/jovanavidenovic/DAM4SAM) - A Distractor-Aware Memory for Visual Object Tracking with SAM2
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2411.02818) / [code](https://github.com/uncbiag/LiVOS) - LiVOS: Light Video Object Segmentation with Gated Linear Matching
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2502.04144) / [project page](https://hd-epic.github.io/) - HD-EPIC: A Highly-Detailed Egocentric Video Dataset (`with long-term SVOS dataset`)
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2412.13803) / [project page](https://zixuan-chen.github.io/M-cube-VOS.github.io/) - M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (`svos with phase transition for embodied ai`)
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2506.01558) / [code](https://github.com/VoyageWang/SAM2LOVE) - SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2503.12840) / [code](https://github.com/YenanLiu/DDESeg) - Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2506.23623) / [code](https://github.com/spyflying/VCT_AVS) - Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2503.12847) / code - Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
:red_square: `AVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Radman_TSAM_Temporal_SAM_Augmented_with_Multimodal_Prompts_for_Referring_Audio-Visual_CVPR_2025_paper.pdf) / [project](https://abdurad.github.io/TSAM/) - TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2412.04623) / [code](https://github.com/Kaihua-Chen/diffusion-vas) - Using Diffusion Priors for Video Amodal Segmentation (`segment both visible and invisible (e.g., occluded) video objects`)
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2506.01304) / [code](https://github.com/showlab/SAM-I2V) - SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2503.22268) / [code](https://github.com/nnanhuang/SegAnyMo) - Segment Any Motion in Videos
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2504.05468) / [code](https://github.com/thanosDelatolas/diff-zvos) - Studying Image Diffusion Features for Zero-Shot Video Object Segmentation
---
### ICLR 2025
:blue_square: `SVOS` - [paper](https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/) / [code](https://github.com/facebookresearch/segment-anything-2) - SAM 2: Segment Anything in Images and Videos
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2410.18538) / [code](https://github.com/alimohammadiamirhossein/smite/) - SMITE: Segment Me In TimE
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2407.07760) / [code](https://github.com/yahooo-m/S3) - Learning Spatial-Semantic Features for Robust Video Object Segmentation
---
### AAAI 2025
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2412.01471) / [project page](https://cvlab-kaist.github.io/MUG-VOS/) - Multi-Granularity Video Object Segmentation
:blue_square: `SVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/32706) / code - Holistic Correction with Object Prototype for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/32626) / code - Beyond Pixel and Object: Part Feature as Reference for Few-Shot Video Object Segmentation
:red_square: `AVOS` :orange_square: `RVOS` - [paper](https://arxiv.org/abs/2408.15876) / [code](https://github.com/appletea233/AL-Ref-SAM2) - Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
---
### Journals 2025
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/document/11311151) / [code](https://github.com/zaplm/DC-SAM) - `TPAMI` DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
:orange_square: `RVOS` :red_square: `AVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/11184493) / [code](https://github.com/yongliu20/MRVS_SOC) - `TPAMI` Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation
:green_square: `SVOS` - [paper](https://arxiv.org/abs/2502.12975) / [code](https://github.com/danqu130/EvInsMOS) - `IJCV` Instance-Level Moving Object Segmentation from a Single Image with Events
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2501.07806) / [code](https://github.com/hy0523/MTNet) -
`TNNLS` Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/10933555) / [code](https://github.com/yk-pku/Low-shot-VOS) - `TPAMI` Low-shot Video Object Segmentation
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/10949703) / code - `TPAMI` JointFormer: A Unified Framework with Joint Modeling for Video Object Segmentation
---
### Earlier Arxiv 2025
:orange_square: `RVOS` `May` - [paper](https://arxiv.org/abs/2505.08581) / [code](https://github.com/jinlab-imvr/ReSurgSAM2) - ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking
:orange_square: `RVOS` `May` - [paper](https://arxiv.org/abs/2505.18561) / [code](https://github.com/DanielSHKao/ThinkVideo) - ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts
:orange_square: `RVOS` `May` - [paper](https://arxiv.org/abs/2505.12702) / [code](https://isee-laboratory.github.io/Long-RVOS) - Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
:red_square: `AVOS` `May` - [paper](https://arxiv.org/abs/2505.01448) / code - OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
:blue_square: `SVOS` `May` - [paper](https://arxiv.org/abs/2505.00739) / code - MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection
:blue_square: `SVOS` `Apr` - [paper](https://arxiv.org/abs/2504.16471) / code - RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory
:green_square: `UVOS` `Apr` - [paper](https://arxiv.org/abs/2504.05904) / code - Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation
:orange_square: `RVOS` `Mar` - [paper](https://arxiv.org/abs/2503.21056) / code - Online Reasoning Video Segmentation with Just-in-Time Digital Twins
:diamond_shape_with_a_dot_inside: `VMAT` `Mar` - [paper](https://arxiv.org/abs/2503.10678) / [project page](https://bio.lehanyang.info/VRMDiff.github.io/) - VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
:diamond_shape_with_a_dot_inside: `VMAT` `Mar` - [paper](https://arxiv.org/abs/2503.01262) / code - Object-Aware Video Matting with Cross-Frame Guidance
:orange_square: `RVOS` `Mar` - [paper](https://arxiv.org/abs/2503.03492) / [code](https://github.com/suhwan-cho/FindTrack) - Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation
:orange_square: `RVOS` `Jan` - [paper](https://arxiv.org/abs/2501.04001) / [code](https://github.com/magic-research/Sa2VA) - Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2502.09660) / code - Towards Fine-grained Interactive Segmentation in Images and Videos
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2502.00358) / code - Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2501.13667) / code - MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2501.07256) / code - EdgeTAM: On-Device Track Anything Model
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2501.07806) / [code](https://github.com/SitongGong/AVS-Mamba) -
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2501.04939) / [code](https://github.com/Choi58/MTCM) - Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation
---
### NeurIPS 2024
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2409.19603) / [code](https://github.com/showlab/VideoLISA) - One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
:blue_square: `RVOS` - [paper](https://arxiv.org/abs/2412.19806) / [code](https://github.com/SkyworkAI/Vitron) - VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing (`with applications in SVOS`)
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2501.12392) / code - Learning segmentation from point trajectories
---
### ACMMM 2024
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2409.19342) / [code](https://github.com/PinxueGuo/X-Prompt) - X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation
---
### ECCV 2024
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2404.06265) / [code](https://github.com/yahooo-m/VOS-Solution) - Spatial-Temporal Multi-level Association for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2403.08682) / [code](https://github.com/L599wy/OneVOS) - OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2309.12303) / [code & dataset](https://github.com/shilinyan99/PanoVOS) - PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2407.11325) / [code](https://github.com/cilinyan/VISA) - VISA: Reasoning Video Object Segmentation via Large Language Model
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2403.12042) / [code](https://github.com/buxiangzhiren/VD-IT) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2407.07402) / [code](https://github.com/ut-vision/ActionVOS) - ActionVOS: Actions as Prompts for Video Object Segmentation
:orange_square: `RVOS` :red_square: `AVOS` - [paper](https://arxiv.org/abs/2403.04924) / [code & dataset](https://github.com/lxa9867/r2bench) - R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations
:orange_square: `RVOS` :red_square: `AVOS` - [paper](https://arxiv.org/abs/2407.10957) / [code](https://github.com/GeWu-Lab/Ref-AVS) - Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2407.11820) / [code](https://github.com/GeWu-Lab/Stepping-Stones) - Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2311.17893) / [code](https://github.com/shvdiwnkozbw/SSL-UVOS) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
---
### CVPR 2024
:diamond_shape_with_a_dot_inside: `VMAT` - [paper](https://arxiv.org/abs/2404.16035) / [code](https://github.com/hmchuong/MaGGIe) - MaGGIe: Masked Guided Gradual Human Instance Matting
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2402.05917) / [code](https://pointvos.github.io/) - Point-VOS: Pointing Up Video Object Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2310.00132) / [code](https://github.com/lxa9867/QSD) - Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2312.06462) / [code](https://github.com/yannqi/COMBO-AVS) - Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2304.02970) / code - A Closer Look at Audio-Visual Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2403.04258) / [code](https://github.com/NiFangBaAGe/DATTT) - Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2211.12036) / [code](https://github.com/Hydragon516/DPA) - Dual Prototype Attention for Unsupervised Video Object Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2303.08314) / code - Guided Slot Attention for Unsupervised Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2404.03645) / [code](https://github.com/heshuting555/DsHmp) - Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2306.08736) / [code](https://github.com/LinfengYuan1997/Losh) - LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2312.01623) / [code](https://github.com/workforai/UniLSeg) - Universal Segmentation at Arbitrary Granularity with Language Instruction
:blue_square: `SVOS` :orange_square: `RVOS` - [paper](https://arxiv.org/abs/2402.18115) / [code](https://github.com/MinghanLi/UniVS) - UniVS: Unified and Universal Video Segmentation with Prompts as Queries
:blue_square: `SVOS` :orange_square: `RVOS` - [paper](https://arxiv.org/abs/2312.09158) / [code](https://github.com/FoundationVision/GLEE) - General Object Foundation Model for Images and Videos at Scale
:blue_square: `SVOS` :green_square: `UVOS` - [paper](https://arxiv.org/abs/2406.04221) / [code](https://github.com/siyuanliii/masa) - Matching Anything By Segmenting Anything
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2406.08476) / [code](https://github.com/Restricted-Memory/RMem) - RMem: Restricted Memory Banks Improve Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2404.01945) / [code](https://github.com/HebeiFast/EventLowLightVOS) - Event-assisted Low-Light Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2310.12982) / [code](https://github.com/hkchengrex/Cutie) - Putting the Object Back into Video Object Segmentation
---
### AAAI 2024
:orange_square: `RVOS` :red_square: `AVOS` - [paper](https://arxiv.org/pdf/2305.16318.pdf) / [code](https://github.com/OpenGVLab/MUTR) - Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
:green_square: `UVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/28295) / code - Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation
---
### Journals 2024
:orange_square: `RVOS` - [paper](https://ieeexplore.ieee.org/document/10694805) / [code](https://github.com/Yxxxb/LAVT-RS) - `TPAMI` Language-Aware Vision Transformer for Referring Segmentation
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/10713285) / [code](https://github.com/BIT-Vision/ECOS) - `TPAMI` Continuous-time Object Segmentation using High Temporal Resolution Event Camera
---
### Earlier Arxiv 2024
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2412.19761) / [project page](https://genprop.github.io/) - Generative Video Propagation (`with applications in SVOS`)
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2412.08161) / code - Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2412.04930) / [project page](https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers (`with applications in UVOS`)
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2412.01136) / [project page](https://cvlab-kaist.github.io/SOLA/) - Referring Video Object Segmentation via Language-aligned Track Selection
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2411.18977) / [code](https://github.com/motern88/Det-SAM2) - Det-SAM2: Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2411.19141) / code - On Moving Object Segmentation from Monocular Video with Transformers
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2411.19210) / code - Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2411.18933) / [code](https://github.com/yformer/EfficientTAM) - Efficient Track Anything
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2411.11922) / [code](https://github.com/yangchris11/samurai) - SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2410.23287) / [project page](https://miccooper9.github.io/projects/ReferEverything/) - ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2409.18653) / [code](https://github.com/zhoustan/SAM2-VCOS) - When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2409.14343) / code - Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2408.01708) / [code](https://github.com/MarkXCloud/AVESFormer) - AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2408.00169) / [code](https://github.com/Vujas-Eteph/LazyXMem) - Strike the Balance: On-the-Fly Uncertainty based User Interactions for Long-Term Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2407.14500) / [code](https://github.com/rkzheng99/ViLLa) - ViLLa: Video Reasoning Segmentation with Large Language Model
:green_square: `UVOS` - [paper](https://arxiv.org/pdf/2407.11714) / code - Improving Unsupervised Video Object Segmentation via Fake Flow Generation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2406.02345) / code - Progressive Confident Masking Attention Network for Audio-Visual Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2406.06163) / code - Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2406.12834) / code - GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation
---
### EMNLP 2023
:orange_square: `RVOS` - [paper](https://aclanthology.org/2023.emnlp-main.140.pdf) / code - Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text (``Spoken language as referring guidance``)
---
### NeurIPS 2023
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2305.17011) / [code](https://github.com/RobertLuo1/NeurIPS2023_SOC) - SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
:blue_square: `SVOS` - [paper](https://openreview.net/pdf?id=9QsdPQlWiE) / [code](https://github.com/ttt-matching-based-vos/ttt_matching_vos) - Test-time Training for Matching-based Video Object Segmentation
:blue_square: `SVOS` - [paper](https://openreview.net/pdf?id=jfsjKBDB1z) / [code](https://github.com/BGU-CS-VIL/Training-Free-VOS) - From ViT Features to Training-free Video Object Segmentation via Streaming-data Mixture Models
---
### ACM MM 2023
:green_square: `UVOS` - [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611804) / code - SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation
:green_square: `UVOS` - [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3612017) / code - Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation
:blue_square: `SVOS` - [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611827) / code - Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks
:red_square: `AVOS` - [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611724) / [code](https://github.com/aspirinone/CATR.github.io) - CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
:red_square: `AVOS` - [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3612373) / code - Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
---
### ICCV 2023
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2309.11160) / [code](https://github.com/nankepan/VIPMT) - Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2308.11796) / [code](https://github.com/SMSD75/Timetuning) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations (`self-supervised learning for UVOS`)
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2308.06693) / [code](https://github.com/DLUT-yyc/Isomer) - Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation
:green_square: `UVOS` - [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Su_Unsupervised_Video_Object_Segmentation_with_Online_Adversarial_Self-Tuning_ICCV_2023_paper.pdf) / code - Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning
:green_square: `UVOS` :orange_square: `RVOS` - [paper](https://arxiv.org/abs/2309.03903) / [code](https://github.com/hkchengrex/Tracking-Anything-with-DEVA) - DEVA: Tracking Anything with Decoupled Video Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2309.03473) / [code](https://github.com/Toneyaya/TempCD) - Temporal Collection and Distribution for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2207.01203) / [code](https://github.com/lxa9867/R2VOS) - Robust Referring Video Object Segmentation with Cyclic Structural Consensus
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2307.13537) / [code](https://github.com/bo-miao/SgMg) - Spectrum-guided Multi-granularity Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2307.09356) / [code](https://github.com/wudongming97/OnlineRefer) - OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2309.02041) / [code](https://github.com/hengliusky/Few_shot_RVOS) - Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Han_HTML_Hybrid_Temporal-scale_Multimodal_Learning_Framework_for_Referring_Video_Object_ICCV_2023_paper.pdf) / code - HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2308.08544) / [code & dataset](https://henghuiding.github.io/MeViS/) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2308.13266) / [code](https://github.com/yoxu515/MITS) - Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2307.15958) / [code](https://github.com/max810/XMem2) - XMem++: Production-level Video Segmentation From Few Annotated Frames
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2308.09903) / code - Scalable Video Object Segmentation with Simplified Framework
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Sun_Alignment_Before_Aggregation_Trajectory_Memory_Retrieval_Network_for_Video_Object_ICCV_2023_paper.pdf) / code - Alignment Before Aggregation: Trajectory Memory Retrieval Network for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2304.03284.pdf) / [code](https://github.com/baaivision/Painter) - SegGPT: Segmenting Everything In Context
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2211.10181.pdf) / [code & dataset](https://lingyihongfd.github.io/lvos.github.io/) - LVOS: A Benchmark for Long-term Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2302.01872) / [code & dataset](https://github.com/henghuiding/MOSE-api) - MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
---
### CVPR 2023
:diamond_shape_with_a_dot_inside: `VMAT` - [paper](https://arxiv.org/abs/2304.06018) / [code](https://github.com/microsoft/AdaM) - Adaptive Human Matting for Dynamic Videos
:green_square: `UVOS` - [paper](https://arxiv.org/pdf/2304.05930.pdf) / [code](https://rkyuca.github.io/medvt/) - MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2304.06211.pdf) / [code](https://github.com/wenguanwang/VOS_Correspondence) - Boosting Video Object Segmentation via Space-time Correspondence Learning
:blue_square: `SVOS` :orange_square: `RVOS` - [paper](https://arxiv.org/pdf/2303.06674.pdf) / [code](https://github.com/MasterBin-IIAU/UNINEXT) - Universal Instance Perception as Object Discovery and Retrieval
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Athar_TarViS_A_Unified_Approach_for_Target-Based_Video_Segmentation_CVPR_2023_paper.pdf) / [code](https://github.com/Ali2500/TarViS) - TarViS: A Unified Approach for Target-Based Video Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2303.12078.pdf) / [code](https://github.com/yk-pku/Two-shot-Video-Object-Segmentation) - Two-shot Video Object Segmetnation
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2303.07815.pdf) / code - MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2212.06826.pdf) / code - Look Before You Match: Instance Understanding Matters in Video Object Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/pdf/2212.06200.pdf) / [code & dataset](https://www.vostdataset.org/) - Breaking the “Object” in Video Object Segmentation
---
### IJCAI 2023
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2309.09501) / code - Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2305.04470) / [code & dataset](https://github.com/yoxu515/VIPOSeg-Benchmark) - Video Object Segmentation in Panoptic Wild Scenes
---
### AAAI 2023
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2212.02112.pdf) / code - Learning to Learn Better for Video Object Segmentation
---
### Journals 2023
:green_square: `UVOS` - [paper](https://ieeexplore.ieee.org/document/10298026) / [code](https://github.com/ZSVOS/HGPU) - `TIP` Hierarchical Graph Pattern Understanding for Zero-Shot Video Object Segmentation
:green_square: `UVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/10159996) / [code](https://github.com/xilin1991/CluterNet) - `TCSVT` Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/10105896) / [code](https://github.com/NUST-Machine-Intelligence-Laboratory/HCPN) - `TIP` Hierarchical Co-Attention Propagation Network for Zero-Shot Video Object Segmentation
:orange_square: `RVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/9932025) / code - `TPAMI` VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
:orange_square: `RVOS` - [paper](https://ieeexplore.ieee.org/abstract/document/10083244) / [code](https://github.com/leonnnop/Locater) - `TPAMI` Local-Global Context Aware Transformer for Language-Guided Video Segmentation
### Earlier Arxiv 2023
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2404.19326) / [code and dataset](https://lingyihongfd.github.io/lvos.github.io/) - LVOS (v2, with more data): A Benchmark for Large-scale Long-term Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2405.10610) / code - Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2405.14010) / code - One-shot Training for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2405.07031) / code - Global Motion Understanding in Large-Scale Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2405.08715) / code - DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2404.13505) / [code](https://github.com/NUST-Machine-Intelligence-Laboratory/HVC) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2404.12389) / [code](https://github.com/Jyxarthur/flowsam) - Moving Object Segmentation: All You Need Is SAM (and Flow)
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2403.19407) / code - Towards Temporally Consistent Referring Video Object Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2403.14203) / code - Unsupervised Audio-Visual Segmentation with Modality Alignment
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2403.17937) / [code](https://github.com/Amshaker/MAVOS) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory
⬜ `XVOS` - [paper](https://arxiv.org/abs/2403.06130) / [code](https://github.com/PinxueGuo/ClickVOS) - ClickVOS: Click Video Object Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2402.02327) / code - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2401.14168) / [code](https://github.com/scott-yjyang/Vivim) - Vivim: a Video Vision Mamba for Medical Video Object Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2401.13937) / code - Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2401.12480) / code - Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2312.17448) / [code](https://github.com/jiawen-zhu/TrackGPT) - Tracking with Human-Intent Reasoning
:blue_square: `SVOS` :orange_square: `RVOS` - [paper](https://arxiv.org/abs/2312.15715) / [code](https://github.com/FoundationVision/UniRef) - UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
:green_square: `UVOS` `Dec` - [paper](https://arxiv.org/abs/2312.11463) / code - Appearance-based Refinement for Object-Centric Motion Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2312.08514) / code - M3T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2311.18837) / [code](https://github.com/ChenHsing/VIDiff) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2311.07261) / [code](https://github.com/YRlin-12/Sketch-VOS-datasets) - Sketch-based Video Object Segmentation: Benchmark and Analysis
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2311.04414) / [code](https://eva-vos.compute.dtu.dk/) - Learning the What and How of Annotation in Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2310.03967) / code - Sub-token ViT Embedding via Stochastic Resonance Transformers (support svos)
:green_square: `UVOS` - [paper](https://arxiv.org/abs/2309.14786) / [code](https://github.com/suhwan-cho/TMO) - Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation
:red_square: `AVOS` - [paper](https://arxiv.org/abs/2310.00132) / code - Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2308.13505) / code - Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation
:orange_square: `RVOS` :red_square: `AVOS` - [paper](https://arxiv.org/abs/2308.04162) / [code](https://github.com/lab206/EPCFormer) - EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2308.02162) / [code](https://github.com/wangbo-zhao/WRVOS/) - Learning Referring Video Object Segmentation from Weak Annotation
:green_square: `UVOS` - [paper](https://arxiv.org/pdf/2305.12659.pdf) / code - UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2305.06558) / [code](https://github.com/z-x-yang/Segment-and-Track-Anything) - Segment and Track Anything
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2304.11968) / [code](https://github.com/gaomingqi/Track-Anything) - Track Anything: Segment Anything Meets Videos
:white_large_square: `XVOS` - [paper](https://arxiv.org/pdf/2303.14384.pdf) / [code](https://github.com/mkg1204/RHMNet-for-SSVOS) - Reliability-Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation
:blue_square: `SVOS` - [paper](https://arxiv.org/abs/2307.13974) / [code](https://github.com/jiawen-zhu/HQTrack) - Tracking Anything in High Quality
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2307.00536) / code - Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation
:orange_square: `RVOS` - [paper](https://arxiv.org/abs/2307.00997) / [code](https://github.com/LancasterLi/RefSAM) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/abs/2307.01197) / [code](https://github.com/SysCV/sam-pt) - Segment Anything Meets Point Tracking
---
### NeurIPS 2022
:blue_square: `SVOS` - [paper](https://arxiv.org/pdf/2210.09782.pdf) / [code](https://github.com/z-x-yang/AOT) - Decoupling Features in Hierarchical Propagation for Video Object Segmentation
:white_large_square: `XVOS` - [paper](https://arxiv.org/pdf/2210.12733.pdf) / code - Self-supervised Amodal Video Object Segmentation
---
### ECCV 2022
:blue_square: `SVOS` - [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880633.pdf) / [code](https://github.com/hkchengrex/XMem) - XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
:blue_square: `SVOS` - [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890603.pdf) / code - BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890462.pdf) / [code](https://github.com/workforai/QDMN) - Learning Quality-aware Dynamic Memory for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136820434.pdf) / [code](https://github.com/suhwan-cho/TBD) - Tackling Background Distraction in Video Object Segmentation
:blue_square: `SVOS` - [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890639.pdf) / [code](https://github.com/workforai/GSFM) - Global Spectral Filter Memory Network for Video Object Segmentation
:green_square: `UVOS` - [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136940584.pdf) / code - Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation
---
### CVPR 2022
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Botach_End-to-End_Referring_Video_Object_Segmentation_With_Multimodal_Transformers_CVPR_2022_paper.pdf) / [code](https://github.com/mttr2021/MTTR) - End-to-End Referring Video Object Segmentation With Multimodal Transformers
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu_Language_As_Queries_for_Referring_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/wjn922/ReferFormer) - Language As Queries for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Ding_Language-Bridged_Spatial-Temporal_Interaction_for_Referring_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/dzh19990407/LBDT) - Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
:orange_square: `RVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu_Multi-Level_Representation_Learning_With_Semantic_Alignment_for_Referring_Video_Object_CVPR_2022_paper.pdf) / code - Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Recurrent_Dynamic_Embedding_for_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/Limingxing00/RDE-VOS-CVPR2022) - Recurrent Dynamic Embedding for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Xu_Accelerating_Video_Object_Segmentation_With_Compressed_Video_CVPR_2022_paper.pdf) / [code](https://github.com/kai422/CoVOS) - Accelerating Video Object Segmentation With Compressed Video
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Lin_SWEM_Towards_Real-Time_Video_Object_Segmentation_With_Sequential_Weighted_Expectation-Maximization_CVPR_2022_paper.pdf) / code - SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization
:blue_square: `SVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Park_Per-Clip_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/pkyong95/PCVOS) - Per-Clip Video Object Segmentation
:white_large_square: `XVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Pan_Wnet_Audio-Guided_Video_Object_Segmentation_via_Wavelet-Based_Cross-Modal_Denoising_Networks_CVPR_2022_paper.pdf) / [code](https://github.com/asudahkzj/Wnet) - Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks
:white_large_square: `XVOS` - [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wei_YouMVOS_An_Actor-Centric_Multi-Shot_Video_Object_Segmentation_Dataset_CVPR_2022_paper.pdf) / [code & dataset](https://donglaiw.github.io/proj/youMVOS/) - YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset
---
### AAAI 2022
:blue_square: `SVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20009) / [code](https://github.com/LANMNG/SITVOS) - Siamese Network with Interactive Transformer for Video Object Segmentation
:blue_square: `SVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20200) / code - Reliable Propagation-Correction Modulation for Video Object Segmentation
:orange_square: `RVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20017) / code - You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation
:green_square: `UVOS` - [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20011) / code - Iteratively Selecting an Easy Reference Frame Makes Unsupervised Video Object Segmentation Easier
---
### Journals 2022
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/document/9745367) / code - `TPAMI` Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/document/9875116) / code - `TIP` From Pixels to Semantics: Self-Supervised Video Object Segmentation With Multiperspective Feature Mining
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/document/9904497) / code - `TIP` Delving Deeper Into Mask Utilization in Video Object Segmentation
:blue_square: `SVOS` - [paper](https://ieeexplore.ieee.org/document/9942927) / code - `TIP` Adaptive Online Mutual Learning Bi-Decoders for Video Object Segmentation
---
End of the list. :seedling:
VOS papers and datasets before 2022 could be found below:
>Deep Learning for Video Object Segmentation: A Review / [paper](https://link.springer.com/content/pdf/10.1007/s10462-022-10176-7.pdf) / [project page](https://github.com/gaomingqi/VOS-Review)