{"id":28392676,"url":"https://github.com/gaomingqi/awesome-video-object-segmentation","last_synced_at":"2026-02-19T06:31:49.036Z","repository":{"id":153236922,"uuid":"622287968","full_name":"gaomingqi/Awesome-Video-Object-Segmentation","owner":"gaomingqi","description":"🔥 Latest advances in Video Object Segmentation (VOS) – papers, datasets, and projects.","archived":false,"fork":false,"pushed_at":"2026-02-05T12:32:49.000Z","size":3248,"stargazers_count":460,"open_issues_count":0,"forks_count":17,"subscribers_count":25,"default_branch":"master","last_synced_at":"2026-02-05T23:55:11.271Z","etag":null,"topics":["audio-visual-segmentation","awesome-papers","awesome-papers-for-video-object-segmentation","referring-video-object-segmentation","semi-supervised-video-object-segmentation","video-matting","video-object-segmentation","video-reasoning-segmentation"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gaomingqi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-04-01T17:04:55.000Z","updated_at":"2026-02-05T12:33:42.000Z","dependencies_parsed_at":"2023-09-29T16:30:33.154Z","dependency_job_id":"e3df2f3e-8d36-443a-bff8-10ddee288eac","html_url":"https://github.com/gaomingqi/Awesome-Video-Object-Segmentation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gaomingqi/Awesome-Video-Object-Segmentation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaomingqi%2FAwesome-Video-Object-Segmentation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaomingqi%2FAwesome-Video-Object-Segmentation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaomingqi%2FAwesome-Video-Object-Segmentation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaomingqi%2FAwesome-Video-Object-Segmentation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gaomingqi","download_url":"https://codeload.github.com/gaomingqi/Awesome-Video-Object-Segmentation/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gaomingqi%2FAwesome-Video-Object-Segmentation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29604786,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T05:11:50.834Z","status":"ssl_error","status_checked_at":"2026-02-19T05:11:38.921Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-visual-segmentation","awesome-papers","awesome-papers-for-video-object-segmentation","referring-video-object-segmentation","semi-supervised-video-object-segmentation","video-matting","video-object-segmentation","video-reasoning-segmentation"],"created_at":"2025-05-31T14:46:02.613Z","updated_at":"2026-02-19T06:31:49.028Z","avatar_url":"https://github.com/gaomingqi.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=center\u003e\n\u003cimg src=\"./data/assets/logo.jpg\" width=\"350em\"/\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"\" alt=\"\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/commit-activity/m/gaomingqi/awesome-video-object-segmentation?colorB=b74e45\" /\u003e\u003c/a\u003e\n    \u003ca href=\"\" alt=\"\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/last-commit/gaomingqi/awesome-video-object-segmentation?colorB=54b345\" /\u003e\u003c/a\u003e\n    \u003ca src=\"https://img.shields.io/badge/survey_paper-PDF-a7c7e7?style=flat-square\" href=\"https://link.springer.com/article/10.1007/s10462-022-10176-7\"\u003e\n        \u003cimg src=\"https://img.shields.io/badge/survey_paper-PDF-a7c7e7\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nLatest Advances in Video Object Segmentation (VOS). VOS works before 2022 can be found in our survey paper:\n\n\u003eDeep Learning for Video Object Segmentation: A Review / [paper](https://link.springer.com/content/pdf/10.1007/s10462-022-10176-7.pdf) / [project page](https://github.com/gaomingqi/VOS-Review) \u003cdetails\u003e\u003csummary\u003eBibTex\u003c/summary\u003e @article{gao2023deep,\n  title={Deep learning for video object segmentation: a review},\n  author={Gao, Mingqi and Zheng, Feng and Yu, James JQ and Shan, Caifeng and Ding, Guiguang and Han, Jungong},\n  journal={Artificial Intelligence Review},\n  volume={56},\n  number={1},\n  pages={457--531},\n  year={2023},\n  publisher={Springer}\n}\n\n---\n\n:teddy_bear: We mark different VOS tasks with coloured squares:\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd style=\"width: 20%;\"\u003e:blue_square:\u003ccode\u003eSVOS\u003c/code\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 30%;\"\u003e\u003cimg src=\"data/assets/svos.gif\" alt=\"SVOS\" style=\"max-width: 100%;\" /\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 20%;\"\u003e:orange_square:\u003ccode\u003eRVOS\u003c/code\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 30%;\"\u003e\u003cimg src=\"data/assets/rvos.gif\" alt=\"RVOS\" style=\"max-width: 100%;\" /\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd style=\"width: 20%;\"\u003e:green_square:\u003ccode\u003eUVOS\u003c/code\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 30%;\"\u003e\u003cimg src=\"data/assets/uvos.gif\" alt=\"UVOS\" style=\"max-width: 100%;\" /\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 20%;\"\u003e:red_square:\u003ccode\u003eAVOS\u003c/code\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 30%;\"\u003e\u003cimg src=\"data/assets/avos.gif\" alt=\"AVOS\" style=\"max-width: 100%;\" /\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd style=\"width: 20%;\"\u003e:diamond_shape_with_a_dot_inside:\u003ccode\u003eVMAT\u003c/code\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 30%;\"\u003e\u003cimg src=\"data/assets/vmat.gif\" alt=\"VMAT\" style=\"max-width: 100%;\" /\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 20%;\"\u003e:white_large_square:\u003ccode\u003eXVOS\u003c/code\u003e\u003c/td\u003e\n        \u003ctd style=\"width: 30%;\"\u003eOther types of VOS\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n:teddy_bear: Please feel free to send us pull requests to add VOS works.\n\n---\n\nLinks for a quick jump: [ArXiv (within 6 months)](#arxiv), 🔥[ICLR 2026](#iclr26)🔥, [AAAI 2026](#aaai26), [NeurIPS 2025](#nips25), [ACM MM 2025](#mm25), [SIGGRAPH 2025](#sig25), [ICCV 2025](#iccv25), [CVPR 2025](#cvpr25), [ICLR 2025](#iclr25), [AAAI 2025](#aaai25), [Journals 2025](#j25), [Earlier ArXiv 2025](#a25), [NeurIPS 2024](#nips24), [ACMMM 2024](#acmmm24), [ECCV 2024](#eccv24), [CVPR 2024](#cvpr24), [AAAI 2024](#aaai24), [Journals 2024](#j24), [Earlier ArXiv 2024](#earxiv24), [EMNLP 2023](#emnlp23), [NeurIPS 2023](#nips23), [ACMMM 2023](#mm23), [ICCV 2023](#iccv23), [CVPR 2023](#cvpr23), [IJCAI 2023](#ijcai23), [AAAI 2023](#aaai23), [Journals 2023](#j23), [Earlier ArXiv 2023](#earxiv23), [NeurIPS 2022](#neurips22), [ECCV 2022](#eccv22), [CVPR 2022](#cvpr22), [AAAI 2022](#aaai22), [Journals 2022](#j22)\n\n---\n### 🏁 \u003cspan id=\"workshopschallenges\"\u003eVOS Workshops and Challenges\u003c/span\u003e\n\n\u003cdetails\u003e\u003csummary\u003eNo Active - Click to see history\u003c/summary\u003e\n    \n:blue_square: `SVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [LSVOS @ICCV 2025](https://lsvos.github.io/) \n\u003c/details\u003e\n\n---\n### :floppy_disk: \u003cspan id=\"dataset\"\u003eVOS Dataset\u003c/span\u003e\n\n\u003cdetails\u003e\u003csummary\u003eClick to expand\u003c/summary\u003e\n\n:blue_square: `SVOS`: [MOSEv2](https://www.codabench.org/competitions/10062/) (2025), [SA-V](https://ai.meta.com/datasets/segment-anything-video/) (2024), [LVOS](https://lingyihongfd.github.io/lvos.github.io/dataset.html) (2023), [MOSEv1](https://henghuiding.github.io/MOSE/) (2023), [VOST](https://www.vostdataset.org/) (2023), [VISOR](https://epic-kitchens.github.io/VISOR/) (2022), [YouTube-VOS](https://youtube-vos.org/) (2018/2019), [DAVIS](https://davischallenge.org/index.html) (2016/2017)\n\n:orange_square: `RVOS`: [MeViSv2](https://henghuiding.com/MeViS/#dataset) (2025), [ReVOS](https://github.com/cilinyan/ReVOS-api) (2024), [MeViS](https://henghuiding.github.io/MeViS/) (2023), [Ref-YouTube-VOS](https://youtube-vos.org/dataset/rvos/) (2020), [Ref-DAVIS](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/video-segmentation/video-object-segmentation-with-language-referring-expressions) (2018), [J-HMDB-Sentences](https://kgavrilyuk.github.io/publication/actor_action/) (2018), [A2D-Sentences](https://kgavrilyuk.github.io/publication/actor_action/) (2018)\n\n:green_square: `UVOS`: [DAVIS](https://davischallenge.org/index.html) (2016)\n\n:red_square: `AVOS`: [AVSBench](https://opennlplab.github.io/AVSBench/) (2022)\n\n:diamond_shape_with_a_dot_inside: `VMAT`: [VideoMatte240K](https://grail.cs.washington.edu/projects/background-matting-v2/#/datasets) (2021), [CRGNN](https://github.com/TiantianWang/VideoMatting-CRGNN) (2021)\n\n\u003c/details\u003e\n\n---\n\n### \u003cspan id=\"arxiv\"\u003eArXiv (Last 6 months)\u003c/span\u003e\n\n:orange_square: `RVOS` `Feb` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2602.12173) / [code](https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder\nfor Efficient Vision-Language Segmentation\n\n:orange_square: `RVOS` `Feb` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2602.04454) / [code](https://github.com/iSEE-Laboratory/Seg-ReSearch) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search (`Reasoning VOS via Outside Knowledge!`)\n\n:red_square: `AVOS` `Feb` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2602.03892) / [code](https://github.com/jasongief/MQA-RefAVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation\n\n:orange_square: `RVOS` `Feb` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2602.03595) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation\n\n:diamond_shape_with_a_dot_inside: `VMAT` `Jan` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2601.14255) / [code](https://github.com/cvlab-kaist/VideoMaMa) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VideoMaMa: Mask-Guided Video Matting via Generative Prior\n\n:blue_square: `SVOS` :orange_square: `RVOS` `Jan` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2601.09699) / [code](https://github.com/FudanCVL/SAM3-DMS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3\n\n:blue_square: `SVOS` `Jan` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2601.08831) / [project page](https://jayisaking.github.io/3AM-Page/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; 3AM: 3egment Anything with Geometric Consistency in Videos\n\n:diamond_shape_with_a_dot_inside: `VMAT` `Dec` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2512.11782) / [project page](https://pq-yang.github.io/projects/MatAnyone2/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator\n\n:red_square: `AVOS` `Dec` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2512.20117) / [project page](https://trilarflagz.github.io/DDAVS-page/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation\n\n:orange_square: `RVOS` :red_square: `AVOS` `Dec` `TPAMI` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2512.10945) / [project page and dataset](https://henghuiding.com/MeViS/index.html) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation\n\n:blue_square: `SVOS` `Dec` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2512.08406) / [code](https://github.com/gaomingqi/sam-body4d) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos (`VOS-driven Human Mesh Recovery`)\n\n:orange_square: `RVOS` `Dec` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2512.02835) / [code](https://github.com/Clementine24/ReVSeg) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning\n\n:blue_square: `SVOS` `Nov` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2511.16618) / [code](https://github.com/jinlab-imvr/SAM2S) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking\n\n:blue_square: `SVOS` `Nov` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2511.20886) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; V2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence\n\n:orange_square: `RVOS` `Nov` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2511.21139) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ReVSeg: Referring Video Object Segmentation with Cross-Modality Proxy Queries\n\n:red_square: `AVOS` `Oct` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2510.10051) / [code](https://github.com/SitongGong/CCFormer) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Complementary and Contrastive Learning for Audio-Visual Segmentation\n\n:orange_square: `RVOS` `Oct` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2510.09274) / [code](https://github.com/Dmmm1997/MomentSeg) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding\n\n:orange_square: `RVOS` `Oct` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2510.08305) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation\n\n:orange_square: `RVOS` `Oct` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2510.07319) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Temporal Prompting Matters: Rethinking Referring Video Object Segmentation\n\n:orange_square: `RVOS` `Oct` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2510.06139) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Deforming Videos to Masks: Flow Matching for Referring Video Segmentation\n\n:red_square: `AVOS` `Sep` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2509.22740) / [code](https://github.com/jinbae-s/ACVIS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation\n\n:red_square: `AVOS` `Sep` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2509.18912) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation\n\n:orange_square: `RVOS` `Sep` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2509.13722) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Mitigating Query Selection Bias in Referring Video Object Segmentation\n\n:orange_square: `RVOS` `Sep` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2509.05751) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation\n\n:orange_square: `RVOS` :blue_square: `SVOS` `Aug` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.21809) / [code](https://github.com/google-deepmind/vocap) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VoCap: Video Object Captioning and Segmentation from Any Prompt\n\n:orange_square: `RVOS` `Aug` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.11955) / [code](https://github.com/Seung-Hun-Lee/SAMDWICH) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation\n\n:orange_square: `RVOS` `Aug` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.13584) / [code](https://github.com/qianqiaoai/HCD) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model\n\n:orange_square: `RVOS` `Aug` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.11538) / [code](https://github.com/SitongGong/Veason-R1) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Reinforcing Video Reasoning Segmentation to Think Before It Segments\n\n:red_square: `AVOS` `Aug` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.02149) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation\n\n:green_square: `UVOS` `Jul` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.19790) / [code](https://github.com/suhwan-cho/DepthFlow) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation\n\n:blue_square: `SVOS` `Jul` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2507.18921) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback\n\n:blue_square: `SVOS` `Jul` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.07603) / [code](https://github.com/LouisFinner/HiM2SAM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking\n\n:white_large_square: `XVOS` `Jul` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.07519) / [dataset](https://volumetric-repository.labs.b-com.com/#/muvod) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation\n\n:diamond_shape_with_a_dot_inside: `VMAT` `Jul` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.04456) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; BiVM: Accurate Binarized Neural Network for Efficient Video Matting\n\n:diamond_shape_with_a_dot_inside: `VMAT` `Jun` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2506.10840) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Post-Training Quantization for Video Matting\n\n:red_square: `AVOS` `Jun` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2506.11436) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models\n\n:orange_square: `RVOS` `Jun` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2506.02356) / [project](https://cvlab-kaist.github.io/InterRVOS/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; InterRVOS: Interaction-aware Referring Video Object Segmentation\n\n:red_square: `AVOS` `Jun` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2506.01015) / [code](https://github.com/yyliu01/AuralSAM2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting\n\n---\n### \u003cspan id=\"iclr26\"\u003eICLR 2026\u003c/span\u003e\n:blue_square: `SVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://scontent-lhr8-1.xx.fbcdn.net/v/t39.2365-6/586186898_724834017304937_2869787384130329011_n.pdf?_nc_cat=107\u0026ccb=1-7\u0026_nc_sid=3c67a6\u0026_nc_ohc=V0ySxB4YEecQ7kNvwE9bO7k\u0026_nc_oc=Adn7bAjFdDo3pyjopE-tHSmOV5lgvaoxLNYmJtRbE9Op6gSQHJoEsg2ANithdVk2hm5mJvm_jjjSyhT0TiB9Go0h\u0026_nc_zt=14\u0026_nc_ht=scontent-lhr8-1.xx\u0026_nc_gid=Oa338lN6JufBYl5-MfIdMg\u0026oh=00_Afj4QvVDmMqGVUembmWPdxu9nWcksZ6Rjruxy28TYOm3PA\u0026oe=6923F072) / [code](https://github.com/facebookresearch/sam3) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM 3: Segment Anything with Concepts\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.15852) / [code](https://github.com/OpenIXCLab/SeC) / [dataset](https://huggingface.co/datasets/OpenIXCLab/SeCVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2602.08224) / [code](https://github.com/jingjing0419/Efficient-SAM2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2510.19592) / [code](https://github.com/HYUNJS/DecAF) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2510.06139) / [code](https://github.com/xmz111/FlowRVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Deforming Videos to Masks: Flow Matching for Referring Video Segmentation\n\n:diamond_shape_with_a_dot_inside: `VMAT` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openreview.net/pdf?id=6K08FPo2cf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Matting Anything 2: Towards Video Matting for Anything\n\n---\n\n### \u003cspan id=\"aaai26\"\u003eAAAI 2026\u003c/span\u003e\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.04418) / [code](https://github.com/jasongief/TGS-Agent) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2511.13715) / [code](https://github.com/FudanCVL/SAAS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Segment Anything Across Shots: A Method and Benchmark\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2511.19475) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Tracking and Segmenting Anything in Any Modality\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2511.16077) / [code](https://github.com/euyis1019/VideoSeg-R1) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VideoSeg-R1: Reasoning Video Object Segmentation via Reinforcement Learning\n\n---\n\n### \u003cspan id=\"nips25\"\u003eNeurIPS 2025\u003c/span\u003e\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2509.18094) / [code](https://github.com/PolyU-ChenLab/UniPixel) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openreview.net/pdf?id=z9xyREqxzq) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention\n\n---\n\n### \u003cspan id=\"mm25\"\u003eACM MM 2025\u003c/span\u003e\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.22465) / [code](https://github.com/ZhengxyFlow/HMHI-Net) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation\n\n---\n\n### \u003cspan id=\"sig25\"\u003eSIGGRAPH 2025\u003c/span\u003e\n\n:diamond_shape_with_a_dot_inside: `VMAT` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2508.07905) / [code](https://github.com/aim-uofa/GVM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Generative Video Matting\n\n---\n### \u003cspan id=\"iccv25\"\u003eICCV 2025\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.18944) / [code](https://github.com/jinlab-imvr/OASIS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2410.16268) / [code](https://github.com/Mark12Ding/SAM2Long) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/ICCV2025/papers/Baek_EVOLVE_Event-Guided_Deformable_Feature_Transfer_and_Dual-Memory_Refinement_for_Low-Light_ICCV_2025_paper.pdf) / [code](https://github.com/whdgusdl48/EVOLVE) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; EVOLVE: Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/ICCV2025/papers/Rong_MPG-SAM_2_Adapting_SAM_2_with_Mask_Priors_and_Global_ICCV_2025_paper.pdf) / [code](https://github.com/rongfu-dsb/MPG-SAM2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.19599) / [code](https://github.com/qirui-chen/RGA3-release) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Object-centric Video Question Answering with Visual Grounding and Referring (`Video LLM with applications on RVOS`)\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.14607) / [code](https://github.com/iSEE-Laboratory/ReferDINO) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.14006) / [code](https://github.com/congvvc/InstructSeg) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.22061) / [code](https://github.com/FudanCVL/MOVE) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MOVE: Motion-Guided Few-Shot Video Object Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.22886) / [code](https://github.com/FudanCVL/OmniAVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2507.20740) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Implicit Counterfactual Learning for Audio-Visual Segmentation\n\n---\n### \u003cspan id=\"cvpr25\"\u003eCVPR 2025\u003c/span\u003e\n\n:diamond_shape_with_a_dot_inside: `VMAT` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.14677) / [code](https://github.com/pq-yang/MatAnyone) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Stable Video Matting with Consistent Memory Propagation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.17646) / [code](https://github.com/ClaudiaCuttano/SAMWISE) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.08549) / [code](https://github.com/SitongGong/VRS-HQ) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; The Devil is in Temporal Token: High Quality Video Reasoning Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.09754) / [code](https://github.com/Ali2500/ViCaS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.09921) / [code](https://github.com/dengandong/GroundMoRe) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2504.07962) / [code](https://github.com/GLUS-video/GLUS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Pan_Semantic_and_Sequential_Alignment_for_Referring_Video_Object_Segmentation_CVPR_2025_paper.pdf) / [code](https://github.com/tavarich/SSA) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Semantic and Sequential Alignment for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Fang_Decoupled_Motion_Expression_Video_Segmentation_CVPR_2025_paper.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Decoupled Motion Expression Video Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.17576) / [code](https://github.com/jovanavidenovic/DAM4SAM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; A Distractor-Aware Memory for Visual Object Tracking with SAM2\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.02818) / [code](https://github.com/uncbiag/LiVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; LiVOS: Light Video Object Segmentation with Gated Linear Matching\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2502.04144) / [project page](https://hd-epic.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; HD-EPIC: A Highly-Detailed Egocentric Video Dataset (`with long-term SVOS dataset`) \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.13803) / [project page](https://zixuan-chen.github.io/M-cube-VOS.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (`svos with phase transition for embodied ai`)\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2506.01558) / [code](https://github.com/VoyageWang/SAM2LOVE) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.12840) / [code](https://github.com/YenanLiu/DDESeg) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2506.23623) / [code](https://github.com/spyflying/VCT_AVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Revisiting Audio-Visual Segmentation with Vision-Centric Transformer\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.12847) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Radman_TSAM_Temporal_SAM_Augmented_with_Multimodal_Prompts_for_Referring_Audio-Visual_CVPR_2025_paper.pdf) / [project](https://abdurad.github.io/TSAM/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.04623) / [code](https://github.com/Kaihua-Chen/diffusion-vas) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Using Diffusion Priors for Video Amodal Segmentation (`segment both visible and invisible (e.g., occluded) video objects`)\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2506.01304) / [code](https://github.com/showlab/SAM-I2V) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.22268) / [code](https://github.com/nnanhuang/SegAnyMo) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Segment Any Motion in Videos\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2504.05468) / [code](https://github.com/thanosDelatolas/diff-zvos) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Studying Image Diffusion Features for Zero-Shot Video Object Segmentation\n\n---\n### \u003cspan id=\"iclr25\"\u003eICLR 2025\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/) / [code](https://github.com/facebookresearch/segment-anything-2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAM 2: Segment Anything in Images and Videos\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2410.18538) / [code](https://github.com/alimohammadiamirhossein/smite/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SMITE: Segment Me In TimE\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2407.07760) / [code](https://github.com/yahooo-m/S3) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning Spatial-Semantic Features for Robust Video Object Segmentation\n\n---\n### \u003cspan id=\"aaai25\"\u003eAAAI 2025\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.01471) / [project page](https://cvlab-kaist.github.io/MUG-VOS/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Multi-Granularity Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/32706) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Holistic Correction with Object Prototype for Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/32626) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Beyond Pixel and Object: Part Feature as Reference for Few-Shot Video Object Segmentation\n\n:red_square: `AVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2408.15876) / [code](https://github.com/appletea233/AL-Ref-SAM2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation\n\n---\n### \u003cspan id=\"j25\"\u003eJournals 2025\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/11311151) / [code](https://github.com/zaplm/DC-SAM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency\n\n:orange_square: `RVOS` :red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/11184493) / [code](https://github.com/yongliu20/MRVS_SOC) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation\n\n:green_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2502.12975) / [code](https://github.com/danqu130/EvInsMOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `IJCV` Instance-Level Moving Object Segmentation from a Single Image with Events\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.07806) / [code](https://github.com/hy0523/MTNet) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; \n `TNNLS` Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/10933555) / [code](https://github.com/yk-pku/Low-shot-VOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` Low-shot Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/10949703) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` JointFormer: A Unified Framework with Joint Modeling for Video Object Segmentation\n\n---\n### \u003cspan id=\"a25\"\u003eEarlier Arxiv 2025\u003c/span\u003e\n\n:orange_square: `RVOS` `May` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2505.08581) / [code](https://github.com/jinlab-imvr/ReSurgSAM2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking\n\n:orange_square: `RVOS` `May` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2505.18561) / [code](https://github.com/DanielSHKao/ThinkVideo) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts\n\n:orange_square: `RVOS` `May` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2505.12702) / [code](https://isee-laboratory.github.io/Long-RVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation\n\n:red_square: `AVOS` `May` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2505.01448) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models\n\n:blue_square: `SVOS` `May` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2505.00739) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection\n\n:blue_square: `SVOS` `Apr` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2504.16471) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory\n\n:green_square: `UVOS` `Apr` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2504.05904) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation\n\n:orange_square: `RVOS` `Mar` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.21056) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Online Reasoning Video Segmentation with Just-in-Time Digital Twins\n\n:diamond_shape_with_a_dot_inside: `VMAT` `Mar` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.10678) / [project page](https://bio.lehanyang.info/VRMDiff.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion\n\n:diamond_shape_with_a_dot_inside: `VMAT` `Mar` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.01262) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Object-Aware Video Matting with Cross-Frame Guidance\n\n:orange_square: `RVOS` `Mar` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2503.03492) / [code](https://github.com/suhwan-cho/FindTrack) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation\n\n:orange_square: `RVOS` `Jan` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.04001) / [code](https://github.com/magic-research/Sa2VA) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2502.09660) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Towards Fine-grained Interactive Segmentation in Images and Videos\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2502.00358) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.13667) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.07256) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; EdgeTAM: On-Device Track Anything Model\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.07806) / [code](https://github.com/SitongGong/AVS-Mamba) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; \n AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.04939) / [code](https://github.com/Choi58/MTCM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation\n\n---\n### \u003cspan id=\"nips24\"\u003eNeurIPS 2024\u003c/span\u003e\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2409.19603) / [code](https://github.com/showlab/VideoLISA) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos\n\n:blue_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.19806) / [code](https://github.com/SkyworkAI/Vitron) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing (`with applications in SVOS`)\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2501.12392) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning segmentation from point trajectories\n\n---\n### \u003cspan id=\"acmmm24\"\u003eACMMM 2024\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2409.19342) / [code](https://github.com/PinxueGuo/X-Prompt) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation\n\n---\n### \u003cspan id=\"eccv24\"\u003eECCV 2024\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.06265) / [code](https://github.com/yahooo-m/VOS-Solution) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Spatial-Temporal Multi-level Association for Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.08682) / [code](https://github.com/L599wy/OneVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.12303) / [code \u0026 dataset](https://github.com/shilinyan99/PanoVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2407.11325) / [code](https://github.com/cilinyan/VISA) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VISA: Reasoning Video Object Segmentation via Large Language Model\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.12042) / [code](https://github.com/buxiangzhiren/VD-IT) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2407.07402) / [code](https://github.com/ut-vision/ActionVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ActionVOS: Actions as Prompts for Video Object Segmentation\n\n:orange_square: `RVOS` :red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.04924) / [code \u0026 dataset](https://github.com/lxa9867/r2bench) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations\n\n:orange_square: `RVOS` :red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2407.10957) / [code](https://github.com/GeWu-Lab/Ref-AVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2407.11820) / [code](https://github.com/GeWu-Lab/Stepping-Stones) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2311.17893) / [code](https://github.com/shvdiwnkozbw/SSL-UVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation\n\n---\n### \u003cspan id=\"cvpr24\"\u003eCVPR 2024\u003c/span\u003e\n\n:diamond_shape_with_a_dot_inside: `VMAT` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.16035) / [code](https://github.com/hmchuong/MaGGIe) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MaGGIe: Masked Guided Gradual Human Instance Matting\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2402.05917) / [code](https://pointvos.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Point-VOS: Pointing Up Video Object Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2310.00132) / [code](https://github.com/lxa9867/QSD) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.06462) / [code](https://github.com/yannqi/COMBO-AVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2304.02970) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; A Closer Look at Audio-Visual Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.04258) / [code](https://github.com/NiFangBaAGe/DATTT) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Depth-aware Test-Time Training for Zero-shot Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2211.12036) / [code](https://github.com/Hydragon516/DPA) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Dual Prototype Attention for Unsupervised Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2303.08314) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Guided Slot Attention for Unsupervised Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.03645) / [code](https://github.com/heshuting555/DsHmp) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2306.08736) / [code](https://github.com/LinfengYuan1997/Losh) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.01623) / [code](https://github.com/workforai/UniLSeg) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Universal Segmentation at Arbitrary Granularity with Language Instruction\n\n:blue_square: `SVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2402.18115) / [code](https://github.com/MinghanLi/UniVS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; UniVS: Unified and Universal Video Segmentation with Prompts as Queries\n\n:blue_square: `SVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.09158) / [code](https://github.com/FoundationVision/GLEE) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; General Object Foundation Model for Images and Videos at Scale\n\n:blue_square: `SVOS` :green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2406.04221) / [code](https://github.com/siyuanliii/masa) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Matching Anything By Segmenting Anything\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2406.08476) / [code](https://github.com/Restricted-Memory/RMem) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; RMem: Restricted Memory Banks Improve Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.01945) / [code](https://github.com/HebeiFast/EventLowLightVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Event-assisted Low-Light Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2310.12982) / [code](https://github.com/hkchengrex/Cutie) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Putting the Object Back into Video Object Segmentation\n\n---\n### \u003cspan id=\"aaai24\"\u003eAAAI 2024\u003c/span\u003e\n:orange_square: `RVOS` :red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2305.16318.pdf) / [code](https://github.com/OpenGVLab/MUTR) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/28295) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Generalizable Fourier Augmentation for Unsupervised Video Object Segmentation\n\n---\n### \u003cspan id=\"j24\"\u003eJournals 2024\u003c/span\u003e\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/10694805) / [code](https://github.com/Yxxxb/LAVT-RS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` Language-Aware Vision Transformer for Referring Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/10713285) / [code](https://github.com/BIT-Vision/ECOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` Continuous-time Object Segmentation using High Temporal Resolution Event Camera\n\n---\n### \u003cspan id=\"earxiv24\"\u003eEarlier Arxiv 2024\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.19761) / [project page](https://genprop.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Generative Video Propagation (`with applications in SVOS`)\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.08161) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.04930) / [project page](https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Video Decomposition Prior: A Methodology to Decompose Videos into Layers (`with applications in UVOS`)\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2412.01136) / [project page](https://cvlab-kaist.github.io/SOLA/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Referring Video Object Segmentation via Language-aligned Track Selection\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.18977) / [code](https://github.com/motern88/Det-SAM2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Det-SAM2: Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.19141) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; On Moving Object Segmentation from Monocular Video with Transformers\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.19210) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.18933) / [code](https://github.com/yformer/EfficientTAM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Efficient Track Anything\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2411.11922) / [code](https://github.com/yangchris11/samurai) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2410.23287) / [project page](https://miccooper9.github.io/projects/ReferEverything/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ReferEverything: Towards Segmenting Everything We Can Speak of in Videos\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2409.18653) / [code](https://github.com/zhoustan/SAM2-VCOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2409.14343) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Memory Matching is not Enough: Jointly Improving Memory Matching and Decoding for Video Object Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2408.01708) / [code](https://github.com/MarkXCloud/AVESFormer) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2408.00169) / [code](https://github.com/Vujas-Eteph/LazyXMem) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Strike the Balance: On-the-Fly Uncertainty based User Interactions for Long-Term Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2407.14500) / [code](https://github.com/rkzheng99/ViLLa) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ViLLa: Video Reasoning Segmentation with Large Language Model\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2407.11714) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Improving Unsupervised Video Object Segmentation via Fake Flow Generation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2406.02345) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Progressive Confident Masking Attention Network for Audio-Visual Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2406.06163) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2406.12834) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation\n\n---\n### \u003cspan id=\"emnlp23\"\u003eEMNLP 2023\u003c/span\u003e\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://aclanthology.org/2023.emnlp-main.140.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text (``Spoken language as referring guidance``)\n\n---\n### \u003cspan id=\"nips23\"\u003eNeurIPS 2023\u003c/span\u003e\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2305.17011) / [code](https://github.com/RobertLuo1/NeurIPS2023_SOC) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openreview.net/pdf?id=9QsdPQlWiE) / [code](https://github.com/ttt-matching-based-vos/ttt_matching_vos) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Test-time Training for Matching-based Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openreview.net/pdf?id=jfsjKBDB1z) / [code](https://github.com/BGU-CS-VIL/Training-Free-VOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; From ViT Features to Training-free Video Object Segmentation via Streaming-data Mixture Models\n\n---\n### \u003cspan id=\"mm23\"\u003eACM MM 2023\u003c/span\u003e\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611804) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3612017) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611827) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3611724) / [code](https://github.com/aspirinone/CATR.github.io) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://dl.acm.org/doi/pdf/10.1145/3581783.3612373) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics\n\n---\n\n### \u003cspan id=\"iccv23\"\u003eICCV 2023\u003c/span\u003e\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.11160) / [code](https://github.com/nankepan/VIPMT) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.11796) / [code](https://github.com/SMSD75/Timetuning) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations (`self-supervised learning for UVOS`)\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.06693) / [code](https://github.com/DLUT-yyc/Isomer) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Su_Unsupervised_Video_Object_Segmentation_with_Online_Adversarial_Self-Tuning_ICCV_2023_paper.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning\n\n:green_square: `UVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.03903) / [code](https://github.com/hkchengrex/Tracking-Anything-with-DEVA) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; DEVA: Tracking Anything with Decoupled Video Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.03473) / [code](https://github.com/Toneyaya/TempCD) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Temporal Collection and Distribution for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2207.01203) / [code](https://github.com/lxa9867/R2VOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Robust Referring Video Object Segmentation with Cyclic Structural Consensus\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.13537) / [code](https://github.com/bo-miao/SgMg) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Spectrum-guided Multi-granularity Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.09356) / [code](https://github.com/wudongming97/OnlineRefer) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.02041) / [code](https://github.com/hengliusky/Few_shot_RVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Han_HTML_Hybrid_Temporal-scale_Multimodal_Learning_Framework_for_Referring_Video_Object_ICCV_2023_paper.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.08544) / [code \u0026 dataset](https://henghuiding.github.io/MeViS/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.13266) / [code](https://github.com/yoxu515/MITS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.15958) / [code](https://github.com/max810/XMem2) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; XMem++: Production-level Video Segmentation From Few Annotated Frames\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.09903) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Scalable Video Object Segmentation with Simplified Framework\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Sun_Alignment_Before_Aggregation_Trajectory_Memory_Retrieval_Network_for_Video_Object_ICCV_2023_paper.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Alignment Before Aggregation: Trajectory Memory Retrieval Network for Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2304.03284.pdf) / [code](https://github.com/baaivision/Painter) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SegGPT: Segmenting Everything In Context\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2211.10181.pdf) / [code \u0026 dataset](https://lingyihongfd.github.io/lvos.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; LVOS: A Benchmark for Long-term Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2302.01872) / [code \u0026 dataset](https://github.com/henghuiding/MOSE-api) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MOSE: A New Dataset for Video Object Segmentation in Complex Scenes\n\n---\n\n### \u003cspan id=\"cvpr23\"\u003eCVPR 2023\u003c/span\u003e\n\n:diamond_shape_with_a_dot_inside: `VMAT` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2304.06018) / [code](https://github.com/microsoft/AdaM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Adaptive Human Matting for Dynamic Videos\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2304.05930.pdf) / [code](https://rkyuca.github.io/medvt/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2304.06211.pdf) / [code](https://github.com/wenguanwang/VOS_Correspondence) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Boosting Video Object Segmentation via Space-time Correspondence Learning\n\n:blue_square: `SVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2303.06674.pdf) / [code](https://github.com/MasterBin-IIAU/UNINEXT) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Universal Instance Perception as Object Discovery and Retrieval \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2023/papers/Athar_TarViS_A_Unified_Approach_for_Target-Based_Video_Segmentation_CVPR_2023_paper.pdf) / [code](https://github.com/Ali2500/TarViS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; TarViS: A Unified Approach for Target-Based Video Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2303.12078.pdf) / [code](https://github.com/yk-pku/Two-shot-Video-Object-Segmentation) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Two-shot Video Object Segmetnation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2303.07815.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2212.06826.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Look Before You Match: Instance Understanding Matters in Video Object Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2212.06200.pdf) / [code \u0026 dataset](https://www.vostdataset.org/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Breaking the “Object” in Video Object Segmentation\n\n---\n\n### \u003cspan id=\"ijcai23\"\u003eIJCAI 2023\u003c/span\u003e\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.09501) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2305.04470) / [code \u0026 dataset](https://github.com/yoxu515/VIPOSeg-Benchmark) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Video Object Segmentation in Panoptic Wild Scenes\n\n---\n\n### \u003cspan id=\"aaai23\"\u003eAAAI 2023\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2212.02112.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning to Learn Better for Video Object Segmentation\n\n---\n\n### \u003cspan id=\"j23\"\u003eJournals 2023\u003c/span\u003e\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/10298026) / [code](https://github.com/ZSVOS/HGPU) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TIP` Hierarchical Graph Pattern Understanding for Zero-Shot Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/10159996) / [code](https://github.com/xilin1991/CluterNet) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TCSVT` Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/10105896) / [code](https://github.com/NUST-Machine-Intelligence-Laboratory/HCPN) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TIP` Hierarchical Co-Attention Propagation Network for Zero-Shot Video Object Segmentation \n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/9932025) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` VLT: Vision-Language Transformer and Query Generation for Referring Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/abstract/document/10083244) / [code](https://github.com/leonnnop/Locater) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` Local-Global Context Aware Transformer for Language-Guided Video Segmentation\n\n\n### \u003cspan id=\"earxiv23\"\u003eEarlier Arxiv 2023\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.19326) / [code and dataset](https://lingyihongfd.github.io/lvos.github.io/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; LVOS (v2, with more data): A Benchmark for Large-scale Long-term Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2405.10610) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Driving Referring Video Object Segmentation with Vision-Language Pre-trained Models\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2405.14010) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; One-shot Training for Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2405.07031) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Global Motion Understanding in Large-Scale Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2405.08715) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.13505) / [code](https://github.com/NUST-Machine-Intelligence-Laboratory/HVC) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp;  Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2404.12389) / [code](https://github.com/Jyxarthur/flowsam) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Moving Object Segmentation: All You Need Is SAM (and Flow)\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.19407) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Towards Temporally Consistent Referring Video Object Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.14203) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Unsupervised Audio-Visual Segmentation with Modality Alignment\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.17937) / [code](https://github.com/Amshaker/MAVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Efficient Video Object Segmentation via Modulated Cross-Attention Memory\n\n⬜ `XVOS`\u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2403.06130) / [code](https://github.com/PinxueGuo/ClickVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; ClickVOS: Click Video Object Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2402.02327) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2401.14168) / [code](https://github.com/scott-yjyang/Vivim) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Vivim: a Video Vision Mamba for Medical Video Object Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2401.13937) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Self-supervised Video Object Segmentation with Distillation Learning of Deformable Attention\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2401.12480) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Explore Synergistic Interaction Across Frames for Interactive Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.17448) / [code](https://github.com/jiawen-zhu/TrackGPT) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Tracking with Human-Intent Reasoning\n\n:blue_square: `SVOS` :orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.15715) / [code](https://github.com/FoundationVision/UniRef) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces\n\n:green_square: `UVOS` `Dec` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.11463) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Appearance-based Refinement for Object-Centric Motion Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2312.08514) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; M3T: Multi-Scale Memory Matching for Video Object Segmentation and Tracking\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2311.18837) / [code](https://github.com/ChenHsing/VIDiff) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models \n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2311.07261) / [code](https://github.com/YRlin-12/Sketch-VOS-datasets) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Sketch-based Video Object Segmentation: Benchmark and Analysis\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2311.04414) / [code](https://eva-vos.compute.dtu.dk/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning the What and How of Annotation in Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2310.03967) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Sub-token ViT Embedding via Stochastic Resonance Transformers (support svos)\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2309.14786) / [code](https://github.com/suhwan-cho/TMO) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation\n\n:red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2310.00132) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.13505) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation\n\n:orange_square: `RVOS` :red_square: `AVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.04162) / [code](https://github.com/lab206/EPCFormer) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2308.02162) / [code](https://github.com/wangbo-zhao/WRVOS/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning Referring Video Object Segmentation from Weak Annotation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2305.12659.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2305.06558) / [code](https://github.com/z-x-yang/Segment-and-Track-Anything) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Segment and Track Anything\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2304.11968) / [code](https://github.com/gaomingqi/Track-Anything) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp;  Track Anything: Segment Anything Meets Videos\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2303.14384.pdf) / [code](https://github.com/mkg1204/RHMNet-for-SSVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Reliability-Hierarchical Memory Network for Scribble-Supervised Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.13974) / [code](https://github.com/jiawen-zhu/HQTrack) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Tracking Anything in High Quality\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.00536) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.00997) / [code](https://github.com/LancasterLi/RefSAM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/abs/2307.01197) / [code](https://github.com/SysCV/sam-pt) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Segment Anything Meets Point Tracking\n\n---\n\n### \u003cspan id=\"neurips22\"\u003eNeurIPS 2022\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2210.09782.pdf) / [code](https://github.com/z-x-yang/AOT)  \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Decoupling Features in Hierarchical Propagation for Video Object Segmentation\n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://arxiv.org/pdf/2210.12733.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Self-supervised Amodal Video Object Segmentation\n\n---\n\n### \u003cspan id=\"eccv22\"\u003eECCV 2022\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880633.pdf) / [code](https://github.com/hkchengrex/XMem) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890603.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890462.pdf) / [code](https://github.com/workforai/QDMN) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Learning Quality-aware Dynamic Memory for Video Object Segmentation \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136820434.pdf) / [code](https://github.com/suhwan-cho/TBD) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Tackling Background Distraction in Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890639.pdf) / [code](https://github.com/workforai/GSFM) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Global Spectral Filter Memory Network for Video Object Segmentation\n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136940584.pdf)  / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation \n\n---\n\n### \u003cspan id=\"cvpr22\"\u003eCVPR 2022\u003c/span\u003e\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Botach_End-to-End_Referring_Video_Object_Segmentation_With_Multimodal_Transformers_CVPR_2022_paper.pdf) / [code](https://github.com/mttr2021/MTTR) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; End-to-End Referring Video Object Segmentation With Multimodal Transformers \n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu_Language_As_Queries_for_Referring_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/wjn922/ReferFormer) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Language As Queries for Referring Video Object Segmentation\n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Ding_Language-Bridged_Spatial-Temporal_Interaction_for_Referring_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/dzh19990407/LBDT) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation \n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wu_Multi-Level_Representation_Learning_With_Semantic_Alignment_for_Referring_Video_Object_CVPR_2022_paper.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Recurrent_Dynamic_Embedding_for_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/Limingxing00/RDE-VOS-CVPR2022) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Recurrent Dynamic Embedding for Video Object Segmentation\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Xu_Accelerating_Video_Object_Segmentation_With_Compressed_Video_CVPR_2022_paper.pdf) / [code](https://github.com/kai422/CoVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Accelerating Video Object Segmentation With Compressed Video \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Lin_SWEM_Towards_Real-Time_Video_Object_Segmentation_With_Sequential_Weighted_Expectation-Maximization_CVPR_2022_paper.pdf) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Park_Per-Clip_Video_Object_Segmentation_CVPR_2022_paper.pdf) / [code](https://github.com/pkyong95/PCVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Per-Clip Video Object Segmentation \n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Pan_Wnet_Audio-Guided_Video_Object_Segmentation_via_Wavelet-Based_Cross-Modal_Denoising_Networks_CVPR_2022_paper.pdf) / [code](https://github.com/asudahkzj/Wnet) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks \n\n:white_large_square: `XVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Wei_YouMVOS_An_Actor-Centric_Multi-Shot_Video_Object_Segmentation_Dataset_CVPR_2022_paper.pdf) / [code \u0026 dataset](https://donglaiw.github.io/proj/youMVOS/) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset\n\n---\n\n### \u003cspan id=\"aaai22\"\u003eAAAI 2022\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20009) / [code](https://github.com/LANMNG/SITVOS) \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Siamese Network with Interactive Transformer for Video Object Segmentation \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20200) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Reliable Propagation-Correction Modulation for Video Object Segmentation \n\n:orange_square: `RVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20017) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation \n\n:green_square: `UVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ojs.aaai.org/index.php/AAAI/article/view/20011) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; Iteratively Selecting an Easy Reference Frame Makes Unsupervised Video Object Segmentation Easier \n\n---\n\n### \u003cspan id=\"j22\"\u003eJournals 2022\u003c/span\u003e\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/9745367) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TPAMI` Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/9875116) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TIP` From Pixels to Semantics: Self-Supervised Video Object Segmentation With Multiperspective Feature Mining\n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/9904497) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TIP` Delving Deeper Into Mask Utilization in Video Object Segmentation \n\n:blue_square: `SVOS` \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; [paper](https://ieeexplore.ieee.org/document/9942927) / code \u0026nbsp;\u0026nbsp;\u0026nbsp;-\u0026nbsp;\u0026nbsp;\u0026nbsp; `TIP` Adaptive Online Mutual Learning Bi-Decoders for Video Object Segmentation \n\n---\n\nEnd of the list. :seedling: \n\nVOS papers and datasets before 2022 could be found below:\n\n\u003eDeep Learning for Video Object Segmentation: A Review / [paper](https://link.springer.com/content/pdf/10.1007/s10462-022-10176-7.pdf) / [project page](https://github.com/gaomingqi/VOS-Review) \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgaomingqi%2Fawesome-video-object-segmentation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgaomingqi%2Fawesome-video-object-segmentation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgaomingqi%2Fawesome-video-object-segmentation/lists"}