https://github.com/JerryX1110/awesome-rvos
Referring Video Object Segmentation / Multi-Object Tracking Repo
https://github.com/JerryX1110/awesome-rvos
List: awesome-rvos
image linguistic multi-modal multimodal-deep-learning refer-segmentation refer-vos refering-seg rvos segmentation text video visual-grounding youtube-vos
Last synced: 6 months ago
JSON representation
Referring Video Object Segmentation / Multi-Object Tracking Repo
- Host: GitHub
- URL: https://github.com/JerryX1110/awesome-rvos
- Owner: JerryX1110
- License: mit
- Created: 2021-12-11T09:07:05.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-07-27T05:08:14.000Z (almost 2 years ago)
- Last Synced: 2024-05-21T03:12:20.310Z (about 1 year ago)
- Topics: image, linguistic, multi-modal, multimodal-deep-learning, refer-segmentation, refer-vos, refering-seg, rvos, segmentation, text, video, visual-grounding, youtube-vos
- Language: Python
- Homepage:
- Size: 79.1 KB
- Stars: 81
- Watchers: 6
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-rvos - Referring Video Object Segmentation / Multi-Object Tracking Repo. (Other Lists / Julia Lists)
README
# Awesome-Referring-Video-Object-Segmentation / Tracking [](https://awesome.re)
Welcome to starts β & comments πΉ & sharing π !!
```diff
- 2021.12.12: Recent papers (from 2021)
- welcome to add if any information misses. π
```---
## Introduction

**Referring video object segmentation** aims at **segmenting an object in video with language expressions**.
Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, **to identify and segment an object referred by the given language expressions in a video**. A detailed explanation of the new task can be found in the following paper.
* Seonguk Seo, Joon-Young Lee, Bohyung Han, βURVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmarkβ, [ECCV20]:
## Impressive Works Related to Referring Video Object Segmentation (RVOS)
* **R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency**[ICCV 2023]: [Repo] (https://github.com/lxa9867/R2VOS)
![]()
* **Spectrum-guided Multi-granularity Referring Video Object Segmentation**[ICCV 2023]:

* **OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation**[ICCV 2023]:

* **Decoupling Multimodal Transformers for Referring Video Object Segmentation** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10147907)
* **Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10146303)
* **Referring Video Segmentation with (Optional) Arbitrary Modality as Query for Fusion [ArXiV](https://arxiv.org/pdf/2207.05580.pdf)**

* **VLT: Vision-Language Transformer and Query Generation for Referring Segmentation [PAMI23]**
* **Multi-Attention Network for Compressed Video Referring Object Segmentation**[ACM MM 2022]:
* **Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [CVPR 2022]**:

* **Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation [CVPR 2022]**

**Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [CVPR 2022]**:
* **Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [ArXiv 2022]**:
* **Local-Global Context Aware Transformer for Language-Guided Video Segmentation [ArXiv 2022]**:

* **ReferFormer [CVPR 2022]**:

* **MTTR [CVPR 2022]**:
* **YOFO [AAAI 2022]**:
* You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

* **ClawCraneNet [ArXiv]**:

* **PMINet [CVPRW 2021]**:
* **RVOS challenge 1st model [CVPRW 2021]**:

* **CMPC-V [PAMI 2021]**:
Cross-modal progressive comprehension for referring segmentation:
* **HINet [BMVC 2021]**:

* **URVOS [ECCV 2020]**:
## Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)
* **LAVT: Language-Aware Vision Transformer for Referring Image Segmentation**:


* **SeqTR: A Simple yet Universal Network for Visual Grounding**:


## Impressive Works Related to Referring Multi-Object Tracking (RMOT)
* **Referring Multi-Object Tracking**[CVPR 23]:
## Benchmark
[The 3rd Large-scale Video Object Segmentation - Track 3: Referring Video Object Segmentation](https://competitions.codalab.org/competitions/29139#results)## Datasets

[Refer-YouTube-VOS-datasets](https://drive.google.com/drive/folders/1J45ubR8Y24wQ6dzKOTkfpd9GS_F9A2kb)
* **YouTube-VOS**:
```shell
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_YTVOS_w_refer.py
python down_YTVOS_w_refer.py
```Folder structure:
```latex
${current_path}/
βββ refer_youtube_vos/
βββ train/
β βββ JPEGImages/
β β βββ */ (video folders)
β β βββ *.jpg (frame image files)
β βββ Annotations/
β βββ */ (video folders)
β βββ *.png (mask annotation files)
βββ valid/
β βββ JPEGImages/
β βββ */ (video folders)
β βββ *.jpg (frame image files)
βββ meta_expressions/
βββ train/
β βββ meta_expressions.json (text annotations)
βββ valid/
βββ meta_expressions.json (text annotations)
```* **A2D-Sentences**:
REPO:
paper:

Citation:
```latex
@misc{gavrilyuk2018actor,
title={Actor and Action Video Segmentation from a Sentence},
author={Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},
year={2018},
eprint={1803.07485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
License: The dataset may not be republished in any form without the written consent of the authors.[README](https://web.eecs.umich.edu/~jjcorso/r/a2d/files/README)
Dataset and Annotation (version 1.0, 1.9GB, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz))
Evaluation Toolkit (version 1.0, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_eval_1_0.tar.bz))```shell
mkdir a2d_sentences
cd a2d_sentences
wget https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz
tar jxvf A2D_main_1_0.tar.bz
mkdir text_annotationscd text_annotations
wget https://kgavrilyuk.github.io/actor_action/a2d_annotation.txt
wget https://kgavrilyuk.github.io/actor_action/a2d_missed_videos.txt
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_a2d_annotation_with_instances.py
python down_a2d_annotation_with_instances.py
unzip a2d_annotation_with_instances.zip
#rm a2d_annotation_with_instances.zip
cd ..cd ..
```
Folder structure:
```latex
${current_path}/
βββ a2d_sentences/
βββ Release/
β βββ videoset.csv (videos metadata file)
β βββ CLIPS320/
β βββ *.mp4 (video files)
βββ text_annotations/
βββ a2d_annotation.txt (actual text annotations)
βββ a2d_missed_videos.txt
βββ a2d_annotation_with_instances/
βββ */ (video folders)
βββ *.h5 (annotations files)
```Citation:
```latex
@inproceedings{YaXuCaCVPR2017,
author = {Yan, Y. and Xu, C. and Cai, D. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking},
year = {2017}
}
@inproceedings{XuCoCVPR2016,
author = {Xu, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Actor-Action Semantic Segmentation with Grouping-Process Models},
year = {2016}
}
@inproceedings{XuHsXiCVPR2015,
author = {Xu, C. and Hsieh, S.-H. and Xiong, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
poster = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D_poster.pdf},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Can Humans Fly? {Action} Understanding with Multiple Classes of Actors},
url = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf},
year = {2015}
}
```* **J-HMDB**:

downloading_script
```shell
mkdir jhmdb_sentences
cd jhmdb_sentences
wget http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz
wget https://kgavrilyuk.github.io/actor_action/jhmdb_annotation.txt
wget http://files.is.tue.mpg.de/jhmdb/puppet_mask.zip
tar -xzvf Rename_Images.tar.gz
unzip puppet_mask.zip
cd ..
```Folder structure:
```latex
${current_path}/
βββ jhmdb_sentences/
βββ Rename_Images/ (frame images)
β βββ */ (action dirs)
βββ puppet_mask/ (mask annotations)
β βββ */ (action dirs)
βββ jhmdb_annotation.txt (text annotations)
```Citation:
```latex
@inproceedings{Jhuang:ICCV:2013,
title = {Towards understanding action recognition},
author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},
booktitle = {International Conf. on Computer Vision (ICCV)},
month = Dec,
pages = {3192-3199},
year = {2013}
}
```* **refer-DAVIS16/17**:[https://arxiv.org/pdf/1803.08006.pdf]


