https://github.com/JerryX1110/awesome-rvos

Referring Video Object Segmentation / Multi-Object Tracking Repo
https://github.com/JerryX1110/awesome-rvos
image linguistic multi-modal multimodal-deep-learning refer-segmentation refer-vos refering-seg rvos segmentation text video visual-grounding youtube-vos
Last synced: 6 months ago
JSON representation
Referring Video Object Segmentation / Multi-Object Tracking Repo
Host: GitHub
URL: https://github.com/JerryX1110/awesome-rvos
Owner: JerryX1110
License: mit
Created: 2021-12-11T09:07:05.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-07-27T05:08:14.000Z (almost 2 years ago)
Last Synced: 2024-05-21T03:12:20.310Z (about 1 year ago)
Topics: image, linguistic, multi-modal, multimodal-deep-learning, refer-segmentation, refer-vos, refering-seg, rvos, segmentation, text, video, visual-grounding, youtube-vos
Language: Python
Homepage:
Size: 79.1 KB
Stars: 81
Watchers: 6
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

ultimate-awesome - awesome-rvos - Referring Video Object Segmentation / Multi-Object Tracking Repo. (Other Lists / Julia Lists)
README

        # Awesome-Referring-Video-Object-Segmentation / Tracking [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

Welcome to starts ⭐ & comments 💹 & sharing 😀 !!

```diff

- 2021.12.12: Recent papers (from 2021) 

- welcome to add if any information misses. 😎

```

---

## Introduction

![image](https://user-images.githubusercontent.com/65257938/145671552-f3d3dad7-77e4-4f12-98de-016cc1184976.png)

**Referring video object segmentation** aims at **segmenting an object in video with language expressions**. 

Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, **to identify and segment an object referred by the given language expressions in a video**. A detailed explanation of the new task can be found in the following paper.

* Seonguk Seo, Joon-Young Lee, Bohyung Han, “URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark”, [ECCV20]:

## Impressive Works Related to Referring Video Object Segmentation (RVOS)

* **R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency**[ICCV 2023]: [Repo] (https://github.com/lxa9867/R2VOS)





* **Spectrum-guided Multi-granularity Referring Video Object Segmentation**[ICCV 2023]:

![Screenshot 2023-07-27 130509](https://github.com/JerryX1110/awesome-rvos/assets/65257938/f433a61b-28a7-4567-bcb4-0c47c56b46a0)

* **OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation**[ICCV 2023]:

![Screenshot 2023-07-27 130535](https://github.com/JerryX1110/awesome-rvos/assets/65257938/6514e4c0-8b74-4952-9a1c-3068009c76ba)

* **Decoupling Multimodal Transformers for Referring Video Object Segmentation** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10147907)

* **Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10146303)

* **Referring Video Segmentation with (Optional) Arbitrary Modality as Query for Fusion [ArXiV](https://arxiv.org/pdf/2207.05580.pdf)**

![image](https://github.com/JerryX1110/awesome-rvos/assets/65257938/9ef9a74d-ade9-4ed4-86ac-0b7a830402f0)

* **VLT: Vision-Language Transformer and Query Generation for Referring Segmentation [PAMI23]**

![VLT_TPAMI](https://user-images.githubusercontent.com/65257938/224544601-94279d80-bcbd-483d-8dce-eb50d4a936e5.png)

* **Multi-Attention Network for Compressed Video Referring Object Segmentation**[ACM MM 2022]:

![image](https://user-images.githubusercontent.com/65257938/181689600-61961aa6-98d7-4234-8bc7-35935a13223c.png)

* **Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [CVPR 2022]**:

![image](https://user-images.githubusercontent.com/65257938/172786658-559618b5-0163-454b-a3d5-086cb2dc1030.png)

* **Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation [CVPR 2022]**

![image](https://user-images.githubusercontent.com/65257938/170071420-cc703191-ff41-4c8b-982c-4416d5456d46.png)

**Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [CVPR 2022]**:

![image](https://user-images.githubusercontent.com/65257938/162581724-48d9afe6-71a4-4987-a0e2-9bb818068608.png)

* **Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [ArXiv 2022]**:

![image](https://user-images.githubusercontent.com/65257938/161317831-4b11f548-d0bc-48cf-92cd-3010a374abdf.png)

* **Local-Global Context Aware Transformer for Language-Guided Video Segmentation [ArXiv 2022]**:

![image](https://user-images.githubusercontent.com/65257938/159480646-02525835-13df-44ec-86ab-4256edd45993.png)

![image](https://user-images.githubusercontent.com/65257938/159480719-e40eeac2-1e08-43e8-989f-e48f25cb05bd.png)

* **ReferFormer [CVPR 2022]**:

![image](https://user-images.githubusercontent.com/65257938/148010130-db43e7e1-464e-4858-9aec-6a487c87b170.png)

![image](https://user-images.githubusercontent.com/65257938/148010352-9f642f2b-eca4-46a8-b131-98847f0c5237.png)

* **MTTR [CVPR 2022]**:

![image](https://user-images.githubusercontent.com/65257938/145671132-1a2c014e-6563-4f2e-91bd-cd58ed999a0a.png)

* **YOFO [AAAI 2022]**:

* You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

![image](https://user-images.githubusercontent.com/65257938/155121320-338d9b02-edac-4288-ae52-bf9e6a7f90d7.png)

![image](https://user-images.githubusercontent.com/65257938/155121513-37b6edd2-bd0a-45fc-8ce7-8d5beefc1bf6.png)

* **ClawCraneNet [ArXiv]**:

![image](https://user-images.githubusercontent.com/65257938/157188461-437c2360-55a8-4c7d-89d1-8b3819a323f0.png)

![image](https://user-images.githubusercontent.com/65257938/157188239-2a9d25f6-ae1b-4727-9250-060414dab17d.png)

* **PMINet [CVPRW 2021]**:

![image](https://user-images.githubusercontent.com/65257938/145671186-0515bf89-1d71-4155-b3f9-27d6903e3f31.png)

* **RVOS challenge 1st model [CVPRW 2021]**:

![image](https://user-images.githubusercontent.com/65257938/155835526-ec3410a4-4004-410d-8e5f-c24e31404b1e.png)

![image](https://user-images.githubusercontent.com/65257938/155835539-8ba742e4-4cc4-4c7d-9d70-135578602936.png)

* **CMPC-V [PAMI 2021]**:

Cross-modal progressive comprehension for referring segmentation:

![image](https://user-images.githubusercontent.com/65257938/145671302-40924570-9cd2-4ffa-84d3-5bd11b95358d.png)

* **HINet [BMVC 2021]**:

![image](https://user-images.githubusercontent.com/65257938/151321471-05d4be4d-1dde-4ea2-a68c-ce3bf19552f8.png)

![image](https://user-images.githubusercontent.com/65257938/151321516-d7e8649a-6eba-460a-af12-f3c6e54e1271.png)

* **URVOS [ECCV 2020]**:

![image](https://user-images.githubusercontent.com/65257938/145671358-229d8e56-8d40-4cc1-bb4f-58bbff38a452.png)

## Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)

* **LAVT: Language-Aware Vision Transformer for Referring Image Segmentation**:

![image](https://user-images.githubusercontent.com/65257938/162582078-46912f60-875b-4c0c-9c81-1643465b6b18.png)

![image](https://user-images.githubusercontent.com/65257938/162582091-44609242-de0b-4526-bad6-1b036902f9c9.png)

![image](https://user-images.githubusercontent.com/65257938/162582107-0fcca4fa-b3a6-4e41-8783-41cd6a422183.png)

* **SeqTR: A Simple yet Universal Network for Visual Grounding**:

![image](https://user-images.githubusercontent.com/65257938/166140318-56247577-b12a-4a5f-bcb7-1391dc6e1fce.png)

![image](https://user-images.githubusercontent.com/65257938/166140328-28ee3209-51c1-4eb8-99b1-3fef9af05d7f.png)

## Impressive Works Related to Referring Multi-Object Tracking (RMOT)

* **Referring Multi-Object Tracking**[CVPR 23]:

![image](https://user-images.githubusercontent.com/65257938/224093075-c48774c0-a17d-4e51-8821-7b74889f1c90.png)

## Benchmark

[The 3rd Large-scale Video Object Segmentation - Track 3: Referring Video Object Segmentation](https://competitions.codalab.org/competitions/29139#results)

## Datasets

![image](https://user-images.githubusercontent.com/65257938/148003637-0384d2a7-9836-488e-96c3-5a282c01c102.png)

[Refer-YouTube-VOS-datasets](https://drive.google.com/drive/folders/1J45ubR8Y24wQ6dzKOTkfpd9GS_F9A2kb)

* **YouTube-VOS**:

```shell

wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_YTVOS_w_refer.py

python down_YTVOS_w_refer.py

```

Folder structure:

```latex

${current_path}/

└── refer_youtube_vos/ 

    ├── train/

    │   ├── JPEGImages/

    │   │   └── */ (video folders)

    │   │       └── *.jpg (frame image files) 

    │   └── Annotations/

    │       └── */ (video folders)

    │           └── *.png (mask annotation files) 

    ├── valid/

    │   └── JPEGImages/

    │       └── */ (video folders)

    │           └── *.jpg (frame image files) 

    └── meta_expressions/

        ├── train/

        │   └── meta_expressions.json  (text annotations)

        └── valid/

            └── meta_expressions.json  (text annotations)

```

* **A2D-Sentences**:

REPO:

paper:

![image](https://user-images.githubusercontent.com/65257938/147182456-d4f25e64-a8a0-4e18-9d56-8bbdacae6f80.png)

Citation:

```latex

@misc{gavrilyuk2018actor,

      title={Actor and Action Video Segmentation from a Sentence}, 

      author={Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},

      year={2018},

      eprint={1803.07485},

      archivePrefix={arXiv},

      primaryClass={cs.CV}

}

```

License: The dataset may not be republished in any form without the written consent of the authors.

[README](https://web.eecs.umich.edu/~jjcorso/r/a2d/files/README)

Dataset and Annotation (version 1.0, 1.9GB, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz))

Evaluation Toolkit (version 1.0, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_eval_1_0.tar.bz))

```shell

mkdir a2d_sentences

cd a2d_sentences

wget https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz

tar jxvf A2D_main_1_0.tar.bz

mkdir text_annotations

cd text_annotations

wget https://kgavrilyuk.github.io/actor_action/a2d_annotation.txt

wget https://kgavrilyuk.github.io/actor_action/a2d_missed_videos.txt

wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_a2d_annotation_with_instances.py

python down_a2d_annotation_with_instances.py

unzip a2d_annotation_with_instances.zip

#rm a2d_annotation_with_instances.zip

cd ..

cd ..

```

Folder structure:

```latex

${current_path}/

└── a2d_sentences/ 

    ├── Release/

    │   ├── videoset.csv  (videos metadata file)

    │   └── CLIPS320/

    │       └── *.mp4     (video files)

    └── text_annotations/

        ├── a2d_annotation.txt  (actual text annotations)

        ├── a2d_missed_videos.txt

        └── a2d_annotation_with_instances/ 

            └── */ (video folders)

                └── *.h5 (annotations files) 

```

Citation:

```latex

@inproceedings{YaXuCaCVPR2017,

  author = {Yan, Y. and Xu, C. and Cai, D. and {\bf Corso}, {\bf J. J.}},

  booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},

  tags = {computer vision, activity recognition, video understanding, semantic segmentation},

  title = {Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking},

  year = {2017}

}

@inproceedings{XuCoCVPR2016,

  author = {Xu, C. and {\bf Corso}, {\bf J. J.}},

  booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},

  datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},

  tags = {computer vision, activity recognition, video understanding, semantic segmentation},

  title = {Actor-Action Semantic Segmentation with Grouping-Process Models},

  year = {2016}

}

@inproceedings{XuHsXiCVPR2015,

  author = {Xu, C. and Hsieh, S.-H. and Xiong, C. and {\bf Corso}, {\bf J. J.}},

  booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},

  datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},

  poster = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D_poster.pdf},

  tags = {computer vision, activity recognition, video understanding, semantic segmentation},

  title = {Can Humans Fly? {Action} Understanding with Multiple Classes of Actors},

  url = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf},

  year = {2015}

}

```

* **J-HMDB**:

![image](https://user-images.githubusercontent.com/65257938/147182575-9ee87a7d-c78d-4ce8-90fe-1109204643da.png)

downloading_script

```shell

mkdir jhmdb_sentences

cd jhmdb_sentences

wget http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz

wget https://kgavrilyuk.github.io/actor_action/jhmdb_annotation.txt

wget http://files.is.tue.mpg.de/jhmdb/puppet_mask.zip

tar -xzvf  Rename_Images.tar.gz

unzip puppet_mask.zip

cd ..

```

Folder structure:

```latex

${current_path}/

└── jhmdb_sentences/ 

    ├── Rename_Images/  (frame images)

    │   └── */ (action dirs)

    ├── puppet_mask/  (mask annotations)

    │   └── */ (action dirs)

    └── jhmdb_annotation.txt  (text annotations)

```

Citation:

```latex

@inproceedings{Jhuang:ICCV:2013,

title = {Towards understanding action recognition},

author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},

booktitle = {International Conf. on Computer Vision (ICCV)},

month = Dec,

pages = {3192-3199},

year = {2013}

}

```

* **refer-DAVIS16/17**:[https://arxiv.org/pdf/1803.08006.pdf]

![image](https://user-images.githubusercontent.com/65257938/148004515-5a099e89-9665-4181-a046-92e33fe975e9.png)

![image](https://user-images.githubusercontent.com/65257938/148004081-0558f83c-404d-4d0f-aaf8-856ab3f462e5.png)

![image](https://user-images.githubusercontent.com/65257938/148004251-7602955f-6a05-4f18-84ff-e18a523a0475.png)

![image](https://user-images.githubusercontent.com/65257938/148004319-b9287160-5e37-4e97-b58c-330be7678a67.png)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/JerryX1110/awesome-rvos

Awesome Lists containing this project

README