Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/JerryX1110/awesome-rvos

Referring Video Object Segmentation / Multi-Object Tracking Repo
https://github.com/JerryX1110/awesome-rvos

List: awesome-rvos

image linguistic multi-modal multimodal-deep-learning refer-segmentation refer-vos refering-seg rvos segmentation text video visual-grounding youtube-vos

Last synced: about 1 month ago
JSON representation

Referring Video Object Segmentation / Multi-Object Tracking Repo

Awesome Lists containing this project

README

        

# Awesome-Referring-Video-Object-Segmentation / Tracking [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

Welcome to starts ⭐ & comments πŸ’Ή & sharing πŸ˜€ !!

```diff
- 2021.12.12: Recent papers (from 2021)
- welcome to add if any information misses. 😎
```

---

## Introduction

![image](https://user-images.githubusercontent.com/65257938/145671552-f3d3dad7-77e4-4f12-98de-016cc1184976.png)

**Referring video object segmentation** aims at **segmenting an object in video with language expressions**.

Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, **to identify and segment an object referred by the given language expressions in a video**. A detailed explanation of the new task can be found in the following paper.

* Seonguk Seo, Joon-Young Lee, Bohyung Han, β€œURVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark”, [ECCV20]:

## Impressive Works Related to Referring Video Object Segmentation (RVOS)

* **R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency**[ICCV 2023]: [Repo] (https://github.com/lxa9867/R2VOS)
ζˆͺ屏2022-07-11 11 45 10
ζˆͺ屏2022-07-11 11 45 50

* **Spectrum-guided Multi-granularity Referring Video Object Segmentation**[ICCV 2023]:

![Screenshot 2023-07-27 130509](https://github.com/JerryX1110/awesome-rvos/assets/65257938/f433a61b-28a7-4567-bcb4-0c47c56b46a0)

* **OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation**[ICCV 2023]:

![Screenshot 2023-07-27 130535](https://github.com/JerryX1110/awesome-rvos/assets/65257938/6514e4c0-8b74-4952-9a1c-3068009c76ba)

* **Decoupling Multimodal Transformers for Referring Video Object Segmentation** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10147907)

* **Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10146303)

* **Referring Video Segmentation with (Optional) Arbitrary Modality as Query for Fusion [ArXiV](https://arxiv.org/pdf/2207.05580.pdf)**

![image](https://github.com/JerryX1110/awesome-rvos/assets/65257938/9ef9a74d-ade9-4ed4-86ac-0b7a830402f0)

* **VLT: Vision-Language Transformer and Query Generation for Referring Segmentation [PAMI23]**
![VLT_TPAMI](https://user-images.githubusercontent.com/65257938/224544601-94279d80-bcbd-483d-8dce-eb50d4a936e5.png)

* **Multi-Attention Network for Compressed Video Referring Object Segmentation**[ACM MM 2022]:
![image](https://user-images.githubusercontent.com/65257938/181689600-61961aa6-98d7-4234-8bc7-35935a13223c.png)

* **Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [CVPR 2022]**:

![image](https://user-images.githubusercontent.com/65257938/172786658-559618b5-0163-454b-a3d5-086cb2dc1030.png)

* **Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation [CVPR 2022]**

![image](https://user-images.githubusercontent.com/65257938/170071420-cc703191-ff41-4c8b-982c-4416d5456d46.png)

**Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/162581724-48d9afe6-71a4-4987-a0e2-9bb818068608.png)

* **Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [ArXiv 2022]**:
![image](https://user-images.githubusercontent.com/65257938/161317831-4b11f548-d0bc-48cf-92cd-3010a374abdf.png)

* **Local-Global Context Aware Transformer for Language-Guided Video Segmentation [ArXiv 2022]**:

![image](https://user-images.githubusercontent.com/65257938/159480646-02525835-13df-44ec-86ab-4256edd45993.png)
![image](https://user-images.githubusercontent.com/65257938/159480719-e40eeac2-1e08-43e8-989f-e48f25cb05bd.png)

* **ReferFormer [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/148010130-db43e7e1-464e-4858-9aec-6a487c87b170.png)
![image](https://user-images.githubusercontent.com/65257938/148010352-9f642f2b-eca4-46a8-b131-98847f0c5237.png)

* **MTTR [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/145671132-1a2c014e-6563-4f2e-91bd-cd58ed999a0a.png)

* **YOFO [AAAI 2022]**:
* You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation
![image](https://user-images.githubusercontent.com/65257938/155121320-338d9b02-edac-4288-ae52-bf9e6a7f90d7.png)
![image](https://user-images.githubusercontent.com/65257938/155121513-37b6edd2-bd0a-45fc-8ce7-8d5beefc1bf6.png)

* **ClawCraneNet [ArXiv]**:
![image](https://user-images.githubusercontent.com/65257938/157188461-437c2360-55a8-4c7d-89d1-8b3819a323f0.png)
![image](https://user-images.githubusercontent.com/65257938/157188239-2a9d25f6-ae1b-4727-9250-060414dab17d.png)

* **PMINet [CVPRW 2021]**:
![image](https://user-images.githubusercontent.com/65257938/145671186-0515bf89-1d71-4155-b3f9-27d6903e3f31.png)

* **RVOS challenge 1st model [CVPRW 2021]**:

![image](https://user-images.githubusercontent.com/65257938/155835526-ec3410a4-4004-410d-8e5f-c24e31404b1e.png)
![image](https://user-images.githubusercontent.com/65257938/155835539-8ba742e4-4cc4-4c7d-9d70-135578602936.png)

* **CMPC-V [PAMI 2021]**:

Cross-modal progressive comprehension for referring segmentation:
![image](https://user-images.githubusercontent.com/65257938/145671302-40924570-9cd2-4ffa-84d3-5bd11b95358d.png)

* **HINet [BMVC 2021]**:
![image](https://user-images.githubusercontent.com/65257938/151321471-05d4be4d-1dde-4ea2-a68c-ce3bf19552f8.png)
![image](https://user-images.githubusercontent.com/65257938/151321516-d7e8649a-6eba-460a-af12-f3c6e54e1271.png)

* **URVOS [ECCV 2020]**:
![image](https://user-images.githubusercontent.com/65257938/145671358-229d8e56-8d40-4cc1-bb4f-58bbff38a452.png)

## Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)
* **LAVT: Language-Aware Vision Transformer for Referring Image Segmentation**:
![image](https://user-images.githubusercontent.com/65257938/162582078-46912f60-875b-4c0c-9c81-1643465b6b18.png)
![image](https://user-images.githubusercontent.com/65257938/162582091-44609242-de0b-4526-bad6-1b036902f9c9.png)
![image](https://user-images.githubusercontent.com/65257938/162582107-0fcca4fa-b3a6-4e41-8783-41cd6a422183.png)

* **SeqTR: A Simple yet Universal Network for Visual Grounding**:

![image](https://user-images.githubusercontent.com/65257938/166140318-56247577-b12a-4a5f-bcb7-1391dc6e1fce.png)

![image](https://user-images.githubusercontent.com/65257938/166140328-28ee3209-51c1-4eb8-99b1-3fef9af05d7f.png)

## Impressive Works Related to Referring Multi-Object Tracking (RMOT)
* **Referring Multi-Object Tracking**[CVPR 23]:

![image](https://user-images.githubusercontent.com/65257938/224093075-c48774c0-a17d-4e51-8821-7b74889f1c90.png)

## Benchmark
[The 3rd Large-scale Video Object Segmentation - Track 3: Referring Video Object Segmentation](https://competitions.codalab.org/competitions/29139#results)

## Datasets

![image](https://user-images.githubusercontent.com/65257938/148003637-0384d2a7-9836-488e-96c3-5a282c01c102.png)

[Refer-YouTube-VOS-datasets](https://drive.google.com/drive/folders/1J45ubR8Y24wQ6dzKOTkfpd9GS_F9A2kb)

* **YouTube-VOS**:
```shell
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_YTVOS_w_refer.py
python down_YTVOS_w_refer.py
```

Folder structure:
```latex
${current_path}/
└── refer_youtube_vos/
β”œβ”€β”€ train/
β”‚ β”œβ”€β”€ JPEGImages/
β”‚ β”‚ └── */ (video folders)
β”‚ β”‚ └── *.jpg (frame image files)
β”‚ └── Annotations/
β”‚ └── */ (video folders)
β”‚ └── *.png (mask annotation files)
β”œβ”€β”€ valid/
β”‚ └── JPEGImages/
β”‚ └── */ (video folders)
β”‚ └── *.jpg (frame image files)
└── meta_expressions/
β”œβ”€β”€ train/
β”‚ └── meta_expressions.json (text annotations)
└── valid/
└── meta_expressions.json (text annotations)
```

* **A2D-Sentences**:

REPO:

paper:

![image](https://user-images.githubusercontent.com/65257938/147182456-d4f25e64-a8a0-4e18-9d56-8bbdacae6f80.png)

Citation:
```latex
@misc{gavrilyuk2018actor,
title={Actor and Action Video Segmentation from a Sentence},
author={Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},
year={2018},
eprint={1803.07485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
License: The dataset may not be republished in any form without the written consent of the authors.

[README](https://web.eecs.umich.edu/~jjcorso/r/a2d/files/README)
Dataset and Annotation (version 1.0, 1.9GB, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz))
Evaluation Toolkit (version 1.0, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_eval_1_0.tar.bz))

```shell
mkdir a2d_sentences
cd a2d_sentences
wget https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz
tar jxvf A2D_main_1_0.tar.bz
mkdir text_annotations

cd text_annotations
wget https://kgavrilyuk.github.io/actor_action/a2d_annotation.txt
wget https://kgavrilyuk.github.io/actor_action/a2d_missed_videos.txt
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_a2d_annotation_with_instances.py
python down_a2d_annotation_with_instances.py
unzip a2d_annotation_with_instances.zip
#rm a2d_annotation_with_instances.zip
cd ..

cd ..

```

Folder structure:
```latex
${current_path}/
└── a2d_sentences/
β”œβ”€β”€ Release/
β”‚ β”œβ”€β”€ videoset.csv (videos metadata file)
β”‚ └── CLIPS320/
β”‚ └── *.mp4 (video files)
└── text_annotations/
β”œβ”€β”€ a2d_annotation.txt (actual text annotations)
β”œβ”€β”€ a2d_missed_videos.txt
└── a2d_annotation_with_instances/
└── */ (video folders)
└── *.h5 (annotations files)
```

Citation:
```latex
@inproceedings{YaXuCaCVPR2017,
author = {Yan, Y. and Xu, C. and Cai, D. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking},
year = {2017}
}
@inproceedings{XuCoCVPR2016,
author = {Xu, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Actor-Action Semantic Segmentation with Grouping-Process Models},
year = {2016}
}
@inproceedings{XuHsXiCVPR2015,
author = {Xu, C. and Hsieh, S.-H. and Xiong, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
poster = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D_poster.pdf},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Can Humans Fly? {Action} Understanding with Multiple Classes of Actors},
url = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf},
year = {2015}
}
```

* **J-HMDB**:

![image](https://user-images.githubusercontent.com/65257938/147182575-9ee87a7d-c78d-4ce8-90fe-1109204643da.png)

downloading_script
```shell
mkdir jhmdb_sentences
cd jhmdb_sentences
wget http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz
wget https://kgavrilyuk.github.io/actor_action/jhmdb_annotation.txt
wget http://files.is.tue.mpg.de/jhmdb/puppet_mask.zip
tar -xzvf Rename_Images.tar.gz
unzip puppet_mask.zip
cd ..
```

Folder structure:
```latex
${current_path}/
└── jhmdb_sentences/
β”œβ”€β”€ Rename_Images/ (frame images)
β”‚ └── */ (action dirs)
β”œβ”€β”€ puppet_mask/ (mask annotations)
β”‚ └── */ (action dirs)
└── jhmdb_annotation.txt (text annotations)
```

Citation:
```latex
@inproceedings{Jhuang:ICCV:2013,
title = {Towards understanding action recognition},
author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},
booktitle = {International Conf. on Computer Vision (ICCV)},
month = Dec,
pages = {3192-3199},
year = {2013}
}
```

* **refer-DAVIS16/17**:[https://arxiv.org/pdf/1803.08006.pdf]
![image](https://user-images.githubusercontent.com/65257938/148004515-5a099e89-9665-4181-a046-92e33fe975e9.png)

![image](https://user-images.githubusercontent.com/65257938/148004081-0558f83c-404d-4d0f-aaf8-856ab3f462e5.png)
![image](https://user-images.githubusercontent.com/65257938/148004251-7602955f-6a05-4f18-84ff-e18a523a0475.png)
![image](https://user-images.githubusercontent.com/65257938/148004319-b9287160-5e37-4e97-b58c-330be7678a67.png)