Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/JerryX1110/awesome-rvos
Referring Video Object Segmentation / Multi-Object Tracking Repo
https://github.com/JerryX1110/awesome-rvos
List: awesome-rvos
image linguistic multi-modal multimodal-deep-learning refer-segmentation refer-vos refering-seg rvos segmentation text video visual-grounding youtube-vos
Last synced: 16 days ago
JSON representation
Referring Video Object Segmentation / Multi-Object Tracking Repo
- Host: GitHub
- URL: https://github.com/JerryX1110/awesome-rvos
- Owner: JerryX1110
- License: mit
- Created: 2021-12-11T09:07:05.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-07-27T05:08:14.000Z (over 1 year ago)
- Last Synced: 2024-05-21T03:12:20.310Z (7 months ago)
- Topics: image, linguistic, multi-modal, multimodal-deep-learning, refer-segmentation, refer-vos, refering-seg, rvos, segmentation, text, video, visual-grounding, youtube-vos
- Language: Python
- Homepage:
- Size: 79.1 KB
- Stars: 81
- Watchers: 6
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-rvos - Referring Video Object Segmentation / Multi-Object Tracking Repo. (Other Lists / Monkey C Lists)
README
# Awesome-Referring-Video-Object-Segmentation / Tracking [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
Welcome to starts β & comments πΉ & sharing π !!
```diff
- 2021.12.12: Recent papers (from 2021)
- welcome to add if any information misses. π
```---
## Introduction
![image](https://user-images.githubusercontent.com/65257938/145671552-f3d3dad7-77e4-4f12-98de-016cc1184976.png)
**Referring video object segmentation** aims at **segmenting an object in video with language expressions**.
Unlike the previous video object segmentation, the task exploits a different type of supervision, language expressions, **to identify and segment an object referred by the given language expressions in a video**. A detailed explanation of the new task can be found in the following paper.
* Seonguk Seo, Joon-Young Lee, Bohyung Han, βURVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmarkβ, [ECCV20]:
## Impressive Works Related to Referring Video Object Segmentation (RVOS)
* **R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency**[ICCV 2023]: [Repo] (https://github.com/lxa9867/R2VOS)
* **Spectrum-guided Multi-granularity Referring Video Object Segmentation**[ICCV 2023]:
![Screenshot 2023-07-27 130509](https://github.com/JerryX1110/awesome-rvos/assets/65257938/f433a61b-28a7-4567-bcb4-0c47c56b46a0)
* **OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation**[ICCV 2023]:
![Screenshot 2023-07-27 130535](https://github.com/JerryX1110/awesome-rvos/assets/65257938/6514e4c0-8b74-4952-9a1c-3068009c76ba)
* **Decoupling Multimodal Transformers for Referring Video Object Segmentation** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10147907)
* **Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning** [TCSVT23](https://ieeexplore.ieee.org/abstract/document/10146303)
* **Referring Video Segmentation with (Optional) Arbitrary Modality as Query for Fusion [ArXiV](https://arxiv.org/pdf/2207.05580.pdf)**
![image](https://github.com/JerryX1110/awesome-rvos/assets/65257938/9ef9a74d-ade9-4ed4-86ac-0b7a830402f0)
* **VLT: Vision-Language Transformer and Query Generation for Referring Segmentation [PAMI23]**
![VLT_TPAMI](https://user-images.githubusercontent.com/65257938/224544601-94279d80-bcbd-483d-8dce-eb50d4a936e5.png)* **Multi-Attention Network for Compressed Video Referring Object Segmentation**[ACM MM 2022]:
![image](https://user-images.githubusercontent.com/65257938/181689600-61961aa6-98d7-4234-8bc7-35935a13223c.png)* **Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/172786658-559618b5-0163-454b-a3d5-086cb2dc1030.png)
* **Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation [CVPR 2022]**
![image](https://user-images.githubusercontent.com/65257938/170071420-cc703191-ff41-4c8b-982c-4416d5456d46.png)
**Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/162581724-48d9afe6-71a4-4987-a0e2-9bb818068608.png)* **Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [ArXiv 2022]**:
![image](https://user-images.githubusercontent.com/65257938/161317831-4b11f548-d0bc-48cf-92cd-3010a374abdf.png)* **Local-Global Context Aware Transformer for Language-Guided Video Segmentation [ArXiv 2022]**:
![image](https://user-images.githubusercontent.com/65257938/159480646-02525835-13df-44ec-86ab-4256edd45993.png)
![image](https://user-images.githubusercontent.com/65257938/159480719-e40eeac2-1e08-43e8-989f-e48f25cb05bd.png)* **ReferFormer [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/148010130-db43e7e1-464e-4858-9aec-6a487c87b170.png)
![image](https://user-images.githubusercontent.com/65257938/148010352-9f642f2b-eca4-46a8-b131-98847f0c5237.png)* **MTTR [CVPR 2022]**:
![image](https://user-images.githubusercontent.com/65257938/145671132-1a2c014e-6563-4f2e-91bd-cd58ed999a0a.png)* **YOFO [AAAI 2022]**:
* You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation
![image](https://user-images.githubusercontent.com/65257938/155121320-338d9b02-edac-4288-ae52-bf9e6a7f90d7.png)
![image](https://user-images.githubusercontent.com/65257938/155121513-37b6edd2-bd0a-45fc-8ce7-8d5beefc1bf6.png)* **ClawCraneNet [ArXiv]**:
![image](https://user-images.githubusercontent.com/65257938/157188461-437c2360-55a8-4c7d-89d1-8b3819a323f0.png)
![image](https://user-images.githubusercontent.com/65257938/157188239-2a9d25f6-ae1b-4727-9250-060414dab17d.png)* **PMINet [CVPRW 2021]**:
![image](https://user-images.githubusercontent.com/65257938/145671186-0515bf89-1d71-4155-b3f9-27d6903e3f31.png)* **RVOS challenge 1st model [CVPRW 2021]**:
![image](https://user-images.githubusercontent.com/65257938/155835526-ec3410a4-4004-410d-8e5f-c24e31404b1e.png)
![image](https://user-images.githubusercontent.com/65257938/155835539-8ba742e4-4cc4-4c7d-9d70-135578602936.png)* **CMPC-V [PAMI 2021]**:
Cross-modal progressive comprehension for referring segmentation:
![image](https://user-images.githubusercontent.com/65257938/145671302-40924570-9cd2-4ffa-84d3-5bd11b95358d.png)* **HINet [BMVC 2021]**:
![image](https://user-images.githubusercontent.com/65257938/151321471-05d4be4d-1dde-4ea2-a68c-ce3bf19552f8.png)
![image](https://user-images.githubusercontent.com/65257938/151321516-d7e8649a-6eba-460a-af12-f3c6e54e1271.png)* **URVOS [ECCV 2020]**:
![image](https://user-images.githubusercontent.com/65257938/145671358-229d8e56-8d40-4cc1-bb4f-58bbff38a452.png)## Impressive Works Related to Referring Image Segmentation (Rerfer-image-segmentation)
* **LAVT: Language-Aware Vision Transformer for Referring Image Segmentation**:
![image](https://user-images.githubusercontent.com/65257938/162582078-46912f60-875b-4c0c-9c81-1643465b6b18.png)
![image](https://user-images.githubusercontent.com/65257938/162582091-44609242-de0b-4526-bad6-1b036902f9c9.png)
![image](https://user-images.githubusercontent.com/65257938/162582107-0fcca4fa-b3a6-4e41-8783-41cd6a422183.png)* **SeqTR: A Simple yet Universal Network for Visual Grounding**:
![image](https://user-images.githubusercontent.com/65257938/166140318-56247577-b12a-4a5f-bcb7-1391dc6e1fce.png)
![image](https://user-images.githubusercontent.com/65257938/166140328-28ee3209-51c1-4eb8-99b1-3fef9af05d7f.png)
## Impressive Works Related to Referring Multi-Object Tracking (RMOT)
* **Referring Multi-Object Tracking**[CVPR 23]:![image](https://user-images.githubusercontent.com/65257938/224093075-c48774c0-a17d-4e51-8821-7b74889f1c90.png)
## Benchmark
[The 3rd Large-scale Video Object Segmentation - Track 3: Referring Video Object Segmentation](https://competitions.codalab.org/competitions/29139#results)## Datasets
![image](https://user-images.githubusercontent.com/65257938/148003637-0384d2a7-9836-488e-96c3-5a282c01c102.png)
[Refer-YouTube-VOS-datasets](https://drive.google.com/drive/folders/1J45ubR8Y24wQ6dzKOTkfpd9GS_F9A2kb)
* **YouTube-VOS**:
```shell
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_YTVOS_w_refer.py
python down_YTVOS_w_refer.py
```Folder structure:
```latex
${current_path}/
βββ refer_youtube_vos/
βββ train/
β βββ JPEGImages/
β β βββ */ (video folders)
β β βββ *.jpg (frame image files)
β βββ Annotations/
β βββ */ (video folders)
β βββ *.png (mask annotation files)
βββ valid/
β βββ JPEGImages/
β βββ */ (video folders)
β βββ *.jpg (frame image files)
βββ meta_expressions/
βββ train/
β βββ meta_expressions.json (text annotations)
βββ valid/
βββ meta_expressions.json (text annotations)
```* **A2D-Sentences**:
REPO:
paper:
![image](https://user-images.githubusercontent.com/65257938/147182456-d4f25e64-a8a0-4e18-9d56-8bbdacae6f80.png)
Citation:
```latex
@misc{gavrilyuk2018actor,
title={Actor and Action Video Segmentation from a Sentence},
author={Kirill Gavrilyuk and Amir Ghodrati and Zhenyang Li and Cees G. M. Snoek},
year={2018},
eprint={1803.07485},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
License: The dataset may not be republished in any form without the written consent of the authors.[README](https://web.eecs.umich.edu/~jjcorso/r/a2d/files/README)
Dataset and Annotation (version 1.0, 1.9GB, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz))
Evaluation Toolkit (version 1.0, [tar.bz](https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_eval_1_0.tar.bz))```shell
mkdir a2d_sentences
cd a2d_sentences
wget https://web.eecs.umich.edu/~jjcorso/bigshare/A2D_main_1_0.tar.bz
tar jxvf A2D_main_1_0.tar.bz
mkdir text_annotationscd text_annotations
wget https://kgavrilyuk.github.io/actor_action/a2d_annotation.txt
wget https://kgavrilyuk.github.io/actor_action/a2d_missed_videos.txt
wget https://github.com/JerryX1110/awesome-rvos/blob/main/down_a2d_annotation_with_instances.py
python down_a2d_annotation_with_instances.py
unzip a2d_annotation_with_instances.zip
#rm a2d_annotation_with_instances.zip
cd ..cd ..
```
Folder structure:
```latex
${current_path}/
βββ a2d_sentences/
βββ Release/
β βββ videoset.csv (videos metadata file)
β βββ CLIPS320/
β βββ *.mp4 (video files)
βββ text_annotations/
βββ a2d_annotation.txt (actual text annotations)
βββ a2d_missed_videos.txt
βββ a2d_annotation_with_instances/
βββ */ (video folders)
βββ *.h5 (annotations files)
```Citation:
```latex
@inproceedings{YaXuCaCVPR2017,
author = {Yan, Y. and Xu, C. and Cai, D. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Weakly Supervised Actor-Action Segmentation via Robust Multi-Task Ranking},
year = {2017}
}
@inproceedings{XuCoCVPR2016,
author = {Xu, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Actor-Action Semantic Segmentation with Grouping-Process Models},
year = {2016}
}
@inproceedings{XuHsXiCVPR2015,
author = {Xu, C. and Hsieh, S.-H. and Xiong, C. and {\bf Corso}, {\bf J. J.}},
booktitle = {{Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}},
datadownload = {http://web.eecs.umich.edu/~jjcorso/r/a2d},
poster = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D_poster.pdf},
tags = {computer vision, activity recognition, video understanding, semantic segmentation},
title = {Can Humans Fly? {Action} Understanding with Multiple Classes of Actors},
url = {http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf},
year = {2015}
}
```* **J-HMDB**:
![image](https://user-images.githubusercontent.com/65257938/147182575-9ee87a7d-c78d-4ce8-90fe-1109204643da.png)
downloading_script
```shell
mkdir jhmdb_sentences
cd jhmdb_sentences
wget http://files.is.tue.mpg.de/jhmdb/Rename_Images.tar.gz
wget https://kgavrilyuk.github.io/actor_action/jhmdb_annotation.txt
wget http://files.is.tue.mpg.de/jhmdb/puppet_mask.zip
tar -xzvf Rename_Images.tar.gz
unzip puppet_mask.zip
cd ..
```Folder structure:
```latex
${current_path}/
βββ jhmdb_sentences/
βββ Rename_Images/ (frame images)
β βββ */ (action dirs)
βββ puppet_mask/ (mask annotations)
β βββ */ (action dirs)
βββ jhmdb_annotation.txt (text annotations)
```Citation:
```latex
@inproceedings{Jhuang:ICCV:2013,
title = {Towards understanding action recognition},
author = {H. Jhuang and J. Gall and S. Zuffi and C. Schmid and M. J. Black},
booktitle = {International Conf. on Computer Vision (ICCV)},
month = Dec,
pages = {3192-3199},
year = {2013}
}
```* **refer-DAVIS16/17**:[https://arxiv.org/pdf/1803.08006.pdf]
![image](https://user-images.githubusercontent.com/65257938/148004515-5a099e89-9665-4181-a046-92e33fe975e9.png)![image](https://user-images.githubusercontent.com/65257938/148004081-0558f83c-404d-4d0f-aaf8-856ab3f462e5.png)
![image](https://user-images.githubusercontent.com/65257938/148004251-7602955f-6a05-4f18-84ff-e18a523a0475.png)
![image](https://user-images.githubusercontent.com/65257938/148004319-b9287160-5e37-4e97-b58c-330be7678a67.png)