Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://henghuiding.github.io/MeViS/
[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
https://henghuiding.github.io/MeViS/
mevis-dataset mose-dataset multimodal-learning referring-expression-comprehension referring-expression-segmentation referring-video-object-segmentation video-understanding
Last synced: 3 months ago
JSON representation
[ICCV 2023] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
- Host: GitHub
- URL: https://henghuiding.github.io/MeViS/
- Owner: henghuiding
- License: mit
- Created: 2023-08-01T13:44:24.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-24T11:00:25.000Z (5 months ago)
- Last Synced: 2024-06-24T12:33:59.175Z (5 months ago)
- Topics: mevis-dataset, mose-dataset, multimodal-learning, referring-expression-comprehension, referring-expression-segmentation, referring-video-object-segmentation, video-understanding
- Language: Python
- Homepage: https://henghuiding.github.io/MeViS/
- Size: 52.2 MB
- Stars: 467
- Watchers: 7
- Forks: 18
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-Referring-Image-Segmentation - RVOS Challenge - scale Video Object Segmentation Challenge](https://lsvos.github.io/) | Aug 2024| [[CodaLab]](https://codalab.lisn.upsaclay.fr/competitions/19583) | (2. Challenges)
README
# MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
[![PyTorch](https://img.shields.io/badge/PyTorch-1.11.0-%23EE4C2C.svg?style=&logo=PyTorch&logoColor=white)](https://pytorch.org/)
[![Python](https://img.shields.io/badge/Python-3.7%20|%203.8%20|%203.9-blue.svg?style=&logo=python&logoColor=ffdd54)](https://www.python.org/downloads/)
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mevis-a-large-scale-benchmark-for-video/referring-video-object-segmentation-on-mevis)](https://paperswithcode.com/sota/referring-video-object-segmentation-on-mevis?p=mevis-a-large-scale-benchmark-for-video)**[π [Project page]](https://henghuiding.github.io/MeViS/)** β **[π[arXiv]](https://arxiv.org/abs/2308.08544)** β **[π[PDF]](https://drive.google.com/file/d/1WRanGRaYPpaNfrwq4xRq0sfmiJLSr9-b/view?usp=sharing)** β **[π₯[Dataset Download]](https://codalab.lisn.upsaclay.fr/competitions/15094)** β **[π₯[Evaluation Server]](https://codalab.lisn.upsaclay.fr/competitions/15094)**
This repository contains code for **ICCV2023** paper:
> [MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions](https://arxiv.org/abs/2308.08544)
> Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
> ICCV 2023
### Abstract
This work strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object segmentation datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. The goal of MeViS benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes.
Figure 1. Examples of video clips from Motion expressions Video Segmentation (MeViS) are provided to illustrate the dataset's nature and complexity. The expressions in MeViS primarily focus on motion attributes and the referred target objects that cannot be identified by examining a single frame solely. For instance, the first example features three parrots with similar appearances, and the target object is identified as "The bird flying away". This object can only be recognized by capturing its motion throughout the video.
TABLE 1. Scale comparison between MeViS and existing language-guided video segmentation datasets.
Dataset
Pub.&Year
Videos
Object
Expression
Mask
Obj/Video
Obj/Expn
Target
A2DΒ Sentence
CVPRΒ 2018
3,782
4,825
6,656
58k
1.28
1
Actor
DAVIS17-RVOS
ACCVΒ 2018
90
205
205
13.5k
2.27
1
Object
ReferYoutubeVOS
ECCVΒ 2020
3,978
7,451
15,009
131k
1.86
1
Object
MeViS (ours)
ICCVΒ 2023
2,006
8,171
28,570
443k
4.28
1.59
Object(s)
## MeViS Dataset Download
β¬οΈ [Download the dataset from οΈhereβοΈ](https://codalab.lisn.upsaclay.fr/competitions/15094).
**Dataset Split**
* 2,006 videos & 28,570 sentences in total;
* **Train set:** 1662 videos & 23,051 sentences, used for training;
* **Valu set:** 50 videos & 793 sentences, used for offline evaluation (e.g., ablation study) by users during training;
* **Val set:** 140 videos & 2,236 sentences, used for [**CodaLab online evaluation**](https://codalab.lisn.upsaclay.fr/competitions/15094);
* **Test set:** 154 videos & 2,490 sentences (not released yet), used for evaluation during the competition periods;
It is suggested to report the results on **Valu set** and **Val set**.## Online Evaluation
Please submit your results of **Val set** on
- π― [**CodaLab**](https://codalab.lisn.upsaclay.fr/competitions/15094).It is strongly suggested to first evaluate your model locally using the **Valu** set before submitting your results of the **Val** to the online evaluation system.
## File Structure
The dataset follows a similar structure as [Refer-YouTube-VOS](https://youtube-vos.org/dataset/rvos/). Each split of the dataset consists of three parts: `JPEGImages`, which holds the frame images, `meta_expressions.json`, which provides referring expressions and metadata of videos, and `mask_dict.json`, which contains the ground-truth masks of objects. Ground-truth segmentation masks are saved in the format of COCO RLE, and expressions are organized similarly like Refer-Youtube-VOS.
Please note that while annotations for all frames in the **Train** set and the **Valu** set are provided, the **Val** set only provide frame images and referring expressions for inference.
```
mevis
βββ train // Split Train
βΒ Β βββ JPEGImages
β β βββ
β β βββ
β β βββ
β β
βΒ Β βββ mask_dict.json
βΒ Β βββ meta_expressions.json
β
βββ valid_u // Split Val^u
βΒ Β βββ JPEGImages
β β βββ
β β
β βββ mask_dict.json
β βββ meta_expressions.json
β
βββ valid // Split Val
Β Β βββ JPEGImages
β βββ
β
Β Β βββ meta_expressions.json```
## Method Code Installation:
Please see [INSTALL.md](https://github.com/henghuiding/MeViS/blob/main/INSTALL.md)
## Inference
### 1. Valu set
Obtain the output masks of Valu set:
```
python train_net_lmpm.py \
--config-file configs/lmpm_SWIN_bs8.yaml \
--num-gpus 8 --dist-url auto --eval-only \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [output_dir]
```
Obtain the J&F results on Valu set:
```
python tools/eval_mevis.py
```
### 2. Val set
Obtain the output masks of Val set for [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/15094) online evaluation:
```
python train_net_lmpm.py \
--config-file configs/lmpm_SWIN_bs8.yaml \
--num-gpus 8 --dist-url auto --eval-only \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [output_dir] DATASETS.TEST '("mevis_test",)'
```
### CodaLab Evaluation Submission GuidelineThe submission format should be a **.zip** file containing the predicted .PNG results of the **Val set** (for current competition stage).
You can use following command to prepare .zip submission file
```
cd [output_dir]
zip -r ../xxx.zip *
```
A submission example named *sample_submission_valid.zip* can be found from the [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/15094).
```
sample_submission_valid.zip // .zip file, which directly packages 140 val video folders
βββ 0ab4afe7fb46 // video folder name
βΒ Β βββ 0 // expression_id folder name
β β βββ 00000.png // .png files
β β βββ 00001.png
β β βββ ....
β β
βΒ Β βββ 1
β β βββ 00000.png
β β βββ ....
β β
β βββ ....
β
βββ 0fea0cb75a25
βΒ Β βββ 0
β β βββ 00000.png
β β βββ ....
β β
β βββ ....
β
βββ ....
```## Training
Firstly, download the backbone weights (`model_final_86143f.pkl`) and convert it using the script:
```
wget https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_swin_tiny_bs16_50ep/model_final_86143f.pkl
python tools/process_ckpt.py
```Then start training:
```
python train_net_lmpm.py \
--config-file configs/lmpm_SWIN_bs8.yaml \
--num-gpus 8 --dist-url auto \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [path_to_weights]
```Note: We also support training ReferFormer by providing [`ReferFormer_dataset.py`](https://github.com/henghuiding/MeViS/blob/main/ReferFormer_dataset.py)
## Models
Our results on Valu set and Val set of MeViS dataset.
* Valu set is used for offline evaluation by userself, like doing ablation study
* Val set is used for CodaLab online evaluation by MeViS dataset organizers
Backbone
Valu
Val
J&F
J
F
J&F
J
F
Swin-Tiny & RoBERTa
40.23
36.51
43.90
37.21
34.25
40.17
βοΈ [Google Drive](https://drive.google.com/file/d/1djNwwNAyAIEJMZIQQHV_NYnlc8TeA4wU/view?usp=drive_link)
## Acknowledgement
This project is based on [VITA](https://github.com/sukjunhwang/VITA), [GRES](https://github.com/henghuiding/ReLA), [Mask2Former](https://github.com/facebookresearch/Mask2Former), and [VLT](https://github.com/henghuiding/Vision-Language-Transformer). Many thanks to the authors for their great works!
## BibTeX
Please consider to cite MeViS if it helps your research.```latex
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}
``````latex
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
``````latex
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
```
A majority of videos in MeViS are from [MOSE: Complex Video Object Segmentation Dataset](https://henghuiding.github.io/MOSE/).
```latex
@inproceedings{MOSE,
title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
booktitle={ICCV},
year={2023}
}
```
MeViS is licensed under a CC BY-NC-SA 4.0 License. The data of MeViS is released for non-commercial research purpose only.