https://github.com/henghuiding/rela
[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation
https://github.com/henghuiding/rela
cvpr2023 multimodal-learning referring-expression-comprehension referring-expression-segmentation referring-image-segmentation vision-language-transformer
Last synced: about 1 year ago
JSON representation
[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation
- Host: GitHub
- URL: https://github.com/henghuiding/rela
- Owner: henghuiding
- License: mit
- Created: 2023-03-11T13:13:50.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-09-05T03:41:53.000Z (almost 3 years ago)
- Last Synced: 2025-03-28T15:04:47.226Z (about 1 year ago)
- Topics: cvpr2023, multimodal-learning, referring-expression-comprehension, referring-expression-segmentation, referring-image-segmentation, vision-language-transformer
- Language: Python
- Homepage: https://henghuiding.github.io/GRES/
- Size: 2.06 MB
- Stars: 692
- Watchers: 5
- Forks: 19
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GRES: Generalized Referring Expression Segmentation
[](https://pytorch.org/)
[](https://www.python.org/downloads/)
[](https://paperswithcode.com/sota/generalized-referring-expression-segmentation?p=gres-generalized-referring-expression-1)
**[🏠[Project page]](https://henghuiding.github.io/GRES/)** **[📄[arXiv]](https://arxiv.org/abs/2306.00968)** **[📄[PDF]](https://openaccess.thecvf.com/content/CVPR2023/papers/Liu_GRES_Generalized_Referring_Expression_Segmentation_CVPR_2023_paper.pdf)** **[🔥[New Dataset Download]](https://github.com/henghuiding/gRefCOCO)**
This repository contains code for **CVPR2023** paper:
> [GRES: Generalized Referring Expression Segmentation](https://arxiv.org/abs/2306.00968)
> Chang Liu, Henghui Ding, Xudong Jiang
> CVPR 2023 Highlight, Acceptance Rate 2.5%
## Update
- **(2023/08/29)** We have updated and reorganized the dataset file. Please download the latest version for train/val/testA/testB! (Note: training expressions are unchanged so the this does not influence training. But some `ref_id` and `sent_id` are re-numbered for better organization.)
- **(2023/08/16)** A new large-scale referring video segmentation dataset [MeViS](https://henghuiding.github.io/MeViS/) is released.
## Installation:
The code is tested under CUDA 11.8, Pytorch 1.11.0 and Detectron2 0.6.
1. Install [Detectron2](https://github.com/facebookresearch/detectron2) following the [manual](https://detectron2.readthedocs.io/en/latest/)
2. Run `sh make.sh` under `gres_model/modeling/pixel_decoder/ops`
3. Install other required packages: `pip -r requirements.txt`
4. Prepare the dataset following `datasets/DATASET.md`
## Inference
```
python train_net.py \
--config-file configs/referring_swin_base.yaml \
--num-gpus 8 --dist-url auto --eval-only \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [output_dir]
```
## Training
Firstly, download the backbone weights (`swin_base_patch4_window12_384_22k.pkl`) and convert it into detectron2 format using the script:
```
wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22k.pth
python tools/convert-pretrained-swin-model-to-d2.py swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl
```
Then start training:
```
python train_net.py \
--config-file configs/referring_swin_base.yaml \
--num-gpus 8 --dist-url auto \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [path_to_weights]
```
Note: You can add your own configurations subsequently to the training command for customized options. For example:
```
SOLVER.IMS_PER_BATCH 48
SOLVER.BASE_LR 0.00001
```
For the full list of base configs, see `configs/referring_R50.yaml` and `configs/Base-COCO-InstanceSegmentation.yaml`
## Models
Update: We have added supports for ResNet-50 and Swin-Tiny backbones! Feel free to use and report these resource-friendly models in your work.
| Backbone | cIoU | gIoU |
|---|---|---|
| Resnet-50 | 39.53 | 38.62 |
| Swin-Tiny | 57.73 | 56.86 |
| Swin-Base | 62.42 | 63.60 |
All models can be downloaded from:
[Onedrive](https://entuedu-my.sharepoint.com/:f:/g/personal/liuc0058_e_ntu_edu_sg/EqyL6nftLjdIihQG2rYirPoB1Sm3HBJwuZgtPII8WcevQw?e=pI1rrg)
## Acknowledgement
This project is based on [refer](https://github.com/lichengunc/refer), [Mask2Former](https://github.com/facebookresearch/Mask2Former), [Detectron2](https://github.com/facebookresearch/detectron2), [VLT](https://github.com/henghuiding/Vision-Language-Transformer). Many thanks to the authors for their great works!
## BibTeX
Please consider to cite GRES if it helps your research.
```bibtex
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}
```