Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hustvl/evf-sam

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
https://github.com/hustvl/evf-sam

multimodal multimodal-large-language-models referring-image-segmentation segment-anything segmentation

Last synced: 3 days ago
JSON representation

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

Host: GitHub
URL: https://github.com/hustvl/evf-sam
Owner: hustvl
License: apache-2.0
Created: 2024-06-12T02:43:25.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-17T02:17:44.000Z (2 months ago)
Last Synced: 2025-01-07T22:05:32.250Z (10 days ago)
Topics: multimodal, multimodal-large-language-models, referring-image-segmentation, segment-anything, segmentation
Language: Python
Homepage:
Size: 5.93 MB
Stars: 330
Watchers: 8
Forks: 15
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        




 📷 EVF-SAM 

 Early Vision-Language Fusion for Text-Prompted Segment Anything Model 


[Yuxuan Zhang](https://github.com/CoderZhangYx)^1,\*, [Tianheng Cheng](https://scholar.google.com/citations?user=PH8rJHYAAAAJ&hl=zh-CN)^1,\*, Lei Liu², Heng Liu², Longjin Ran², Xiaoxin Chen², [Wenyu Liu](http://eic.hust.edu.cn/professor/liuwenyu)¹, [Xinggang Wang](https://xwcv.github.io/)^1,📧

¹ Huazhong University of Science and Technology, ² vivo AI Lab

(\* equal contribution, 📧 corresponding author)

[![arxiv paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2406.20076)

[![🤗 HuggingFace models](https://img.shields.io/badge/HuggingFace🤗-Models-orange)](https://huggingface.co/YxZhang/)  

[![🤗 HuggingFace Demo](https://img.shields.io/badge/EVF_SAM-🤗_HF_Demo-orange)](https://huggingface.co/spaces/wondervictor/evf-sam)

[![🤗 HuggingFace Demo](https://img.shields.io/badge/EVF_SAM_2-🤗_HF_Demo-orange)](https://huggingface.co/spaces/wondervictor/evf-sam2)

[![colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hustvl/EVF-SAM/blob/main/inference_image.ipynb)



## News

* 2025.1: Preview! the EVF-SAM v2 is on the way, going to support salient object segmentation, salient object matte, and referring matte! Besides, better performance on original capabilities is developed!  

* 2024.9: We have expanded our EVF-SAM to powerful [SAM-2](https://github.com/facebookresearch/segment-anything-2). Besides fewer parameters and improvements on image prediction, our new model also performs well on video prediction (powered by SAM-2). Only at the expense of a simple image training process on RES datasets, we find our EVF-SAM has zero-shot video text-prompted capability. Try our code!

## Highlight







* EVF-SAM extends SAM's capabilities with text-prompted segmentation, achieving high accuracy in Referring Expression Segmentation.  

* EVF-SAM is designed for efficient computation, enabling rapid inference in few seconds per image on a T4 GPU.

## Updates

- [x] Release code

- [x] Release weights

- [x] Release demo 👉 [🤗 evf-sam](https://huggingface.co/spaces/wondervictor/evf-sam)

- [x] Release code and weights based on SAM-2

- [x] Update demo supporting SAM-2👉 [🤗 evf-sam2](https://huggingface.co/spaces/wondervictor/evf-sam2)

- [x] release new checkpoint supporting body part segmentation and semantic level segmentation.

- [x] update demo supporting multitask

## Visualization 

  Input text

  Input image

  Output

  "zebra top left"

  

  

 

  "a pizza with a yellow sign on top of it"

  

  

 

  "the broccoli closest to the ketchup bottle"

  

  

 

  "[semantic] hair"

  

  

 

  "[semantic] sea"

  

  

 

## Installation

1. Clone this repository  

2. Install [pytorch](https://pytorch.org/) for your cuda version. **Note** that torch>=2.0.0 is needed if you are to use SAM-2, and torch>=2.2 is needed if you want to enable flash-attention. (We use torch==2.0.1 with CUDA 11.7 and it works fine.)

3. pip install -r requirements.txt

4. If you are to use the video prediction function, run:

```

cd model/segment_anything_2

python setup.py build_ext --inplace

```

## Weights

  Name

  SAM

  BEIT-3

  Params

  Prompt Encoder & Mask Decoder

  Reference Score

    

  EVF-SAM-multitask

  SAM-H

  BEIT-3-L

  1.32B

  train

  84.2

    

  EVF-SAM2-multitask

  SAM-2-L

  BEIT-3-L

  898M

  freeze

  83.2

  EVF-SAM

  SAM-H

  BEIT-3-L

  1.32B

  train

  83.7

  EVF-SAM2

  SAM-2-L

  BEIT-3-L

  898M

  freeze

  83.6

  EVF-Effi-SAM-L 

  EfficientSAM-S

  BEIT-3-L

  700M

  train

  83.5

  EVF-Effi-SAM-B 

  EfficientSAM-T

  BEIT-3-B

  232M

  train

  80.0

1. -multimask checkpoints are only available with commits>=9d00853, while other checkpoints are available with commits<9d00853

2. -multimask checkpoints are jointly trained on Ref, ADE20k, Object365, PartImageNet, humanparsing, pascal part datasets. These checkpoints are able to segment part (e.g., hair, arm), background object (e.g., sky, ground), and semantic-level masks. (by adding special token "\[semantic\] " in front your prompt)

## Inference

### 1. image prediction

```

python inference.py  \

  --version  \

  --precision='fp16' \

  --vis_save_path "" \

  --model_type <"ori" or "effi" or "sam2", depending on your loaded ckpt>   \

  --image_path  \

  --prompt 

```

`--load_in_8bit` and `--load_in_4bit` are **optional**  

for example: 

```

python inference.py  \

  --version YxZhang/evf-sam2 \

  --precision='fp16' \

  --vis_save_path "vis" \

  --model_type sam2   \

  --image_path "assets/zebra.jpg" \

  --prompt "zebra top left"

```

### 2. video prediction  

firstly slice video into frames

```

ffmpeg -i .mp4 -q:v 2 -start_number 0 /'%05d.jpg'

```

then:

```

python inference_video.py  \

  --version  \

  --precision='fp16' \

  --vis_save_path "vis/" \

  --image_path    \

  --prompt    \

  --model_type sam2

```

you can use frame2video.py to concat the predicted frames to a video.

## Demo

image demo

```

python demo.py 

```

video demo

```

python demo_video.py 

```

## Data preparation

Referring segmentation datasets: [refCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip), [refCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip), [refCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip), [refCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip) ([saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip)) and [COCO2014train](http://images.cocodataset.org/zips/train2014.zip)  

```

├── dataset

│   ├── refer_seg

│   │   ├── images

│   │   |   ├── saiapr_tc-12 

│   │   |   └── mscoco

│   │   |       └── images

│   │   |           └── train2014

│   │   ├── refclef

│   │   ├── refcoco

│   │   ├── refcoco+

│   │   └── refcocog

```

## Evaluation

```

torchrun --standalone --nproc_per_node  eval.py   \

    --version  \

    --dataset_dir    \

    --val_dataset "refcoco|unc|val" \

    --model_type <"ori" or "effi" or "sam2", depending on your loaded ckpt>

```

## Acknowledgement

We borrow some codes from [LISA](https://github.com/dvlab-research/LISA/tree/main), [unilm](https://github.com/microsoft/unilm), [SAM](https://github.com/facebookresearch/segment-anything), [EfficientSAM](https://github.com/yformer/EfficientSAM), [SAM-2](https://github.com/facebookresearch/segment-anything-2).

## Citation

```bibtex

@article{zhang2024evfsamearlyvisionlanguagefusion,

      title={EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model}, 

      author={Yuxuan Zhang and Tianheng Cheng and Rui Hu and Lei Liu and Heng Liu and Longjin Ran and Xiaoxin Chen and Wenyu Liu and Xinggang Wang},

      year={2024},

      eprint={2406.20076},

      archivePrefix={arXiv},

      primaryClass={cs.CV},

      url={https://arxiv.org/abs/2406.20076}, 

}

```