https://github.com/wjpoom/SPEC

[CVPR' 24] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"
https://github.com/wjpoom/SPEC

clip compositionality computer-vision fine-grained multimodal vision-language vision-language-model

Last synced: 3 months ago
JSON representation

[CVPR' 24] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"

Host: GitHub
URL: https://github.com/wjpoom/SPEC
Owner: wjpoom
Created: 2023-11-27T07:55:15.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-13T17:07:29.000Z (over 1 year ago)
Last Synced: 2024-04-14T09:41:28.893Z (over 1 year ago)
Topics: clip, compositionality, computer-vision, fine-grained, multimodal, vision-language, vision-language-model
Language: Jupyter Notebook
Homepage: https://arxiv.org/abs/2312.00081
Size: 11.4 MB
Stars: 15
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

Awesome-Segment-Anything - [code

README

          




SPEC: Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding



    





    





    





    





    





    Wujian Peng,

    Sicheng Xie,

    Zuyao You,

    Shiyi Lan,

    Zuxuan Wu^†,





    ^† Corresponding author 





## :fire: News

* `Jun. 17, 2025` 🔥  We have released the [checkpoints](https://huggingface.co/wjpoom/SPEC-CLIP-ViT-B-32) of our fine-tuned model.

* `Apr. 13, 2024` We released the SPEC dataset and the code for evaluation, sorry for the delay :relaxed:.

* `Feb. 28, 2024` Our work has been accepted by [CVPR 2024](https://cvpr.thecvf.com/) :tada:.

## :mag: SPEC Benchmark

To evaluate the understanding capability of visual-language models on fine-grained concepts, we propose a new benchmark, SPEC, 

which consists of six distinct subsets, distributed across the dimensions of **S**ize, **P**osition, **E**xistence, and **C**ount.

Each test case consists of an image candidate set, which differs only in certain visual concepts, and a text candidate set, 

which differs only in the corresponding language concept.



  



## :wrench: Usage

### install

``` shell

git clone https://github.com/wjpoom/SPEC.git

cd SPEC/

pip install -e .

```

### prepare data

* run the following code in Python shell, replace `/path/to/save/data` with a specified dir to store the data.

```python

import zipfile

import os

from huggingface_hub import hf_hub_download

data_root = '/path/to/save/data'

hf_hub_download(repo_id='wjpoom/SPEC', repo_type='dataset', filename='data.zip', local_dir=data_root)

with zipfile.ZipFile(os.path.join(data_root, 'data.zip'), 'r') as zip_ref:

    zip_ref.extractall(os.path.join(data_root))

    

os.remove(os.path.join(data_root, 'data.zip'))

```

### explore the dataset

* We provide a 📓notebook that enables you to visually explore the test samples in the SPEC dataset.

* Run this notebook either [locally](https://github.com/wjpoom/SPEC/blob/main/notebooks/explore_spec_local.ipynb) or online using [Colab](https://colab.research.google.com/github/wjpoom/SPEC/blob/main/notebooks/explore_spec_colab.ipynb).

### reproduce the results

* In our paper, we evaluated four popular VLMs using our SPEC dataset, namely: CLIP, BLIP, FLAVA and CoCa.

* To reproduce the results with these VLMs, you can run [this script](https://github.com/wjpoom/SPEC/blob/main/spec/run_eval.sh).

* You can also reproduce with this [local notebook](https://github.com/wjpoom/SPEC/blob/main/notebooks/evaluate_example_local.ipynb) or the online [Colab notebook](https://colab.research.google.com/github/wjpoom/SPEC/blob/main/notebooks/evaluate_example_colab.ipynb).

### evaluate custom VLMs

* If you want to evaluate your custom model on SPEC, you can follow the instructions in [this document](https://github.com/wjpoom/SPEC/blob/main/docs/evaluate_custom_model.md).

## :space_invader: Model weights

```shell

pip install open_clip_torch

mkdir checkpoints

huggingface-cli download wjpoom/SPEC-CLIP-ViT-B-32 --local-dir checkpoints/SPEC-CLIP-ViT-B-32

```

```python

import torch

from PIL import Image

import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='checkpoints/SPEC-CLIP-ViT-B-32', load_weights_only=False)

model.eval()

tokenizer = open_clip.get_tokenizer('ViT-B-32')

image = preprocess(Image.open("assets/image.png")).unsqueeze(0)

text = tokenizer([

    "the broccoli is situated above the backpack.", 

    "the broccoli is situated to the right of the backpack",

    "the broccoli is positioned on the left of the backpack.",

    "the broccoli is placed beneath the backpack."

    ])

with torch.no_grad(), torch.autocast("cuda"):

    image_features = model.encode_image(image)

    text_features = model.encode_text(text)

    image_features /= image_features.norm(dim=-1, keepdim=True)

    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  

```

## :memo: TODO

- [x] Release the checkpoint of fine-tuned model

- [x] Release the testing set of SPEC benchmark

- [x] Release the evaluation code of SPEC

## :clap: Acknowledgement

Part of this repository is built upon [ARO](https://github.com/mertyg/vision-language-models-are-bows), thanks for the well-organized codebase.

## Contact Us

Feel free to contact us if you have any questions or suggestions 

Email (Wujian Peng): wjpeng24@m.fudan.edu.cn

## :black_nib: Citation

If you use our code or data in this repo or find our work helpful, please consider giving a citation:

``` bibtex

@inproceedings{peng2024synthesize,

  title={Synthesize diagnose and optimize: Towards fine-grained vision-language understanding},

  author={Peng, Wujian and Xie, Sicheng and You, Zuyao and Lan, Shiyi and Wu, Zuxuan},

  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

  pages={13279--13288},

  year={2024}

}

```