https://github.com/compvis/disclip

[AAAI 2025] Does VLM Classification Benefit from LLM Description Semantics?
https://github.com/compvis/disclip

Last synced: 3 months ago
JSON representation

[AAAI 2025] Does VLM Classification Benefit from LLM Description Semantics?

Host: GitHub
URL: https://github.com/compvis/disclip
Owner: CompVis
Created: 2024-11-25T15:19:17.000Z (over 1 year ago)
Default Branch: release
Last Pushed: 2025-08-05T09:48:10.000Z (10 months ago)
Last Synced: 2025-09-10T05:24:25.592Z (9 months ago)
Language: Python
Homepage:
Size: 7.19 MB
Stars: 22
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

🦙 Does VLM Classification Benefit from LLM Description Semantics?

Pingchuan Ma^* · Lennart Rietdorf^* · Dmytro Kotovenko · Vincent Tao Hu · Björn Ommer

CompVis Group @ LMU Munich, MCML

^* equal contribution

## 📖 Overview
Describing images accurately through text is key to explainability. Vision-Language Models (VLMs) as CLIP align images and texts in a shared space. Descriptions generated by Large Language Models (LLMs) can further improve their classification performance. However, it remains unclear if performance gains stem from true semantics or semantic-agnostic ensembling effects, as questioned by several prior works. To address this, we propose an alternative evaluation scenario to isolate the discriminative power of descriptions and introduce a training-free method for selecting discriminative descriptions. This method improves classification accuracy across datasets by leveraging CLIP’s local label neighborhood, offering insights into description-based classification and explainability in VLMs. [Figure 1](#fig1) depicts this procedure.

This repository is our official implementation for the paper **"Does VLM Classification Benefit from LLM Description Semantics?"**. It enables the evaluation of Visual-Language Model (VLM) classification accuracy across different datasets, leveraging the semantics of descriptions generated by Large Language Models (LLMs).

Figure 1: Depiction of the suggested approach

## 🛠️ Setup
### Environment
Results were obtained using `Ubuntu 22.04.5 LTS`, `Cuda 11.8`, and `Python 3.10.14`

Install the necessary dependencies manually via
```bash
conda create -n python=3.10.14
conda activate
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm
pip install torchmetrics
pip install imagenetv2_pytorch
pip install git+https://github.com/modestyachts/ImageNetV2_pytorch
pip install pyyaml
pip install git+https://github.com/openai/CLIP.git
pip install requests
```
The resulting python env will correspond to requirements.txt.
### Datasets

The datasets supported by this implementation are:
- **Flowers102**
- **DTD (Describable Textures Dataset)**
- **Places365**
- **EuroSAT**
- **Oxford Pets**
- **Food101**
- **CUB-200**
- **ImageNet**
- **ImageNet V2**

Most of these datasets will be automatically downloaded as `torchvision` datasets and stored in ```./datasets``` during the first run of ```main.py```. Instructions for datasets that have to be installed manually can be found below.

### CUB-200 Dataset
The CUB-200 dataset requires downloading the dataset files first, e.g. from https://data.caltech.edu/records/65de6-vp158 via
```bash
wget https://data.caltech.edu/records/65de6-vp158/files/CUB_200_2011.tgz?download=1D
```
After that, create a directory `./datasets/cub_200` where you unpack `CUB_200_2011.tgz`. The dataset is then ready for embedding.

### ImageNet Dataset
Follow the instructions to download ImageNet's dataset files under the following link:
```bash
https://pytorch.org/vision/main/generated/torchvision.datasets.ImageNet.html
```
Save these files to `./datasets/ilsvrc`. The dataset is then ready to use and embed for the ```main.py``` script.

### ImageNetV2 Dataset
ImageNetV2 is an additional test dataset for the ImageNet training dataset.
This dataset requires the installation of `imagenet_v2_pytorch` package stated above in the _Environment_.
The dataset files will be downloaded automatically.

## 🦙 Description Pools
Available description pools can be found under ```./descriptions```. DClip descriptions are taken from ```https://github.com/sachit-menon/classify_by_description_release```.
The description pools supported by this implementation are:
- **DClip**
- **Contrastive Llama**

Assignments of selected descriptions will be saved as JSON files to ```./saved_descriptions```.

## 🔢 Embeddings
In the first run of ```main.py```, the datasets will be embedded first by CLIP's VLM backbones before the description selection pipeline depicted in [Figure 1](#fig1) is executed. The image embeddings will be stored in ```./image_embeddings``` for further usage. This speeds up further executions of the script.

## 🚀 Usage
### Run
To run the whole pipeline as depicted in [Figure 1](#fig1) call the script ```main.py```. As stated above, the new dataset will be downloaded and embedded in the first run of a new dataset. Use the following command with the following options:

```bash
python main.py --dataset --pool --encoding_device --calculation_device
```

### Arguments

`--dataset`
Choose the dataset to evaluate. Available options are:
- **flowers**
- **dtd**
- **eurosat**
- **places**
- **food**
- **pets**
- **cub**
- **ilsvrc**
- **imagenet_v2**

_Be aware that_ Downloading and embedding the places dataset may take a long time.

**Default:** `flowers`

`--pool`
Select the description pool to use for the evaluation. Available options are:
- **dclip**
- **con_llama**

**Default:** `dclip`

`--encoding_device` and `--calculation_device`
Select the cuda ID as an integer for encoding of images and texts; ID for evaluation device.

**Default:** 0 and 1

`--backbone`
Select the openai ViT CLIP backbone. Available options are:
- **b32**
- **b16**
- **l14**
- **l14@336px**

**Default:** `b32`

## 📈 Results
Our evaluation demonstrates that the proposed method significantly outperforms baselines in the classname-free setup, minimizing artificial gains from the ensembling effect. Additionally, we show that these improvements transfer to the conventional evaluation setup, achieving competitive results with substantially fewer descriptions required, while offering better interpretability.
![Results](/assets/tab_wo_cls.jpg)
![Results](/assets/tab_cls.jpg)

## 🎓 Citation

If you use this codebase or otherwise found our work valuable, please cite our paper:

```bibtex
@inproceedings{ma2025does,
title={Does VLM Classification Benefit from LLM Description Semantics?},
author={Ma, Pingchuan and Rietdorf, Lennart and Kotovenko, Dmytro and Hu, Vincent Tao and Ommer, Bj{\"o}rn},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={6},
pages={5973--5981},
year={2025}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/compvis/disclip

Awesome Lists containing this project

README

🦙 Does VLM Classification Benefit from LLM Description Semantics?