https://github.com/orshkuri/vqa-qformer-comparison

A benchmark and analysis of QFormer, Cross Attention, and Concat models for binary Visual Question Answering (VQA) using CLIP and BERT+ViT-CLIP encoders.
https://github.com/orshkuri/vqa-qformer-comparison

bert clip deep-learning-multimodal pytorch pytorch-lightning transformers vqa

Last synced: 5 months ago
JSON representation

A benchmark and analysis of QFormer, Cross Attention, and Concat models for binary Visual Question Answering (VQA) using CLIP and BERT+ViT-CLIP encoders.

Host: GitHub
URL: https://github.com/orshkuri/vqa-qformer-comparison
Owner: orshkuri
License: mit
Created: 2025-04-24T16:37:55.000Z (6 months ago)
Default Branch: master
Last Pushed: 2025-05-04T12:15:02.000Z (5 months ago)
Last Synced: 2025-05-04T12:30:33.412Z (5 months ago)
Topics: bert, clip, deep-learning-multimodal, pytorch, pytorch-lightning, transformers, vqa
Language: Python
Homepage:
Size: 35.2 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# VQA Model Comparison: QFormer, Cross Attention & Concat

This repository presents a comparative study of three deep learning architectures for Visual Question Answering (VQA) on a binary yes/no classification dataset.
A detailed step-by-step guide to our Q-Former implementation is available in [this Medium article](https://medium.com/@ofekirsh/build-blip-2-step-by-step-59dc5a041827).
For more context, refer to the original [BLIP-2 paper](https://lnkd.in/dbFZtHqk) which inspired this work.

VQA Example

## Project Overview

The goal is to benchmark the following models:
- **QFormer**: A multi-task transformer with cross-modal attention and diverse objectives.
- **Cross Attention**: A model that directly applies cross-attention between visual and textual embeddings.
- **Concat Model**: A simpler baseline concatenating visual and text features before classification.

Each model supports two encoder types:
- **CLIP**: For both image and text.
- **BERT + ViT**: BERT for text and ViT-CLIP for image encoding.

## Dataset

The dataset used for this project was sourced from the following repository:
[Visual_Question_Answering by dinhquy-nguyen-1704](https://github.com/dinhquy-nguyen-1704/Visual_Question_Answering)

- **Train**: 7,846 samples
- **Validation**: 1,952 samples
- **Test**: 2,022 samples

## Installation

```bash
git clone https://github.com/orshkuri/vqa-model-comparison.git
cd vqa-model-comparison
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
pip install -r requirements.txt
```

## Usage

### Run All Experiments

```bash
python scripts/run.py --gpus 0,1 --models all --encoders all
```

You can customize the script using the following command-line arguments:

| Argument | Description | Default |
|----------|-------------|---------|
| `--results_dir` | Directory to save results | `../results` |
| `--models_dir` | Directory to save trained models | `../saved_models` |
| `--config_dir` | Directory containing model config files | `../configs` |
| `--data_dir` | Directory containing dataset files | `../data/vqa` |
| `--gpus` | Comma-separated list of GPU IDs to use | `0` |
| `--models` | Which models to run (`all`, `qformer`, `cross_attention`, `concat`) | `all` |
| `--encoders` | Which text encoders to use (`all`, `clip`, `bert`) | `all` |
| `--num_runs` | Number of runs per experiment | `5` |
| `--base_seed` | Base seed for random number generation | `42` |
| `--save_learning_curves` | Save learning curves for the first run (`true` or `false`) | `True` |
| `--generate_plots` | Generate plots after experiments (`true` or `false`) | `True` |

#### Example with Custom Parameters

```bash
python scripts/run.py --gpus 1 --models cross_attention --encoders clip --num_runs 3 --base_seed 123 --save_learning_curves True --generate_plots True
```
---

### Generate Comparison Plots

```bash
python scripts/make_plots.py --results_dir results --output_dir plots
```
You can customize the script using the following command-line arguments:

| Argument | Description | Default |
|----------|-------------|---------|
| `--results_dir` | Directory containing results files | `../results` |
| `--output_dir` | Directory to save generated plots | `../plots` |

#### Example with Custom Parameters

```bash
python scripts/make_plots.py --results_dir ../results --output_dir ../plots
```

## Results and Visualizations

Test Accuracy Comparison

**Figure:** Test accuracy across model and encoder configurations.

Additional plots are available in the [`images/`](images/) directory.

### QFormer
Multi-objective model using:
- Answer classification
- Image-Text Contrastive (ITC)
- Image-Text Matching (ITM)
- Image-Grounded Text (IGT)

### Cross Attention
Focuses on cross-attention mechanisms between image and question embeddings.

### Concat Model
Simple concatenation of features followed by classification layers.

## License
Distributed under the MIT License. See `LICENSE` for more information.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/orshkuri/vqa-qformer-comparison

Awesome Lists containing this project

README