An open API service indexing awesome lists of open source software.

https://github.com/amazon-science/qa-vit


https://github.com/amazon-science/qa-vit

Last synced: about 1 year ago
JSON representation

Awesome Lists containing this project

README

          




Question Aware Vision Transformer for Multimodal Reasoning




Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham


Oren Nuriel
Shai Mazor
Ron Litman



### Installation

First, clone this repository:

```bash
git clone https://github.com/amazon-science/QA-ViT.git
cd QA-ViT
```

Next, to install the requirements in a new conda environment, run:

```bash
conda env create -f qavit.yml
conda activate qavit
```

### Data preparation

Download the following datasets from the official websites, and organize them as follows:

QA-ViT
├── configs
│ ├── ...
├── data
│ ├── textvqa
│ ├── stvqa
│ ├── OCRVQA
│ ├── vqav2
│ ├── vg
│ ├── textcaps
│ ├── docvqa
│ ├── infovqa
│ ├── vizwiz
├── models
│ ├── ...
├── ...

### DeepSpeed Configuration

Our framework is based on deepspeed stage 2 and should be configured accordingly:

```bash
accelerate config
```
The `accelerate config` opens a dialog and should be set as follows:

Model | DeepSpeed stage | Grad accumulation | Grad clipping | Dtype
--- | :---: | :---: |:-------------:| :---:
ViT+T5 base | 2 | ❌ | 1.0 | bf16 |
ViT+T5 large | 2 | ❌ | 1.0 | bf16 |
ViT+T5 xl | 2 | 2 | 1.0 | bf16 |


### Training
After setting up DeepSpeed, run the following command to train QA-ViT:
```bash
accelerate launch run_train.py --config --seed
```
### Evaluation

After setting up DeepSpeed, run the following command to evaluate a trained model:

```bash
accelerate launch run_eval.py --config --ckpt
```

where `` and `` specify the desired evaluation configuration and trained model checkpoint, respectively.

### Trained Checkpoints

We provide trained checkpoints of QA-ViT in the table below:

ViT+T5 base | ViT+T5 large | ViT+T5 xl |
--- | :---: | :---: |
Download | Download | Download

LLaVA's checkpoints will be uploaded soon.

### Main Results

| Method | VQAv2
vqa-score | COCO
CIDEr | VQAT
vqa-score | VQAST
ANLS | TextCaps
CIDEr | VizWiz
vqa-score | General
Average | Scene-Text
Average |
|--------------------|---------------------------------|------------------|------------------|----------------------------|-----------|---------|----------------------|-------------------------|
| ViT+T5-base | 66.5 | 110.0 | 40.2 | 47.6 | 86.3 | 23.7 | 88.3 | 65.1 |
| + QA-ViT | 71.7 | 114.9 | 45.0 | 51.1 | 96.1 | 23.9 | 93.3 | 72.1 |
| Δ |+5.2 | +4.9 | +4.8 | +3.5 | +9.8 | +0.2 | +5.0 | +7.0 |
| ViT+T5-large | 70.0 | 114.3 | 44.7 | 50.6 | 96.0 | 24.6 | 92.2 | 71.8 |
| + QA-ViT | 72.0 | 118.7 | 48.7 | 54.4 | 106.2 | 26.0 | 95.4 | 78.9 |
| Δ | +2.0 | +4.4 | +4.0 | +3.8 | +10.2 | +1.4 | +3.2 | +7.1 |
| ViT+T5-xl | 72.7 | 115.5 | 48.0 | 52.7 | 103.5 | 27.0 | 94.1 | 77.0 |
| + QA-ViT | 73.5 | 116.5 | 50.3 | 54.9 | 108.2 | 28.3 | 95.0 | 80.4 |
| Δ | +0.8 | +1.0 | +2.3 | +2.2 | +4.7 | +1.3 | +0.9 | +3.4 |

### Citation
If you find this code or data to be useful for your research, please consider citing it.

@article{ganz2024question,
title={Question Aware Vision Transformer for Multimodal Reasoning},
author={Ganz, Roy and Kittenplon, Yair and Aberdam, Aviad and Avraham, Elad Ben and Nuriel, Oren and Mazor, Shai and Litman, Ron},
journal={arXiv preprint arXiv:2402.05472},
year={2024}
}