https://github.com/amazon-science/qa-vit
https://github.com/amazon-science/qa-vit
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/amazon-science/qa-vit
- Owner: amazon-science
- License: apache-2.0
- Created: 2024-03-07T11:23:18.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-17T14:23:06.000Z (almost 2 years ago)
- Last Synced: 2025-03-31T05:11:15.534Z (about 1 year ago)
- Language: Python
- Size: 410 KB
- Stars: 64
- Watchers: 5
- Forks: 7
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz •
Yair Kittenplon •
Aviad Aberdam •
Elad Ben Avraham
Oren Nuriel •
Shai Mazor •
Ron Litman
### Installation
First, clone this repository:
```bash
git clone https://github.com/amazon-science/QA-ViT.git
cd QA-ViT
```
Next, to install the requirements in a new conda environment, run:
```bash
conda env create -f qavit.yml
conda activate qavit
```
### Data preparation
Download the following datasets from the official websites, and organize them as follows:
QA-ViT
├── configs
│ ├── ...
├── data
│ ├── textvqa
│ ├── stvqa
│ ├── OCRVQA
│ ├── vqav2
│ ├── vg
│ ├── textcaps
│ ├── docvqa
│ ├── infovqa
│ ├── vizwiz
├── models
│ ├── ...
├── ...
### DeepSpeed Configuration
Our framework is based on deepspeed stage 2 and should be configured accordingly:
```bash
accelerate config
```
The `accelerate config` opens a dialog and should be set as follows:
Model | DeepSpeed stage | Grad accumulation | Grad clipping | Dtype
--- | :---: | :---: |:-------------:| :---:
ViT+T5 base | 2 | ❌ | 1.0 | bf16 |
ViT+T5 large | 2 | ❌ | 1.0 | bf16 |
ViT+T5 xl | 2 | 2 | 1.0 | bf16 |
### Training
After setting up DeepSpeed, run the following command to train QA-ViT:
```bash
accelerate launch run_train.py --config --seed
```
### Evaluation
After setting up DeepSpeed, run the following command to evaluate a trained model:
```bash
accelerate launch run_eval.py --config --ckpt
```
where `` and `` specify the desired evaluation configuration and trained model checkpoint, respectively.
### Trained Checkpoints
We provide trained checkpoints of QA-ViT in the table below:
ViT+T5 base | ViT+T5 large | ViT+T5 xl |
--- | :---: | :---: |
Download | Download | Download
LLaVA's checkpoints will be uploaded soon.
### Main Results
| Method | VQAv2
vqa-score | COCO
CIDEr | VQAT
vqa-score | VQAST
ANLS | TextCaps
CIDEr | VizWiz
vqa-score | General
Average | Scene-Text
Average |
|--------------------|---------------------------------|------------------|------------------|----------------------------|-----------|---------|----------------------|-------------------------|
| ViT+T5-base | 66.5 | 110.0 | 40.2 | 47.6 | 86.3 | 23.7 | 88.3 | 65.1 |
| + QA-ViT | 71.7 | 114.9 | 45.0 | 51.1 | 96.1 | 23.9 | 93.3 | 72.1 |
| Δ |+5.2 | +4.9 | +4.8 | +3.5 | +9.8 | +0.2 | +5.0 | +7.0 |
| ViT+T5-large | 70.0 | 114.3 | 44.7 | 50.6 | 96.0 | 24.6 | 92.2 | 71.8 |
| + QA-ViT | 72.0 | 118.7 | 48.7 | 54.4 | 106.2 | 26.0 | 95.4 | 78.9 |
| Δ | +2.0 | +4.4 | +4.0 | +3.8 | +10.2 | +1.4 | +3.2 | +7.1 |
| ViT+T5-xl | 72.7 | 115.5 | 48.0 | 52.7 | 103.5 | 27.0 | 94.1 | 77.0 |
| + QA-ViT | 73.5 | 116.5 | 50.3 | 54.9 | 108.2 | 28.3 | 95.0 | 80.4 |
| Δ | +0.8 | +1.0 | +2.3 | +2.2 | +4.7 | +1.3 | +0.9 | +3.4 |
### Citation
If you find this code or data to be useful for your research, please consider citing it.
@article{ganz2024question,
title={Question Aware Vision Transformer for Multimodal Reasoning},
author={Ganz, Roy and Kittenplon, Yair and Aberdam, Aviad and Avraham, Elad Ben and Nuriel, Oren and Mazor, Shai and Litman, Ron},
journal={arXiv preprint arXiv:2402.05472},
year={2024}
}