https://github.com/amazon-science/qa-vit

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/amazon-science/qa-vit
Owner: amazon-science
License: apache-2.0
Created: 2024-03-07T11:23:18.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-07-17T14:23:06.000Z (almost 2 years ago)
Last Synced: 2025-03-31T05:11:15.534Z (about 1 year ago)
Language: Python
Size: 410 KB
Stars: 64
Watchers: 5
Forks: 7
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          


  


Question Aware Vision Transformer for Multimodal Reasoning

  






  Roy Ganz •

  Yair Kittenplon •

  Aviad Aberdam •

  Elad Ben Avraham

  


  Oren Nuriel •

  Shai Mazor •

  Ron Litman





  



### Installation

First, clone this repository:

```bash

git clone https://github.com/amazon-science/QA-ViT.git

cd QA-ViT

```

Next, to install the requirements in a new conda environment, run:

```bash

conda env create -f qavit.yml

conda activate qavit

```

### Data preparation 

Download the following datasets from the official websites, and organize them as follows:

    QA-ViT

    ├── configs

    │   ├── ...

    ├── data

    │   ├── textvqa

    │   ├── stvqa

    │   ├── OCRVQA

    │   ├── vqav2

    │   ├── vg

    │   ├── textcaps

    │   ├── docvqa

    │   ├── infovqa

    │   ├── vizwiz

    ├── models

    │   ├── ...

    ├── ...

### DeepSpeed Configuration

Our framework is based on deepspeed stage 2 and should be configured accordingly:

```bash

accelerate config

```

The `accelerate config` opens a dialog and should be set as follows:

Model | DeepSpeed stage | Grad accumulation | Grad clipping | Dtype

--- | :---: | :---: |:-------------:| :---:

ViT+T5 base | 2 | ❌ |      1.0      | bf16 |

ViT+T5 large | 2 | ❌ |       1.0       | bf16 |

ViT+T5 xl | 2 | 2 |       1.0       | bf16 |

 

### Training

After setting up DeepSpeed, run the following command to train QA-ViT:

```bash

accelerate launch run_train.py --config  --seed 

```

### Evaluation

After setting up DeepSpeed, run the following command to evaluate a trained model:

 

```bash

accelerate launch run_eval.py --config  --ckpt 

```

where ``  and `` specify the desired evaluation configuration and trained model checkpoint, respectively.

### Trained Checkpoints

We provide trained checkpoints of QA-ViT in the table below:

ViT+T5 base | ViT+T5 large | ViT+T5 xl |

--- | :---: | :---: |

Download | Download | Download 

LLaVA's checkpoints will be uploaded soon.

### Main Results

| Method             | VQA^v2 
 vqa-score | COCO  
 CIDEr | VQA^T  
 vqa-score | VQA^ST 
 ANLS | TextCaps 
 CIDEr | VizWiz 
 vqa-score | General 
 Average | Scene-Text 
 Average |

|--------------------|---------------------------------|------------------|------------------|----------------------------|-----------|---------|----------------------|-------------------------|

| ViT+T5-base        | 66.5                            | 110.0            | 40.2             | 47.6                       | 86.3      | 23.7    | 88.3                 | 65.1                    |

| + QA-ViT         | 71.7                            | 114.9            | 45.0             | 51.1                       | 96.1      | 23.9    | 93.3                 | 72.1                    |

| Δ                  |+5.2                            | +4.9             | +4.8             | +3.5                       | +9.8      | +0.2    | +5.0                 | +7.0                    |

| ViT+T5-large       | 70.0                            | 114.3            | 44.7             | 50.6                       | 96.0      | 24.6    | 92.2                 | 71.8                    |

| + QA-ViT         | 72.0                            | 118.7            | 48.7             | 54.4                       | 106.2     | 26.0    | 95.4                 | 78.9                    |

| Δ                  | +2.0                            | +4.4             | +4.0             | +3.8                       | +10.2     | +1.4    | +3.2                 | +7.1                    |

| ViT+T5-xl          | 72.7                            | 115.5            | 48.0             | 52.7                       | 103.5     | 27.0    | 94.1                 | 77.0                    |

| + QA-ViT         | 73.5                            | 116.5            | 50.3             | 54.9                       | 108.2     | 28.3    | 95.0                 | 80.4                    |

| Δ                  | +0.8                            | +1.0             | +2.3             | +2.2                       | +4.7      | +1.3    | +0.9                 | +3.4                    |

### Citation

If you find this code or data to be useful for your research, please consider citing it.

    @article{ganz2024question,

      title={Question Aware Vision Transformer for Multimodal Reasoning},

      author={Ganz, Roy and Kittenplon, Yair and Aberdam, Aviad and Avraham, Elad Ben and Nuriel, Oren and Mazor, Shai and Litman, Ron},

      journal={arXiv preprint arXiv:2402.05472},

      year={2024}

    }

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/amazon-science/qa-vit

Awesome Lists containing this project

README

Question Aware Vision Transformer for Multimodal Reasoning