An open API service indexing awesome lists of open source software.

https://github.com/showlab/lova3

(NeurIPS 2024) Official PyTorch implementation of LOVA3
https://github.com/showlab/lova3

benchmark large-multimodal-models multimodal-large-language-models visual-question-answering visual-question-generation

Last synced: about 1 year ago
JSON representation

(NeurIPS 2024) Official PyTorch implementation of LOVA3

Awesome Lists containing this project

README

          

LOVA3: Learning to Visual Question Answering, Asking and Assessment





Paper PDF
Project Page
Models
EvalQABench
Dataset


TL;DR: No hyperparameter modification and extra data annotation; LOVA3 is a new training paradigm for advancing multimodal training by incorporating new capabilities: asking questions and assessing vqa triplets.

### Overall Performance Improvements



## Abstract

Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. However, current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. In this study, we introduce LOVA3 designed to equip MLLMs with these additional capabilities.

## 📢 Update
* [03/03/2025] We update four models in paper for testing, have fun!
* [10/16/2024] We release the [webpage](https://zhaohengyuan1.github.io/lova3.github.io/).
* [09/26/2024] LOVA3 is accepted by NeurIPS 2024.
* [07/01/2024] Related work [Genixer](https://github.com/zhaohengyuan1/Genixer) is accepted by ECCV 2024.
* [05/24/2024] We release the code of LOVA3, the [EvalQABench](https://huggingface.co/datasets/hhenryz/EvalQABench), the training dataset [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M), and the checkpoint [LOVA3-llava-v1.5-7b](https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b).
* [05/23/2024] We release the LOVA3 [paper](https://arxiv.org/abs/2405.14974).

## 🌺 To Do List

- [x] Using Gemini-1.5-Flash to creating EvalQA training data with larger size and higher quality.

- [x] Applying LOVA3 to samller language model Phi-1.5.

## 🚀 Quick Start (Training)

If you are using the codebase [LLaVA](https://github.com/haotian-liu/LLaVA), just replace the `--data_path` with [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M) to enjoy the performance improvement.

```bash
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path checkpoints/vicuna-7b-v1.5 \
--version v1 \
--data_path ./data/Mixed_VQA_GenQA_EvalQA_1.5M.jsonl \
...
```

## ⚒️ Install (Optional)

If you have the python environments for [LLaVA](https://github.com/haotian-liu/LLaVA), please skip this step.

```shell
conda create -n LOVA python=3.10
conda activate LOVA
pip install --upgrade pip
pip install -e .
```
## Model weights

|Model Name|Size|Checkpoint|EvalQA Data generated By|
|-|-|-|-|
|LOVA3-llava-v1.5-7b|7B|[checkpoint](https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b) | Fuyu-8B |
|LOVA3-llava-v1.5-7b-gemini|7B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-7b-gemini)| Gemini-1.5-Flash |
|LOVA3-llava-v1.5-phi1.5-baseline|1.5B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-phi1.5-baseline)| - |
|LOVA3-llava-v1.5-phi1.5-fuyu|1.5B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-phi1.5-fuyu) | Fuyu-8B |
|LOVA3-llava-v1.5-phi1.5-gemini|1.5B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-phi1.5-gemini)| Gemini-1.5-Flash |

Download from huggingface:
```
git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b
```

## Data Preparation

### Download the data Json
* Training Data: [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M).

* EvalQABench Data: [EvalQABench](https://huggingface.co/datasets/hhenryz/EvalQABench)

### Image Datasets

Please download the images from constituting datasets:

- COCO: [train2014](http://images.cocodataset.org/zips/train2014.zip)
- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**
- AOKVQA: [download script](https://github.com/allenai/aokvqa?tab=readme-ov-file#downloading-the-dataset)
- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
- LLaVA-Instruct: [huggingface](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)

## 💃 Evaluation

1. Download [LOVA3-llava-v1.5-7b](https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b) under the folder `checkpoints`.

2. Download the CLIP vision encoder [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) under the folder `checkpoints`.

3. Run the evaluation scripts under the folder `scripts/v1_5/eval`. There are 12 multimodal datasets and benchmarks awaiting evaluation.

Take VizWiz as an example, the running command is as follows:

```
modelname=LOVA3-llava-v1.5-7b

python -m llava.eval.model_vqa_loader \
--model-path checkpoints/$modelname \
--question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--image-folder /yourpath/vizwiz/test/ \
--answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--temperature 0 \
--conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
--annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

```

## Training

1. Download the pretrained MLP adapter weights [llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5) from and put it under the folder `checkpoints`.

2. Download the model weight [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) under the folder `checkpoints`.

3. Download the model weight [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) under the folder `checkpoints`.

4. Download the training data [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M) under the folder `data`.

5. Run the training script.

```
bash scripts/v1_5/finetune.sh
```

## 🙏 Acknowledgement

- [LLaVA](https://github.com/haotian-liu/LLaVA): The codebase we built upon.
- [LAVIS](https://github.com/salesforce/LAVIS): We download some datasets from its scripts.

## 🎓 Citation

If you find LOVA3 useful, please cite using this BibTeX:

```bibtex
@misc{zhao2024lova3learningvisualquestion,
title={LOVA3: Learning to Visual Question Answering, Asking and Assessment},
author={Henry Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
year={2024},
eprint={2405.14974},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2405.14974},
}
```