{"id":19279764,"url":"https://github.com/showlab/lova3","last_synced_at":"2025-04-09T21:14:51.135Z","repository":{"id":251722110,"uuid":"802793640","full_name":"showlab/LOVA3","owner":"showlab","description":"(NeurIPS 2024) Official PyTorch implementation of LOVA3","archived":false,"fork":false,"pushed_at":"2025-03-21T19:30:46.000Z","size":6299,"stargazers_count":81,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-09T21:14:44.599Z","etag":null,"topics":["benchmark","large-multimodal-models","multimodal-large-language-models","visual-question-answering","visual-question-generation"],"latest_commit_sha":null,"homepage":"https://zhaohengyuan1.github.io/lova3.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/showlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-19T09:33:33.000Z","updated_at":"2025-03-30T04:34:35.000Z","dependencies_parsed_at":"2025-01-03T03:07:35.171Z","dependency_job_id":"4b75f2ed-219e-4074-931f-755e5d9f502c","html_url":"https://github.com/showlab/LOVA3","commit_stats":{"total_commits":34,"total_committers":1,"mean_commits":34.0,"dds":0.0,"last_synced_commit":"950b09acfa08fb7794c35a039910ca69c49042f7"},"previous_names":["showlab/lova3"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FLOVA3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FLOVA3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FLOVA3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FLOVA3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/showlab","download_url":"https://codeload.github.com/showlab/LOVA3/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248111971,"owners_count":21049578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","large-multimodal-models","multimodal-large-language-models","visual-question-answering","visual-question-generation"],"created_at":"2024-11-09T21:16:01.793Z","updated_at":"2025-04-09T21:14:51.109Z","avatar_url":"https://github.com/showlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\n  \u003ch1 align=\"center\"\u003eLOVA3: Learning to Visual Question Answering, Asking and Assessment\u003c/h1\u003e\n  \u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n        \u003ca href=\"https://arxiv.org/abs/2405.14974\"\u003e\u003cimg src='https://img.shields.io/badge/arXiv-LOVA3-red' alt='Paper PDF'\u003e\u003c/a\u003e\n        \u003ca href='https://zhaohengyuan1.github.io/lova3.github.io/'\u003e\u003cimg src='https://img.shields.io/badge/Project_Page-LOVA3-green' alt='Project Page'\u003e\u003c/a\u003e\n        \u003ca href=\"https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b\"\u003e\u003cimg src='https://img.shields.io/badge/Model-LOVA3-blue' alt='Models'\u003e\u003c/a\u003e\n        \u003ca href=\"https://huggingface.co/datasets/hhenryz/EvalQABench\"\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-EvalQABench-yellow' alt='EvalQABench'\u003e\u003c/a\u003e\n        \u003ca href=\"https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M\"\u003e\u003cimg src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-TrainingData-yellow' alt='Dataset'\u003e\u003c/a\u003e\n    \u003cbr\u003e\n    \u003cb\u003eTL;DR: No hyperparameter modification and extra data annotation; LOVA3 is a new training paradigm for advancing multimodal training by incorporating new capabilities: asking questions and assessing vqa triplets.\u003c/b\u003e\n  \u003c/p\u003e\n\n\u003c/p\u003e\n\n### Overall Performance Improvements\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/comprehensive_comparison.png\"\u003e\n\u003c/p\u003e\n\n## Abstract\n\nQuestion answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. However, current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. In this study, we introduce LOVA3 designed to equip MLLMs with these additional capabilities.\n\n## 📢 Update\n* [03/03/2025] We update four models in paper for testing, have fun!\n* [10/16/2024] We release the [webpage](https://zhaohengyuan1.github.io/lova3.github.io/).\n* [09/26/2024] LOVA3 is accepted by NeurIPS 2024.\n* [07/01/2024] Related work [Genixer](https://github.com/zhaohengyuan1/Genixer) is accepted by ECCV 2024.\n* [05/24/2024] We release the code of LOVA3, the [EvalQABench](https://huggingface.co/datasets/hhenryz/EvalQABench), the training dataset [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M), and the checkpoint [LOVA3-llava-v1.5-7b](https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b).\n* [05/23/2024] We release the LOVA3 [paper](https://arxiv.org/abs/2405.14974).\n\n## 🌺 To Do List\n\n- [x] Using Gemini-1.5-Flash to creating EvalQA training data with larger size and higher quality.\n\n- [x] Applying LOVA3 to samller language model Phi-1.5.\n\n\n\u003c!-- ## 💡Key Contributions:\n\n* **LOVA3** - To the best of our knowledge, LOVA3 is the first effort to imbue the asking and assessment abilities in training a robust and intelligent MLLM, inspired from human learning mechanism.\n* **EvalQABench** - We build a new benchmark EvalQABench for the VQA correction evaluation as the first effort to advance the development of future research.\n\n* **Performance Improvement** - Training with our proposed LOVA3 framework, we observe consistent improvement on 10 representative benchmarks.\n\n\n**Usage and License Notices**: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. \n\n## GenQA: Learn to generate diverse VQA pairs for unlabeled images\n\nIf one MLLM is able to successfully generate high-quality question-answer pairs based on visual input, it indicates a stronger problem-solving ability. To enable the MLLM to ask questions, we carefully define five main multimodal data types as listed in following table.\n\u003cp align=\"center\"\u003e\u003cimg src=\"./assets/GenQAData.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e\n\n\n## EvalQA: Learn to assess the correctness of VQA triplet\n\n### Automatic Data Generation Pipeline\nIllustration of the proposed pipeline for generating negative answers and feedback.\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/EvalqaPipeline.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e\n\n### Selected examples from EvalQABench\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/evalqa_visual.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e\n\n### EvalQABench Results\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/evalqabenchresult.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e\n\n## Main Results\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/result1.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/result2.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"assets/result3.png\" alt=\"pipeline\"/\u003e\u003c/p\u003e --\u003e\n\n## 🚀 Quick Start (Training)\n\nIf you are using the codebase [LLaVA](https://github.com/haotian-liu/LLaVA), just replace the `--data_path` with [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M) to enjoy the performance improvement.\n\n```bash\ndeepspeed llava/train/train_mem.py \\\n    --deepspeed ./scripts/zero3.json \\\n    --model_name_or_path checkpoints/vicuna-7b-v1.5 \\\n    --version v1 \\\n    --data_path ./data/Mixed_VQA_GenQA_EvalQA_1.5M.jsonl \\\n    ...\n```\n\n## ⚒️ Install (Optional)\n\nIf you have the python environments for [LLaVA](https://github.com/haotian-liu/LLaVA), please skip this step.\n\n```shell\nconda create -n LOVA python=3.10\nconda activate LOVA\npip install --upgrade pip\npip install -e .\n```\n## Model weights\n\n|Model Name|Size|Checkpoint|EvalQA Data generated By|\n|-|-|-|-|\n|LOVA3-llava-v1.5-7b|7B|[checkpoint](https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b) | Fuyu-8B |\n|LOVA3-llava-v1.5-7b-gemini|7B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-7b-gemini)| Gemini-1.5-Flash |\n|LOVA3-llava-v1.5-phi1.5-baseline|1.5B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-phi1.5-baseline)| - |\n|LOVA3-llava-v1.5-phi1.5-fuyu|1.5B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-phi1.5-fuyu) | Fuyu-8B |\n|LOVA3-llava-v1.5-phi1.5-gemini|1.5B|[checkpoint](https://huggingface.co/ZechenBai/LOVA3-llava-v1.5-phi1.5-gemini)| Gemini-1.5-Flash |\n\nDownload from huggingface:\n```\ngit clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b\n```\n\n## Data Preparation\n\n### Download the data Json\n* Training Data: [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M).\n\n* EvalQABench Data: [EvalQABench](https://huggingface.co/datasets/hhenryz/EvalQABench)\n\n### Image Datasets\n\nPlease download the images from constituting datasets:\n\n- COCO: [train2014](http://images.cocodataset.org/zips/train2014.zip)\n- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)\n- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**\n- AOKVQA: [download script](https://github.com/allenai/aokvqa?tab=readme-ov-file#downloading-the-dataset)\n- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)\n- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)\n- LLaVA-Instruct: [huggingface](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)\n\n\n## 💃 Evaluation\n\n1. Download [LOVA3-llava-v1.5-7b](https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b) under the folder `checkpoints`.\n\n2. Download the CLIP vision encoder [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) under the folder `checkpoints`.\n\n3. Run the evaluation scripts under the folder `scripts/v1_5/eval`. There are 12 multimodal datasets and benchmarks awaiting evaluation.\n\nTake VizWiz as an example, the running command is as follows:\n\n```\nmodelname=LOVA3-llava-v1.5-7b\n\npython -m llava.eval.model_vqa_loader \\\n    --model-path checkpoints/$modelname \\\n    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \\\n    --image-folder /yourpath/vizwiz/test/ \\\n    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \\\n    --temperature 0 \\\n    --conv-mode vicuna_v1\n\npython scripts/convert_vizwiz_for_submission.py \\\n    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \\\n    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \\\n    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json\n\n```\n\n## Training\n\n1. Download the pretrained MLP adapter weights [llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5) from and put it under the folder `checkpoints`.\n\n2. Download the model weight [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) under the folder `checkpoints`.\n\n3. Download the model weight [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) under the folder `checkpoints`.\n\n4. Download the training data [Mixed_VQA_GenQA_EvalQA_1.5M.jsonl](https://huggingface.co/datasets/hhenryz/Mixed_VQA_GenQA_EvalQA_1.5M) under the folder `data`.\n\n5. Run the training script.\n\n```\nbash scripts/v1_5/finetune.sh\n```\n\n## 🙏 Acknowledgement\n\n- [LLaVA](https://github.com/haotian-liu/LLaVA): The codebase we built upon. \n- [LAVIS](https://github.com/salesforce/LAVIS): We download some datasets from its scripts.\n\n## 🎓 Citation\n\nIf you find LOVA3 useful, please cite using this BibTeX:\n\n```bibtex\n@misc{zhao2024lova3learningvisualquestion,\n      title={LOVA3: Learning to Visual Question Answering, Asking and Assessment}, \n      author={Henry Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},\n      year={2024},\n      eprint={2405.14974},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2405.14974}, \n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Flova3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshowlab%2Flova3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Flova3/lists"}