{"id":14964673,"url":"https://github.com/aimagelab/llava-more","last_synced_at":"2025-04-06T02:07:17.866Z","repository":{"id":251199327,"uuid":"836347646","full_name":"aimagelab/LLaVA-MORE","owner":"aimagelab","description":"LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning","archived":false,"fork":false,"pushed_at":"2025-04-01T06:24:33.000Z","size":2744,"stargazers_count":124,"open_issues_count":2,"forks_count":8,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-06T02:06:37.481Z","etag":null,"topics":["deepseek-r1","gemma-2","llama3","llama3-1","llama3-vision","llava","llava-llama3","llms","multimodal-llms","siglip","siglip2","vision-and-language"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aimagelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-31T16:49:46.000Z","updated_at":"2025-04-03T07:37:44.000Z","dependencies_parsed_at":"2024-08-01T13:05:33.291Z","dependency_job_id":"3217247d-6a8d-4c3e-95b2-f920ae168a9a","html_url":"https://github.com/aimagelab/LLaVA-MORE","commit_stats":{"total_commits":17,"total_committers":5,"mean_commits":3.4,"dds":0.5294117647058824,"last_synced_commit":"68602512158043f019186bbac4768a8610fb627a"},"previous_names":["aimagelab/llava-more"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FLLaVA-MORE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FLLaVA-MORE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FLLaVA-MORE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimagelab%2FLLaVA-MORE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aimagelab","download_url":"https://codeload.github.com/aimagelab/LLaVA-MORE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247423513,"owners_count":20936626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deepseek-r1","gemma-2","llama3","llama3-1","llama3-vision","llava","llava-llama3","llms","multimodal-llms","siglip","siglip2","vision-and-language"],"created_at":"2024-09-24T13:33:36.675Z","updated_at":"2025-04-06T02:07:17.847Z","avatar_url":"https://github.com/aimagelab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=  \"images/image_no_back.png\"\n   width=\"200\" height=\"200\"\u003e\n  \u003ch1\u003e  🔥 LLaVA-MORE 🔥\n    \n A Comparative Study of LLMs and Visual Backbones \u003cbr\u003efor Enhanced Visual Instruction Tuning\n  \u003c/h1\u003e  \n\n[![HuggingFace](https://img.shields.io/badge/🤗_LLaVA_MORE-1d8c0a)](https://huggingface.co/collections/aimagelab/llava-more-66aa6c49167e190bf27e7be4)\n[![Paper](https://img.shields.io/badge/Paper-arxiv.2503.15621-B31B1B.svg)](https://arxiv.org/abs/2503.15621)\n[![Website](https://img.shields.io/badge/AImageLab-red)](https://aimagelab.ing.unimore.it/imagelab)\n\n\u003c/div\u003e\n\n\n\u003cdiv align='center'\u003e\n\n#### [Federico Cocchi](https://federico1-creator.github.io/Federico_Cocchi/), [Nicholas Moratelli](https://nicholasmoratelli.github.io), [Davide Caffagni](https://github.com/dcaffo98), [Sara Sarto](https://github.com/sarasarto),\n#### [Lorenzo Baraldi](https://www.lorenzobaraldi.com/), [Marcella Cornia](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90) and [Rita Cucchiara](https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=1)\n\n\u003c/div\u003e\n\n## Citation\nIf you make use of our work, please cite our repo:\n\n```bibtex\n@inproceedings{cocchi2025llava,\n      title={{LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning}},\n      author={Cocchi, Federico and Moratelli, Nicholas and Caffagni, Davide and Sarto, Sara and Baraldi, Lorenzo and Cornia, Marcella and Cucchiara, Rita},\n      booktitle={arXiv preprint arXiv:2503.15621},\n      year={2025}\n}\n```\n\n\n\n## 📢 Latest Updates\n- [2025/03/21] 🔜 Training and release of our LLaVA-MORE checkpoints with different LLMs and Visual Backbones\n- [2025/03/21] 📚 Check out [our latest paper](https://arxiv.org/abs/2503.15621)\n- [2025/03/18] 🔥 LLaVA-MORE 8B is now availalbe on [Ollama](https://ollama.com/aimagelab/llava-more-8b)!\n- [2024/08/16] 📌 Improved LLaVA-MORE 8B model, considering advanced image backbones.\n- [2024/08/01] 🔥 First release of our LLaVA-MORE 8B, based on LLaMA 3.1.\n- [2024/08/01] 🔎 If you are interested in this area of research, check out [our survey](https://arxiv.org/abs/2402.12451) on the revolution of Multimodal LLMs, recently published in ACL (Findings).\n- [2024/08/01] 📚 Check out the latest researches from [AImageLab](https://aimagelab.ing.unimore.it/imagelab/).\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Performance](#performance)\n3. [Checkpoints](#checkpoints)\n4. [Installation](#installation)\n5. [Training](#training)\n6. [Inference](#inference)\n7. [Acknowledgments](#acknowledgments)\n\n## Overview\n\n```LLaVA-MORE``` is a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures.\n\nTo further support the research community in enhancing Multimodal LLM performance, we are also releasing the training code and scripts for distributed training.\n\nRemember to star the repository to stay updated on future releases 🤗!\n\n## Performance\nIn this section, we present the performance of our model compared to other versions of LLaVA across different multimodal datasets.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"images/radar_plot.png\" width=\"500\"\"\u003e\n\u003c/div\u003e\n\n### Benchmarks and Comparisons on Instruction Multimodal Datasets in the Literature\n\n\u003cdiv align=\"center\"\u003e\n\n|       Model Name     |  Text-VQA*  |  Science-QA  |  AI2D  |  SEED-vid  |  SEED-all  |  SEED-img  |  MMMU  |  MMBench-Cn  |  MMBench-En  |  POPE  |  GQA  |   MME-P  |  MME-C  |\n|----------------------|:----------: |:------------:|:------:|:----------:|:----------:|:----------:|:------:|:------------:|:------------:|:------:|:-----:|:--------:|:-------:|\n|    LLaVA-v1.5-7B              |    58.2      |     69.0     |  56.4     |    42.0    |    61.6    |    66.8     |  34.2     |      56.5     |      65.3     |  85.6     |  62.4     |  1474.3     |  314.6     |\n| LLaVA-v1.5-LLaMA3-8B          |    57.6      |     74.2     |  60.7     |    42.0    |    64.3    |    70.1     |  37.3     |      65.4     |      70.3     |  85.4     |  63.5     |  1544.4     |  330.3     |\n|  **LLaVA-MORE-8B**            |    58.4      |     76.3     |  61.8     |    42.4    |    64.1    |    69.8     |  39.4     |      **68.2** |      72.4     |  85.1     |  63.6     |  1531.5     |  **353.3** |\n|  **LLaVA-MORE-8B-S2**         |    60.9      |     76.7     |  62.2     |    42.3    |    64.2    |    69.9     |  38.7     |      65.8     |      71.1     |  **86.5** |  64.5     |  **1563.8** |  293.2     |\n|  **LLaVA-MORE-8B-siglip**     |    62.1      |     **77.5** |  **63.6** |  **46.1**  |   **65.8** |    **71.0** |  39.8     |      **68.2** |      **73.1** |  86.1     |  64.6     |  1531.0     |  315.4     |\n|  **LLaVA-MORE-8B-S2-siglip**  |    **63.5**  |     77.1     |  62.7     |    44.7    |    65.5    |    **71.0** |  **40.0** |      68.0     |      71.8     |  86.0     |  **64.9** |  1541.4     |  336.4     |\n\n\u003c/div\u003e\n\n*\\* The results of TextVQA are computed with OCR token in the input prompt.*\n\n## Checkpoints\n\nIn the table below, you can find links to ours 🤗 Hugging Face models.\n\n|         Model Name        |      🤗 Hugging Face      |             Summary                            |\n|---------------------------|:-------------------------:|------------------------------------------------|\n| LLaVA_MORE-llama_3_1-8B-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |\n| LLaVA_MORE-llama_3_1-8B-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |\n| LLaVA_MORE-llama_3_1-8B-S2-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |\n| LLaVA_MORE-llama_3_1-8B-S2-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |\n| LLaVA_MORE-llama_3_1-8B-siglip-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |\n| LLaVA_MORE-llama_3_1-8B-siglip-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-siglip-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |\n| LLaVA_MORE-llama_3_1-8B-S2-siglip-pretrain | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-pretrain)  | Pretrained on [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone            |\n| LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning | [Hugging Face Model](https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-S2-siglip-finetuning)  | Finetuned on [LLaVA-Instruct-665K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and using [LLaMA 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) as LLM backbone         |\n\n\n## Installation\nTo create the conda environment named ```more``` use the following instructions.\nWith this environment you will have all the packages to run the code in this repo. \n```\nconda create -n more python==3.8.16\nconda activate more\npip install -r requirements.txt\n```\n\nNote that the requirements are heavily inspired by the original [LLaVA](https://github.com/haotian-liu/LLaVA.git) repo.\n\n## Training\nTo help the community in training complex systems in distributed scenarios, we are publicly releasing not only the source code but also the bash scripts needed to train ```LLaVA-MORE``` on HPC facilities with a SLURM scheduler.\n\nTo further extend the reproducibility of our approach, we are also releasing the [wandb logs](https://api.wandb.ai/links/aimagelab/kq668y5l) of the training runs.\n\n**Pretraining**\n\n``` bash\nsbatch scripts/more/11_pretrain_llama_31_acc_st_1.sh\n```\n**Finetuning**\n``` bash\nsbatch scripts/more/12_finetuning_llama_31_acc_st_1.sh\n```\n\n### Visual Backbones\n\nAs mentioned before, ```LLaVA-MORE``` introduces the use of LLaMA 3.1 within the LLaVA architecture for the first time. However, this repository goes beyond that single enhancement.\nWe have also incorporated the ability to use different visual backbones, such as SigLIP, and various methods for managing image resolutions (S2).\n\nConsidering that, you can view this repo as an effort to expand the study of Multimodal LLMs in multiple directions and as a \nstarting point for enhancing new features to improve the connection between images and language.\n\nYou can find more references in this folder: ```scripts/more```.\n\n\n## Inference\nYou can try our ```LLaVA-MORE``` with LLaMA 3.1 in the Image-To-Text task using the following script.\n``` python\nsource activate more\ncd local/path/LLaVA-MORE\nexport PYTHONPATH=.\n\n# tokenizer_model_path\nexport HF_TOKEN=hf_read_token\nexport TOKENIZER_PATH=aimagelab/LLaVA_MORE-llama_3_1-8B-finetuning \n\npython -u llava/eval/run_llava.py\n```\nIf you get out-of-memory problems, consider loading the model weights in 8 bit (```load_in_8bit=True```).\n\n## Acknowledgments\nWe thank the [LLaVA](https://github.com/haotian-liu/LLaVA.git) team for open-sourcing a modular codebase to extend and train different models within the LLaVA family.\nWe are also happy users of the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval.git) library, which has significantly reduced the evaluation time of our checkpoints across different datasets.\n\nWe also thank [CINECA](https://www.hpc.cineca.it/systems/hardware/leonardo/) for the availability of high-performance computing resources used to train ```LLaVA-MORE```. This work is supported by the PNRR-M4C2 project [FAIR - Future Artificial Intelligence Research](https://fondazione-fair.it/) and by the PNRR project [ITSERR - Italian Strengthening of Esfri RI Resilience](https://www.itserr.it/).\n\n\nIn case you face any issues or have any questions, please feel free to create an issue.\nAdditionally, we welcome you to open a pull request to integrate new features and contribute to our project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimagelab%2Fllava-more","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faimagelab%2Fllava-more","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimagelab%2Fllava-more/lists"}