{"id":18832940,"url":"https://github.com/declare-lab/mm-instructeval","last_synced_at":"2025-04-14T04:31:46.658Z","repository":{"id":200392163,"uuid":"704364202","full_name":"declare-lab/MM-InstructEval","owner":"declare-lab","description":"This repository contains code to evaluate various multimodal large language models using different instructions across multiple multimodal content comprehension tasks.","archived":false,"fork":false,"pushed_at":"2025-03-09T10:29:38.000Z","size":34221,"stargazers_count":27,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T18:21:34.982Z","etag":null,"topics":["multimodal-content-comprehension-tasks","multimodal-large-language-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/declare-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-13T05:31:24.000Z","updated_at":"2025-03-27T01:56:34.000Z","dependencies_parsed_at":"2023-10-16T20:08:08.387Z","dependency_job_id":"f75de30f-7851-44cf-8256-b039cce7ac58","html_url":"https://github.com/declare-lab/MM-InstructEval","commit_stats":null,"previous_names":["declare-lab/mm-bigbench","declare-lab/mm-instructeval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FMM-InstructEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FMM-InstructEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FMM-InstructEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/declare-lab%2FMM-InstructEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/declare-lab","download_url":"https://codeload.github.com/declare-lab/MM-InstructEval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248821742,"owners_count":21166948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multimodal-content-comprehension-tasks","multimodal-large-language-models"],"created_at":"2024-11-08T01:59:34.401Z","updated_at":"2025-04-14T04:31:46.634Z","avatar_url":"https://github.com/declare-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks\n[Paper](https://arxiv.org/abs/2310.09036) | [Data](https://github.com/declare-lab/MM-BigBench/tree/main/multimodal_data) | [Leaderboard](https://declare-lab.github.io/MM-InstructEval/)\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"Figure/mm-bigbench.png\" alt=\"\" width=\"200\" height=\"300\"\u003e\n\u003c/p\u003e\n\n# Why?\n\nThe popularity of **multimodal large language models (MLLMs)** has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs, such as [MME](https://arxiv.org/abs/2306.13394), [SEED-Bench](https://arxiv.org/abs/2307.16125), [LVLM-eHub](https://arxiv.org/abs/2306.09265), and [MM-Vet](https://arxiv.org/abs/2308.02490), primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. \n\nIn this project, we introduce a comprehensive assessment framework called **MM-BigBench**, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of **various models and instructions** across a wide spectrum of diverse **multimodal content comprehension tasks**, including Multimodal Sentiment Analysis (MSA), Multimodal Aspect-Based Sentiment Analysis (MABSA), Multimodal Hateful Memes Recognition (MHMR), Multimodal Sarcasm Recognition (MSR), Multimodal Relation Extraction (MRE), and the Visual Question Answering (VQA) with text context. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs.\n\n**MM-BigBench**, with a range of diverse metrics to provide a thorough evaluation of different models and instructions, including the Best Performance metric, the Mean Relative Gain metric, the Stability metric, and the Adaptability metric.\n\n\n## Evaluated Models (14 MLLMs)\n\n|Model Name| Modality   | Model/Code         | Paper         | PLM           | PVM      |ToTal-Paras | Training-Paras |\n|----------|------------|---------------------|---------------|---------------|----------|------------|----------------|\n|ChatGPT | Text       | [ChatGPT](https://openai.com/blog/chatgpt)                                                                                                                       | [Paper](https://arxiv.org/abs/2303.08774)        | gpt-3.5-turb  | -         | -      | -          | \n|LLaMA1-7B | Text       | [LLaMA-1](https://github.com/facebookresearch/llama/tree/llama_v1) | [Paper](https://arxiv.org/abs/2302.13971) | LLaMA-V1-7B   | -    | 6.74B  | 6.74B |\n|LLaMA1-13B| Text       |[LLaMA-1](https://github.com/facebookresearch/llama/tree/llama_v1) | [Paper](https://arxiv.org/abs/2302.13971) | LLaMA-V1-13B | _ | 13.02B | 13.02B |\n|LLaMA2-7B  | Text       | [LLaMA-2](https://github.com/facebookresearch/llama) and [llama-recipes](https://github.com/facebookresearch/llama-recipes/)  | [Paper](https://arxiv.org/abs/2307.09288) | LLaMA-V2-7B   | -    | 6.74B  | 6.74B |\n|LLaMA2-13B | Text       |[LLaMA-2](https://github.com/facebookresearch/llama) and [llama-recipes](https://github.com/facebookresearch/llama-recipes/)  | [Paper](https://arxiv.org/abs/2307.09288) | LLaMA-V2-13B | _ | 13.02B | 13.02B |\n|Flan-T5-XXL | Text |[Flan-T5-XXL](https://huggingface.co/google/flan-t5-xxl)  |[Paper](https://arxiv.org/abs/2210.11416)| Flan-T5-XXL | - | 11.14B | 11.14B |\n|OpenFlamingo | Multimodal | [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) | [Paper](https://openreview.net/forum?id=EbMuimAbPbs)    | LLaMA-7B | ViT-L/14 | 8.34B | 1.31B |\n|Fromage | Multimodal | [Fromage](https://github.com/kohjingyu/fromage) |[Paper](https://dl.acm.org/doi/10.5555/3618408.3619119) | OPT-6.7B | ViT-L/14 | 6.97B | 0.21B |\n|LLaVA-7B | Multimodal | [LLaVA-7B](https://github.com/haotian-liu/LLaVA) |[Paper](https://arxiv.org/abs/2304.08485) |LLaMA-7B | ViT-L/14 | 6.74B | 6.74B |\n|LLaVA-13B | Multimodal | [LLaVA-7B](https://github.com/haotian-liu/LLaVA) |[Paper](https://arxiv.org/abs/2304.08485) |LLaMA-13B | ViT-L/14 | 13.02B | 13.02B |\n|MiniGPT4 | Multimodal | [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4) |[Paper](https://arxiv.org/abs/2304.10592) |Vicuna-13B | ViT-g/14 | 14.11B | 0.04B |\n|mPLUG-Owl | Multimodal| [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl) |[Paper](https://arxiv.org/abs/2304.14178) | LLaMA-7B | ViT-L/14 | 7.12B | 7.12B |\n|LLaMA-Adapter V2 | Multimodal | [LLaMA-Adapter V2](https://github.com/ZrrSkywalker/LLaMA-Adapter) | [Paper](https://www.arxiv-vanity.com/papers/2304.15010/) | LLaMA-7B | ViT-L/14 | 7.23B | 7.23B |\n|VPGTrans |  Multimodal| [VPGTrans](https://github.com/VPGTrans/VPGTrans) | [Paper](https://arxiv.org/abs/2305.01278) | Vicuna-7B | -  | 7.83B |\t0.11B |\n|Multimodal-GPT |  Multimodal| [Multimodal-GPT](https://github.com/open-mmlab/Multimodal-GPT) | [Paper](https://arxiv.org/abs/2305.04790) |  LLaMA-7B | ViT-L-14 | 8.37B | 0.02B |\n|LaVIN-7B |  Multimodal| [LaVIN-7B](https://github.com/luogen1996/LaVIN) | [Paper](https://arxiv.org/abs/2305.15023) | LLaMA-7B | ViT-L/14 | 7.17B | 7.17B |\n|LaVIN-13B |  Multimodal| [LaVIN-13B](https://github.com/luogen1996/LaVIN) | [Paper](https://arxiv.org/abs/2305.15023) | LLaMA-13B | ViT-L/14 | 13.36B | 13.36B |\n| Lynx |  Multimodal| [Lynx](https://github.com/bytedance/lynx-llm) | [Paper](https://arxiv.org/abs/2307.02469) | Vicuna-7B |Eva-ViT-1b | 8.41B | 0.69B |\n|BLIP-2 |Multimodal|[BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | [Paper](https://arxiv.org/abs/2301.12597) | FlanT5-XXL | ViT-g/14 | 12.23B | 0.11B |\n|InstructBLIP | Multimodal|[InstructBLIP](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip#instructblip-towards-general-purpose-vision-language-models-with-instruction-tuning) | [Paper](https://arxiv.org/abs/2305.06500) | FlanT5-XXL | ViT-g/14 | 12.31B | 0.45B |\n\n\nNote: Refer to [Setting Models](multimodal_eval_main/models/README.md) for more information.\n\n### Results\n\nFor detailed results, please go to our [MM-BigBench leaderboard](https://declare-lab.github.io/MM-BigBench/)\n\n## Setup\n\nInstall dependencies and download data.\n\n```\nconda create -n mm-bigbench python=3.8 -y\nconda activate mm-bigbench\npip install -r requirements.txt\n```\n\n## Fast Start\n\n### Select evaluated dataset, model, prompt_type\n\n1. Select the evaluated task and dataset in the \"*.sh\" file;\n2. Select the model to be evaluated in the \"*.sh\" file;\n3. Select the prompt type, from \"1\" to \"10\" in the \"*.sh\" file;\n4. Run the corresponding \"*.sh\" file.\n\n\n### Running the inference of difffernt models\n\nFor ChatGPT, LLaMA-V1-7B, LLaMA-V1-13B, LLaMA-V2-7B, LLaMA-V2-13B, Text-FlanT5-XXL, BLIP-2, InstructBLIP, Fromage, OpenFlamingo, Multimodal-GPT,  mPLUG-Owl, MiniGPT4, LLaMA-Adapterv2, VPGTrans, LLaVA-7B, LLaVA-13 models:\n```\nsh test_scripts/run_scripts.sh\n\n## Change the \"model_name\" in the \".sh\" file to corresponding to the 'chatgpt', 'decapoda-llama-7b-hf', 'decapoda-llamab-hf', 'meta-llama2-7b-hf', 'meta-llama2-13b-hf', 'text_flan-t5-xxl', 'blip2_t5', 'blip2_instruct_flant5xxl', 'fromage', 'openflamingo', 'mmgpt', 'mplug_owl', 'minigpt4', 'llama_adapterv2', 'vpgtrans', 'llava_7b', 'llava_13b'.\n\n\n```\n\nFor LaVIN model: \n```\nsh test_scripts/run_LaVIN_zero_shot.sh\n```\n\nFor Lynx model:\n```\nsh test_scripts/run_lynx_llm_zero_shot.sh\n```\n\n\n### Eval the results to get the accuracy metric\n\n```\nsh test_scripts/eval_scripts.sh\n```\n\n## Diverse Metrics\n\nMetrics used in our paper can be found in [Evaluation Metrics](evaluation).\n\n\n## Citation\n\n```bibtex\n@inproceedings{Yang2023MMBigBenchEM,\n  title={MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks},\n  author={Xiaocui Yang and Wenfang Wu and Shi Feng and Ming Wang and Daling Wang and Yang Li and Qi Sun and Yifei Zhang and Xiaoming Fu and Soujanya Poria},\n  year={2023},\n  url={https://api.semanticscholar.org/CorpusID:264127863}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeclare-lab%2Fmm-instructeval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeclare-lab%2Fmm-instructeval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeclare-lab%2Fmm-instructeval/lists"}