{"id":14567393,"url":"https://github.com/OpenGVLab/MMIU","last_synced_at":"2025-09-04T09:32:07.560Z","repository":{"id":251781730,"uuid":"838410273","full_name":"OpenGVLab/MMIU","owner":"OpenGVLab","description":"MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models","archived":false,"fork":false,"pushed_at":"2024-09-14T09:52:09.000Z","size":1363,"stargazers_count":35,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-09-14T20:31:25.398Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://mmiu-bench.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-05T15:28:20.000Z","updated_at":"2024-09-14T09:52:12.000Z","dependencies_parsed_at":"2024-08-05T19:03:43.432Z","dependency_job_id":"c3b5a999-6475-4d5a-8ac8-83dcb1814389","html_url":"https://github.com/OpenGVLab/MMIU","commit_stats":null,"previous_names":["opengvlab/mmiu"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMMIU","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMMIU/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMMIU/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMMIU/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/MMIU/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231949213,"owners_count":18450456,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-07T05:01:15.346Z","updated_at":"2024-12-31T05:32:13.828Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":["Multi-modal Large Language Models (MLLMs) Datasets \u003ca id=\"multi-modal-large-language-models-mllms-datasets\"\u003e\u003c/a\u003e"],"sub_categories":["Evaluation Datasets \u003ca id=\"evaluation02\"\u003e\u003c/a\u003e"],"readme":"# Best Practice\n\n**We strongly recommend using [VLMEevalKit](https://github.com/open-compass/VLMEvalKit) for its useful features and ready-to-use LVLM implementations**.\n\n# MMIU\n\n\u003cp align=\"left\"\u003e\n  \u003ca href=\"#🚀-quick-start\"\u003e\u003cb\u003eQuick Start\u003c/b\u003e\u003c/a\u003e |\n  \u003ca href=\"https://mmiu-bench.github.io/\"\u003e\u003cb\u003eHomePage\u003c/b\u003e\u003c/a\u003e |\n  \u003ca href=\"https://arxiv.org/abs/2408.02718\"\u003e\u003cb\u003earXiv\u003c/b\u003e\u003c/a\u003e |\n  \u003ca href=\"https://huggingface.co/datasets/FanqingM/MMIU-Benchmark\"\u003e\u003cb\u003eDataset\u003c/b\u003e\u003c/a\u003e |\n  \u003ca href=\"#🖊️-citation\"\u003e\u003cb\u003eCitation\u003c/b\u003e\u003c/a\u003e \u003cbr\u003e\n\u003c/p\u003e\n\n\nThis repository is the official implementation of [MMIU](https://arxiv.org/abs/2408.02718). \n\n\u003e [MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models](https://arxiv.org/abs/2408.02718)  \n\u003e Fanqing Meng\u003csup\u003e\\*\u003c/sup\u003e, Jin Wang\u003csup\u003e\\*\u003c/sup\u003e, Chuanhao Li\u003csup\u003e\\*\u003c/sup\u003e, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai,  Yu Qiao, Ping Luo, Kaipeng Zhang\u003csup\u003e\\#\u003c/sup\u003e, Wenqi Shao\u003csup\u003e\\#\u003c/sup\u003e  \n\u003e \u003csup\u003e\\*\u003c/sup\u003e MFQ, WJ and LCH contribute equally.  \n\u003e \u003csup\u003e\\#\u003c/sup\u003e SWQ (shaowenqi@pjlab.org.cn) and ZKP (zhangkaipeng@pjlab.org.cn) are correponding authors. \n\n## 💡 News\n\n- `2024/08/13`: We have released the codes. \n\n- `2024/08/08`: We have released the dataset at https://huggingface.co/datasets/FanqingM/MMIU-Benchmark 🔥🔥🔥\n\n- `2024/08/05`: The datasets and codes are coming soon! 🔥🔥🔥\n\n- `2024/08/05`: The technical report of [MMIU](https://arxiv.org/abs/2408.02718) is released! And check our [project page](https://mmiu-bench.github.io/)! 🔥🔥🔥\n\n\n## Introduction\nMultimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. \n![overview](assets/overview.jpg)\n\n\n\n\n\n\n\n\n## Evaluation Results Overview\n- The closed-source proprietary model GPT-4o from OpenAI has taken a leading position in MMIU, surpassing other models such as InternVL2-pro, InternVL1.5-chat, Claude3.5-Sonnet, and Gemini1.5 flash. Note that the open-source models InternVL2-pro.\n\n\n- Some powerful LVLMs like InternVL1.5  and  GLM4V whose pre-training data do not contain multi-image content even outperform many multi-image models which undergo multi-image supervised fine-tuning (SFT), indicating the strong capacity in single-image understanding is the foundation of multi-image comprehension.\n- By comparing performance at the level of image relationships, we conclude that LVLM excels at understanding semantic content in multi-image scenarios but has weaker performance in comprehending temporal and spatial relationships in multi-image contexts.\n- The analysis based on the task map reveals that models perform better on high-level understanding tasks such as video captioning which are in-domain tasks, but struggle with 3D perception tasks such as 3D detection and temporal reasoning tasks such as image ordering which are out-of-domain tasks.\n- By task learning difficulty analysis, tasks involving ordering, retrieval and massive images cannot be overfitted by simple SFT, suggesting that additional pre-training data or training techniques should be incorporated for improvement.\n![taskmap](assets/taskmap.jpg)\n\n\n## 🏆 Leaderboard\n\n\n\n| Rank | Model | Score |\n| ---- | ---------------------- | ----- |\n| **1** | **GPT4o** | **55.72** |\n| 2 | Gemini | 53.41 |\n| 3 | Claude3 | 53.38 |\n| **4** | **InternVL2** | **50.30** |\n| 5 | Mantis | 45.58 |\n| 6 | Gemini1.0 | 40.25 |\n| 7 | internvl1.5-chat | 37.39 |\n| 8 | Llava-interleave | 32.37 |\n| 9 | idefics2_8b | 27.80 |\n| 10 | glm-4v-9b | 27.02 |\n| 11 | deepseek_vl_7b | 24.64 |\n| 12 | XComposer2_1.8b | 23.46 |\n| 13 | deepseek_vl_1.3b | 23.21 |\n| 14 | flamingov2 | 22.26 |\n| 15 | llava_next_vicuna_7b | 22.25 |\n| 16 | XComposer2 | 21.91 |\n| 17 | MiniCPM-Llama3-V-2_5 | 21.61 |\n| 18 | llava_v1.5_7b | 19.19 |\n| 19 | sharegpt4v_7b | 18.52 |\n| 20 | sharecaptioner | 16.10 |\n| 21 | qwen_chat | 15.92 |\n| 22 | monkey-chat | 13.74 |\n| 23 | idefics_9b_instruct | 12.84 |\n| 24 | qwen_base | 5.16 |\n| -   | Frequency Guess        | 31.5  |\n| -   | Random Guess           | 27.4  |\n\n\n\n\n## 🚀 Quick Start\n\nHere, we mainly use the VLMEvalKit framework for testing, with some separate tests as well. Specifically, for multi-image models, we include the following models:\n\n**transformers == 33.0**\n\n- `XComposer2`\n- `XComposer2_1.8b`\n- `qwen_base`\n- `idefics_9b_instruct`\n- `qwen_chat`\n- `flamingov2`\n\n**transformers == 37.0**\n- `deepseek_vl_1.3b`\n- `deepseek_vl_7b`\n\n**transformers == 40.0**\n\n- `idefics2_8b`\n\nFor single-image models, we include the following:\n\n**transformers == 33.0**\n\n- `sharecaptioner`\n- `monkey-chat`\n\n**transformers == 37.0**\n\n- `sharegpt4v_7b`\n- `llava_v1.5_7b`\n- `glm-4v-9b`\n\n**transformers == 40.0**\n\n- `llava_next_vicuna_7b`\n- `MiniCPM-Llama3-V-2_5`\n\nWe use the VLMEvalKit framework for testing. You can refer to the code in `VLMEvalKit/test_models.py`. Additionally, for closed-source models, please replace the following part of the code by following the example here:\n\n```python\nresponse = model.generate(tmp) # tmp = image_paths + [question]\n```\n\nFor other open-source models, we have provided reference code for `Mantis` and `InternVL1.5-chat`. For `LLava-Interleave`, please refer to the original repository.\n\n\n\n\n## 💐 Acknowledgement\n\nWe expressed sincerely gratitude for the projects listed following:\n- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) provides useful out-of-box tools and implements many adavanced LVLMs. Thanks for their selfless dedication.\n- The Team of InternVL for apis.\n\n\n## 📧 Contact\nIf you have any questions, feel free to contact Fanqing Meng with mengfanqing33@gmail.com\n\n\n\n## 🖊️ Citation \nIf you feel MMIU useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!\n\n```\n@article{meng2024mmiu,\n  title={MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models},\n  author={Meng, Fanqing and Wang, Jin and Li, Chuanhao and Lu, Quanfeng and Tian, Hao and Liao, Jiaqi and Zhu, Xizhou and Dai, Jifeng and Qiao, Yu and Luo, Ping and others},\n  journal={arXiv preprint arXiv:2408.02718},\n  year={2024}\n}\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FMMIU","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGVLab%2FMMIU","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FMMIU/lists"}