{"id":13628896,"url":"https://github.com/OpenGVLab/MM-NIAH","last_synced_at":"2025-04-17T04:32:32.886Z","repository":{"id":243988096,"uuid":"810632597","full_name":"OpenGVLab/MM-NIAH","owner":"OpenGVLab","description":"[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. ","archived":false,"fork":false,"pushed_at":"2024-11-25T04:59:21.000Z","size":2963,"stargazers_count":102,"open_issues_count":1,"forks_count":6,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-25T05:26:31.248Z","etag":null,"topics":["benchmark","long-context","multimodal-large-language-models","vision-language-model"],"latest_commit_sha":null,"homepage":"https://mm-niah.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-05T04:32:38.000Z","updated_at":"2024-11-25T04:59:24.000Z","dependencies_parsed_at":"2024-11-08T19:43:04.289Z","dependency_job_id":null,"html_url":"https://github.com/OpenGVLab/MM-NIAH","commit_stats":null,"previous_names":["opengvlab/mm-niah"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMM-NIAH","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMM-NIAH/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMM-NIAH/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2FMM-NIAH/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/MM-NIAH/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249315998,"owners_count":21249871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","long-context","multimodal-large-language-models","vision-language-model"],"created_at":"2024-08-01T22:00:59.132Z","updated_at":"2025-04-17T04:32:30.466Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":["Multi-modal Large Language Models (MLLMs) Datasets \u003ca id=\"multi-modal-large-language-models-mllms-datasets\"\u003e\u003c/a\u003e"],"sub_categories":["Evaluation Datasets \u003ca id=\"evaluation02\"\u003e\u003c/a\u003e"],"readme":"# \u003cimg width=\"60\" alt=\"image\" src=\"assets/logo.png\"\u003e Needle In A Multimodal Haystack\n\n[[Project Page](https://mm-niah.github.io/)]\n[[arXiv Paper](http://arxiv.org/abs/2406.07230)]\n[[Dataset](https://huggingface.co/datasets/OpenGVLab/MM-NIAH)]\n[[Leaderboard](https://mm-niah.github.io/#overall_test_leaderboard)]\n\u003c!-- [[Github](https://github.com/OpenGVLab/MM-NIAH)] --\u003e\n\n## News🚀🚀🚀\n- `2024/10/15`: [LMDeploy](https://github.com/InternLM/lmdeploy) is now supported for the evaluation of MM-NIAH, thanks to [ttguoguo3](https://github.com/ttguoguo3)!\n- `2024/09/27`: MM-NIAH is accepted to NeurIPS 2024 Track Datasets and Benchmarks! 🎉\n- `2024/07/04`: 🚀We have updated the performance of [InternVL2-Pro](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) on our leaderboard and support the evaluation of InternVL2-Pro.\n- `2024/06/13`: 🚀We release Needle In A Multimodal Haystack ([MM-NIAH](https://huggingface.co/OpenGVLab/MM-NIAH)), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.\n**Experimental results show that the performance of Gemini-1.5 on tasks involving image needles is no better than random guessing.**\n\n## Introduction\n\nNeedle In A Multimodal Haystack (MM-NIAH) is a comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.\nThis benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document.\nThe evaluation data in MM-NIAH consists of three tasks: `retrieval`, `counting`, and `reasoning`. The needles are inserted into either text or images in the documents. Those inserted into text are termed `text needles`, whereas those within images are referred to as `image needles`.\nPlease see [our paper](http://arxiv.org/abs/2406.07230) for more details.\n\n\u003cimg width=\"800\" alt=\"image\" src=\"assets/data_examples.jpg\"\u003e\n\n## Main Findingds\n\nBased on our benchmark, we conducted a series of experiments. The main findings are summarized as follows:\n\n- The most advanced MLLMs (e.g. Gemini-1.5) still struggle to comprehend multimodal documents.\n\n- **All MLLMs exhibit poor performance on image needles.**\n\n- MLLMs fail to recognize the exact number of images in the document.\n\n- Models pre-trained on image-text interleaved data do not exhibit superior performance.\n\n- Training on background documents does not boost performance on MM-NIAH.\n\n- The \"Lost in the Middle\" problem also exists in MLLMs.\n\n- Long context capability of LLMs is NOT retained in MLLMs.\n\n- RAG boosts Text Needle Retrieval but not Image Needle Retrieval.\n\n- Placing questions before context does NOT improve model performance.\n\n- Humans achieve near-perfect performance on MM-NIAH.\n\n\nPlease see [our paper](http://arxiv.org/abs/2406.07230) for more detailed analyses.\n\n## Experimental Results\n\nFor the retrieval and reasoning tasks, we utilize Accuracy as the evaluation metric.\n\nFor the counting task, we use Soft Accuracy, defined as $\\frac{1}{N} \\sum_{i=1}^{N} \\frac{m_i}{M_i}$, where $m_i$ is the number of matched elements in the corresponding positions between the predicted and ground-truth lists and $M_i$ is the number of elements in the ground-truth list for the $i$-th sample. Note that the required output for this task is a list.\n\n\u003cimg width=\"800\" alt=\"image\" src=\"assets/main_table.jpg\"\u003e\n\n\u003c!-- \u003cdetails\u003e --\u003e\n\u003c!-- \u003csummary\u003eHeatmaps (click to expand)\u003c/summary\u003e --\u003e\n\u003cimg width=\"800\" alt=\"image\" src=\"assets/main_heatmap.jpg\"\u003e\n\u003c!-- \u003c/details\u003e --\u003e\n\n\u003c!-- \u003cdetails\u003e --\u003e\n\u003c!-- \u003csummary\u003eTables (click to expand)\u003c/summary\u003e --\u003e\n\u003cimg width=\"800\" alt=\"image\" src=\"assets/subtasks_table.jpg\"\u003e\n\u003c!-- \u003c/details\u003e --\u003e\n\n## Evaluation\n\nTo calculate the scores, please prepare the model responses in jsonl format, like this [example](outputs_example/example-retrieval-text.jsonl). Then you can place all jsonl files in a single folder and execute our script [calculate_scores.py](calculate_scores.py) to get the heatmaps and scores.\n\n```shell\npython calculate_scores.py --outputs-dir /path/to/your/responses\n```\n\nFor example, if you want to reproduce the experimental results of [InternVL-1.5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), you should first install the environment following [the document](https://github.com/OpenGVLab/InternVL/blob/main/INSTALLATION.md) and download [the checkpoints](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5). Then you can execute the evaluation script [eval_internvl.py](eval_internvl.py) for InternVL to obtain the results, using the following commands:\n\n```shell\nsh shells/eval_internvl.sh\npython calculate_scores.py --outputs-dir ./outputs/\n```\n\nIf you want to reproduce the results of InternVL-1.5-RAG, please first prepare the retrieved segments using the following commands:\n\n```shell\nsh shells/prepare_rag.sh\n```\n\nThen, run these commands to obtain the results of InternVL-1.5-RAG:\n\n```shell\nsh shells/eval_internvl_rag.sh\npython calculate_scores.py --outputs-dir ./outputs/\n```\n\nIf you want to evaluate the model by LMDeploy, run the command to obtain the jsonl file of result in one task:\n\n```shell\nsrun -p VC5 --gres=gpu:8 --ntasks=1 --ntasks-per-node=1 python ./mmniah_lmdeploy.py --file-name=retrieval-image --model-path=/path/to/your/model/ --file-dir=/path/to/MMNIAH/jsonl/dir/ --image-dir=/path/to/MMNIAH/image/dir/ --save-dir=/path/to/save/dir/\n```\n\nAfter obtaining all six result jsonl files, run the command to get the final test results:\n\n```shell\npython ./val_mmniah.py --file-dir /path/to/result/dir/\n```\n\n`NOTE`: Make sure that you install the [flash-attention](https://github.com/Dao-AILab/flash-attention) successfully, otherwise you will meet the torch.cuda.OutOfMemoryError.\n\n## Leaderboard\n\n🚨🚨 The leaderboard is continuously being updated.\n\nTo submit your results to the leaderboard on MM-NIAH, please send to [this email](mailto:wangweiyun@pjlab.org.cn) with your result jsonl files on each task, referring to the template file [example-retrieval-text.jsonl](outputs_example/example-retrieval-text.jsonl).\nPlease organize the result jsonl files as follows:\n\n```\n├── ${model_name}_retrieval-text-val.jsonl\n├── ${model_name}_retrieval-image-val.jsonl\n├── ${model_name}_counting-text-val.jsonl\n├── ${model_name}_counting-image-val.jsonl\n├── ${model_name}_reasoning-text-val.jsonl\n├── ${model_name}_reasoning-image-val.jsonl\n├──\n├── ${model_name}_retrieval-text-test.jsonl\n├── ${model_name}_retrieval-image-test.jsonl\n├── ${model_name}_counting-text-test.jsonl\n├── ${model_name}_counting-image-test.jsonl\n├── ${model_name}_reasoning-text-test.jsonl\n└── ${model_name}_reasoning-image-test.jsonl\n```\n\n## Visualization\n\nIf you want to visualize samples in MM-NIAH, please install `gradio==3.43.2` and run this script [visualization.py](visualization.py).\n\n## Data Format\n\n```python\n{\n    # int, starting from 0, each task type has independent ids.\n    \"id\": xxx,\n    # List of length N, where N is the number of images. Each element is a string representing the relative path of the image. The image contained in the \"choices\" is not included here, only the images in the \"context\" and \"question\" are recorded.\n    \"images_list\": [\n        \"xxx\",\n        \"xxx\",\n        \"xxx\"\n    ],\n    # str, multimodal haystack, \"\u003cimage\u003e\" is used as the image placeholder.\n    \"context\": \"xxx\",\n    # str, question\n    \"question\": \"xxx\",\n    # Union[str, int, List], records the standard answer. Open-ended questions are str or List (counting task), multiple-choice questions are int\n    \"answer\": \"xxx\",\n    # meta_info, records various statistics\n    \"meta\": {\n        # Union[float, List[float]], range [0,1], position of the needle. If multiple needles are inserted, it is List[float].\n        \"placed_depth\": xxx,\n        # int, number of text and visual tokens\n        \"context_length\": xxx,\n        # int, number of text tokens\n        \"context_length_text\": xxx,\n        # int, number of image tokens\n        \"context_length_image\": xxx,\n        # int, number of images\n        \"num_images\": xxx,\n        # List[str], inserted needles. If it is a text needle, record the text; if it is an image needle, record the relative path of the image.\n        \"needles\": [xxx, ..., xxx],\n        # List[str], candidate text answers. If it is not a multiple-choice question or there are no text candidates, write None.\n        \"choices\": [xxx, ..., xxx],\n        # List[str], candidate image answers. The relative path of the image. If it is not a multiple-choice question or there are no image candidates, write None.\n        \"choices_image_path\": [xxx, ..., xxx],\n    }\n}\n```\n\n`NOTE 1`: The number of `\u003cimage\u003e` in the context and question equates to the length of the `images_list`.\n\n`NOTE 2`: Save as a jsonl file, each line is a `Dict`.\n\n\n## Contact\n- Weiyun Wang: wangweiyun@pjlab.org.cn\n- Wenhai Wang: wangwenhai@pjlab.org.cn\n- Wenqi Shao: shaowenqi@pjlab.org.cn\n\n## Acknowledgement\n\nThe multimodal haystack of MM-NIAH is build upon the documents from [OBELICS](https://github.com/huggingface/OBELICS).\nBesides, our project page is adapted from [Nerfies](https://github.com/nerfies/nerfies.github.io) and [MathVista](https://github.com/lupantech/MathVista).\n\nThanks for their awesome work!\n\n## Citation\n```BibTex\n@article{wang2024needle,\n  title={Needle In A Multimodal Haystack},\n  author={Wang, Weiyun and Zhang, Shuibo and Ren, Yiming and Duan, Yuchen and Li, Tiantong and Liu, Shuo and Hu, Mengkang and Chen, Zhe and Zhang, Kaipeng and Lu, Lewei and others},\n  journal={arXiv preprint arXiv:2406.07230},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FMM-NIAH","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenGVLab%2FMM-NIAH","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenGVLab%2FMM-NIAH/lists"}