{"id":13653158,"url":"https://github.com/FudanDISC/ReForm-Eval","last_synced_at":"2025-04-23T06:31:20.859Z","repository":{"id":198143433,"uuid":"667289800","full_name":"FudanDISC/ReForm-Eval","owner":"FudanDISC","description":"An benchmark for evaluating the capabilities of large vision-language models (LVLMs)","archived":false,"fork":false,"pushed_at":"2023-11-17T13:02:05.000Z","size":10524,"stargazers_count":33,"open_issues_count":2,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-11-10T04:36:21.854Z","etag":null,"topics":["benchmark","embodied-ai","gpt4","in-context-learning","instruction-following","instruction-tuning","large-language-models","large-vision-language-models","llm","multimodal","multimodal-chain-of-thought","pre-training","reformulation","visual-chain-of-thought"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FudanDISC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-07-17T07:10:44.000Z","updated_at":"2024-10-31T13:48:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"63e9e3f9-d18f-4f52-a5c5-d7a331024a74","html_url":"https://github.com/FudanDISC/ReForm-Eval","commit_stats":null,"previous_names":["fudandisc/reform-eval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FudanDISC%2FReForm-Eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FudanDISC%2FReForm-Eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FudanDISC%2FReForm-Eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FudanDISC%2FReForm-Eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FudanDISC","download_url":"https://codeload.github.com/FudanDISC/ReForm-Eval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250384908,"owners_count":21421814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","embodied-ai","gpt4","in-context-learning","instruction-following","instruction-tuning","large-language-models","large-vision-language-models","llm","multimodal","multimodal-chain-of-thought","pre-training","reformulation","visual-chain-of-thought"],"created_at":"2024-08-02T02:01:06.553Z","updated_at":"2025-04-23T06:31:15.843Z","avatar_url":"https://github.com/FudanDISC.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1 style=\"display: inline-block; font-size: 48px;\"\u003eReForm-Eval\u003c/h1\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://avatars.githubusercontent.com/u/100903507?s=200\u0026v=4\" alt=\"Fudan Disc Logo\" style=\"display: inline-block; vertical-align: middle; height: 48px;\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Version-v1.0-Green\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Licence-Apache_2.0-Green\" /\u003e\n    \u003ca href=\"https://github.com/FudanDISC\"\u003e\u003cimg src=\"https://img.shields.io/badge/DISC-Repositories-blue\" /\u003e\u003c/a\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/FudanDISC/ReForm-Eval?label=Stars\" /\u003e\n    \u003ca href=\"https://hits.seeyoufarm.com\"\u003e\u003cimg src=\"https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FFudanDISC%2FReForm-Eval\u0026count_bg=%23D8659B\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=Visitors\u0026edge_flat=false\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://arxiv.org/pdf/2310.02569.pdf\"\u003e\u003cimg src=\"https://img.shields.io/badge/Paper-PDF-red\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://arxiv.org/abs/2310.02569\"\u003e\u003cimg src=\"https://img.shields.io/badge/Paper-Arxiv-red\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/datasets/Aweminus/ReForm-Eval-Data/tree/main\"\u003e\u003cimg src=\"https://img.shields.io/badge/🤗_Hugging_Face-Dataset-orange\" /\u003e\u003c/a\u003e\n    \u003ca href=\"https://drive.google.com/file/d/1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI/view\"\u003e\u003cimg src=\"https://img.shields.io/badge/Google_Drive-Dataset-orange?logo=googledrive\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003ch2 \u003eReForm-Eval: EVALUATING LARGE VISION LANGUAGE MODELS VIA UNIFIED RE-FORMULATION OF TASK-ORIENTED BENCHMARKS\u003c/h2\u003e\n\u003c/div\u003e\n\n\n\u003cp align=\"center\"\u003e\u003cstrong\u003eZejun Li\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e†\u003c/sup\u003e , Ye Wang\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e†\u003c/sup\u003e , Mengfei Du\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e†\u003c/sup\u003e , Qingwen Liu\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e†\u003c/sup\u003e , Binhao Wu\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e†\u003c/sup\u003e , Jiwen Zhang\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e†\u003c/sup\u003e , Chengxing Zhou\u003csup\u003e2\u003c/sup\u003e , Zhihao Fan\u003csup\u003e3\u003c/sup\u003e , Jie Fu\u003csup\u003e4\u003c/sup\u003e , Jingjing Chen\u003csup\u003e1\u003c/sup\u003e , Xuanjing Huang\u003csup\u003e1\u003c/sup\u003e , Zhongyu Wei\u003csup\u003e1\u003c/sup\u003e\u003csup\u003e*\u003c/sup\u003e.\n \u003c/strong\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003csup\u003e1\u003c/sup\u003eFudan University      \u003csup\u003e2\u003c/sup\u003eNortheastern University      \u003csup\u003e3\u003c/sup\u003eAlibaba Group        \u003csup\u003e4\u003c/sup\u003eHong Kong University of Science and Technology\u003c/p\u003e \n\u003cp align=\"center\"\u003e\u003csup\u003e†\u003c/sup\u003eEqual Contribution        \u003csup\u003e*\u003c/sup\u003eCorresponding Author\u003c/p\u003e \n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2310.02569v1\"\u003eReForm-Eval Paper\u003c/a\u003e | \u003ca href=\"https://huggingface.co/datasets/Aweminus/ReForm-Eval/tree/main\"\u003e🤗ReForm-Eval-Data\u003c/a\u003e | \u003ca href=\"https://drive.google.com/file/d/1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI/view\"\u003e☁️Google Drive\u003c/a\u003e\n\u003c/p\u003e\n\n\n\u003eRecent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluated. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the freeform text output of LVLMs. To effectively leverage the annotations available in existing benchmarks and reduce the manual effort required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM compatible formats. Through systematic data collection and reformulation, we present the ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Based on ReForm-Eval, we conduct extensive experiments, thoroughly analyze the strengths and weaknesses of existing LVLMs, and identify the underlying factors. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.\n\nWe explore ways of re-formulating existing benchmarks into unified formats that are compatible with LVLMs. \n\n\u003cp align=\"center\"\u003e\u003cimg src=\"./short.png\" /\u003e\u003c/p\u003e\n\n\u003cspan style=\"font-size:larger;\"\u003e**Existing LVLMs Evaluation:**\u003c/span\u003e\n\n- **No Quantification**: The capabilities of existing LVLMs are mainly demonstrated only by qualitative examples.\n- **Task-Oriented**: Most existing multi-modal benchmarks cannot be directly utilized to evaluate LVLMs since they are designed for specific tasks and rely on structured input-output formats for evaluation, even need to be fine-tuned or learn task-specific parameters.\n- **Limited Samples**: Limited manual annotation such as around 100 samples per dimension in **MME** and **MMBench** could potentially introduce evaluation bias into the results.\n\n\u003cspan style=\"font-size:larger;\"\u003e**Based on the re-formulation framework, we present our unified multi-modal benchmark, ReForm-Eval:**\u003c/span\u003e\n- **Larger Data Scale**: ReForm-Eval provides a dataset scale almost **100 times larger** than existing benchmarks, allowing models to be comprehensively evaluated across various dimensions.\n\n- **Without Manual Annotation**: ReForm-Eval leverages publicly open resources, reducing annotation costs while providing a larger-scale dataset.\n\n- **Universal Evaluation**: Unlike **LVLM-ehub** which requires designing complex and dataset-specific evaluation strategies, ReForm-Eval offers greater scalability and a more universally applicable and efficient evaluation approach.\n\n- **Comprehensive Evaluation**: We re-formulate **61 benchmark datasets** based on existing data resources, the evaluation dimensions range from basic visual perception to high-level visual reasoning and dialog.\n\n- **Unified Re-formulation**: Multi-modal benchmark datasets are re-formulated as **multiple-choice problems** or specialized **text generation problems**. Additionally, **generation-based black-box** and **likelihood-based white-box approaches** are implemented for evaluation.\n\nThe unified formulation enables universal and comprehensive evaluation. For each formulation, we design a consistent and reliable evaluation method. As mentioned in ([Fu et al., 2023](https://arxiv.org/abs/2306.13394)), current LVLMs may struggle to follow multiple-choice instructions, we propose both black-box and white-box approaches to assist: \n\n**(1)** Guiding LVLMs to output in desired formats through in-context learning; \n\n**(2)** Directly calculating the generation probability for options and selecting the one with the highest value. \n\nConsidering the sensitivity of LVLMs to the input prompts ([Zeng et al., 2023](https://arxiv.org/abs/2307.02469)), we additionally design an instability-aware evaluation strategy and introduce a metric to characterize such instability. \n\n**🔧🔧🔧 ReForm-Eval serves as a reliable tool for quantitative analysis of LVLMs, aiding in the research and development of LVLMs. 🔧🔧🔧**\n\n**🙌🙌🙌 We welcome a diverse range of large vision and language models to participate in ReForm-Eval benchmark evaluation!!! 🙌🙌🙌**\n\n## 📣 Update\n**If you have any questions, please send us an email or leave a github issue!**\n**`Email: yewang22@m.fudan.edu.cn`**\n\n- **[2023-11]** We added `BLEU`, `Meteor`, and `Rouge-L` metrics for the **Generation** task, and update `Ground IC15`, `FUNSD` dataset.\n- **[2023-10]** We released the initial version of the [ReForm-Eval](https://arxiv.org/abs/2310.02569), containing interfaces of 16 models and 61 converted reformulated datasets [🤗ReForm-Eval-Data](https://huggingface.co/datasets/Aweminus/ReForm-Eval-Data/tree/main)!\n\n## 📖 Contents\n- [Model Performance](#🦾-model-performance)\n- [Getting Start](#🔥-getting-start)\n  - [Install](#install)\n  - [Pipeline](#pipeline)\n  - [Load Data](#load-data)\n  - [Create Your Own Model Interface](#create-your-own-model-interface)\n- [Evaluation](#🚀-evaluation)\n  - [Demo](#demo)\n  - [Parameters](#parameters)\n  - [Model Usage](#model-usage)\n  - [Data Usage](#data-usage)\n  - [Output Result](#output-result)\n- [Citation](#🖋-citation)\n- [Acknowledgements](#🤝-acknowledgements)\n- [Related Projects](#🔏-related-projects)\n\n## 🦾 Model Performance\nWe list the average ranking and the score of the model under Generation Evaluation and Likelihood Evaluation in the table below. \n\n**If you get results on our benchmark using the new LVLM interface, please contact us to add your model to this table.**\n**`Email: yewang22@m.fudan.edu.cn`**\n\n| Model          | Gen-Avg-Rank | Gen-Avg-Score | Like-Avg-Rank | Like-Avg     |\n|----------------|--------------|---------------|---------------|--------------|\n| **BLIP-2**     | *2.3*          | **62.94**         | 4.3           | 62.92        |\n| **InstructBLIP_F** | **2.0**      | *60.77*         | 4.0           | 63.48        |\n| **InstructBLIP_V** | 4.4      | 52.20         | 3.0           | *64.37*        |\n| **LLaVA_V**    | 11.1         | 34.24         | 8.7           | 55.49        |\n| **LLaVA_L2**   | 5.9          | 45.78         | 11.2          | 52.97        |\n| **MiniGPT4**   | 7.3          | 43.12         | 7.8           | 56.15        |\n| **mPLUG-Owl**  | 10.6         | 37.95         | 10.3          | 53.69        |\n| **PandaGPT**   | 13.9         | 26.84         | 15.8          | 41.80        |\n| **IB-LLM** | 13.0       | 30.24         | 14.5          | 47.58        |\n| **LA-V2**      | 12.5         | 32.60         | 12.2          | 50.00        |\n| **mmGPT**      | 14.4         | 29.38         | 12.8          | 50.92        |\n| **Shikra**     | 11.0         | 36.14         | 7.0           | 58.40        |\n| **Lynx**       | 5.0          | 50.00         | *2.8*           | 63.93        |\n| **Cheetor_V**  | 6.8          | 44.74         | 8.2           | 56.73        |\n| **Cheetor_L2** | 7.9          | 41.75         | 10.7          | 52.43        |\n| **BLIVA**      | 7.9          | 42.40         | **2.7**           | **64.92**        |\n\n`Gen-Avg-Rank` and `Like-Avg-Rank` represents the average rank of Generation and Likelihood evaluation. `Gen-Avg-Score` and `Like-Avg-Score` are the average score of Generation and Likelihood evaluation, respectively.\n\n\n## 🔥 Getting Start\n\n### Install\n**1. Git clone our repository, via the following command**\n```bash\ngit clone https://github.com/FudanDISC/ReForm-Eval.git\ncd ReForm-Eval\npip install -r requirements.txt\n```\n\nIf you want to test all existing 16 models, you need to run the following command\n```bash\ngit clone https://github.com/FudanDISC/ReForm-Eval.git --recursive\ncd ReForm-Eval\npip install -r requirements.txt\n```\n\n**2. Build from source**\n```bash\ngit clone https://github.com/FudanDISC/ReForm-Eval.git\ncd ReForm-Eval\npip install .\n```\n\nThe advantage of building from source is that you can directly replace the command of `python run_eval.py` and `python run_loader_eval.py` with the `run_eval` or `run_loader_eval` by modifying the config file, and can be executed in any path, including the dataloader function `load_reform_dataset`.\n\nOpen your shell configuration file.\n```bash\nvim ~/.bashrc\n```\nAdd the following line at the end of the file:\n```bash\nexport PYTHONPATH=/path/to/ReForm-Eval:$PYTHONPATH\n```\n\n**Note:** Once you use `run_eval` or `run_loader_eval` on other paths, the parameters related to the file dir should be set to absolute paths.\n\n### Pipeline\nOur benchmark provides accuracy and instability as metrics for each task, to quantify the model performance. We provide two methods: \n\n**(A)** Create the interface in our framework and run it directly. \n\n**(B)** Use the Data Loader we provide and output the inference results, then provide a new script to evaluate with our benchmark, taking the problem formulation and the output json file as input.\n\n#### Method A\n\n**Step 1:** Use an existing model interface or create a new model interface based on ReForm-Eval framework refer to [Create Your Own Model Interface](#create-your-own-model-interface).\n\n**Step 2:** Create the conda env corresponding to the model and install the necessary packages.\n\n**Step 3:** Switch to the corresponding conda env, run `run_eval.py` in the root path of this repository, and add necessary parameters.\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 run_eval.py \\\n    --model lynx  --model_name models/interfaces/lynx/configs/LYNX.yaml \\\n    --dataset_name VisDial --output_dir output/lynx/VisDial/test_generation/ \\\n    --per_gpu_eval_batch_size 4 --formulation SingleChoice \\\n    --infer_method generation --do_eval --half_evaluation  --dataset_duplication 1 \\\n    --in_context_sample --option_mark upper \\\n    --dataset_config build/configs/VisDial_val_v1.2.yaml \\\n```\n\n**Step 4:** Check the inference progress and results in the terminal. The accuracy, (the format hit rate or instability) can also be viewed in `output_dir/log.txt`.\n\n#### Method B\n\n**Step 1:** Build a dataset using our Data Loader and process them into a string with the desired format of the corresponding model.\n\u003c!-- ```python\nfrom build import load_reform_dataset\n\n# example for loading VQA v2\ndataset = load_reform_dataset(\n    # dataset config, please check Data Usage for available arguments\n    dataset_name='VQA',\n    formulation='SingleChoice',\n    dataset_config='/path/to/ReForm-Eval/build/configs/VQA_vqa_v2_val.yaml',\n    inference_method='generation', # inference method, generation / likeligood\n    in_context_sample=True, # whether to include in-context-sample\n    random_instruct=True, # whether to use different instructions for the same sample\n    data_duplication=5, # number of multiple tests for the same sample\n    shuffle_options=True, # whether to shuffle the options for the same sample\n    load_from_hf:=True, # (Optional) whether to load from huggingface\n    offline_from_hf:=False # (Optional) whether to load the huggingface data from the local path\n)\n``` --\u003e\n\n**Step 2:** The model outputs a json file, such as `/path/to/TDIUC_SingleChoice_likelihood_imagebindLLM_imagebindLLM.json'`, based on the dataset built by **step 1**.\n\n**Step 3:** Run our new script `run_loader_eval.py`, taking the problem formulation and the output json file as main parameters of input.\n```bash\npython run_loader_eval.py --formulation SingleChoice --infer_method likelihood --eval_stability \\\n    --prediction_file test_output/SingleChoice/TDIUC_SingleChoice_likelihood_imagebindLLM_imagebindLLM.json\n```\n\nOr\n```python\nfrom run_loader_eval import loader_eval\n\ndataset = loader_eval(formulation='SingleChoice',\n            infer_method='likelihood',\n            multi_round_eval=False,\n            eval_stability=True,\n            prediction_file='/path/to/TDIUC_SingleChoice_likelihood_imagebindLLM_imagebindLLM.json'\n)\n```\n\n**Note:** There are four types of `Formulation: SingleChoice, Generation, OCROpenEnded and KIEOpenEnded`, respectively. It can only be set `eval_stability` and `multi_round_eval` when `--formulation SingleChoice`, which means that only SingleChoice can measure the instability and be used for the multi-round evaluation.\n\nNotice that each sample in the output json are supposed to be specific format:\n```python\n{\n  # dataset information\n  'sample_id': 'VQA_0'\n  'answer': 1\n  'answer_options': ['yes', 'no', 'maybe']\n  'prediction': '(A) yes' # the prediction\n}\n```\n\n**Note:** During generation-based evaluation for multiple-choice questions, we only consider the format like (A), (a), (1), if a prediction does not hit the format, it will be considered wrong. The requirement for likelihood prediction is `int`, and for generation prediction is `str`.\n\n**Step 4:** The accuracy, (the format hit rate or instability) can be viewed in `output_dir/log.txt`.\n\n### Load Data\nThere are two ways to load data, using our framework directly or using Data Loader.\n\n**The most recommendation is using Hugging Face Data, which we call it ReForm-Eval-Data. We introduce how to load ReForm-Eval-Data from Hugging Face Hub or the local path. If this still does not work, we also provide other loading methods. Please refer to [Prepare Dataset](build/prepare_dataset.md#📥-prepare-dataset) for more details.**\n\nHere is the Google Drive link of ReForm-Eval-Data and you can directly download it to load from the local path!\n\n**download URL**\n\n[https://drive.google.com/file/d/1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI/view](https://drive.google.com/file/d/1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI/view)\n\n**wget**\n```\nwget https://drive.google.com/uc?export=download\u0026id=1GjWvm0f6fkJ7VFySKyEfb2N_KyZxcdyI\n```\n\n#### Using ReForm-Eval Framework\nIf you load data from ReForm-Eval Framework, when running `run_eval.py` and `run_loader_eval.py`, you should set the data-related parameters, including `--dataset_name`, `--formulation`, `--dataset_config`, `--dataset_duplication`, `--in_context_sample` and `--capitalize`.\n\n**Please set `--hf` or `--offline_hf` if you would like to load ReForm-Eval-Data. `--hf` is loading from Hugging Face Hub, and `--offline_hf` is loading ReForm-Eval-Data from the local path. If set at the same time, data will be loaded from Hugging Face Hub.**\n\n#### Using Data Loader\nReForm-Eval provides the direct data loader if you would like to perform evaluation without our framework. Here is an example:\n```python\nfrom build import load_reform_dataset\n\n# example for loading VQA v2\ndataset = load_reform_dataset(\n    # dataset config, please check Data Usage for available arguments\n    dataset_name='VQA',\n    formulation='SingleChoice',\n    dataset_config='/path/to/ReForm-Eval/build/configs/VQA_vqa_v2_val.yaml',\n    inference_method='generation', # inference method, generation / likeligood\n    in_context_sample=True, # whether to include in-context-sample\n    random_instruct=True, # whether to use different instructions for the same sample\n    data_duplication=5, # number of multiple tests for the same sample\n    shuffle_options=True, # whether to shuffle the options for the same sample\n    load_from_hf=True, # (Optional) whether to load from huggingface\n    option_mark='upper', # (Optional) the option mark to use, number/upper/lower/random\n    offline_from_hf=False # (Optional) whether to load the huggingface data from the local path\n)\n```\nNotice that each sample of the loaded dataset will be a dict containing all information like: \n```\n{\n    'sample_id': 'VQA_000',\n    'image': \u003cPIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x484\u003e,\n    'question': 'Is there a cat in the image?',\n    'answer': 2,\n    'answer_options': ['yes', 'no', 'maybe'],\n    'instruct': 'Based on the image, answer the question with the provided options.',\n    'question_with_option': 'Is there a cat in the image? Options: (A) yes; (B) no; (C) maybe.'\n}\n```\nYou may need to process them into a string with the desired format. You may be intersted in the [Preprocessors](models/prepare_models.md#preprocessors) we used in ReForm-Eval to gather the information into a dialogue-like string as the input for you model. All valid datasets and corresponding arguments are in the [Data Usage](#data-usage).\n\n**Please set `load_from_hf=True` or `offline_from_hf=True` if you would like to load ReForm-Eval-Data. `load_from_hf=True` is loading from Hugging Face Hub, and `offline_from_hf=True` is loading ReForm-Eval-Data from the local path. If `True` is set at the same time, data will be loaded from Hugging Face Hub.**\n\n### Create Your Own Model Interface\nTo add new models, you need to create the corresponding model interface for the unified evaluation. For a general new model interface, please refer to the interface template in `/path/to/ReForm-Eval/models/interfaces/base_interface.py`. Here we provide a step-by-step guide for the convenience of your implementation (taking Lynx as an example).\n\n#### Step 1: Configure the Code Path\nAdd the Lynx project as a submodule to `/path/to/ReForm-Eval/models/interfaces/`:\n```bash\ncd models/interfaces\ngit submodule add https://github.com/bytedance/lynx-llm.git\n```\n\n#### Step 2: Model Loading\nRefer to the code for loading the model in the original Lynx project.\n```python\ndef main(args, config):\n    print(\"### Evaluating\", flush=True)\n    device = torch.device(args.device)\n\n    seed = args.seed + utils.get_rank()\n    torch.manual_seed(seed)\n    np.random.seed(seed)\n    random.seed(seed)\n    cudnn.benchmark = True\n\n    print(\"config:\", json.dumps(config), flush=True)\n    print(\"output_path, \", args.output_path, flush=True)\n\n    print(\"### Creating model\", flush=True)\n    from models.lynx import LynxBase\n    model = LynxBase(config=config, freeze_vit=config['freeze_vit'], freeze_llm=config['freeze_llm'], load_bridge=False)\n```\n\nSo, we can implement the `__init__` function for model loading in our interface:\n```python\nclass Lynx_Interface(nn.Module):\n    def __init__(self, model_config=None, device=None, half=False, inference_method='generation') -\u003e None:\n        super(Lynx_Interface, self).__init__()\n        # setup the model device\n        if device is None:\n            self.device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n        else:\n            self.device = torch.device(device)\n        \n        # loading the model\n        self.config = yaml.load(open(model_config, 'r'), Loader=yaml.Loader)\n        self.model = LynxBase(config=self.config, freeze_vit=self.config['freeze_vit'], freeze_llm=self.config['freeze_llm'], load_bridge=False)\n        \n        # locate the model to half-precision and target device if needed\n        self.prec_half = half\n        if self.prec_half:\n            self.model = self.model.half()\n        self.model = self.model.to(self.device)\n        \n        # setup the inference method\n        self.inference_method = inference_method\n```\n\n#### Step 3: Implement the Inference Function\n**Generation-based Black-Box Evaluation**\n\nWe provide the Black-box Generation-based Inference Method.\n```\nBlack-box Generation-based Inference Method\n\nArgs:\n    image (list[PIL.Image]):\n        The batch of input images. Each element is loaded as PIL.Image.\n    prompt (list[str]):\n        The batch of input textual prompts. Prompts should be formulated as a dialoge by the\n        model preprocessor (see utils/preprocessors.py)\n    temperature (float, **optional**):\n        A generation-related parameter: the temperature parameter in the generation process\n        of language models.\n    max_new_tokens (int, **optional**):\n        A generation-related parameter: the maximal number of tokens a model can generate.\n        \nReturns:\n    outputs (list[str]):\n        The generated output response in text.\n\n```\n\nAn example is provided below:\n\n```python\n\u003e\u003e\u003e # An example of VQA for LLaVA\n\u003e\u003e\u003e from models.interfaces.llava_interface import LLaVA_Interface\n\u003e\u003e\u003e from PIL import Image\n\n\u003e\u003e\u003e image = Image.open(PATH_TO_IMAGE).convert('RGB')\n\u003e\u003e\u003e model = LLaVA_Interface(PATH_TO_LLAVA, device='cuda:0')\n\n\u003e\u003e\u003e prompt = \"A chat between a curious human and an artificial intelligence assistant. The\\\n              assistant gives helpful detailed, and polite answers to the human's questions.\\\n              ###Human: \u003cimage\u003e\\n Can you see the Image? Options: (A) yes; (B) no.\\\n              ###Assistant: The answer is (A) yes.\\\n              ###Human: What color is the truck? Options: (A) blue; (B) orange.\\\n              ###Assistant: The answer is\"\n\n\u003e\u003e\u003e # Generation-based Inference\n\u003e\u003e\u003e outputs = model.raw_batch_generate([image], [prompt])\n\u003e\u003e\u003e outputs\n\"(B) orange.\"\n```\n\nThen, find the generation-related code in the original Lynx project.\n```python\n@torch.no_grad()\ndef evaluation(model, data_loader, device, config):\n    # test\n    model.eval()\n    result = []\n\n    for n, (idx, vision_input, input_ids, input_atts) in enumerate(data_loader):\n        vision_input = vision_input.to(device, non_blocking=True)\n        input_ids = input_ids.to(device)\n        input_atts = input_atts.to(device)\n\n        text_outputs = model.generate(\n            vision_input=vision_input,\n            input_ids=input_ids, input_atts=input_atts,\n            use_nucleus_sampling=config.get('use_nucleus_sampling', False),\n            apply_lemmatizer=config['apply_lemmatizer'],\n            num_beams=config['num_beams'],\n            min_length=config['min_length'],\n            length_penalty=config.get('length_penalty', 1.0),\n            no_repeat_ngram_size=config.get('no_repeat_ngram_size', -1),\n            top_p=config.get('top_p', 0.9),\n            top_k=config.get('top_k', 3),\n            max_new_tokens=config.get('max_new_tokens', 64))\n\n        for i, output in zip(idx, text_outputs):\n            result.append({\"index\": i, \"text_output\": output.strip()})\n\n    return result\n```\n\nTherefore, in `lynx_interface.py`, we can implement the generation inference function as:\n```python\n    @torch.no_grad()\n    def raw_generate(self, image, prompt, temperature=1, max_new_tokens=30):\n        vision_input = self.load_vision_inp(image).unsqueeze(0)\n        if self.prec_half:\n            vision_input = vision_input.to(torch.float16)\n        \n        input_ids, input_atts = self.process_text(prompt)\n        \n        answer = self.model.generate(\n            vision_input=vision_input,\n            input_ids=input_ids, input_atts=input_atts,\n            use_nucleus_sampling=self.config.get('use_nucleus_sampling', False),\n            apply_lemmatizer=self.config['apply_lemmatizer'],\n            num_beams=3, # self.config['num_beams'],\n            min_length=self.config['min_length'],\n            length_penalty=self.config.get('length_penalty', 1.0),\n            no_repeat_ngram_size=self.config.get('no_repeat_ngram_size', -1),\n            top_p=self.config.get('top_p', 0.9),\n            top_k=self.config.get('top_k', 3),\n            max_new_tokens=max_new_tokens,\n            temperature=temperature)\n\n        return answer[0]\n```\n\nIn this function, you have to use the internal vision processor to get the vision input (open and get the image), and the internal tokenizer to get the input_ids and input_atts. All of these codes can be directly found and implemented from the original project.\n```python\n    def load_vision_inp(self, vision_inp):\n        if vision_inp is None:\n            return None\n\n        elif isinstance(vision_inp, list) or isinstance(vision_inp, np.ndarray):\n            return self._get_frames(vision_inp)\n\n        elif isinstance(vision_inp, str):\n\n            if os.path.exists(vision_inp):\n                image = Image.open(vision_inp).convert('RGB')\n\n            else:  # base64 encoding\n                try:\n                    image = Image.open(io.BytesIO(b64decode(vision_inp))).convert(\"RGB\")\n                except Exception as e:\n                    raise ValueError(f\"check whether it is a rpath (and not exist)?: {vision_inp} {e}\")\n        else:\n            image = vision_inp\n        \n        image = self.img_transform(image)\n\n        return image.to(self.device)\n    \n    def process_text(self, text):\n        text = text.strip()\n        if self.lower_text:\n            text = text.lower()\n        input_ids = [self.tokenizer.bos_token] + self.tokenizer.tokenize(text)\n        # print(input_ids)\n        input_ids = self.tokenizer.convert_tokens_to_ids(input_ids)\n        input_atts = torch.LongTensor([[1]*len(input_ids)])\n        input_ids = torch.LongTensor([input_ids])\n        return input_ids.to(self.device), input_atts.to(self.device)\n```\n\n**Likelihood-based White-Box Evaluation**\n\nWe provide the White-box Likelihood-based Inference Method.\n```\nWhite-box Likelihood-based Inference Method\n\nArgs:\n    image (list[PIL.Image]):\n        The batch of input images. Each element is loaded as PIL.Image.\n    prompt (list[str]):\n        The batch of input textual prompts. Prompts should be formulated as a dialoge by the\n        model preprocessor (see utils/preprocessors.py)\n    candidates (list[list[str]]):\n        The list of candidate lists, each element (candidates[i]) is the candidate list\n        of the corresponding question.\n        \nReturns:\n    outputs (list[int]):\n        The generated output prediction index. Each element (outputs[i]) is the selected index\n        of the corresponding candidates. The prediction is therefore (candidates[i][outputs[i]])\n```\n\nHere is an example:\n```python\n\u003e\u003e\u003e # An example of VQA for LLaVA\n\u003e\u003e\u003e from models.interfaces.llava_interface import LLaVA_Interface\n\u003e\u003e\u003e from PIL import Image\n\n\u003e\u003e\u003e image = Image.open(PATH_TO_IMAGE).convert('RGB')\n\u003e\u003e\u003e model = LLaVA_Interface(PATH_TO_LLAVA, device='cuda:0')\n\n\u003e\u003e\u003e prompt = \"A chat between a curious human and an artificial intelligence assistant. The\\\n              assistant gives helpful detailed, and polite answers to the human's questions.\\\n              ###Human: What color is the truck?\\\n              ###Assistant:\"\n\u003e\u003e\u003e candidates = ['orange', 'blue']\n\n\u003e\u003e\u003e # Likelihood-based Inference\n\u003e\u003e\u003e outputs = model.raw_batch_predict([image], [prompt], [candidates])\n\u003e\u003e\u003e outputs\n1\n```\n\nTo support the likelihood evaluation, we add the following function in our model file `/path/to/ReForm-Eval/models/interfaces/lynx/models/lynx.py` to calculate the loss (neg-log likelihood) for each sequence.\n```python\n    def forward_likelihood(self, vision_input, input_ids, input_atts, labels, likelihood_reduction='sum'):\n        text_embeds = self.embed_tokens(input_ids)\n\n        if vision_input is not None:\n            vision_embeds, vision_atts = self.get_vision_embeds(vision_input)\n            v2t_feats, v2t_atts = self.bridge(vision_embeds=vision_embeds, vision_atts=vision_atts)\n\n            inputs_embeds = torch.cat([v2t_feats, text_embeds], dim=1)\n            attention_mask = torch.cat([v2t_atts, input_atts], dim=1)\n\n        else:\n            inputs_embeds = text_embeds\n            attention_mask = input_atts\n\n        outputs = self.LLM(\n            inputs_embeds=inputs_embeds,\n            attention_mask=attention_mask,\n            labels=labels,\n            return_dict=True,\n            reduction='none'\n        )\n        loss = outputs.loss.reshape(inputs_embeds.shape[0], -1)\n        if likelihood_reduction == 'sum':\n            loss = loss.sum(1)\n        elif likelihood_reduction == 'mean':\n            valid_num_targets = (loss \u003e 0).sum(1)\n            loss = loss.sum(1) / valid_num_targets\n        elif likelihood_reduction == 'none':\n            loss = loss\n        else:\n            raise ValueError\n        return loss\n```\n\nHence, in `lynx_interface.py`, we can use `self.model.forward_likelihood` at the `raw_predict` function.\n```python\n    def raw_predict(self, image, prompt, candidates, likelihood_reduction='sum'):\n        # loading the image-text pair\n        vision_input = self.load_vision_inp(image).unsqueeze(0)\n        if self.prec_half:\n            vision_input = vision_input.to(torch.float16)\n        \n        input_ids, attention_mask = self.process_text(prompt)\n        \n        # get the embedding from the input\n        num_cand = len(candidates)\n        input_seq_len = input_ids.shape[1]\n\n        # tokenize the candidates\n        current_padding_side = self.tokenizer.padding_side\n        current_truncation_side = self.tokenizer.truncation_side\n        self.tokenizer.padding_side = 'right'\n        self.tokenizer.truncation_side = 'right'\n        if self.lower_text:\n            candidates = [cand.lower() for cand in candidates]\n        candidates_tokens = self.tokenizer(\n            candidates,\n            return_tensors='pt',\n            padding='longest'\n        ).to(self.device)\n        self.tokenizer.padding_side = current_padding_side\n        self.tokenizer.truncation_side = current_truncation_side\n\n        # construct the inputs_ids and LM targets\n        candidates_ids = candidates_tokens.input_ids[:, 1:] # remove the \u003cs\u003e token\n        candidates_att = candidates_tokens.attention_mask[:, 1:] # remove the \u003cs\u003e token\n        # mask the LM targets with \u003cpad\u003e\n        cand_targets = candidates_ids.clone()\n        cand_targets = cand_targets.masked_fill(cand_targets == self.tokenizer.pad_token_id, -100)\n        # mask the targets for inputs part\n        targets = torch.cat([-100*torch.ones(num_cand, input_seq_len+self.config[\"num_bridge_tokens\"], dtype=torch.long, device=self.device), \\\n                             cand_targets], dim=1)\n        # concatenate the inputs for the model\n        attention_mask = torch.cat([attention_mask.repeat_interleave(num_cand, dim=0), candidates_att], dim=1)\n        full_input_ids = torch.cat([input_ids.repeat_interleave(num_cand, dim=0), candidates_ids], dim=1)\n        \n        # calculate the loss (neg-log likelihood) for each candidate\n        with torch.inference_mode():\n            outputs = self.model.forward_likelihood(\n                vision_input=vision_input.repeat_interleave(num_cand, dim=0),\n                input_ids=full_input_ids,\n                input_atts=attention_mask,\n                labels=targets,\n                likelihood_reduction=likelihood_reduction\n            )\n        neg_likelihood = outputs\n        # select the one with the highest likelihood / lowest loss\n        output_class_ranks = torch.argsort(neg_likelihood, dim=-1)[0].item()\n\n        return output_class_ranks\n```\n\n#### Step 4: Implement the Preprocessor\nPreprocessors are used to formulate the structural information in order to get the correct form of dialogue. Our preprocessor is in `/path/to/ReForm-Eval/utils/preprocessors.py`.\n```python\nclass ConvSingleChoiceProcessor(object):\n    def __init__(self, sep, sep2=None, roles=['Question', 'Answer'], system_msg=None, first_query_fn=None, \\\n                 init_conv=None, sep_style='two', alphabet_choice=None, infer_method='generation', response_prefix=None):\n        \"\"\"\n        Preprocessors to convert input information into a dialogue string\n        \n        Args:\n            sep (str):\n                The text separator-1.\n            sep2 (str):\n                The text separator-2.\n            roles (list[str]):\n                Role names of the dialogue, roles[0] is the role of users while \n                roles[1] is the name of assistants.\n            system_msg (str, **optional**):\n                The system message that appears at the beginning.\n            first_query_fn (function, **optional**):\n                The function to process the first query, mainly for adding \u003cimg\u003e marks.\n            init_conv (list[list[str]]):\n                The initial conversation. Each element is a list[str, str] where the first\n                is the role name and the second is the message. \n            sep_style (str):\n                The dialogue style. \n            alphabet_choice (str, **optional**):\n                The option mark used for multiple-choice questions, defaults to \"random\"\n            infer_method (str, \"optional\"):\n                The inference method (\"generation\" or \"likelihood\")\n            response_prefix (str, **optional**):\n                The prefix text for the response of LVLM assistants, we use \"The answer is\"\n                to help with multiple-choice questions.\n                \n        Returns:\n            output (str):\n                The constructed dialogue text.\n        \"\"\"\n```\n\nHere is an example of the `\\n`-separated preprocessor:\n```python\nproc = ConvSingleChoiceProcessor('\\n', roles=['User', 'Bot'], first_query_fn=lambda x: \"\u003cimage\u003e \"+x,\n                                sep_style='one', infer_method=model_args['inference_method'], response_prefix='The answer is',\n                                system_message=\"A chat between a curious human and an artificial intelligence assistant. The \n                                assistant gives helpful, detailed, and polite answers to the human's questions.\")\n```\n\nThe input sample is a json-style dict:\n```\ninputs = {'sample_id': '287626_3',\n 'round_id': 3,\n 'image': 'IMAGE_PATH.jpg',\n 'question': 'Is there a cat in the image?',\n 'answer': '2',\n 'answer_options': ['yes', 'no', 'maybe'],\n 'history': [{'from': 'human', 'value': 'Can you see the image? Options: (A) yes; (B) no'},\n             {'from': 'assistant', 'value': 'The answer is (A) yes'}]\n}\n```\n\nTherefore, the final content will be:\n```\nA chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUser: \u003cimage\u003e Can you see the image? Options: (A) yes; (B) no.\\n\nBot: The answer is (A) yes\\n\nUser: Is there a cat in the image? Options: (A) yes; (B) no; (C) maybe.\\n\nBot:The answer is\n```\n\nFor other supported sep_style, please refer to `/path/to/ReForm-Eval/utils/preprocessors.py`.\n`init_conv` can also be used to add `\u003cimage\u003e` marks, if it is `init_conv=[['User', \"\u003cimage\u003e\"]]`, this means that a new conversation will be started.\n\n```\nUser: \u003cimage\u003e\nUser: ......\nBot: ......\n```\n\n#### Step 5: Add Model Loader\nImplement the model loading function in `/path/to/ReForm-Eval/models/interfaces/lynx_interface.py`.\n```python\ndef get_lynx(model_config=None):\n    model_args = {}\n    # map the general input arguments to the model-specific arguments\n    if model_config is not None:\n        valid_args = ['model_name', 'device', 'half', 'inference_method']\n        target_args = ['model_config', 'device', 'half', 'inference_method']\n        for i, arg in enumerate(valid_args):\n            if arg in model_config:\n                model_args[target_args[i]] = model_config[arg]\n    # configure the dialogue preprocessor\n    proc = ConvSingleChoiceProcessor('\\n', roles=['User', 'Bot'], \\\n                                     sep_style='one', infer_method=model_args['inference_method'], response_prefix='The answer is')\n    return Lynx_Interface(**model_args), proc\n```\n\nAdditionally, you should add the following codes in  `/path/to/ReForm-Eval/models/__init__.py`.\n```python\n    elif model_name == 'lynx':\n        from .interfaces.lynx_interface import get_lynx\n        return get_lynx(model_config)\n```\n\n#### Done!\nFinally, you can use the following model arguments in the main entrance to evaluate your model! \n```bash\n--model lynx  --model_name models/interfaces/lynx/configs/LYNX.yaml\n```\n\nIf you have trouble incorporating new models into our framework, please let us know through GitHub issues or emails. For more details about models and preprocessors, please refer to [Prepare Models](models/prepare_models.md#🤖-prepare-models).\n\n## 🚀 Evaluation\nOur benchmark supports multi-GPU evaluation. If the half evaluation is set, the evaluation can be run on a single machine within CUDA memory of 24G on a single card for 7B models under limited equipment conditions.\n\n### Demo\nWe provide one example of running the benchmark test, using Lynx model for VisDial Evaluation.\n```bash\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 run_eval.py \\\n    --model lynx  --model_name models/interfaces/lynx/configs/LYNX.yaml \\\n    --dataset_name VisDial --output_dir output/lynx/VisDial/test_generation/ \\\n    --per_gpu_eval_batch_size 4 --formulation SingleChoice \\\n    --infer_method generation --do_eval --half_evaluation  --dataset_duplication 1 \\\n    --in_context_sample --option_mark upper \\\n    --dataset_config build/configs/VisDial_val_v1.2.yaml \\\n```\n\nThe num of `--nproc_per_node` must be equal to the num of `CUDA_VISIBLE_DEVICES`. \n`--output_dir` is the path of output result. \n`--formulation` must be `Generation`, `SingleChoice`, `OCROpenEnded` or `KIEOpenEnded`. \n`--infer_method` must be `generation` or `likelihood`. \nIf you infer in generation mode, you should use `--in_context_sample` to assist models to generate option marks for most questions. \n`--dataset_config` is the path of the dataset config file.\n\n### Parameters\nAll parameters used are listed below and you can modify any parameter to customize your evaluation settings.\n\n```python\ndef main():\n    parser = argparse.ArgumentParser()\n    # model-related parameters\n    parser.add_argument('--model', type=str, default=None, help='the model family name')\n    parser.add_argument('--model_name', type=str, default=None, help='the model name to load')\n    parser.add_argument('--model_type', type=str, default=None, help='the model type to set')\n    # dataset-related parameters\n    parser.add_argument('--dataset_name', type=str, default=None, help='the dataset name to evaluate on')\n    parser.add_argument('--formulation', type=str, default=None, help='the problem formulation to perform, must be in (\"Generation\", \"SingleChoice\")')\n    parser.add_argument('--dataset_config', type=str, default=None, help='the config file path, using the default path without explicit ')\n    parser.add_argument('--dataset_duplication', type=int, default=1, help='duplicate the sample for evaluating the stability')\n    parser.add_argument('--in_context_sample', action='store_true', help='whether to provide in-context-learning samples')\n    parser.add_argument('--capitalize', action='store_true', help='whether to capitalize the qa')\n    # 0805 add\n    parser.add_argument('--yesno_instruct', action='store_true', help='whether add \"please answer yes or no\" to the full instruct')\n    parser.add_argument('--answer_space_instruct', action='store_true', help='whether add answer space to the full instruct')\n    # running parameters\n    parser.add_argument('--per_gpu_eval_batch_size', type=int, default=1, help='the batch size per GPU')\n    parser.add_argument('--num_workers', type=int, default=4, help='workers in dataloader')\n    parser.add_argument('--half_evaluation', action='store_true', help='whether to use half precision for evluation')\n    # general evaluation setup\n    parser.add_argument('--do_eval', action='store_true', help='whether to evluate the output.')\n    parser.add_argument('--eval_stability', action='store_true', help='whether to evaluate the stability')\n    # parameters for model generation\n    parser.add_argument('--temperature', type=float, default=None, help='the temperature for generation')\n    parser.add_argument('--max_new_tokens', type=int, default=None, help='max new tokens to generate')\n    # parameters for likelihood measurement\n    parser.add_argument('--likelihood_reduction', type=str, default=None, help='the reduction method for likelihood measurement')\n    # parameters for SingleChoice problem\n    parser.add_argument('--infer_method', type=str, default='generation', help='the inference method to use, must be in [\"generation\", \"likelihood\"]')\n    parser.add_argument('--option_mark', type=str, default=None, help='the index mark for options in single-shoice questions, \\\n                        \"number\" for (1,2,3,4), \"lower\" for (a,b,c,d) while \"upper\" for (A,B,C,D)')\n    # parameters for randomness control\n    parser.add_argument('--random_instruct', action='store_true', help='whether to use random instructions')\n    parser.add_argument('--shuffle_options', action='store_true', help='whether to shuffle options')\n    # parameters for multi-round problem\n    parser.add_argument('--options_in_history', action='store_true', help='whether to put options in history.')\n    parser.add_argument('--online_multi_round', action='store_true', help='make online update to the history during dialog')\n    parser.add_argument('--multi_round_eval', action='store_true', help='whether to evaluate multi-round performance')\n    # output setup\n    parser.add_argument('--output_dir', type=str, default='./output/', help='the path to save the output')\n    # debug mode\n    parser.add_argument('--dataset_debug', action='store_true', help='debug on the dataset setup')\n    parser.add_argument('--dataset_subsample', type=int, default=None, help='only n sub-samples of the dataset')\n    # core\n    parser.add_argument('--core_eval', action='store_true', help='only eval on the core datasets')\n    # hugging face\n    parser.add_argument('--hf', action='store_true', help='whether to load the dataset directly from Hugging Face')\n    parser.add_argument('--offline_hf', action='store_true', help='whether to load the Hugging Face data from the local path')\n    args = parser.parse_args()\n```\n\n### Model Usage\nWhen running the evaluation, these model-related parameters must be applied for specific models.\n\n**Some models require additional forward_likelihood function, please refer to `Likelihood-based White-Box Evaluation` in [Create Your Own Model Interface](#create-your-own-model-interface).**\n\nWe only list a few examples of BLIP-2 and InstructBLIP here. For the remaining models, please refer to the [Complete Model Usage](models/complete_model_usage.md#complete-model-usage).\n\n#### BLIP-2 + InstructBLIP\n```bash\n# BLIP-2 flant5\n--model blip2  --model_name blip2_t5  --model_type pretrain_flant5xl\n# InstructBLIP flan-t5\n--model blip2  --model_name blip2_t5_instruct  --model_type flant5xl\n# InstructBLIP vicuna\n--model blip2  --model_name blip2_vicuna_instruct  --model_type vicuna7b\n```\nYou also have to put `bert-base-uncased` and `google/flan-t5-xl` folders on the root directory of our repository.\n```\n|-- ReForm-Eval\n    |-- bert-base-uncased\n    |-- google\n        |-- flan-t5-xl\n        ...\n    |-- build\n    |-- commands\n    |-- metrics\n    |-- models\n    ...\n```\n\nIf you load `blip2_t5`, you need to add the `predict_class` function in `blip2_t5.py`.\n```python\n    def predict_class(\n        self,\n        samples,\n        candidates,\n        n_segments=1,\n    ):\n        # If candidates is a list of lists, each sample has its candidates, then we need to iterate one by one\n        if type(candidates[0]) == list:\n            results = []\n\n            for i in range(samples[\"image\"].size(0)):\n                # add support for different prompts for different samples\n                this_sample = {\n                    \"image\": samples[\"image\"][i].unsqueeze(0),\n                    \"prompt\": samples[\"prompt\"][i] if type(samples[\"prompt\"]) == list else samples['prompt'],\n                }\n\n                if \"text_input\" in samples.keys():\n                    this_sample[\"text_input\"] = [samples[\"text_input\"][i]]\n\n                if 'context' in samples.keys():\n                    this_sample['context'] = [samples[\"context\"][i]]\n\n                if 'history' in samples.keys():\n                    this_sample['history'] = [samples[\"history\"][i]]\n\n                if 'caption' in samples.keys():\n                    this_sample['caption'] = [samples[\"caption\"][i]]\n\n                this_result = self._predict_class(this_sample, candidates[i], n_segments)\n                results.append(this_result)\n\n            try:\n                results = torch.cat(results, dim=0)\n            except:\n                results = [res.tolist()[0] for res in results]\n\n            return results\n\n        return self._predict_class(samples, candidates, n_segments)\n\n    def _predict_class(\n        self,\n        samples,\n        candidates,\n        n_segments=1,\n    ):\n        \"\"\"\n        Args:\n            samples (dict): A dictionary containing the following keys:\n                - image (torch.Tensor): A tensor of shape (batch_size, 3, H, W)\n                - prompt: the instruction\n            candidates:\n                (list): A list of candidate class names;\n            n_segments:\n                (int): Split the candidates into n_segments and predict one by one. This is useful when the number of candidates is too large.\n        Returns:\n            output_class: predicted class index\n        \"\"\"\n\n        image = samples[\"image\"]\n        prompt = samples[\"prompt\"]\n\n        bs = image.size(0)\n\n        if isinstance(prompt, str):\n            prompt = [prompt] * bs\n        else:\n            assert len(prompt) == bs, \"The number of prompts must be equal to the batch size.\"\n\n        if \"text_input\" in samples.keys():\n            if type(samples[\"text_input\"][0]) == list:\n                prompt = [prompt[i].format(*samples[\"text_input\"][i]) for i in range(len(prompt))]\n            else:\n                prompt = [prompt[i].format(samples[\"text_input\"][i]) for i in range(len(prompt))]\n\n        # scienceqa\n        if 'context' in samples.keys() and samples['context'] != '':\n            prompt = [f'context: {samples[\"context\"][i]}. {prompt[i]}' for i in range(len(prompt))]\n\n        # visual dialog\n        if 'history' in samples.keys() and samples['history'][0] != '':\n            prompt = [f'dialog history: {samples[\"history\"][i]}\\n{prompt[i]}' for i in range(len(prompt))]\n\n        if 'caption' in samples.keys() and samples['caption'][0] != '':\n            prompt = [f'This image has the caption \"{samples[\"caption\"][i]}\". {prompt[i]}' for i in range(len(prompt))]\n\n        query_tokens = self.query_tokens.expand(bs, -1, -1)\n \n        if image.dim() == 5:\n            inputs_t5, atts_t5 = [], []\n            for j in range(image.size(2)):\n                this_frame = image[:,:,j,:,:]\n                with self.maybe_autocast():\n                    frame_embeds = self.ln_vision(self.visual_encoder(this_frame))\n                    frame_atts = torch.ones(frame_embeds.size()[:-1], dtype=torch.long).to(image.device)\n\n                frame_query_output = self.Qformer.bert(\n                    query_embeds=query_tokens,\n                    encoder_hidden_states=frame_embeds,\n                    encoder_attention_mask=frame_atts,\n                    return_dict=True,\n                )\n\n                frame_inputs_t5 = self.t5_proj(frame_query_output.last_hidden_state[:,:query_tokens.size(1),:])\n                frame_atts_t5 = torch.ones(frame_inputs_t5.size()[:-1], dtype=torch.long).to(image.device)\n                inputs_t5.append(frame_inputs_t5)\n                atts_t5.append(frame_atts_t5)\n            inputs_t5 = torch.cat(inputs_t5, dim=1)\n            atts_t5 = torch.cat(atts_t5, dim=1)\n        else:\n            with self.maybe_autocast():\n                image_embeds = self.ln_vision(self.visual_encoder(image))\n            image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image.device)\n\n            query_output = self.Qformer.bert(\n                query_embeds=query_tokens,\n                encoder_hidden_states=image_embeds,\n                encoder_attention_mask=image_atts,\n                return_dict=True,\n            )\n\n            inputs_t5 = self.t5_proj(query_output.last_hidden_state[:,:query_tokens.size(1),:])\n            atts_t5 = torch.ones(inputs_t5.size()[:-1], dtype=torch.long).to(image.device)\n\n        input_tokens = self.t5_tokenizer(\n            prompt, padding=\"longest\", return_tensors=\"pt\"\n        ).to(image.device)\n        output_tokens = self.t5_tokenizer(\n            candidates, padding=\"longest\", return_tensors=\"pt\"\n        ).to(image.device)\n\n        encoder_atts = torch.cat([atts_t5, input_tokens.attention_mask], dim=1)\n\n        n_cands = len(candidates)\n\n        with self.maybe_autocast(dtype=torch.bfloat16):\n            inputs_embeds = self.t5_model.encoder.embed_tokens(input_tokens.input_ids)\n            inputs_embeds = torch.cat([inputs_t5, inputs_embeds], dim=1)\n\n            encoder_outputs = self.t5_model.encoder(\n                inputs_embeds=inputs_embeds,\n                attention_mask=encoder_atts,\n            )\n\n            all_losses = []\n            for n in range(n_segments):\n                seg_len = n_cands // n_segments\n                if n == (n_segments - 1):\n                    seg_len = n_cands - seg_len * (n_segments - 1)\n\n                # this_encoder_outputs = copy.deepcopy(encoder_outputs)\n                this_encoder_outputs = BaseModelOutput(\n                    last_hidden_state=encoder_outputs[0].clone(),\n                )\n\n                this_encoder_outputs['last_hidden_state'] = this_encoder_outputs[0].repeat_interleave(seg_len, dim=0)\n                this_encoder_atts = encoder_atts.repeat_interleave(seg_len, dim=0)\n\n                start_i = n * (n_cands // n_segments)\n                end_i = start_i + seg_len\n                this_output_tokens_ids = output_tokens.input_ids[start_i:end_i].repeat(bs, 1)\n                this_output_tokens_atts = output_tokens.attention_mask[start_i:end_i].repeat(bs, 1)\n\n                this_targets = this_output_tokens_ids.masked_fill(this_output_tokens_ids == self.t5_tokenizer.pad_token_id, -100)\n\n                outputs = self.t5_model(\n                    encoder_outputs=this_encoder_outputs,\n                    attention_mask=this_encoder_atts,\n                    decoder_attention_mask=this_output_tokens_atts,\n                    return_dict=True,\n                    labels=this_targets,\n                    reduction=\"none\",\n                )\n                loss = outputs.loss\n\n                loss = loss.reshape(bs, seg_len)\n                # output_class_ranks = torch.argsort(loss, dim=-1)\n                all_losses.append(loss)\n\n            all_losses = torch.cat(all_losses, dim=-1)\n            output_class_ranks = torch.argsort(all_losses, dim=-1)\n\n        return output_class_ranks\n```\n\nThen, you should run the following command to implement the modification.\n```\ncd models/LAVIS\npip install e .\n```\n\n### Data Usage\nFor data-related parameters, we list required parameters of different tasks for comprehensive evaluation.\n\n#### Coarse-Grained Perception\nCoarse-grained perception (CG) is the ability to recognize the overall layout and main objects at the image level.\n\n##### Flowers102\n```bash\n--dataset_name Flowers102 --formulation SingleChoice --dataset_config build/configs/ImageClassification_flowers102_val.yaml\n```\n##### CIFAR10\n```bash\n--dataset_name CIFAR10 --formulation SingleChoice --dataset_config build/configs/ImageClassification_cifar10_val.yaml\n```\n##### ImageNet-1K\n```bash\n--dataset_name ImageNet-1K --formulation SingleChoice --dataset_config build/configs/ImageClassification_imagenet1k_val.yaml\n```\n##### Pets37\n```bash\n--dataset_name Pets37 --formulation SingleChoice --dataset_config build/configs/ImageClassification_pets37_val.yaml\n```\n##### VizWiz-yesno\n```bash\n--dataset_name VizWiz --formulation SingleChoice --dataset_config build/configs/ImageQuality_vizwiz_yesNo_val.yaml\n```\n##### VizWiz-singleChoice\n```bash\n--dataset_name VizWiz --formulation SingleChoice --dataset_config build/configs/ImageQuality_vizwiz_singleChoice_val.yaml\n```\n##### TDIUC-Sport\n```bash\n--dataset_name VizWiz --formulation SingleChoice --dataset_config build/configs/ImageQuality_vizwiz_singleChoice_val.yaml\n```\n##### TDIUC-Scene\n```bash\n--dataset_name TDIUC --formulation SingleChoice --dataset_config build/configs/TDIUC_scene.yaml\n```\n##### MEDIC\n```bash\n--dataset_name MEDIC --formulation SingleChoice --dataset_config build/configs/DisasterType_val.yaml\n```\n\n#### Fine-Grained Perception\nFine-grained perception (FG) requires detailed sensing at the object level.\n\n##### MSCOCO-MCI\n```bash\n--dataset_name MSCOCO --formulation SingleChoice --dataset_config build/configs/MulticlassIdentification_val.yaml\n```\n##### MSCOCO-GOI\n```bash\n--dataset_name MSCOCO --formulation SingleChoice --dataset_config build/configs/GroundedObjIdentification_val.yaml\n```\n##### MSCOCO-MOS\n```bash\n--dataset_name MSCOCO --formulation SingleChoice --dataset_config build/configs/MissingObjectSelection_val.yaml\n```\n\n##### TDIUC-Color\n```bash\n--dataset_name TDIUC --formulation SingleChoice --dataset_config build/configs/TDIUC_color.yaml\n```\n##### TDIUC-Utility\n```bash\n--dataset_name TDIUC --formulation SingleChoice --dataset_config build/configs/TDIUC_utility.yaml\n```\n##### TDIUC-Position\n```bash\n--dataset_name TDIUC --formulation SingleChoice --dataset_config build/configs/TDIUC_position.yaml\n```\n##### TDIUC-Detection\n```bash\n--dataset_name TDIUC --formulation SingleChoice --dataset_config build/configs/TDIUC_detection.yaml\n```\n##### TDIUC-Counting\n```bash\n--dataset_name TDIUC --formulation SingleChoice --dataset_config build/configs/TDIUC_counting.yaml\n```\n##### RefCOCO\n```bash\n--dataset_name RefCOCO --formulation SingleChoice --dataset_config build/configs/ReferringExpression_val.yaml\n```\n##### MSCOCO-OC\n```bash\n--dataset_name MSCOCO --formulation SingleChoice --dataset_config build/configs/ObjectCounting_mscoco_val.yaml\n```\n\n#### Visually Grounded Reasoning\nA reliable LVLM is supposed to perform reasoning based on multi-modal contextual information. In order to assess such capability, we adopt the commonly applied visual question answering (VQA) task and its variant, knowledge-based visual question answer (K-VQA), which further requires models to utilize internally stored knowledge.\n\n##### VQA v2\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_vqa_v2_val.yaml\n```\n\n##### GQA\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_gqa_val_v2.0.yaml\n```\n\n##### Whoops\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_whoops_val.yaml\n```\n##### OK-VQA\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_okvqa_val.yaml\n```\n\n##### ScienceQA\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_scienceqa_val_v2.0.yaml\n```\n\n##### VizWiz\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_vizwiz_val_v2.0.yaml\n```\n\n\n##### ViQuAE\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_viquae_val.yaml\n```\n\n##### K-ViQuAE\n``` bash\n--dataset_name KVQA --formulation SingleChoice --dataset_config build/configs/KVQA_viquae_val.yaml\n```\n\n##### A-OKVQA\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_aokvqa_val.yaml\n```\n\n##### A-OKVQRA\n``` bash\n--dataset_name VQRA --formulation SingleChoice --dataset_config build/configs/VQRA_aokvqa_val.yaml\n```\n\n##### A-OKVQAR\n``` bash\n--dataset_name VQAR --formulation SingleChoice --dataset_config build/configs/VQAR_aokvqa_val.yaml\n```\n\n##### ImageNetVC\n``` bash\n--dataset_name VQA --formulation SingleChoice --dataset_config build/configs/VQA_imagenetvc_val.yaml\n```\n\n#### Spatial Understanding\nSpatial understanding is the key to the real-life application of LVLMs on robots. This task requires a comprehensive understanding of both the object-object and object-observer relationship so as to make reasonable behaviors.\n\n##### CLEVR\n``` bash\n--dataset_name CLEVR --formulation SingleChoice --dataset_config build/configs/Spatial_clevr_val.yaml\n```\n\n##### VSR\n``` bash\n--dataset_name VSR --formulation SingleChoice --dataset_config build/configs/Spatial_vsr_val.yaml\n```\n\n##### MP3D\n``` bash\n--dataset_name MP3D --formulation SingleChoice --dataset_config build/configs/Spatial_mp3d_val.yaml\n```\n\n\n#### Multi-Turn Dialogue\nReForm-Eval evaluates the performance of LVLMs in multi-turn dialogues.\n\n##### VQA-MT\n``` bash\n--dataset_name VisDial --formulation SingleChoice --dataset_config build/configs/VQA_vqa_MultiRound_val.yaml --online_multi_round --num_workers 0\n```\n\n##### VisDial\n``` bash\n--dataset_name VisDial --formulation SingleChoice --dataset_config build/configs/VisDial_val_v1.2.yaml --online_multi_round --num_workers 0\n```\n\nPlease refer to [Online Multi-round Dialogue](build/prepare_dataset.md#online-multi-round-dialogue) for the details of the setup of online multi-round dialogues.\n\n#### Cross-Modal Inference\nWe consider two tasks: image-text matching (ITM) requires models to measure the cross-modal similarities and visual entailment (VE) demands models to check whether the information is entailed across modalities.\n\n##### MSCOCO-ITM\n```bash\n--dataset_name MSCOCO --formulation SingleChoice --dataset_config build/configs/ImageTextMatching_val.yaml\n```\n##### MSCOCO-ITS\n```bash\n--dataset_name MSCOCO --formulation SingleChoice --dataset_config build/configs/ImageTextSelection_val.yaml\n```\n\n##### WikiHow\n```bash\n--dataset_name WikiHow --formulation SingleChoice --dataset_config build/configs/TemporalOrdering_val.yaml\n```\n\n##### Winoground\n``` bash\n--dataset_name CaptionSelection --formulation SingleChoice --dataset_config build/configs/CaptionSelection_winoground_val.yaml\n```\n\n##### SNLI-VE\n```bash\n--dataset_name SNLI-VE --formulation SingleChoice --dataset_config build/configs/VisualEntailment_val.yaml\n```\n\n##### MOCHEG\n``` bash\n--dataset_name MCV  --formulation SingleChoice --dataset_config build/configs/MCV_mocheg_val.yaml\n```\n\n#### Scene Text Perception\nScene text perception enables LVLMs to identify, understand, and perform inference based on text in images.\n\n##### Grounded IC15\n```bash\n--dataset_name IC15 --formulation OCROpenEnded --dataset_config build/configs/GroundOCR_ic15_val.yaml\n```\n\n##### IC15\n```bash\n--dataset_name IC15 --formulation OCROpenEnded --dataset_config build/configs/OCR_ic15_val.yaml\n```\n\n##### Grounded COCO-Text\n```bash\n--dataset_name COCO_text --formulation OCROpenEnded --dataset_config build/configs/GroundOCR_cocotext_val.yaml\n```\n\n##### COCO-Text\n```bash\n--dataset_name COCO_text --formulation OCROpenEnded --dataset_config build/configs/OCR_cocotext_val.yaml\n```\n\n##### Grounded TextOCR\n```bash\n--dataset_name TextOCR --formulation OCROpenEnded --dataset_config build/configs/GroundOCR_textocr_val.yaml\n```\n\n##### TextOCR\n```bash\n--dataset_name TextOCR --formulation OCROpenEnded --dataset_config build/configs/OCR_textocr_val.yaml\n```\n\n##### CUTE80\n```bash\n--dataset_name CUTE80 --formulation OCROpenEnded --dataset_config build/configs/OCR_cute80_val.yaml\n```\n\n##### IIIT5K\n```bash\n--dataset_name IIIT5K --formulation OCROpenEnded --dataset_config build/configs/OCR_iiit5k_val.yaml\n```\n\n##### WordArt\n```bash\n--dataset_name WordArt --formulation OCROpenEnded --dataset_config build/configs/OCR_wordart_val.yaml\n```\n\n##### FUNSD\n```bash\n--dataset_name FUNSD --formulation KIEOpenEnded --dataset_config build/configs/KIE_funsd_val.yaml\n```\n\n##### POIE\n```bash\n--dataset_name POIE --formulation OCROpenEnded --dataset_config build/configs/KIE_poie_val.yaml\n```\n\n##### SROIE\n```bash\n--dataset_name SROIE --formulation OCROpenEnded --dataset_config build/configs/KIE_sroie_val.yaml\n```\n\n##### TextVQA\n``` bash\n--dataset_name OCR --formulation OCROpenEnded --dataset_config build/configs/OCR_textvqa_val.yaml\n```\n\n##### DocVQA\n``` bash\n--dataset_name OCR --formulation OCROpenEnded --dataset_config build/configs/OCR_docvqa_val.yaml\n```\n\n##### OCR-VQA\n``` bash\n--dataset_name OCR --formulation OCROpenEnded --dataset_config build/configs/OCR_ocrvqa_val.yaml\n```\n\n#### Visual Description\nVisual description is an inherent capability of LVLMs as generative models.\n\n##### MSCOCO\n```bash\n--dataset_name MSCOCO --formulation Generation --dataset_config build/configs/Caption_MSCOCO_val.yaml\n```\n\n##### TextCaps\n```bash\n--dataset_name TextCaps --formulation Generation --dataset_config build/configs/Caption_TextCaps_val.yaml\n```\n\n##### NoCaps\n```bash\n--dataset_name NoCaps --formulation Generation --dataset_config build/configs/Caption_NoCaps_val.yaml\n```\n\n##### Flickr30K\n```bash\n--dataset_name Flickr30K --formulation Generation --dataset_config build/configs/Caption_Flickr30K_val.yaml\n```\n\n### Output Result\nThe output json file is generated in your `--output_dir` path, and you can dircetly look up the corresponding json file for the final result. You can also run command by ipython in the terminal:\n```python\nimport json\nres = json.load(open('/path/to/YOUR_PREDICTION_FILE.json')) #load the output json file\nres[0] #res[n], n can be any number within the generated results\n```\n\n## 🖋 Citation\nIf ReForm-Eval has been beneficial to your research and work, please cite our work using the following format:\n```latex\n@misc{li2023reformeval,\n      title={ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks}, \n      author={Zejun Li and Ye Wang and Mengfei Du and Qingwen Liu and Binhao Wu and Jiwen Zhang and Chengxing Zhou and Zhihao Fan and Jie Fu and Jingjing Chen and Xuanjing Huang and Zhongyu Wei},\n      year={2023},\n      eprint={2310.02569},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n## 🤝 Acknowledgements\nWe thank [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), [MMBench](https://github.com/open-compass/MMBench), [LVLM-eHub](http://lvlm-ehub.opengvlab.com/index.html), [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) and other repositories that have made great contributions to multi-modal large model evaluation. In addition, we are also very grateful that many LVLMs can be open sourced and participate in our evaluation, enriching results of our benchmarks.\n\n\n## 🔏 Related Projects\n- [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)\n- [MMBench: Is Your Multi-modal Model an All-around Player?](https://github.com/open-compass/MMBench)\n- [LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models](http://lvlm-ehub.opengvlab.com/index.html)\n- [M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning](https://huggingface.co/datasets/MMInstruction/M3IT)\n","funding_links":[],"categories":["Datasets-or-Benchmark"],"sub_categories":["多模态-跨模态"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFudanDISC%2FReForm-Eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFudanDISC%2FReForm-Eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFudanDISC%2FReForm-Eval/lists"}