Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/q-future/q-bench
①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A benchmark for multi-modality LLMs (MLLMs) on low-level vision and visual quality assessment.
https://github.com/q-future/q-bench
gpt-4 iclr image-quality-assessment large-language-models low-level-vision quality-assessment vision-language-dataset visual-large-language-models
Last synced: 3 months ago
JSON representation
①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A benchmark for multi-modality LLMs (MLLMs) on low-level vision and visual quality assessment.
- Host: GitHub
- URL: https://github.com/q-future/q-bench
- Owner: Q-Future
- Created: 2023-09-25T05:03:16.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2024-07-18T06:10:20.000Z (4 months ago)
- Last Synced: 2024-08-01T02:26:24.168Z (3 months ago)
- Topics: gpt-4, iclr, image-quality-assessment, large-language-models, low-level-vision, quality-assessment, vision-language-dataset, visual-large-language-models
- Language: Jupyter Notebook
- Homepage: https://q-future.github.io/Q-Bench/
- Size: 29.2 MB
- Stars: 222
- Watchers: 1
- Forks: 12
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-ChatGPT-repositories - Q-Bench - ①[ICLR2024 Spotlight] (GPT-4V/Gemini-Pro/Qwen-VL-Plus+16 OS MLLMs) A benchmark for multi-modality LLMs (MLLMs) on low-level vision and visual quality assessment. (NLP)
README
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
_How do multi-modaility LLMs perform on low-level computer vision?_
1Nanyang Technological University, 2Shanghai Jiaotong University, 3Sensetime Research
*Equal contribution. #Corresponding author.
ICLR2024 Spotlight
Paper |
Project Page |
Github |
Data (LLVisionQA) |
Data (LLDescribe) |
质衡 (Chinese-Q-Bench)
The proposed Q-Bench includes three realms for low-level vision: perception (A1), description (A2), and assessment (A3).
- For perception (A1) /description (A2), we collect two benchmark datasets LLVisionQA/LLDescribe.
- We are open to **submission-based evaluation** for the two tasks. The details for submission is as follows.
- For assessment (A3), as we use **public datasets**, we provide an abstract evaluation code for arbitrary MLLMs for anyone to test.## Use with `datasets` API
For the Q-Bench-A1 (with multi-choice questions), we have converted them into [HF-format datasets](https://huggingface.co/datasets/q-future/Q-Bench-HF) that can automatically be downloaded and used with `datasets` API. Please refer to the following instruction:
```shell
pip install datasets
```### Q-Bench (single images)
```python
from datasets import load_datasetds = load_dataset("q-future/Q-Bench-HF")
print(ds["dev"][0])
### {'id': 0,
### 'image': ,
### 'question': 'How is the lighting of this building?',
### 'option0': 'High',
### 'option1': 'Low',
### 'option2': 'Medium',
### 'option3': 'N/A',
### 'question_type': 2,
### 'question_concern': 3,
### 'correct_choice': 'B'}
```### Q-Bench2 (image pairs)
```python
from datasets import load_datasetds = load_dataset("q-future/Q-Bench2-HF")
print(ds["dev"][0])
### {'id': 0,
### 'image1': ,
### 'image2': ,
### 'question': 'Compared to the first image, how is the clarity of the second image?',
### 'option0': 'More blurry',
### 'option1': 'Clearer',
### 'option2': 'About the same',
### 'option3': 'N/A',
### 'question_type': 2,
### 'question_concern': 0,
### 'correct_choice': 'B'}
```## Release
- [2024/6/17]🔥 The **Q-Bench**, **Q-Bench2**([Q-bench+](https://arxiv.org/abs/2402.07116)), and [**A-Bench**](https://github.com/Q-Future/A-Bench) have now joined [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), which makes it easier to test LMM !!
- [2024/6/3] 🔥 [Github repo](https://github.com/Q-Future/A-Bench) for **A-Bench** is online. Do you want to find out if your LMM is a master at evaluating AI-generated images? Come and test on **A-Bench** !!
- [3/1] 🔥 We are releasing **Co-instruct**, *Towards Open-ended Visual Quality Comparison* [here](https://co-instruct.github.io/). More details are coming soon.
- [2/27] 🔥 Our work **Q-Insturct** has been accepted by CVPR 2024, try to learn the [details](https://github.com/Q-Future/Q-Instruct) about how to instruct MLLMs on low-level vision!
- [2/23] 🔥 The low-level vision compare task part of [Q-bench+](https://arxiv.org/abs/2402.07116) is now released at [Q-bench+(Dataset)](https://huggingface.co/datasets/q-future/q-bench2)!
- [2/10] 🔥 We are releasing the extended [Q-bench+](https://arxiv.org/abs/2402.07116), which challenges MLLMs with both single images and **image pairs** on low-level vision. The [LeaderBoard](https://huggingface.co/spaces/q-future/Q-Bench-Leaderboard) is onsite, check out the low-level vision ability for your favorite MLLMs!! More details coming soon.
- [1/16] 🔥 Our work ["Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision"](https://arxiv.org/abs/2309.14181) is accepted by **ICLR2024 as Spotlight Presentation**.
## Close-source MLLMs (GPT-4V-Turbo, Gemini, Qwen-VL-Plus, GPT-4V)
We test on three close-source API models, GPT-4V-Turbo (`gpt-4-vision-preview`, replacing the no-longer-available *old version* GPT-4V results), Gemini Pro (`gemini-pro-vision`) and Qwen-VL-Plus (`qwen-vl-plus`). Slightly improved compared with the older version, GPT-4V still tops among all MLLMs and almost a junior-level human's performance. Gemini Pro and Qwen-VL-Plus follows behind, still better than best open-source MLLMs (0.65 overall).
Update on [2024/7/18], We are glad to release the new SOTA performance of **BlueImage-GPT** (close-source).
**Perception, A1-Single**
|**Participant Name** | yes-or-no | what | how | distortion | others | in-context distortion | in-context others | overall |
| - | - | - | - | - | - | -| - | - |
| Qwen-VL-Plus (`qwen-vl-plus`) | 0.7574 | 0.7325 | 0.5733| 0.6488 | 0.7324 | 0.6867 | 0.7056 | 0.6893 |
| BlueImage-GPT (`from VIVO` *New Champion*) | **0.8467** | 0.8351 | **0.7469** | 0.7819 | **0.8594** | 0.7995 | 0.8240 | 0.8107 |
| Gemini-Pro (`gemini-pro-vision`) | 0.7221 | 0.7300 |0.6645 | 0.6530 | 0.7291 | 0.7082 | 0.7665 | 0.7058 |
| GPT-4V-Turbo (`gpt-4-vision-preview`) |0.7722 | 0.7839 | 0.6645 |0.7101 | 0.7107 | 0.7936 | 0.7891 | 0.7410 |
| GPT-4V (*old version*) | 0.7792 | 0.7918 | 0.6268 | 0.7058 | 0.7303 | 0.7466 | 0.7795 | 0.7336 |
| human-1-junior | 0.8248 | 0.7939 | 0.6029 | 0.7562 | 0.7208 | 0.7637 | 0.7300 | 0.7431 |
| human-2-senior | 0.8431 | **0.8894** | 0.7202 | **0.7965** | 0.7947 | **0.8390** | **0.8707** | **0.8174** |**Perception, A1-Pair**
|**Participant Name** | yes-or-no | what | how | distortion | others | compare | joint | overall |
| - | - | - | - | - | - | -| - | - |
| Qwen-VL-Plus (`qwen-vl-plus`) | 0.6685 | 0.5579 | 0.5991 | 0.6246 | 0.5877 | 0.6217 | 0.5920 | 0.6148 |
| Qwen-VL-Max (`qwen-vl-max`) | 0.6765 | 0.6756 | 0.6535 | 0.6909 | 0.6118 | 0.6865 | 0.6129 | 0.6699 |
| BlueImage-GPT (`from VIVO` *New Champion*) | **0.8843** | 0.8033 | **0.7958** | **0.8464** | 0.8062 | 0.8462 | 0.7955 | 0.8348 |
| Gemini-Pro (`gemini-pro-vision`) | 0.6578 | 0.5661 | 0.5674 | 0.6042 | 0.6055 | 0.6046 | 0.6044 | 0.6046 |
| GPT-4V (`gpt-4-vision`) | 0.7975 | 0.6949 | 0.8442 | 0.7732 | 0.7993 | 0.8100 | 0.6800 | 0.7807 |
| Junior-level Human | 0.7811 | 0.7704 | 0.8233 | 0.7817 | 0.7722 | 0.8026 | 0.7639 | 0.8012 |
| Senior-level Human | 0.8300 | **0.8481** | 0.8985 | 0.8313 | **0.9078** | **0.8655** | **0.8225** | **0.8548** |We have also evaluated several new open-source models recently, and will release their results soon.
## Submission Guideline for A1/A2
### Option 1: Submit Results
#### Step 1: Download Images
We now provide two ways to download the datasets (LLVisionQA\&LLDescribe)
- via GitHub Release: Please see our [release](https://github.com/Q-Future/Q-Bench/releases/tag/v1.0.1.1014datarelease) for details.
- via Huggingface Datasets: Please refer to the [data release notes](/data_release) to download the images.
#### Step 2: Test with Your Model
It is highly recommended to convert your model into Huggingface format to smoothly test these data. See the [example scripts for Huggingface's IDEFICS-9B-Instruct](/example_code_for_idefics) as an example, and modify them for your custom model to test on your model.
**Please email `[email protected]` to submit your result in json format.**
### Option 2: Submit Model
You can also submit your model (could be Huggingface AutoModel or ModelScope AutoModel) to us, alongside your custom evaluation scripts. Your custom scripts can be modified from the [template scripts](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa_qbench.py) that works for LLaVA-v1.5 (for A1/A2), and [here](example_code_for_idefics/a3_assessment_all.py) (for image quality assessment).
**Please email `[email protected]` to submit your model if you are _outside_ China Mainland.**
**Please email `[email protected]` to submit your model if you are _inside_ China Mainland.**## A1: Perception
A snapshot for LLVisionQA benchmark dataset for MLLM low-level perception ability is as follows. See the [leaderboard](leaderboards/) here.
![Picture](llvisionqa_db.png)
We measure the answer accuracy of MLLMs (provided with the question and all choices) as the metric here.
## A2: Description
A snapshot for LLDescribe benchmark dataset for MLLM low-level description ability is as follows. See the [leaderboard](leaderboards/) here.
![Picture](lldescribe.png)
We measure the _completeness_, _precision_, and _relevance_ of MLLM descriptions as the metric here.
## A3: Assessment
_An exciting ability that MLLMs are able to predict quantitative scores for IQA!_
### Methodology
![Picture](llmiqa.png)
### Predict a Score
#### Pseudo Code
Similarly as above, as long as a model (based on causal language models) has the following two methods: `embed_image_and_text` (to allow multi-modality inputs), and `forward` (for computing logits), the Image Quality Assessment (IQA) with the model can be achieved as follows:
```python
from PIL import Image
from my_mllm_model import Model, Tokenizer, embed_image_and_textmodel, tokenizer = Model(), Tokenizer()
prompt = "##User: Rate the quality of the image.\n" \
"##Assistant: The quality of the image is" ### This line can be modified based on MLLM's default behaviour.good_idx, poor_idx = tokenizer(["good","poor"]).tolist()
image = Image.open("image_for_iqa.jpg")
input_embeds = embed_image_and_text(image, prompt)
output_logits = model(input_embeds=input_embeds).logits[0,-1]
q_pred = (output_logits[[good_idx, poor_idx]] / 100).softmax(0)[0]
```\*Note that you can modify the second line based on your model's default format, _e.g._ for [Shikra](https://github.com/shikras/shikra), the "##Assistant: The quality of the image is" is modified as "##Assistant: The answer is". It is okay if your MLLM will first answer "Ok, I would like to help! The image quality is", just replace this into line 2 of the prompt.
#### Example Real Code for IDEFICS
We further provide a full implementation of IDEFICS on IQA. See [example](example_code_for_idefics/README.md) on how to run IQA with this MLLM. Other MLLMs can also be modified in the same way for use in IQA.
#### Compute SRCC/PLCC with IQA databases
We have prepared JSON format human opinion scores (MOS) for the seven IQA databases as evaluated in our benchmark.
Please see [IQA_databases](a3_iqa_databases/) for details.
### Official Results on IQA Databases
Moved to [leaderboards](leaderboards/). Please click to see details.
## Contact
Please contact any of the first authors of this paper for queries.
- Haoning Wu, `[email protected]`, @teowu
- Zicheng Zhang, `[email protected]`, @zzc-1998
- Erli Zhang, `[email protected]`, @ZhangErliCarl## Citation
If you find our work interesting, please feel free to cite our paper:
```bibtex
@inproceedings{wu2024qbench,
author = {Wu, Haoning and Zhang, Zicheng and Zhang, Erli and Chen, Chaofeng and Liao, Liang and Wang, Annan and Li, Chunyi and Sun, Wenxiu and Yan, Qiong and Zhai, Guangtao and Lin, Weisi},
title = {Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision},
booktitle = {ICLR},
year = {2024}
}
```