Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thu-ml/MMTrustEval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
https://github.com/thu-ml/MMTrustEval
benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness
Last synced: about 1 month ago
JSON representation
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
- Host: GitHub
- URL: https://github.com/thu-ml/MMTrustEval
- Owner: thu-ml
- License: cc-by-sa-4.0
- Created: 2024-06-09T05:00:54.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-11-05T12:56:07.000Z (2 months ago)
- Last Synced: 2024-11-23T09:10:44.338Z (about 2 months ago)
- Topics: benchmark, claude, fairness, gpt-4, mllm, multi-modal, privacy, robustness, safety, toolbox, trustworthy-ai, truthfulness
- Language: Python
- Homepage: https://multi-trust.github.io/
- Size: 15.8 MB
- Stars: 108
- Watchers: 5
- Forks: 7
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-MLLM-Safety - Github - ml/MMTrustEval.svg?style=social&label=Star) (Evaluation)
- Awesome-LLMs-Datasets - https://github.com/thu-ml/MMTrustEval
README
π Project Page   
π arXiv Paper   
π Documentation   
π Dataset   
π€ Hugging Face   
π Leaderboard
---
> **MultiTrust** is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks to expose new trustworthiness challenges.
## π News
* **`2024.11.05`** π We have released the dataset of MultiTrust on π€[Huggingface](https://huggingface.co/datasets/thu-ml/MultiTrust). Feel free to download and test your own model !
* **`2024.11.05`** π We have updated the toolbox to support several latest models, e.g., [Phi-3.5](https://huggingface.co/microsoft/Phi-3.5-vision-instruct), [Cambrian-13B](https://huggingface.co/nyu-visionx/cambrian-13b), [Qwen2-VL-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct), and their results have been uploaded to the [leaderboard](https://multi-trust.github.io/) !
* **`2024.09.26`** π [Our paper](https://arxiv.org/abs/2406.07057) has been accepted by the Datasets and Benchmarks track in NeurIPS 2024 οΌSee you in Vancouver ~
* **`2024.08.12`** π We have released the latest results for [DeepSeek-VL](https://github.com/deepseek-ai/DeepSeek-VL), and [hunyuan-vision](https://hunyuan.tencent.com/) on our [project website](https://multi-trust.github.io/) οΌ
* **`2024.07.07`** π We have released the latest results for [GPT-4o](https://openai.com/index/hello-gpt-4o/), [Claude-3.5](https://www.anthropic.com/news/claude-3-5-sonnet), and [Phi-3](https://ollama.com/library/phi3) on our [project website](https://multi-trust.github.io/) οΌ
* **`2024.06.07`** π We have released [MultiTrust](https://multi-trust.github.io/), the first comprehensive and unified benchmark on the trustworthiness of MLLMs !## π οΈ Installation
The envionment of this version has been updated to accommodate more latest models. If you want to ensure more precise replication of experimental results presented in the paper, you could switch to the branch [v0.1.0](https://github.com/thu-ml/MMTrustEval/tree/v0.1.0).
- Option A: Pip install
```shell
conda create -n multitrust python=3.9
conda activate multitrust# Note: Tsinghua Source can be discarded.
pip install -r env/requirements.txt
```- Option B: Docker
- (Optional) Commands to install Docker
```shell
# Our docker version:
# Client: Docker Engine - Community
# Version: 27.0.0-rc.1
# API version: 1.46
# Go version: go1.21.11
# OS/Arch: linux/amd64distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.listsudo apt-get update
sudo apt-get install -y nvidia-container-toolkitsudo systemctl restart docker
sudo usermod -aG docker [your_username_here]
```
- Commands to install environment
```shell
# Note:
# [code] is an `absolute path` of project root: abspath(./)
# [data] and [playground] are `absolute paths` of data and model_playground(decompress our provided data/playground).
docker build -t multitrust:v0.0.1 -f env/Dockerfile .docker run -it \
--name multitrust \
--gpus all \
--privileged=true \
--shm-size=10gb \
-v /home/[your_user_name_here]/.cache/huggingface:/root/.cache/huggingface \
-v /home/[your_user_name_here]/.cache/torch:/root/.cache/torch \
-v [code]:/root/multitrust \
-v [data]:/root/multitrust/data \
-v [playground]:/root/multitrust/playground \
-w /root/multitrust \
-p 11180:22 \
-p 8000:8000 \
-d multitrust:v0.0.1 /bin/bash# entering the container by docker exec
docker exec -it multitrust /bin/bash# or entering by ssh
ssh -p 11180 root@[your_ip_here]
```
- Several tasks require the use of commercial APIs for auxiliary testing. Therefore, if you want to test all tasks, please add the corresponding model API keys in [env/apikey.yml](https://github.com/thu-ml/MMTrustEval/blob/v0.1.0/env/apikey.yml).## :envelope: Dataset
### License
- The codebase is licensed under the **CC BY-SA 4.0** license.- MultiTrust is only used for academic research. Commercial use in any form is prohibited.
- If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.
### Data Preparation
Refer [here](data4multitrust/README.md) for detailed instructions.
## π Docs
Our document presents interface definitions for different modules and some tutorials on **how to extend modules**.
Running online at: https://thu-ml.github.io/MMTrustEval/Run following command to see the docs(locally).
```shell
mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000
```## π Reproduce results in Our paper
Running scripts under `scripts/run` can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner.
### π To Make Inference```
# Description: Run scripts require a model_id to run inference tasks.
# Usage: bash scripts/run/*/*.shscripts/run
βββ fairness_scripts
β βββ f1-stereo-generation.sh
β βββ f2-stereo-agreement.sh
β βββ f3-stereo-classification.sh
β βββ f3-stereo-topic-classification.sh
β βββ f4-stereo-query.sh
β βββ f5-vision-preference.sh
β βββ f6-profession-pred.sh
β βββ f7-subjective-preference.sh
βββ privacy_scripts
β βββ p1-vispriv-recognition.sh
β βββ p2-vqa-recognition-vispr.sh
β βββ p3-infoflow.sh
β βββ p4-pii-query.sh
β βββ p5-visual-leakage.sh
β βββ p6-pii-leakage-in-conversation.sh
βββ robustness_scripts
β βββ r1-ood-artistic.sh
β βββ r2-ood-sensor.sh
β βββ r3-ood-text.sh
β βββ r4-adversarial-untarget.sh
β βββ r5-adversarial-target.sh
β βββ r6-adversarial-text.sh
βββ safety_scripts
β βββ s1-nsfw-image-description.sh
β βββ s2-risk-identification.sh
β βββ s3-toxic-content-generation.sh
β βββ s4-typographic-jailbreaking.sh
β βββ s5-multimodal-jailbreaking.sh
β βββ s6-crossmodal-jailbreaking.sh
βββ truthfulness_scripts
βββ t1-basic.sh
βββ t2-advanced.sh
βββ t3-instruction-enhancement.sh
βββ t4-visual-assistance.sh
βββ t5-text-misleading.sh
βββ t6-visual-confusion.sh
βββ t7-visual-misleading.sh
```### π To Evaluate Results
After that, scripts under `scripts/score` can be used to calculate the statistical results based on the outputs and show the results reported in the paper.
```
# Description: Run scripts require a model_id to calculate statistical results.
# Usage: python scripts/score/*/*.py --model_idscripts/score
βββ fairness
β βββ f1-stereo-generation.py
β βββ f2-stereo-agreement.py
β βββ f3-stereo-classification.py
β βββ f3-stereo-topic-classification.py
β βββ f4-stereo-query.py
β βββ f5-vision-preference.py
β βββ f6-profession-pred.py
β βββ f7-subjective-preference.py
βββ privacy
β βββ p1-vispriv-recognition.py
β βββ p2-vqa-recognition-vispr.py
β βββ p3-infoflow.py
β βββ p4-pii-query.py
β βββ p5-visual-leakage.py
β βββ p6-pii-leakage-in-conversation.py
βββ robustness
β βββ r1-ood_artistic.py
β βββ r2-ood_sensor.py
β βββ r3-ood_text.py
β βββ r4-adversarial_untarget.py
β βββ r5-adversarial_target.py
β βββ r6-adversarial_text.py
βββ safefy
β βββ s1-nsfw-image-description.py
β βββ s2-risk-identification.py
β βββ s3-toxic-content-generation.py
β βββ s4-typographic-jailbreaking.py
β βββ s5-multimodal-jailbreaking.py
β βββ s6-crossmodal-jailbreaking.py
βββ truthfulness
βββ t1-basic.py
βββ t2-advanced.py
βββ t3-instruction-enhancement.py
βββ t4-visual-assistance.py
βββ t5-text-misleading.py
βββ t6-visual-confusion.py
βββ t7-visual-misleading.py
```
### π Task List
The total 32 tasks are listed here and β: rule-based evaluation (e.g., keywords matching); β: automatic evaluation
by GPT-4 or other classifiers; β: mixture evaluation.| ID | Task Name | Metrics | Task Type | Eval |
|-----|-----------------------------------|-----------------------------------------|-------------|--|
| T.1 | Basic World Understanding | Accuracy ($\uparrow$) | Dis.\&Gen. | β|
| T.2 | Advanced Cognitive Inference | Accuracy ($\uparrow$) | Dis. | β|
| T.3 | VQA under Instruction Enhancement | Accuracy ($\uparrow$) | Gen. | β|
| T.4 | QA under Visual Assistance | Accuracy ($\uparrow$) | Gen. | β|
| T.5 | Text Misleading VQA | Accuracy ($\uparrow$) | Gen. | β|
| T.6 | Visual Confusion VQA | Accuracy ($\uparrow$) | Gen. | β|
| T.7 | Visual Misleading QA | Accuracy ($\uparrow$) | Dis. | β|
| S.1 | Risk Identification | Accuracy ($\uparrow$) | Dis.\&Gen. | β|
| S.2 | Image Description | Toxicity Score ($\downarrow$), RtA ($\uparrow$) | Gen. | β|
| S.3 | Toxicity Content Generation | Toxicity Score ($\downarrow$), RtA ($\uparrow$) | Gen. | β|
| S.4 | Plain Typographic Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. | β|
| S.5 | Optimized Multimodal Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. | β|
| S.6 | Cross-modal Influence on Jailbreaking | ASR ($\downarrow$), RtA ($\uparrow$) | Gen. | β|
| R.1 | VQA for Artistic Style images | Score ($\uparrow$) | Gen. | β|
| R.2 | VQA for Sensor Style images | Score ($\uparrow$) | Gen. | β|
| R.3 | Sentiment Analysis for OOD texts | Accuracy ($\uparrow$) | Dis. | β|
| R.4 | Image Captioning under Untarget attack | Accuracy ($\uparrow$) | Gen. | β|
| R.5 | Image Captioning under Target attack | Attack Success Rate ($\downarrow$) | Gen. | β|
| R.6 | Textual Adversarial Attack | Accuracy ($\uparrow$) | Dis. | β|
| F.1 | Stereotype Content Detection | Containing Rate ($\downarrow$) | Gen. | β|
| F.2 | Agreement on Stereotypes | Agreement Percentage ($\downarrow$) | Dis. | β|
| F.3 | Classification of Stereotypes | Accuracy ($\uparrow$) | Dis. | β|
| F.4 | Stereotype Query Test | RtA ($\uparrow$) | Gen. | β|
| F.5 | Preference Selection in VQA | RtA ($\uparrow$) | Gen. | β|
| F.6 | Profession Prediction | Pearsonβs correlation ($\uparrow$) | Gen. | β|
| F.7 | Preference Selection in QA | RtA ($\uparrow$) | Gen. | β|
| P.1 | Visual Privacy Recognition | Accuracy, F1 ($\uparrow$) | Dis. | β|
| P.2 | Privacy-sensitive QA Recognition | Accuracy, F1 ($\uparrow$) | Dis. | β|
| P.3 | InfoFlow Expectation | Pearson's Correlation ($\uparrow$) | Gen. | β|
| P.4 | PII Query with Visual Cues | RtA ($\uparrow$) | Gen. | β|
| P.5 | Privacy Leakage in Vision | RtA ($\uparrow$), Accuracy ($\uparrow$) | Gen. | β|
| P.6 | PII Leakage in Conversations | RtA ($\uparrow$) | Gen. | β|### βοΈ Overall Results
- Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.
- A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.
- Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.
## :black_nib: Citation
If you find our work helpful for your research, please consider citing our work.```bibtex
@misc{zhang2024benchmarking,
title={Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study},
author={Yichi Zhang and Yao Huang and Yitong Sun and Chang Liu and Zhe Zhao and Zhengwei Fang and
Yifan Wang and Huanran Chen and Xiao Yang and Xingxing Wei and Hang Su and Yinpeng Dong and
Jun Zhu},
year={2024},
eprint={2406.07057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```