{"id":14043900,"url":"https://github.com/thu-ml/MMTrustEval","last_synced_at":"2025-07-27T15:32:02.238Z","repository":{"id":243507447,"uuid":"812506883","full_name":"thu-ml/MMTrustEval","owner":"thu-ml","description":"A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks) ","archived":false,"fork":false,"pushed_at":"2024-11-05T12:56:07.000Z","size":16537,"stargazers_count":108,"open_issues_count":3,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-23T09:10:44.338Z","etag":null,"topics":["benchmark","claude","fairness","gpt-4","mllm","multi-modal","privacy","robustness","safety","toolbox","trustworthy-ai","truthfulness"],"latest_commit_sha":null,"homepage":"https://multi-trust.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-sa-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-09T05:00:54.000Z","updated_at":"2024-11-18T11:59:51.000Z","dependencies_parsed_at":"2024-09-11T12:39:51.493Z","dependency_job_id":"2e217ee5-0228-4c9e-b0a3-3b405ee27345","html_url":"https://github.com/thu-ml/MMTrustEval","commit_stats":null,"previous_names":["thu-ml/mmtrusteval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FMMTrustEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FMMTrustEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FMMTrustEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FMMTrustEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-ml","download_url":"https://codeload.github.com/thu-ml/MMTrustEval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227814494,"owners_count":17823912,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","claude","fairness","gpt-4","mllm","multi-modal","privacy","robustness","safety","toolbox","trustworthy-ai","truthfulness"],"created_at":"2024-08-12T08:06:37.166Z","updated_at":"2025-07-27T15:32:02.231Z","avatar_url":"https://github.com/thu-ml.png","language":"Python","funding_links":[],"categories":["Evaluation","Multi-modal Large Language Models (MLLMs) Datasets \u003ca id=\"multi-modal-large-language-models-mllms-datasets\"\u003e\u003c/a\u003e"],"sub_categories":["Evaluation Datasets \u003ca id=\"evaluation02\"\u003e\u003c/a\u003e"],"readme":"\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"docs/structure/background.png\" alt=\"background\" style=\"width: 90%;\"\u003e \n\u003c/div\u003e\n\n\u003cdiv align=\"center\" style=\"font-size: 16px;\"\u003e\n    🌐 \u003ca href=\"https://multi-trust.github.io/\"\u003eProject Page\u003c/a\u003e \u0026nbsp\u0026nbsp\n    📖 \u003ca href=\"https://arxiv.org/abs/2406.07057\"\u003earXiv Paper\u003c/a\u003e \u0026nbsp\u0026nbsp\n    📜 \u003ca href=\"https://thu-ml.github.io/MMTrustEval/\"\u003eDocumentation \u003c/a\u003e \u0026nbsp\u0026nbsp\n    📊 \u003ca href=\"https://docs.google.com/forms/d/e/1FAIpQLSd9ZXKXzqszUoLhRT5fD9ggsSZtbmYNKgFPVekSaseYU69a_Q/viewform?usp=sf_link\"\u003eDataset\u003c/a\u003e \u0026nbsp\u0026nbsp\n    🤗 \u003ca href=\"https://huggingface.co/datasets/thu-ml/MultiTrust\"\u003eHugging Face\u003c/a\u003e \u0026nbsp\u0026nbsp\n    🏆 \u003ca href=\"https://multi-trust.github.io/#leaderboard\"\u003eLeaderboard\u003c/a\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Benchmark-Truthfulness-yellow\" alt=\"Truthfulness\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Benchmark-Safety-red\" alt=\"Safety\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Benchmark-Robustness-blue\" alt=\"Robustness\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Benchmark-Fairness-orange\" alt=\"Fairness\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Benchmark-Privacy-green\" alt=\"Privacy\" /\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n\n---\n\n\u003e **MultiTrust** is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks to expose new trustworthiness challenges. \n\n\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"docs/structure/framework.jpg\" alt=\"framework\" style=\"width: 90%;\"\u003e \n\u003c/div\u003e\n\n\n\n\n\n\n## 🚀 News\n* **`2025.03.03`** 🌟 We have released the latest results for [DeepSeek-VL2](https://huggingface.co/deepseek-ai/deepseek-vl2) on our [project website](https://multi-trust.github.io/) ！\n* **`2025.02.11`** 🌟 We have released the latest results for [DeepSeek-Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B), [CogVLM2-Llama3-Chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) and [GLM-4v-9B](https://huggingface.co/THUDM/glm-4v-9b) on our [project website](https://multi-trust.github.io/) ！\n* **`2024.11.05`** 🌟 We have released the dataset of MultiTrust on 🤗[Huggingface](https://huggingface.co/datasets/thu-ml/MultiTrust). Feel free to download and test your own model !\n* **`2024.11.05`** 🌟 We have updated the toolbox to support several latest models, e.g., [Phi-3.5](https://huggingface.co/microsoft/Phi-3.5-vision-instruct), [Cambrian-13B](https://huggingface.co/nyu-visionx/cambrian-13b), [Qwen2-VL-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [Llama-3.2-11B-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct), and their results have been uploaded to the [leaderboard](https://multi-trust.github.io/) !\n* **`2024.09.26`** 🎉 [Our paper](https://arxiv.org/abs/2406.07057) has been accepted by the Datasets and Benchmarks track in NeurIPS 2024 ！See you in Vancouver ~\n* **`2024.08.12`** 🌟 We have released the latest results for [DeepSeek-VL](https://github.com/deepseek-ai/DeepSeek-VL), and [hunyuan-vision](https://hunyuan.tencent.com/) on our [project website](https://multi-trust.github.io/) ！\n* **`2024.07.07`** 🌟 We have released the latest results for [GPT-4o](https://openai.com/index/hello-gpt-4o/), [Claude-3.5](https://www.anthropic.com/news/claude-3-5-sonnet), and [Phi-3](https://ollama.com/library/phi3) on our [project website](https://multi-trust.github.io/) ！\n* **`2024.06.07`** 🌟 We have released [MultiTrust](https://multi-trust.github.io/), the first comprehensive and unified benchmark on the trustworthiness of MLLMs !\n\n## 🛠️ Installation\n\nThe envionment of this version has been updated to accommodate more latest models. If you want to ensure more precise replication of experimental results presented in the paper, you could switch to the branch [v0.1.0](https://github.com/thu-ml/MMTrustEval/tree/v0.1.0).\n\n- Option A: UV install\n    ```shell\n    uv venv --python 3.9\n    source .venv/bin/activate\n\n    uv pip install setuptools\n    uv pip install torch==2.3.0\n    uv pip sync --no-build-isolation env/requirements.txt\n    ```\n\n- Option B: Docker\n    - How to install docker\n        ```shell\n        # Our docker version:\n        #     Client: Docker Engine - Community\n        #     Version:           27.0.0-rc.1\n        #     API version:       1.46\n        #     Go version:        go1.21.11\n        #     OS/Arch:           linux/amd64\n\n        distribution=$(. /etc/os-release;echo $ID$VERSION_ID)\n        curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -\n        curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list\n\n        sudo apt-get update\n        sudo apt-get install -y nvidia-container-toolkit\n\n        sudo systemctl restart docker\n        sudo usermod -aG docker [your_username_here]\n        ```\n    - Get our image: \n        - B.1: Pull image from DockerHub\n            ```shell\n            docker pull jankinfstmrvv/multitrust:latest\n            ```\n\n        - B.2: Build from scratch\n            ```shell\n            #  Note: \n            # [data] is the `absolute paths` of data.\n\n            docker build --network=host -t multitrust:latest -f env/Dockerfile .\n            ```\n    \n    - Start a container:\n        ```shell\n        docker run -it \\\n            --name multitrust \\\n            --gpus all \\\n            --privileged=true \\\n            --shm-size=10gb \\\n            -v $HOME/.cache/huggingface:/root/.cache/huggingface \\\n            -v $HOME/.cache/torch:/root/.cache/torch \\\n            -v [data]:/root/MMTrustEval/data \\\n            -w /root/MMTrustEval \\\n            -d multitrust:latest /bin/bash\n\n        # entering the container\n        docker exec -it multitrust /bin/bash\n        ```\n  \n- Several tasks require the use of commercial APIs for auxiliary testing. Therefore, if you want to test all tasks, please add the corresponding model API keys in [env/apikey.yml](https://github.com/thu-ml/MMTrustEval/blob/v0.1.0/env/apikey.yml).\n\n## :envelope: Dataset\n\n### License\n- The codebase is licensed under the **CC BY-SA 4.0** license.\n\n- MultiTrust is only used for academic research. Commercial use in any form is prohibited.\n\n- If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.\n\n### Data Preparation\n\nRefer [here](data4multitrust/README.md) for detailed instructions.\n\n## 📚 Docs\nOur document presents interface definitions for different modules and some tutorials on **how to extend modules**.\nRunning online at: https://thu-ml.github.io/MMTrustEval/\n\nRun following command to see the docs(locally).\n```shell\nmkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000\n```\n\n## 📈 Reproduce results in Our paper\n\nRunning scripts under `scripts/run` can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner. \n### 📌 To Make Inference \n\n```\n# Description: Run scripts require a model_id to run inference tasks.\n# Usage: bash scripts/run/*/*.sh \u003cmodel_id\u003e\n\nscripts/run\n├── fairness_scripts\n│   ├── f1-stereo-generation.sh\n│   ├── f2-stereo-agreement.sh\n│   ├── f3-stereo-classification.sh\n│   ├── f3-stereo-topic-classification.sh\n│   ├── f4-stereo-query.sh\n│   ├── f5-vision-preference.sh\n│   ├── f6-profession-pred.sh\n│   └── f7-subjective-preference.sh\n├── privacy_scripts\n│   ├── p1-vispriv-recognition.sh\n│   ├── p2-vqa-recognition-vispr.sh\n│   ├── p3-infoflow.sh\n│   ├── p4-pii-query.sh\n│   ├── p5-visual-leakage.sh\n│   └── p6-pii-leakage-in-conversation.sh\n├── robustness_scripts\n│   ├── r1-ood-artistic.sh\n│   ├── r2-ood-sensor.sh\n│   ├── r3-ood-text.sh\n│   ├── r4-adversarial-untarget.sh\n│   ├── r5-adversarial-target.sh\n│   └── r6-adversarial-text.sh\n├── safety_scripts\n│   ├── s1-nsfw-image-description.sh\n│   ├── s2-risk-identification.sh\n│   ├── s3-toxic-content-generation.sh\n│   ├── s4-typographic-jailbreaking.sh\n│   ├── s5-multimodal-jailbreaking.sh\n│   └── s6-crossmodal-jailbreaking.sh\n└── truthfulness_scripts\n    ├── t1-basic.sh\n    ├── t2-advanced.sh\n    ├── t3-instruction-enhancement.sh\n    ├── t4-visual-assistance.sh\n    ├── t5-text-misleading.sh\n    ├── t6-visual-confusion.sh\n    └── t7-visual-misleading.sh\n```\n\n### 📌 To Evaluate Results\nAfter that, scripts under `scripts/score` can be used to calculate the statistical results based on the outputs and show the results reported in the paper.\n```\n# Description: Run scripts require a model_id to calculate statistical results.\n# Usage: python scripts/score/*/*.py --model_id \u003cmodel_id\u003e\n\nscripts/score\n├── fairness\n│   ├── f1-stereo-generation.py\n│   ├── f2-stereo-agreement.py\n│   ├── f3-stereo-classification.py\n│   ├── f3-stereo-topic-classification.py\n│   ├── f4-stereo-query.py\n│   ├── f5-vision-preference.py\n│   ├── f6-profession-pred.py\n│   └── f7-subjective-preference.py\n├── privacy\n│   ├── p1-vispriv-recognition.py\n│   ├── p2-vqa-recognition-vispr.py\n│   ├── p3-infoflow.py\n│   ├── p4-pii-query.py\n│   ├── p5-visual-leakage.py\n│   └── p6-pii-leakage-in-conversation.py\n├── robustness\n│   ├── r1-ood_artistic.py\n│   ├── r2-ood_sensor.py\n│   ├── r3-ood_text.py\n│   ├── r4-adversarial_untarget.py\n│   ├── r5-adversarial_target.py\n│   └── r6-adversarial_text.py\n├── safefy\n│   ├── s1-nsfw-image-description.py\n│   ├── s2-risk-identification.py\n│   ├── s3-toxic-content-generation.py\n│   ├── s4-typographic-jailbreaking.py\n│   ├── s5-multimodal-jailbreaking.py\n│   └── s6-crossmodal-jailbreaking.py\n└── truthfulness\n    ├── t1-basic.py\n    ├── t2-advanced.py\n    ├── t3-instruction-enhancement.py\n    ├── t4-visual-assistance.py\n    ├── t5-text-misleading.py\n    ├── t6-visual-confusion.py\n    └── t7-visual-misleading.py\n```\n### 📌 Task List\nThe total 32 tasks are listed here and ○: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation\nby GPT-4 or other classifiers; ◐: mixture evaluation.\n\n| ID  | Task Name                         | Metrics                                 | Task Type        | Eval       |\n|-----|-----------------------------------|-----------------------------------------|-------------|--|\n| T.1 | Basic World Understanding         | Accuracy ($\\uparrow$)                   | Dis.\\\u0026Gen.  | ◐|\n| T.2 | Advanced Cognitive Inference      | Accuracy ($\\uparrow$)                   | Dis.        | ○|\n| T.3 | VQA under Instruction Enhancement | Accuracy ($\\uparrow$)                   | Gen.        | ●|\n| T.4 | QA under Visual Assistance        | Accuracy ($\\uparrow$)                   | Gen.        | ●|\n| T.5 | Text Misleading VQA               | Accuracy ($\\uparrow$)                   | Gen.        | ●|\n| T.6 | Visual Confusion VQA              | Accuracy ($\\uparrow$)                   | Gen.        | ○|\n| T.7 | Visual Misleading QA              | Accuracy ($\\uparrow$)                   | Dis.        | ●|\n| S.1 | Risk Identification               | Accuracy ($\\uparrow$)                   | Dis.\\\u0026Gen.  | ◐|\n| S.2 | Image Description                 | Toxicity Score ($\\downarrow$), RtA ($\\uparrow$)  | Gen.        | ●|\n| S.3 | Toxicity Content Generation       | Toxicity Score ($\\downarrow$), RtA ($\\uparrow$)  | Gen.        | ◐|\n| S.4 | Plain Typographic Jailbreaking    | ASR ($\\downarrow$), RtA ($\\uparrow$)             | Gen.        | ◐|\n| S.5 | Optimized Multimodal Jailbreaking | ASR ($\\downarrow$), RtA ($\\uparrow$)             | Gen.        | ◐|\n| S.6 | Cross-modal Influence on Jailbreaking | ASR ($\\downarrow$), RtA ($\\uparrow$)          | Gen.        | ◐|\n| R.1 | VQA for Artistic Style images     | Score ($\\uparrow$)                      | Gen.        | ◐|\n| R.2 | VQA for Sensor Style images       | Score ($\\uparrow$)                      | Gen.        | ●|\n| R.3 | Sentiment Analysis for OOD texts  | Accuracy ($\\uparrow$)                   | Dis.        | ○|\n| R.4 | Image Captioning under Untarget attack | Accuracy ($\\uparrow$)               | Gen.        | ◐|\n| R.5 | Image Captioning under Target attack | Attack Success Rate ($\\downarrow$)    | Gen.        | ◐|\n| R.6 | Textual Adversarial Attack        | Accuracy ($\\uparrow$)                   | Dis.        | ○|\n| F.1 | Stereotype Content Detection      | Containing Rate ($\\downarrow$)          | Gen.        | ●|\n| F.2 | Agreement on Stereotypes          | Agreement Percentage ($\\downarrow$)     | Dis.        | ◐|\n| F.3 | Classification of Stereotypes     | Accuracy ($\\uparrow$)                   | Dis.        | ○|\n| F.4 | Stereotype Query Test             | RtA ($\\uparrow$)                        | Gen.        | ◐|\n| F.5 | Preference Selection in VQA       | RtA ($\\uparrow$)                        | Gen.        | ●|\n| F.6 | Profession Prediction             | Pearson’s correlation ($\\uparrow$)      | Gen.        | ◐|\n| F.7 | Preference Selection in QA        | RtA ($\\uparrow$)                        | Gen.        | ●|\n| P.1 | Visual Privacy Recognition        | Accuracy, F1 ($\\uparrow$)               | Dis.        | ○|\n| P.2 | Privacy-sensitive QA Recognition  | Accuracy, F1 ($\\uparrow$)               | Dis.        | ○|\n| P.3 | InfoFlow Expectation              | Pearson's Correlation ($\\uparrow$)      | Gen.        | ○|\n| P.4 | PII Query with Visual Cues        | RtA ($\\uparrow$)                        | Gen.        | ◐|\n| P.5 | Privacy Leakage in Vision         | RtA ($\\uparrow$), Accuracy ($\\uparrow$) | Gen.        | ◐|\n| P.6 | PII Leakage in Conversations      | RtA ($\\uparrow$) | Gen.        | ◐|\n\n### ⚛️ Overall Results \n- Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.\n- A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.\n- Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.\n\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"docs/structure/overall.png\" alt=\"result\" style=\"width: 90%;\"\u003e \n\u003c/div\u003e\n\n\n## :black_nib: Citation\nIf you find our work helpful for your research, please consider citing our work.\n\n```bibtex\n@article{zhang2024benchmarking,\n  title={Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study},\n  author={Zhang, Yichi and Huang, Yao and Sun, Yitong and Liu, Chang and Zhao, Zhe and Fang, Zhengwei and Wang, Yifan and Chen, Huanran and Yang, Xiao and Wei, Xingxing and others},\n  journal={arXiv preprint arXiv:2406.07057},\n  year={2024}\n}  \n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2FMMTrustEval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-ml%2FMMTrustEval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2FMMTrustEval/lists"}