https://github.com/yfzhang114/mmrlhf-eval
https://github.com/yfzhang114/mmrlhf-eval
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/yfzhang114/mmrlhf-eval
- Owner: yfzhang114
- License: other
- Created: 2025-02-06T01:36:39.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-02-21T07:56:21.000Z (4 months ago)
- Last Synced: 2025-03-31T00:03:03.380Z (3 months ago)
- Language: Python
- Size: 2.3 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - yfzhang114/mmrlhf-eval - eval项目旨在评估大型语言模型(LLM)在多模态强化学习人类反馈(RLHF)任务中的表现。它提供了一个全面的评估框架,包括数据集、评估指标和基准模型。该项目的特色在于其多模态性,能够处理图像、文本等多种输入模态,更贴近真实世界的应用场景。其工作原理是利用预训练的LLM作为策略网络,通过RLHF算法进行微调,使其更好地对人类反馈进行响应。评估指标包括奖励得分、成功率等,用于衡量模型的性能。项目提供了详细的实验设置和复现步骤,方便研究人员进行实验和比较。该项目对于研究多模态RLHF、提升LLM的智能体能力具有重要意义。它支持多种LLM模型,并提供可扩展的评估平台,方便用户自定义数据集和评估指标。 总之,mmrlhf-eval是一个用于多模态RLHF评估的强大工具,旨在推动LLM在智能体领域的应用。 (多模态大模型 / 资源传输下载)
README
![]()
[[📖 arXiv Paper](https://arxiv.org/abs/2502.10391)]
[[📊 MM-RLHF Data](https://huggingface.co/datasets/yifanzhang114/MM-RLHF)]
[[📝 Homepage](https://mm-rlhf.github.io/)][[🏆 Reward Model](https://huggingface.co/yifanzhang114/MM-RLHF-Reward-7B-llava-ov-qwen)]
[[🔮 MM-RewardBench](https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench)]
[[🔮 MM-SafetyBench](https://github.com/yfzhang114/mmrlhf-eval)]
[[📈 Evaluation Suite](https://github.com/yfzhang114/mmrlhf-eval)]# The Evaluation Suite of Large Multimodal Models
Welcome to the docs for `mmrlhf-eval`: the evaluation suite for the [MM-RLHF](https://github.com/yfzhang114/MM-RLHF) project.
---
## Annoucement
- [2025-03] 📝📝 This project is built upon the lmms_eval framework. We have established a dedicated *"Hallucination and Safety Tasks"* category, incorporating three key benchmarks - *AMBER, MMHal-Bench, and ObjectHallusion.* **Additionally, we introduce our novel MM-RLHF-SafetyBench task, a comprehensive safety evaluation protocol specifically designed for MLLM.** Detailed specifications of the MM-RLHF-SafetyBench are documented in [current_tasks](docs/current_tasks.md).
## Installation
For development, you can install the package by cloning the repository and running the following command:
```bash
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
```If you want to test LLaVA, you will have to clone their repo from [LLaVA](https://github.com/haotian-liu/LLaVA) and
```bash
# for llava 1.5
# git clone https://github.com/haotian-liu/LLaVA
# cd LLaVA
# pip install -e .# for llava-next (1.6)
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .
```## Evaluation and Safety Benchmark
### AMBER Dataset
To run evaluations for the **AMBER dataset**, you need to download the image data from the following link and place it in the `lmms_eval/tasks/amber` folder:
[AMBER dataset image download](https://drive.google.com/file/d/1MaCHgtupcZUjf007anNl4_MV0o4DjXvl/view?usp=sharing)
Once the image data is downloaded and placed in the correct folder, you can proceed with evaluating AMBER-based tasks.
### CHIAR Metric for Object Hallucination and AMBER
For benchmarks that require the calculation of the **CHIAR metric** (such as **Object Hallucination** and **AMBER**), you'll need to install and configure the required Natural Language Toolkit (NLTK) resources. Run the following commands to download necessary NLTK data:
```bash
python3 - <