https://github.com/yfzhang114/mmrlhf-eval

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/yfzhang114/mmrlhf-eval
Owner: yfzhang114
License: other
Created: 2025-02-06T01:36:39.000Z (5 months ago)
Default Branch: main
Last Pushed: 2025-02-21T07:56:21.000Z (4 months ago)
Last Synced: 2025-03-31T00:03:03.380Z (3 months ago)
Language: Python
Size: 2.3 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - yfzhang114/mmrlhf-eval - eval项目旨在评估大型语言模型（LLM）在多模态强化学习人类反馈（RLHF）任务中的表现。它提供了一个全面的评估框架，包括数据集、评估指标和基准模型。该项目的特色在于其多模态性，能够处理图像、文本等多种输入模态，更贴近真实世界的应用场景。其工作原理是利用预训练的LLM作为策略网络，通过RLHF算法进行微调，使其更好地对人类反馈进行响应。评估指标包括奖励得分、成功率等，用于衡量模型的性能。项目提供了详细的实验设置和复现步骤，方便研究人员进行实验和比较。该项目对于研究多模态RLHF、提升LLM的智能体能力具有重要意义。它支持多种LLM模型，并提供可扩展的评估平台，方便用户自定义数据集和评估指标。总之，mmrlhf-eval是一个用于多模态RLHF评估的强大工具，旨在推动LLM在智能体领域的应用。 (多模态大模型 / 资源传输下载)

README

        






  

[[📖 arXiv Paper](https://arxiv.org/abs/2502.10391)] 

[[📊 MM-RLHF Data](https://huggingface.co/datasets/yifanzhang114/MM-RLHF)] 

[[📝 Homepage](https://mm-rlhf.github.io/)] 

[[🏆 Reward Model](https://huggingface.co/yifanzhang114/MM-RLHF-Reward-7B-llava-ov-qwen)] 

[[🔮 MM-RewardBench](https://huggingface.co/datasets/yifanzhang114/MM-RLHF-RewardBench)] 

[[🔮 MM-SafetyBench](https://github.com/yfzhang114/mmrlhf-eval)] 

[[📈 Evaluation Suite](https://github.com/yfzhang114/mmrlhf-eval)] 



# The Evaluation Suite of Large Multimodal Models 

Welcome to the docs for `mmrlhf-eval`: the evaluation suite for the [MM-RLHF](https://github.com/yfzhang114/MM-RLHF) project.

---

## Annoucement

- [2025-03] 📝📝 This project is built upon the lmms_eval framework. We have established a dedicated *"Hallucination and Safety Tasks"* category, incorporating three key benchmarks - *AMBER, MMHal-Bench, and ObjectHallusion.* **Additionally, we introduce our novel MM-RLHF-SafetyBench task, a comprehensive safety evaluation protocol specifically designed for MLLM.** Detailed specifications of the MM-RLHF-SafetyBench are documented in [current_tasks](docs/current_tasks.md).

## Installation

For development, you can install the package by cloning the repository and running the following command:

```bash

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval

cd lmms-eval

pip install -e .

```

If you want to test LLaVA, you will have to clone their repo from [LLaVA](https://github.com/haotian-liu/LLaVA) and

```bash

# for llava 1.5

# git clone https://github.com/haotian-liu/LLaVA

# cd LLaVA

# pip install -e .

# for llava-next (1.6)

git clone https://github.com/LLaVA-VL/LLaVA-NeXT

cd LLaVA-NeXT

pip install -e .

```

## Evaluation and Safety Benchmark

### AMBER Dataset

To run evaluations for the **AMBER dataset**, you need to download the image data from the following link and place it in the `lmms_eval/tasks/amber` folder:

[AMBER dataset image download](https://drive.google.com/file/d/1MaCHgtupcZUjf007anNl4_MV0o4DjXvl/view?usp=sharing)

Once the image data is downloaded and placed in the correct folder, you can proceed with evaluating AMBER-based tasks.

### CHIAR Metric for Object Hallucination and AMBER

For benchmarks that require the calculation of the **CHIAR metric** (such as **Object Hallucination** and **AMBER**), you'll need to install and configure the required Natural Language Toolkit (NLTK) resources. Run the following commands to download necessary NLTK data:

```bash

python3 - <

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yfzhang114/mmrlhf-eval

Awesome Lists containing this project

README