Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/allenai/reward-bench
RewardBench: the first evaluation tool for reward models.
https://github.com/allenai/reward-bench
preference-learning rlhf
Last synced: 3 months ago
JSON representation
RewardBench: the first evaluation tool for reward models.
- Host: GitHub
- URL: https://github.com/allenai/reward-bench
- Owner: allenai
- License: apache-2.0
- Created: 2023-12-23T16:20:07.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-03-26T22:54:41.000Z (3 months ago)
- Last Synced: 2024-03-27T00:38:50.613Z (3 months ago)
- Topics: preference-learning, rlhf
- Language: Python
- Homepage: https://huggingface.co/spaces/allenai/reward-bench
- Size: 3.29 MB
- Stars: 122
- Watchers: 3
- Forks: 8
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-llm-eval - RewardBench - bench)οΌ[Code](https://github.com/allenai/reward-bench) ε [Dataset](https://hf.co/datasets/allenai/reward-bench) (2024-03-20)| (Datasets-or-Benchmark / ιη¨)
README
RewardBench: Evaluating Reward Models
Leaderboard π |
RewardBench Dataset |
Existing Test Sets |
Results π |
Paperπ
![]()
---
**RewardBench** is a benchmark designed to evaluate the capabilities and safety of reward models (including those trained with Direct Preference Optimization, DPO).
The repository includes the following:
* Common inference code for a variety of reward models (Starling, PairRM, OpenAssistant, DPO, and more).
* Common dataset formatting and tests for fair reward model inference.
* Analysis and visualization tools.The two primary scripts to generate results (more in `scripts/`):
1. `scripts/run_rm.py`: Run evaluations for reward models.
2. `scripts/run_dpo.py`: Run evaluations for direct preference optimization (DPO) models.
3. `scripts/train_rm.py`: A basic RM training script built on [TRL](https://github.com/huggingface/trl).## Installation
Please install `torch` on your system, and then install the following requirements.
```
pip install -e .
```
Add the following to your `.bashrc`:
```
export HF_TOKEN="{your_token}"
```## Contribute Your Model
For now, in order to contribute your model to the leaderboard, open an issue with the model name on HuggingFace (you can still evaluate local models with RewardBench, see below).
If custom code is needed, please open a PR that enables it in our inference stack (see [`rewardbench/models`](https://github.com/allenai/reward-bench/tree/main/rewardbench/models) for more information).# Evaluating Models
For reference configs, see `scripts/configs/eval_configs.yaml`.
For reference on Chat Templates, many models follow the base / sft model terminology [here](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py).
A small model for debugging is available at `natolambert/gpt2-dummy-rm`.The core scripts automatically evaluate our core evaluation set. To run these on [existing preference sets](https://huggingface.co/datasets/allenai/pref-test-sets), add the argument `--pref_sets`.
## Running Reward Models
To run individual models with `scripts/run_rm.py`, use any of the following examples:
```
python scripts/run_rm.py --model=openbmb/UltraRM-13b --chat_template=openbmb --batch_size=8
python scripts/run_rm.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia
python scripts/run_rm.py --model=PKU-Alignment/beaver-7b-v1.0-cost --chat_template=pku-align --batch_size=16
python scripts/run_rm.py --model=IDEA-CCNL/Ziya-LLaMA-7B-Reward --batch_size=32 --trust_remote_code --chat_template=Ziya
```To run these models with AI2 infrastructure, run:
```
python scripts/submit_eval_jobs.py
```
Or for example, the best of N sweep on the non-default image:
```
python scripts/submit_eval_jobs.py --eval_on_bon --image=nathanl/herm_bon
```Models using the default abstraction `AutoModelForSequenceClassification.from_pretrained` can also be loaded locally. Expanding this functionality is TODO. E.g.
```
python scripts/run_rm.py --model=/net/nfs.cirrascale/allennlp/hamishi/EasyLM/rm_13b_3ep --chat_template=tulu --batch_size=8
```## Running DPO Models
And for DPO:
```
python scripts/run_dpo.py --model=stabilityai/stablelm-zephyr-3b --ref_model=stabilityai/stablelm-3b-4e1t --batch_size=8
python scripts/run_dpo.py --model=stabilityai/stablelm-2-zephyr-1_6b --ref_model=stabilityai/stablelm-2-1_6b --batch_size=16
```## Creating Best of N (BoN) rankings
To create the ranking across the dataset, run (best_of 8 being placeholder, 16 should be fine as eval logic will handle lower best of N numbers):
```
python scripts/run_bon.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia --best_of=8 --debug
```
## Getting Leaderboard Section Scores**Important**: We use prompt-weighed scores for the sections Chat, Chat Hard, Safety, and Reasoning (with math equalized to code here) to avoid assigning too much credit to small subsets (e.g. MT Bench ones). Use the following code to compute the scores for each category, assuming `RewardBench` is installed:
```
from analysis.constants import EXAMPLE_COUNTS, SUBSET_MAPPING
metrics = {
"alpacaeval-easy": 0.5,
"alpacaeval-hard": 0.7052631578947368,
"alpacaeval-length": 0.5894736842105263,
"chat_template": "tokenizer",
"donotanswer": 0.8235294117647058,
"hep-cpp": 0.6280487804878049,
"hep-go": 0.6341463414634146,
"hep-java": 0.7073170731707317,
"hep-js": 0.6646341463414634,
"hep-python": 0.5487804878048781,
"hep-rust": 0.6463414634146342,
"llmbar-adver-GPTInst": 0.391304347826087,
"llmbar-adver-GPTOut": 0.46808510638297873,
"llmbar-adver-manual": 0.3695652173913043,
"llmbar-adver-neighbor": 0.43283582089552236,
"llmbar-natural": 0.52,
"math-prm": 0.2953020134228188,
"model": "PKU-Alignment/beaver-7b-v1.0-cost",
"model_type": "Seq. Classifier",
"mt-bench-easy": 0.5714285714285714,
"mt-bench-hard": 0.5405405405405406,
"mt-bench-med": 0.725,
"refusals-dangerous": 0.97,
"refusals-offensive": 1,
"xstest-should-refuse": 1,
"xstest-should-respond": 0.284
}def calculate_scores_per_section(example_counts, subset_mapping, metrics):
section_scores = {}
for section, tests in subset_mapping.items():
total_weighted_score = 0
total_examples = 0
for test in tests:
if test in metrics:
total_weighted_score += metrics[test] * example_counts[test]
total_examples += example_counts[test]
if total_examples > 0:
section_scores[section] = total_weighted_score / total_examples
else:
section_scores[section] = 0
return section_scores# Calculate and print the scores per section
scores_per_section = calculate_scores_per_section(EXAMPLE_COUNTS, SUBSET_MAPPING, metrics)
scores_per_section
```## Repository structure
```
βββ README.md <- The top-level README for researchers using this project
βββ analysis/ <- Directory of tools to analyze RewardBench results or other reward model properties
βββ rewardbench/ <- Core utils and modeling files
| βββ models/ βββ Standalone files for running existing reward models
| βββ *.py βββ RewardBench tools and utilities
βββ scripts/ <- Scripts and configs to train and evaluate reward models
βββ tests <- Unit tests
βββ Dockerfile <- Build file for reproducible and scaleable research at AI2
βββ LICENSE
βββ Makefile <- Makefile with commands like `make style`
βββ setup.py <- Makes project pip installable (pip install -e .) so `alignment` can be imported
```## Maintenance
This section is designed for AI2 usage, but may help others evaluating models with Docker.
### Updating the docker image
When updating this repo, the docker image should be rebuilt to include those changes.
For AI2 members, please update the list below with any images you use regularly.
For example, if you update `scripts/run_rm.py` and include a new package (or change a package version), you should rebuild the image and verify it still works on known models.To update the image, run these commands in the root directory of this repo:
1. `docker build -t . --platform linux/amd64`
2. `beaker image create -n `Notes: Do not use the character - in image names for beaker,
When updating the `Dockerfile`, make sure to see the instructions at the top to update the base cuda version.
In development, we have the following docker images (most recent first as it's likely what you need).
- `nathanl/rewardbench_v2`: fix beaver cost model
- `nathanl/rewardbench_v1`: release version## Citation
Please cite our work with the following:
```
@misc{lambert2024rewardbench,
title={RewardBench: Evaluating Reward Models for Language Modeling},
author={Nathan Lambert and Valentina Pyatkin and Jacob Morrison and LJ Miranda and Bill Yuchen Lin and Khyathi Chandu and Nouha Dziri and Sachin Kumar and Tom Zick and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi},
year={2024},
eprint={2403.13787},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```