Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mnoukhov/async_rlhf
Code and Configs for Asynchronous RLHF: Faster and More Efficient RL for Language Models
https://github.com/mnoukhov/async_rlhf
Last synced: 25 days ago
JSON representation
Code and Configs for Asynchronous RLHF: Faster and More Efficient RL for Language Models
- Host: GitHub
- URL: https://github.com/mnoukhov/async_rlhf
- Owner: mnoukhov
- Created: 2024-04-05T20:44:56.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-11-26T23:56:32.000Z (about 1 month ago)
- Last Synced: 2024-11-27T00:27:56.276Z (about 1 month ago)
- Language: Python
- Homepage:
- Size: 421 KB
- Stars: 14
- Watchers: 3
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Code for Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
## Setup
install dependencies
```
pip install -r requirements.txt
```you can create the dataset that we use with `python relabel_with_rm.py --configs configs/relabel_rm.yml` or just use the dataset available on huggingface hub
## Train
Each algorithm is a command that is run with a config and command line args override the config
```
python online_dpo.py --config config/onlinedpo_pythia410m_tldr.yml --override_arg override_value
```## Gold Eval
To evaluate with the gold model, we first generate completions with the trained model
```
python generate_for_eval.py --config/generate_tldr.yml --model_name_or_path PATH_TO_MODEL_CHECKPOINTS
```Then we load the generated completions and evaluate them with our gold model
```
python load_and_eval.py --config/evaluate_tldr.yml --model_name_or_path PATH_TO_MODEL_CHECKPOINTS
```By default generations are saved in `PATH_TO_MODEL_CHECKPOINTS/_generations` but you can save elsewhere with `--dataset_path`
## Slurm scripts
To make things easier, I provide slurm scripts for training, generation, and eval all-in-one
```bash
./train_generate_and_eval.sh command
```For example to run online dpo
```bash
./train_generate_and_eval.sh python online_dpo.py --config configs/onlinedpo_pythia410m_tldr.yml --override_arg=override_value
```Note: to facilitate passing the output directory to the eval scripts, the train script creates a symlink to the output dir called `output_dir`
## Multi-GPU training and eval
By default, the script is single gpu. For multi-gpu, there is an annoying issue with vllm inference used in `generate_for_eval.py`
In order to do evaluation, I need to load and unload vllm models. This is complicated https://github.com/vllm-project/vllm/issues/1908The main solution I've found is to use `vllm==0.4.0post1` but this makes the environment deps sometimes angry.
So the workaround is either use two separate environments (one for training, one for eval) or just run training and eval separately.
I suggest the latter and provide `train.sh` and then run `generate_and_eval.sh` single-gpuMulti-gpu Training with 4 GPUs (3 for training, 1 for generation with vllm)
```
./train.sh accelerate launch --config_file configs/deepspeed_zero2.yaml --mixed_precision bf16 --num_processes 3 online_dpo.py --config configs/onlinedpo_pythia2.8b_tldr_vllm_bf16_4gpu.yml --output_dir onlinedpo_pythia2.8b_multigpu
```Single-gpu Eval
```
./generate_and_eval.sh --model_name_or_path results/onlinedpo_pythia2.8b_multigpu --torch_dtype bfloat16
```## Configs
All hyperparameters and model names are in corresponding configs in the `configs/` folder
- `python online_dpo.py --config configs/onlinedpo_*`
- `python ppo.py --config configs/ppo_*`
- `python rloo.py --config configs/rloo_*`
- `python rloo.py --config configs/bo2_*` for best-of-2 finetuningIf you want to create the sft and reward models instead of using mine on the huggingface hub
- `python sft.py --config configs/sft*`
- `python reward_modeling.py --config configs/rm*`Notes
- configs with `vllm` are using vllm to generate, otherwise they use huggingface `generate`
- single-gpu `vllm` configs are placing an extra, vllm model on the gpu for generation. this uses more memory but can be worth it
- 4 gpu vllm configs assume 3 gpus for the training and 1 for vllm generation, adjust batch sizes if you do things differently
- if you're using older GPUs without bf16, add args `--fp16 --bf16 False --torch_dtype float16`
- `--wandb_run_id` sets the wandb run id or if set to `=slurm` it will default to `parent folder / slurm_id / output_dir`## Asynchronous Secret Sauce
The asynchronous learning leverages the code:
1. `src/vllm_utils.py` which tricks vllm into being on only the GPUs we tell it to be on
2. `src/online_dpo_vllm_trainer.py` which creates a separate thread for the training and the vllm inference and passes data between the two of them using queuesThe majority of the asynchronous parts of the code were written by [Costa](https://github.com/vwxyzjn) but commits were lost when copying and patching so I want to give credit where it's due.
## Citation
```
@misc{noukhovitch_asynchronous_2024,
title = {Asynchronous {RLHF}: {Faster} and {More} {Efficient} {Off}-{Policy} {RL} for {Language} {Models}},
shorttitle = {Asynchronous {RLHF}},
url = {http://arxiv.org/abs/2410.18252},
publisher = {arXiv},
author = {Noukhovitch, Michael and Huang, Shengyi and Xhonneux, Sophie and Hosseini, Arian and Agarwal, Rishabh and Courville, Aaron},
month = oct,
year = {2024},
}
```