Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/openai/lm-human-preferences
Code for the paper Fine-Tuning Language Models from Human Preferences
https://github.com/openai/lm-human-preferences
Last synced: about 1 month ago
JSON representation
Code for the paper Fine-Tuning Language Models from Human Preferences
- Host: GitHub
- URL: https://github.com/openai/lm-human-preferences
- Owner: openai
- License: mit
- Created: 2019-09-14T06:15:53.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-07-25T16:48:57.000Z (over 1 year ago)
- Last Synced: 2024-09-27T06:21:14.317Z (about 2 months ago)
- Language: Python
- Homepage: https://openai.com/blog/fine-tuning-gpt-2/
- Size: 94.7 KB
- Stars: 1,207
- Watchers: 24
- Forks: 163
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
**Status:** Archive (code is provided as-is, no updates expected)
**Status:** All references to `gs://lm-human-preferences/` were updated to `https://openaipublic.blob.core.windows.net/lm-human-preferences`, as we migrated from GCP to Azure. The code provided as is may no longer work. Pull requests welcome
# lm-human-preferences
This repository contains code for the paper [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593). See also our [blog post](https://openai.com/blog/fine-tuning-gpt-2/).
We provide code for:
- Training reward models from human labels
- Fine-tuning language models using those reward modelsIt does not contain code for generating labels. However, we have released human labels collected for our experiments, at `gs://lm-human-preferences/labels`.
For those interested, the question and label schemas are simple and documented in [`label_types.py`](./lm_human_preferences/label_types.py).The code has only been tested using the smallest GPT-2 model (124M parameters).
## Instructions
This code has only been tested using Python 3.7.3. Training has been tested on GCE machines with 8 V100s, running Ubuntu 16.04, but development also works on Mac OS X.
### Installation
- Install [pipenv](https://github.com/pypa/pipenv#installation).
- Install [tensorflow](https://www.tensorflow.org/install/gpu): Install CUDA 10.0 and cuDNN 7.6.2, then `pipenv install tensorflow-gpu==1.13.1`. The code may technically run with tensorflow on CPU but will be very slow.
- Install [`gsutil`](https://cloud.google.com/storage/docs/gsutil_install)
- Clone this repo. Then:
```
pipenv install
```- (Recommended) Install [`horovod`](https://github.com/horovod/horovod#install) to speed up the code, or otherwise substitute some fast implementation in the `mpi_allreduce_sum` function of [`core.py`](./lm_human_preferences/utils/core.py). Make sure to use pipenv for the install, e.g. `pipenv install horovod==0.18.1`.
### Running
The following examples assume we are aiming to train a model to continue text in a physically descriptive way.
You can read [`launch.py`](./launch.py) to see how the `descriptiveness` experiments and others are defined.Note that we provide pre-trained models, so you can skip directly to RL fine-tuning or even to sampling from a trained policy, if desired.
#### Training a reward model
To train a reward model, use a command such as
```
experiment=descriptiveness
reward_experiment_name=testdesc-$(date +%y%m%d%H%M)
pipenv run ./launch.py train_reward $experiment $reward_experiment_name
```This will save outputs (and tensorboard event files) to the directory `/tmp/save/train_reward/$reward_experiment_name`. The directory can be changed via the `--save_dir` flag.
#### Finetuning a language model
Once you have trained a reward model, you can finetune against it.
First, set
```
trained_reward_model=/tmp/save/train_reward/$reward_experiment_name
```
or if using our pretrained model,
```
trained_reward_model=gs://lm-human-preferences/runs/descriptiveness/reward_model
```Then,
```
experiment=descriptiveness
policy_experiment_name=testdesc-$(date +%y%m%d%H%M)
pipenv run ./launch.py train_policy $experiment $policy_experiment_name --rewards.trained_model $trained_reward_model --rewards.train_new_model 'off'
```This will save outputs (and tensorboard event files) to the directory `/tmp/save/train_policy/$policy_experiment_name`. The directory can be changed via the `--save_dir` flag.
#### Both steps at once
You can run a single command to train a reward model and then finetune against it
```
experiment=descriptiveness
experiment_name=testdesc-$(date +%y%m%d%H%M)
pipenv run ./launch.py train_policy $experiment $experiment_name
```In this case, outputs are in the directory `/tmp/save/train_policy/$policy_experiment_name`, and the reward model is saved to a subdirectory `reward_model`. The directory can be changed via the `--save_dir` flag.
#### Sampling from a trained policy
Specify the policy to load:
```
save_dir=/tmp/save/train_policy/$policy_experiment_name
```
or if using our pretrained model,
```
save_dir=gs://lm-human-preferences/runs/descriptiveness
```Then run:
```
pipenv run ./sample.py sample --save_dir $save_dir --savescope policy
```Note that this script can run on less than 8 GPUs. You can pass the flag `--mpi 1`, for exapmle, if you only have one GPU.
## LICENSE
[MIT](./LICENSE)
## Citation
Please cite the paper with the following bibtex entry:
```
@article{ziegler2019finetuning,
title={Fine-Tuning Language Models from Human Preferences},
author={Ziegler, Daniel M. and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B. and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey},
journal={arXiv preprint arXiv:1909.08593},
url={https://arxiv.org/abs/1909.08593},
year={2019}
}
```