Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/xrsrke/instructGOOSE
Implementation of Reinforcement Learning from Human Feedback (RLHF)
https://github.com/xrsrke/instructGOOSE
chatgpt human-feedback instructgpt reinforcement-learning rlhf
Last synced: 3 months ago
JSON representation
Implementation of Reinforcement Learning from Human Feedback (RLHF)
- Host: GitHub
- URL: https://github.com/xrsrke/instructGOOSE
- Owner: xrsrke
- License: mit
- Created: 2022-12-28T06:39:51.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-04-07T08:45:59.000Z (almost 2 years ago)
- Last Synced: 2024-04-26T02:22:29.546Z (9 months ago)
- Topics: chatgpt, human-feedback, instructgpt, reinforcement-learning, rlhf
- Language: Jupyter Notebook
- Homepage: https://xrsrke.github.io/instructGOOSE/
- Size: 3.31 MB
- Stars: 163
- Watchers: 4
- Forks: 20
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-human-in-the-loop - Github - xrsrke/instructGOOSE
- Awesome-Prompt-Engineering - [Github
- awesome-prompt-engineering-zh-cn - [Github
- awesome-prompt-engineering-zh-cn - [Github
README
InstructGoose
================[](https://github.com/xrsrke/instructGOOSE)
[![tests](https://github.com/vwxyzjn/cleanrl/actions/workflows/tests.yaml/badge.svg)](https://github.com/xrsrke/instructGOOSE/actions/workflows/tests.yaml)
[![docs](https://img.shields.io/github/deployments/vwxyzjn/cleanrl/Production?label=docs&logo=vercel.png)](https://xrsrke.github.io/instructGOOSE/)
[![Code style:
black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Open In
Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HN1t75Jem9jXPzaOQdQO06OXMxinimNx?usp=sharing)Paper: InstructGPT - [Training language models to follow instructions
with human feedback](https://arxiv.org/abs/2203.02155)![image.png](index_files/figure-commonmark/531cd4d9-1-image.png)
## Install
Install from PipPy
``` sh
pip install instruct-goose
```Install directly from the source code
``` sh
git clone https://github.com/xrsrke/instructGOOSE.git
cd instructGOOSE
pip install -e .
```### How to Train
**For reward model**
Use 🤗 Accelerate to launch distributed training
``` bash
accelerate config
accelerate launch scripts/train_reward.py
```## Train the RL-based language model
``` python
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_datasetimport torch
from torch.utils.data import DataLoader, random_split
from torch import optimfrom instruct_goose import Agent, RewardModel, RLHFTrainer, RLHFConfig, create_reference_model
```**Step 1:** Load dataset
``` python
dataset = load_dataset("imdb", split="train")
dataset, _ = random_split(dataset, lengths=[10, len(dataset) - 10]) # for demenstration purposes
train_dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
```Found cached dataset imdb (/Users/education/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
**Step 2**: Load the pre-trained model and tokenizer
``` python
model_base = AutoModelForCausalLM.from_pretrained("gpt2") # for demonstration purposes
reward_model = RewardModel("gpt2")tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left")
eos_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token
```**Step 3**: Create the RL-based language model agent and the reference
model``` python
model = Agent(model_base)
ref_model = create_reference_model(model)
```**Step 4**: Train it
``` python
max_new_tokens = 20
generation_kwargs = {
"min_length":-1,
"top_k": 0.0,
"top_p": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.eos_token_id,
"max_new_tokens": max_new_tokens
}config = RLHFConfig()
N_EPOCH = 1 # for demonstration purposes
trainer = RLHFTrainer(model, ref_model, config)
optimizer = optim.SGD(model.parameters(), lr=1e-3)
`````` python
for epoch in range(N_EPOCH):
for batch in train_dataloader:
inputs = tokenizer(batch["text"], padding=True, truncation=True, return_tensors="pt")
response_ids = model.generate(
inputs["input_ids"], attention_mask=inputs["attention_mask"],
**generation_kwargs
)# extract the generated text
response_ids = response_ids[:, -max_new_tokens:]
response_attention_mask = torch.ones_like(response_ids)# evaluate from the reward model
with torch.no_grad():
text_input_ids = torch.stack([torch.concat([q, r]) for q, r in zip(inputs["input_ids"], response_ids)], dim=0)
rewards = reward_model(text_input_ids)# calculate PPO loss
loss = trainer.compute_loss(
query_ids=inputs["input_ids"],
query_attention_mask=inputs["attention_mask"],
response_ids=response_ids,
response_attention_mask=response_attention_mask,
rewards=rewards
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"loss={loss}")
```loss=-824.6560668945312
loss=0.030958056449890137
loss=4.284017562866211## TODO
- Add support custom reward function
- Add support custom value function
- Add support non-transformer models
- Write config class✅ Distributed training using 🤗 Accelerate
## Resources
I implemented this using these resources
- Copied the
[`load_yaml`](https://xrsrke.github.io/instructGOOSE/utils.html#load_yaml)
function from https://github.com/Dahoas/reward-modeling
- How to build a dataset to train reward model:
https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX–VmlldzozMzAwODM2
- How to add value head in PPO agent: https://github.com/lvwerra/trl
- How to calculate the loss of PPO agent:
https://github.com/lvwerra/trl/blob/main/trl/trainer/ppo_trainer.py
- How to use PPO to train RLHF agent: https://github.com/voidful/TextRL
- How PPO works:
https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py
- Copied the compute `advantages` and `returns` from `TLR`:
https://github.com/lvwerra/trl/blob/d2e8bcf8373726fb92d2110c500f7df6d0bd566d/trl/trainer/ppo_trainer.py#L686