https://github.com/openai/following-instructions-human-feedback

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/openai/following-instructions-human-feedback
Owner: openai
Created: 2022-01-25T01:15:00.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-12-11T19:58:53.000Z (over 2 years ago)
Last Synced: 2025-05-09T21:46:50.493Z (about 2 months ago)
Size: 1020 KB
Stars: 1,220
Watchers: 136
Forks: 145
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-ChatGPT-repositories - following-instructions-human-feedback - @rachel_l_woods this is my go-to paper for instructing gpt to not hallucinate (Others)

README

# InstructGPT: Training Language Models to Follow Instructions with Human Feedback

[Paper link][LINK_TO_PAPER]

> Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI-API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback (RLHF). We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

## Contents
- [model-card.md](model-card.md) - InstructGPT model card
- [automatic-eval-samples](automatic-eval-samples/) - Samples from our models (both GPT-3 and InstructGPT) on public NLP benchmarks.
- [API distribution labeling instructions](https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit#) - Google doc of instructions given to contractors for final evaluations on our API prompt distribution.
- [Toxicity labeling instructions](https://docs.google.com/document/d/1d3n6AqNrd-SJEKm_etEo3rUwXxKG4evCbzfWExvcGxg/edit?usp=sharing) - Google doc of instructions given to contractors for labeling toxic outputs on the RealToxicityPrompts dataset

[LINK_TO_PAPER]: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/openai/following-instructions-human-feedback

Awesome Lists containing this project

README