awesome-RLHF
A curated list of reinforcement learning with human feedback resources (continually updated)
https://github.com/opendilab/awesome-RLHF
Last synced: 5 days ago
JSON representation
-
Papers
-
2024
- On Diversified Preferences of Large Language Model Alignment
- A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
- official
- official
- official
- official
- official
- official
- Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
- Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
- Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases
- RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
- Training Diffusion Models with Reinforcement Learning
- official
- AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model
- official
- Dense Reward for Free in Reinforcement Learning from Human Feedback
- Transforming and Combining Rewards for Aligning Large Language Models
- HybridFlow: A Flexible and Efficient RLHF Framework
- Official
- official
- official
- A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference
- Mitigating the Alignment Tax of RLHF
- A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference
- MaxMin-RLHF: Towards equitable alignment of large language models with diverse human preferences
- Dataset Reset Policy Optimization for RLHF
- official
- Aligning Crowd Feedback via Distributional Preference Reward Modeling
- Dense Reward for Free in Reinforcement Learning from Human Feedback
- Training Diffusion Models with Reinforcement Learning
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- Mitigating the Alignment Tax of RLHF
- Official
- Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
- Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment
- ALaRM: Align Language Models via Hierarchical Rewards Modeling
- Official
- TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
- Aligning Large Multimodal Models with Factually Augmented RLHF
- Official
- Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
- Official
- Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
- Official
- Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint
- Official
- RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
- Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
- official
- A Minimaximalist Approach to Reinforcement Learning from Human Feedback
- official
- RLHF Workflow: From Reward Modeling to Online RLHF
- official
- Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
- Official
- REvolve: Reward Evolution with Large Language Models using Human Feedback
- MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
- Official
- Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
- Official
- Aligning Few-Step Diffusion Models with Dense Reward Difference Learning
- Official
- Aligning Crowd Feedback via Distributional Preference Reward Modeling
- Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
- Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
- On Diversified Preferences of Large Language Model Alignment
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback
-
2023
- Adversarial Preference Optimization
- Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration
- Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing
- A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift
- Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language
- Let's Reinforce Step by Step
- Direct Preference-based Policy Optimization without Reward Modeling
- AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model
- Eureka: Human-Level Reward Design via Coding Large Language Models
- Safe RLHF: Safe Reinforcement Learning from Human Feedback
- Quality Diversity through Human Feedback
- ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
- Tuning computer vision models with task rewards
- The Wisdom of Hindsight Makes Language Models Better Instruction Followers
- Language Instructed Reinforcement Learning for Human-AI Coordination
- Aligning Language Models with Offline Reinforcement Learning from Human Feedback
- Preference Ranking Optimization for Human Alignment
- official
- Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
- GPT-4 Technical Report
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears
- Few-shot Preference Learning for Human-in-the-Loop RL
- Better Aligning Text-to-Image Models with Human Preference
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
- COCO
- Aligning Text-to-Image Models using Human Feedback
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- Pretraining Language Models with Human Preferences
- Aligning Language Models with Preferences through f-divergence Minimization - DPG)
- Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
- The Capacity for Moral Self-Correction in Large Language Models
- Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
- Inverse Preference Learning: Preference-based RL without a Reward Function
- Preference Ranking Optimization for Human Alignment
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
- Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
- Preference-grounded Token-level Guidance for Language Model Fine-tuning
- Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-oriented Dialogue Systems
- The Wisdom of Hindsight Makes Language Models Better Instruction Followers
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears
-
2022
- Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
- Scaling Laws for Reward Model Overoptimization
- Improving alignment of dialogue agents via targeted human judgements
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning
- Quark: Controllable Text Generation with Reinforced Unlearning
- WRITINGPROMPTS - 2](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), [WIKITEXT-103](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- TriviaQA - book-qa), [LAMBADA](https://zenodo.org/record/2630551#.Y_KLJ-yZNhF), [HumanEval](https://github.com/openai/human-eval), [MMLU](https://github.com/hendrycks/test), [TruthfulQA](https://github.com/sylinrl/TruthfulQA)
- Teaching language models to support answers with verified quotes
- Training language models to follow instructions with human feedback
- Constitutional AI: Harmlessness from AI Feedback
- Discovering Language Model Behaviors with Model-Written Evaluations
- Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning
-
2021
-
2020 and before
- Learning to summarize from human feedback
- Fine-Tuning Language Models from Human Preferences
- Scalable agent alignment via reward modeling: a research direction
- Reward learning from human preferences and demonstrations in Atari
- Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
- Deep reinforcement learning from human preferences
- Interactive Learning from Policy-Dependent Human Feedback
-
Detailed Explanation
- visit this link - enhanced paper reading experience.
-
2025
-
-
Blogs
-
2020 and before
-
-
Dataset
-
Codebases
-
2020 and before
- enwik8
- DeepSpeed-Chat
- FG-RLHF
- IMDB - dailymail), [ToTTo](https://github.com/google-research-datasets/ToTTo), [WMT-16 (en-de)](https://www.statmt.org/wmt16/it-translation-task.html), [NarrativeQA](https://github.com/deepmind/narrativeqa), [DailyDialog](http://yanran.li/dailydialog)
- TL;DR - dailymail)
-
Programming Languages
Categories
Sub Categories
Keywords
rlhf
5
large-language-models
3
multimodal
2
alignment
2
dense-reward-for-direct-preference-optimization
1
preference-alignment
1
text-to-image-generation
1
deep-learning
1
fine-tuning
1
self-play
1
diffusion-models
1
human-feedback
1
reinforcement-learning
1
stable-diffusion
1
text-to-image
1
scalable-oversight
1
ai-alignment
1
chatbot
1
gpt-4
1
llama
1
multi-modality
1
rlhf-v
1
visual-language-learning
1
llama3
1
llm
1
chameleon
1
dpo
1
vision-language-model
1