Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-rlhf
An index of algorithms for reinforcement learning from human feedback (rlhf))
https://github.com/louieworth/awesome-rlhf
Last synced: 3 days ago
JSON representation
-
Papers
-
RLHF for LLMs: Theory / Methods
- Supervised Fine-Tuning as Inverse Reinforcement Learning
- Policy Optimization in RLHF: The Impact of Out-of-preference Data
- Statistical Rejection Sampling Improves Preference Optimization
- A Minimaximalist Approach to Reinforcement Learning from Human Feedback
- Preference as Reward, Maximum Preference Optimization with Importance Sampling
- Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles
- Aligning Large Language Models with Human Preferences through Representation Engineering
- The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
- Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration
- On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models
- Policy Optimization in RLHF: The Impact of Out-of-preference Data
- Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders
- SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
- Nash Learning from Human Feedback
- Adversarial Preference Optimization
- Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
- Fake Alignment: Are LLMs Really Aligned Well?
- Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment
- Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback
- Is RLHF More Difficult than Standard RL?
- A General Theoretical Paradigm to Understand Learning from Human Preferences
- COPF: Continual Learning Human Preference through Optimal Policy Fitting
- SuperHF: Supervised Iterative Learning from Human Feedback
- Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis
- Entangled Preferences: The History and Risks of Reinforcement Learning and Human Feedback
- Group Preference Optimization: Few-Shot Alignment of Large Language Models
- Safe RLHF: Safe Reinforcement Learning from Human Feedback
- ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
- Stabilizing RLHF through Advantage Model and Selective Rehearsal
- Beyond One-Preference-for-All: Multi-Objective Direct Preference Optimization for Language Models
- A General Theoretical Paradigm to Understand Learning from Human Preferences
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment
- Understanding the Effects of RLHF on LLM Generalisation and Diversity
- SALMON: Self-Alignment with Principle-Following Reward Models
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- Preference Ranking Optimization for Human Alignment
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears
- Reward Model Ensembles Help Mitigate Overoptimization
- Learning Optimal Advantage from Preferences and Mistaking it for Reward
- Enable Language Models to Implicitly Learn Self-Improvement From Data
- The Trickle-down Impact of Reward (In-)consistency on RLHF
- Aligning Language Models with Offline Reinforcement Learning from Human Feedback
- Human Feedback is not Gold Standard
- Fine-Tuning Language Models with Advantage-Induced Policy Alignment
- Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons
- Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding
- Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF
- RAIN: Your Language Models Can Align Themselves without Finetuning
- Statistical Rejection Sampling Improves Preference Optimization
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
- Reinforced Self-Training (ReST) for Language Modeling
- Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
- Let Me Teach You: Pedagogical Foundations of Feedback for Language Models
- Generalized Knowledge Distillation for Auto-regressive Language Models
- Secrets of RLHF in Large Language Models Part I: PPO
- Learning to Generate Better Than Your LLM
- Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
- Continually Improving Extractive QA via Human Feedback
- SLiC-HF: Sequence Likelihood Calibration with Human Feedback
- The Wisdom of Hindsight Makes Language Models Better Instruction Followers
- Direct Preference Optimization with an Offset
- Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation
- Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction
- Bayesian Preference Elicitation with Language Models
- Learn Your Reference Model for Real Good Alignment
- Dataset Reset Policy Optimization for RLHF
- Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- CPPO: Continual Learning for Reinforcement Learning with Human Feedback
- KTO: Model Alignment as Prospect Theoretic Optimization
- Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
- MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
- Human Alignment of Large Language Models through Online Preference Optimisation
- Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy
- Supervised Fine-Tuning as Inverse Reinforcement Learning
- Self-Rewarding Language Models
- Improving Language Models with Advantage-based Offline Policy Gradients
-
Review/Survey
- AI Alignment: A Comprehensive Survey
- Aligning Large Language Models with Human: A Survey
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
-
RLHF for Other Domains
- Contrastive Preference Learning: Learning from Human Feedback without RL
- Beyond Reward: Offline Preference-guided Policy Optimization
- PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback
- WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
- Shepherd: A Critic for Language Model Generation
- Reinforcement Learning with Human Feedback for Realistic Traffic Simulation
- Aligning Large Multimodal Models with Factually Augmented RLHF
- Motif: Intrinsic Motivation from Artificial Intelligence Feedback
-
Datasets
-
-
Open Source Software/Implementations
-
Uncategorized
-
Uncategorized
- Li Jiang
- RLHF papers
- hugging face blog - RLHF) from OpendiLab.
-
-
Blogs/Talks/Reports
-
Blogs
-
Talks
-
Reports
-
Programming Languages
Sub Categories