https://github.com/rkinas/rlhf_thinking_model

This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.
https://github.com/rkinas/rlhf_thinking_model
llm rl rlhf
Last synced: about 1 year ago
JSON representation
Host: GitHub
URL: https://github.com/rkinas/rlhf_thinking_model
Owner: rkinas
Created: 2025-02-17T18:17:41.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-23T20:24:33.000Z (about 1 year ago)
Last Synced: 2025-04-02T06:11:09.522Z (about 1 year ago)
Topics: llm, rl, rlhf
Language: Python
Homepage:
Size: 233 KB
Stars: 91
Watchers: 8
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # **Thinking Model and RLHF Research Notes**  

This repository serves as a collection of research notes and resources on **training large language models (LLMs)** and **Reinforcement Learning from Human Feedback (RLHF)**. It focuses on the latest research, methodologies, and techniques for fine-tuning language models.  

## **Repository Contents**  

### **Reinforcement Learning and RLHF Overview**  

A curated list of materials providing an introduction to RL and RLHF:  

- Research papers and books covering key concepts in reinforcement learning.  

- Video lectures explaining the fundamentals of RLHF.  

### **Methods for LLM Training**  

An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:  

- Key techniques such as PPO, DPO, KTO, ORPO, and more.  

- The latest ArXiv publications and publicly available implementations.  

- Analysis of effectiveness across different optimization strategies.  

## **Purpose of this Repository**  

This repository is designed as a reference for researchers and engineers working on **reinforcement learning and large language models**. If you're interested in **model alignment**, **experiments with DPO and its variants**, or **alternative RL-based methods**, you will find valuable resources here.  

## RL overview

- [Reinforcement Learning: An Overview](https://arxiv.org/pdf/2412.05265)

- [A COMPREHENSIVE SURVEY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE](https://arxiv.org/pdf/2407.16216)

- [Book-Mathematical-Foundation-of-Reinforcement-Learning](https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning)

- [The FASTEST introduction to Reinforcement Learning on the internet](https://www.youtube.com/watch?v=VnpRp7ZglfA)

- [rlhf-book](https://github.com/natolambert/rlhf-book)

- [Notes on reinforcement learning](https://newfacade.github.io/notes-on-reinforcement-learning/01-intro.html)

## Methods for LLM training

- [PPO - Proximal Policy Optimization Algorithm - OpenAI](https://arxiv.org/pdf/1707.06347)

- [DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford](https://arxiv.org/pdf/2305.18290)

- [online DPO]()

- [KTO - KTO: Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/pdf/2402.01306)

- [SimPO imple Preference Optimization with a Reference-Free Reward - Princeton](https://arxiv.org/pdf/2405.14734v1)

- [ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI](https://arxiv.org/pdf/2403.07691v2)

- [Sample Efficient Reinforcement Learning with REINFORCE](https://arxiv.org/pdf/2010.11364)

- [REINFORCE++](https://arxiv.org/pdf/2501.03262v1)

- [RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/pdf/2501.03262v1)

- [RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs](https://arxiv.org/pdf/2402.14740) 

- [GRPO](https://arxiv.org/pdf/2402.03300)

- [ReMax -  Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models](https://arxiv.org/pdf/2310.10505)

- [DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive](https://arxiv.org/abs/2402.13228)

- [BCO - Binary Classifier Optimization for Large Language Model Alignment](https://arxiv.org/pdf/2404.04656v1)

## Minimal implementation

|    Method                                                                                              |

|--------------------------------------------------------------------------------------------------------|

| [DPO](https://github.com/rkinas/rlhf_thinking_model/blob/main/minimal_implementation/dpo_trainer.py)   |   

## Tutorials

Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO

- [CS234: Reinforcement Learning Winter 2025 ](https://web.stanford.edu/class/cs234/)

- [CS285 Deep Reinforcement Learning](https://rail.eecs.berkeley.edu/deeprlcourse/)

- [Welcome to Spinning Up in Deep RL](https://spinningup.openai.com/en/latest/index.html)

- [deep-rl-course from Huggingface](https://huggingface.co/learn/deep-rl-course/unit0/introduction)

- [RL Course by David Silver](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-)

## RLHF training techniques explained

- [Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.](https://www.youtube.com/watch?v=qGyFrqc34yc)

- [Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math](https://www.youtube.com/watch?v=hvGa5Mba4c8)

- [GRPO vs PPO](https://yugeten.github.io/posts/2025/01/ppogrpo/)

- [Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights](https://hijkzzz.notion.site/Unraveling-RLHF-and-Its-Variants-Progress-and-Practical-Engineering-Insights-147d9a33ecc980199dc5cb967c5e9374)

## Training frameworks

- [VERL](https://github.com/volcengine/verl)

- [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)

- [TRL](https://huggingface.co/docs/trl/)

## RLHF methods implementation (only with detailed explanations)

- GRPO

  - [GRPO A.Burkov](https://github.com/aburkov/theLMbook/blob/main/GRPO.py)

  - [Minimal implementation by willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)

  - [TinyZero](https://github.com/Jiayi-Pan/TinyZero)

  - [microGRPO](https://github.com/superlinear-ai/microGRPO)

## Articles

- [Reasoning LLMs](https://docs.google.com/document/d/1TW7wEUgo61FZnPckZMploGTdB0eNcemiDPDqdmzsCvA/edit?tab=t.0)

- [Process Reinforcement through Implicit Rewards](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)

- [DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)

- [On the Emergence of Thinking in LLMs I: Searching for the Right Intuition](https://arxiv.org/pdf/2502.06773)

- [LIMR: Less is More for RL Scaling](https://arxiv.org/pdf/2502.11886)

- [LIMO: Less Is More for Reasoning](https://github.com/GAIR-NLP/LIMO)

- [s1: Simple test-time scaling](https://github.com/simplescaling/s1) and s1.1 

- [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)

- [Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead](https://efficient-unicorn-451.notion.site/Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908b9a70e7b80c3bc83f4cf04b2f175) and [github](https://github.com/RLHFlow/Online-DPO-R1)

- [a reinforcement learning guide](https://naklecha.notion.site/a-reinforcement-learning-guide)

- [Approximating KL Divergence](http://joschu.net/blog/kl-approx.html)

- [How to align open LLMs in 2025 with DPO & and synthetic data](https://www.philschmid.de/rl-with-llms-in-2025-dpo)

- DeepSeek-R1 -> [The Illustrated DeepSeek-R1](https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r1), [DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs](https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1), [DeepSeek R1 and R1-Zero Explained](https://thelmbook.com/articles/#!./DeepSeek-R1.md)

- 2025.03.23

  - [Reinforcement Learning for Reasoning in Small LLMs: What Works and WhatDoesn’t](https://arxiv.org/pdf/2503.16219)

  - [Understanding R1-zero](https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf)

- 2025.02.22

  - [Small Models Struggle to Learn from Strong Reasoners](https://arxiv.org/pdf/2502.12143v1)

  - [Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning](https://arxiv.org/pdf/2502.14768)

  - [LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization](https://www.arxiv.org/abs/2502.13922)

  - [Open Reasoner Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

# Thinking process

## Repos

- [Awesome-System2-Reasoning-LLM](https://github.com/zzli2022/Awesome-System2-Reasoning-LLM)

## Articles

- ✨ [LLM Reasoning: Curated Insights](https://shangshangwang.notion.site/llm-reasoning)

- [LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!](https://arxiv.org/pdf/2502.07374)

- [LLM Post-Training: A Deep Dive into Reasoning Large Language Models](https://arxiv.org/pdf/2502.21321)

## Papers

- [SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models](https://arxiv.org/abs/2502.09604)

- [ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates](https://arxiv.org/abs/2502.06772)

- [A Minimalist Approach to Offline Reinforcement Learning](https://arxiv.org/abs/2106.06860)

- [Training Language Models to Reason Efficiently](https://arxiv.org/abs/2502.04463)

- [Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://arxiv.org/abs/2502.02508)

## Open-source project to reproduce DeepSeek R1

- [DeepScaleR - Democratizing Reinforcement Learning for LLMs](https://github.com/agentica-project/deepscaler)

## Datasets - thinking models

- [R1 - distill] [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)

- [R1 - distill] [s1K-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1)

- [R1 - distill] [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)

- [R1 - distill] [LIMO](https://huggingface.co/datasets/GAIR/LIMO)

- [R1 - distill] [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)

- [Llama-70B - distill] [natural_reasoning](https://huggingface.co/datasets/facebook/natural_reasoning) - licence for non commercial use

- [Open Reasoning Data ](https://gr.inc/)

- [Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified)

# Evaluation and benchmarks

- [Open R1 - A fully open reproduction of DeepSeek-R1](https://github.com/huggingface/open-r1)

- [GMIL CM Benchmark - Math Reasoning as an 11-Year-Old](https://github.com/przadka/gmil-cm-benchmark?tab=readme-ov-file)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rkinas/rlhf_thinking_model

Awesome Lists containing this project

README