{"id":26487306,"url":"https://github.com/rkinas/rlhf_thinking_model","last_synced_at":"2025-04-09T09:07:28.932Z","repository":{"id":278514389,"uuid":"934376942","full_name":"rkinas/rlhf_thinking_model","owner":"rkinas","description":"This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.","archived":false,"fork":false,"pushed_at":"2025-03-23T20:24:33.000Z","size":239,"stargazers_count":91,"open_issues_count":0,"forks_count":6,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-02T06:11:09.522Z","etag":null,"topics":["llm","rl","rlhf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rkinas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-17T18:17:41.000Z","updated_at":"2025-03-28T15:07:26.000Z","dependencies_parsed_at":"2025-02-20T07:34:13.266Z","dependency_job_id":"752f43fd-f91d-4126-aa99-168370321a80","html_url":"https://github.com/rkinas/rlhf_thinking_model","commit_stats":null,"previous_names":["rkinas/rlhf_thinking_model"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkinas%2Frlhf_thinking_model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkinas%2Frlhf_thinking_model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkinas%2Frlhf_thinking_model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rkinas%2Frlhf_thinking_model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rkinas","download_url":"https://codeload.github.com/rkinas/rlhf_thinking_model/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248008631,"owners_count":21032556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","rl","rlhf"],"created_at":"2025-03-20T06:38:04.218Z","updated_at":"2025-04-09T09:07:28.747Z","avatar_url":"https://github.com/rkinas.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Thinking Model and RLHF Research Notes**  \n\nThis repository serves as a collection of research notes and resources on **training large language models (LLMs)** and **Reinforcement Learning from Human Feedback (RLHF)**. It focuses on the latest research, methodologies, and techniques for fine-tuning language models.  \n\n## **Repository Contents**  \n\n### **Reinforcement Learning and RLHF Overview**  \nA curated list of materials providing an introduction to RL and RLHF:  \n- Research papers and books covering key concepts in reinforcement learning.  \n- Video lectures explaining the fundamentals of RLHF.  \n\n### **Methods for LLM Training**  \nAn extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:  \n- Key techniques such as PPO, DPO, KTO, ORPO, and more.  \n- The latest ArXiv publications and publicly available implementations.  \n- Analysis of effectiveness across different optimization strategies.  \n\n## **Purpose of this Repository**  \nThis repository is designed as a reference for researchers and engineers working on **reinforcement learning and large language models**. If you're interested in **model alignment**, **experiments with DPO and its variants**, or **alternative RL-based methods**, you will find valuable resources here.  \n\n## RL overview\n- [Reinforcement Learning: An Overview](https://arxiv.org/pdf/2412.05265)\n- [A COMPREHENSIVE SURVEY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE](https://arxiv.org/pdf/2407.16216)\n- [Book-Mathematical-Foundation-of-Reinforcement-Learning](https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning)\n- [The FASTEST introduction to Reinforcement Learning on the internet](https://www.youtube.com/watch?v=VnpRp7ZglfA)\n- [rlhf-book](https://github.com/natolambert/rlhf-book)\n- [Notes on reinforcement learning](https://newfacade.github.io/notes-on-reinforcement-learning/01-intro.html)\n\n## Methods for LLM training\n- [PPO - Proximal Policy Optimization Algorithm - OpenAI](https://arxiv.org/pdf/1707.06347)\n- [DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford](https://arxiv.org/pdf/2305.18290)\n- [online DPO]()\n- [KTO - KTO: Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/pdf/2402.01306)\n- [SimPO imple Preference Optimization with a Reference-Free Reward - Princeton](https://arxiv.org/pdf/2405.14734v1)\n- [ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI](https://arxiv.org/pdf/2403.07691v2)\n- [Sample Efficient Reinforcement Learning with REINFORCE](https://arxiv.org/pdf/2010.11364)\n- [REINFORCE++](https://arxiv.org/pdf/2501.03262v1)\n- [RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/pdf/2501.03262v1)\n- [RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs](https://arxiv.org/pdf/2402.14740) \n- [GRPO](https://arxiv.org/pdf/2402.03300)\n- [ReMax -  Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models](https://arxiv.org/pdf/2310.10505)\n- [DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive](https://arxiv.org/abs/2402.13228)\n- [BCO - Binary Classifier Optimization for Large Language Model Alignment](https://arxiv.org/pdf/2404.04656v1)\n\n## Minimal implementation\n|    Method                                                                                              |\n|--------------------------------------------------------------------------------------------------------|\n| [DPO](https://github.com/rkinas/rlhf_thinking_model/blob/main/minimal_implementation/dpo_trainer.py)   |   \n\n## Tutorials\nNotes for learning RL: Value Iteration -\u003e Q Learning -\u003e DQN -\u003e REINFORCE -\u003e Policy Gradient Theorem -\u003e TRPO -\u003e PPO\n- [CS234: Reinforcement Learning Winter 2025 ](https://web.stanford.edu/class/cs234/)\n- [CS285 Deep Reinforcement Learning](https://rail.eecs.berkeley.edu/deeprlcourse/)\n- [Welcome to Spinning Up in Deep RL](https://spinningup.openai.com/en/latest/index.html)\n- [deep-rl-course from Huggingface](https://huggingface.co/learn/deep-rl-course/unit0/introduction)\n- [RL Course by David Silver](https://www.youtube.com/watch?v=2pWv7GOvuf0\u0026list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-)\n\n\n## RLHF training techniques explained\n- [Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.](https://www.youtube.com/watch?v=qGyFrqc34yc)\n- [Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math](https://www.youtube.com/watch?v=hvGa5Mba4c8)\n- [GRPO vs PPO](https://yugeten.github.io/posts/2025/01/ppogrpo/)\n- [Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights](https://hijkzzz.notion.site/Unraveling-RLHF-and-Its-Variants-Progress-and-Practical-Engineering-Insights-147d9a33ecc980199dc5cb967c5e9374)\n\n## Training frameworks\n- [VERL](https://github.com/volcengine/verl)\n- [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)\n- [TRL](https://huggingface.co/docs/trl/)\n\n## RLHF methods implementation (only with detailed explanations)\n- GRPO\n  - [GRPO A.Burkov](https://github.com/aburkov/theLMbook/blob/main/GRPO.py)\n  - [Minimal implementation by willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)\n  - [TinyZero](https://github.com/Jiayi-Pan/TinyZero)\n  - [microGRPO](https://github.com/superlinear-ai/microGRPO)\n\n## Articles\n- [Reasoning LLMs](https://docs.google.com/document/d/1TW7wEUgo61FZnPckZMploGTdB0eNcemiDPDqdmzsCvA/edit?tab=t.0)\n- [Process Reinforcement through Implicit Rewards](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)\n- [DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)\n- [On the Emergence of Thinking in LLMs I: Searching for the Right Intuition](https://arxiv.org/pdf/2502.06773)\n- [LIMR: Less is More for RL Scaling](https://arxiv.org/pdf/2502.11886)\n- [LIMO: Less Is More for Reasoning](https://github.com/GAIR-NLP/LIMO)\n- [s1: Simple test-time scaling](https://github.com/simplescaling/s1) and s1.1 \n- [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)\n- [Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead](https://efficient-unicorn-451.notion.site/Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908b9a70e7b80c3bc83f4cf04b2f175) and [github](https://github.com/RLHFlow/Online-DPO-R1)\n- [a reinforcement learning guide](https://naklecha.notion.site/a-reinforcement-learning-guide)\n- [Approximating KL Divergence](http://joschu.net/blog/kl-approx.html)\n- [How to align open LLMs in 2025 with DPO \u0026 and synthetic data](https://www.philschmid.de/rl-with-llms-in-2025-dpo)\n- DeepSeek-R1 -\u003e [The Illustrated DeepSeek-R1](https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r1), [DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs](https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1), [DeepSeek R1 and R1-Zero Explained](https://thelmbook.com/articles/#!./DeepSeek-R1.md)\n\n- 2025.03.23\n  - [Reinforcement Learning for Reasoning in Small LLMs: What Works and WhatDoesn’t](https://arxiv.org/pdf/2503.16219)\n  - [Understanding R1-zero](https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf)\n\n- 2025.02.22\n  - [Small Models Struggle to Learn from Strong Reasoners](https://arxiv.org/pdf/2502.12143v1)\n  - [Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning](https://arxiv.org/pdf/2502.14768)\n  - [LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization](https://www.arxiv.org/abs/2502.13922)\n  - [Open Reasoner Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model\n\n\n# Thinking process\n\n## Repos\n- [Awesome-System2-Reasoning-LLM](https://github.com/zzli2022/Awesome-System2-Reasoning-LLM)\n\n## Articles\n- ✨ [LLM Reasoning: Curated Insights](https://shangshangwang.notion.site/llm-reasoning)\n- [LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!](https://arxiv.org/pdf/2502.07374)\n- [LLM Post-Training: A Deep Dive into Reasoning Large Language Models](https://arxiv.org/pdf/2502.21321)\n\n## Papers\n- [SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models](https://arxiv.org/abs/2502.09604)\n- [ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates](https://arxiv.org/abs/2502.06772)\n- [A Minimalist Approach to Offline Reinforcement Learning](https://arxiv.org/abs/2106.06860)\n- [Training Language Models to Reason Efficiently](https://arxiv.org/abs/2502.04463)\n- [Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://arxiv.org/abs/2502.02508)\n\n\n## Open-source project to reproduce DeepSeek R1\n- [DeepScaleR - Democratizing Reinforcement Learning for LLMs](https://github.com/agentica-project/deepscaler)\n\n## Datasets - thinking models\n- [R1 - distill] [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)\n- [R1 - distill] [s1K-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1)\n- [R1 - distill] [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)\n- [R1 - distill] [LIMO](https://huggingface.co/datasets/GAIR/LIMO)\n- [R1 - distill] [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)\n- [Llama-70B - distill] [natural_reasoning](https://huggingface.co/datasets/facebook/natural_reasoning) - licence for non commercial use\n- [Open Reasoning Data ](https://gr.inc/)\n- [Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified)\n\n# Evaluation and benchmarks\n- [Open R1 - A fully open reproduction of DeepSeek-R1](https://github.com/huggingface/open-r1)\n- [GMIL CM Benchmark - Math Reasoning as an 11-Year-Old](https://github.com/przadka/gmil-cm-benchmark?tab=readme-ov-file)\n  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frkinas%2Frlhf_thinking_model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frkinas%2Frlhf_thinking_model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frkinas%2Frlhf_thinking_model/lists"}