{"id":24744803,"url":"https://github.com/hscspring/rl-llm-nlp","last_synced_at":"2025-07-06T19:40:13.148Z","repository":{"id":271475342,"uuid":"907159620","full_name":"hscspring/rl-llm-nlp","owner":"hscspring","description":"Reinforcement Learning in LLM and NLP.","archived":false,"fork":false,"pushed_at":"2025-03-16T00:10:51.000Z","size":23,"stargazers_count":16,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T01:19:52.202Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hscspring.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-23T00:58:22.000Z","updated_at":"2025-03-16T00:10:54.000Z","dependencies_parsed_at":"2025-01-08T01:35:35.525Z","dependency_job_id":"26dcb28b-c1cd-440d-b238-8069e4687a13","html_url":"https://github.com/hscspring/rl-llm-nlp","commit_stats":null,"previous_names":["hscspring/rl-llm-nlp"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscspring%2Frl-llm-nlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscspring%2Frl-llm-nlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscspring%2Frl-llm-nlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hscspring%2Frl-llm-nlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hscspring","download_url":"https://codeload.github.com/hscspring/rl-llm-nlp/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245029298,"owners_count":20549684,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-28T02:20:06.193Z","updated_at":"2025-07-06T19:40:13.141Z","avatar_url":"https://github.com/hscspring.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# RL-LLM-NLP\nThis repository encompasses libraries and papers on Reinforcement Learning (RL) within Large Language Models (LLM) and Natural Language Processing (NLP).\n\nI consider RL to be a pivotal technology in the field of AI, and NLP (particularly LLM) to be a direction well worth exploring.\n\n## Library\n\n| GitHub                                                       | From              | Year | Desc                                                         |\n| ------------------------------------------------------------ | ----------------- | ---- | ------------------------------------------------------------ |\n| [prime-rl](https://github.com/PrimeIntellect-ai/prime-rl)    | PrimeIntellect-ai | 2025 | Prime-rl is a codebase for decentralized RL training at scale. |\n| [PRIME](https://github.com/PRIME-RL/PRIME)                   | PRIME-RL          | 2025 | Scalable RL solution for the advanced reasoning of language models |\n| [rStar](https://github.com/microsoft/rStar)                  | MicroSoft         | 2025 |                                                              |\n| [veRL](https://github.com/volcengine/verl)                   | Bytedance         | 2024 | Volcano Engine Reinforcement Learning for LLM                |\n| [trl](https://github.com/huggingface/trl)                    | HuggingFace       | 2024 | Train LM with RL                                             |\n| [RL4LMs](https://github.com/allenai/RL4LMs)                  | Allen             | 2023 | RL library to fine-tune LM to human preferences              |\n| [alignment-handbook](https://github.com/huggingface/alignment-handbook) | huggingface       | 2023 | Robust recipes to align language models with human and AI preferences |\n\n## Paper\n\n| Cate         | Abbr                   | Title                                                        | From                    | Year | Link                                                         |\n| ------------ | ---------------------- | ------------------------------------------------------------ | ----------------------- | ---- | ------------------------------------------------------------ |\n| RL           | MRT                    | Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning | Carnegie Mellon         | 2025 | [paper](http://arxiv.org/abs/2503.07572), [GitHub](https://github.com/CMU-AIRe/MRT) |\n| RL           | L1, LCPO               | L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning | Carnegie Mellon         | 2025 | [paper](http://arxiv.org/abs/2503.04697), [GitHub](https://github.com/cmu-l3/l1) |\n| RL           | Online-DPO-R1          | Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead | Salesforce AI Research  | 2025 | [paper](https://efficient-unicorn-451.notion.site/Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908b9a70e7b80c3bc83f4cf04b2f175), [GitHub](https://github.com/RLHFlow/Online-DPO-R1) |\n| RL           | orz                    | Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model | StepFun                 | 2025 | [paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf), [GitHub](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main) |\n| RL           | OREAL                  | Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning | InternLM                | 2025 | [paper](https://arxiv.org/abs/2502.06781), [GitHub](https://github.com/InternLM/OREAL) |\n| RL           | R1                     | DeepSeek-R1                                                  | DeepSeek                | 2025 | [paper](https://github.com/deepseek-ai/DeepSeek-R1), ①       |\n|              |                        |                                                              |                         |      |                                                              |\n|              |                        |                                                              |                         |      |                                                              |\n|              |                        |                                                              |                         |      |                                                              |\n| o1           | Sky-T1                 | Sky-T1: Train your own O1 preview model within $450          | NovaSky-AI              | 2025 | [GitHub](https://github.com/NovaSky-AI/SkyThought)           |\n| o1           | STILL                  | A series of technical report on Slow Thinking with LLM       | RUCAIBox                | 2025 | [GitHub](https://github.com/RUCAIBox/Slow_Thinking_with_LLMs) |\n|              |                        |                                                              |                         |      |                                                              |\n| RL Scaling   | RM                     | Inference-Time Scaling for Generalist Reward Modeling        | DeepSeek                | 2025 | [paper](https://arxiv.org/abs/2504.02495)                    |\n| RL Scaling   | LIMR                   | LIMR: Less is More for RL Scaling                            | GAIR-NLP                | 2025 | [paper](https://arxiv.org/abs/2502.11886), [GitHub](https://github.com/GAIR-NLP/LIMR) |\n| RL Scaling   | DeepScaleR             | DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL | Agentica                | 2025 | [paper](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2), [GitHub](https://github.com/agentica-project/deepscaler) |\n| RL Scaling   | ScalingLaw             | Value-Based Deep RL Scales Predictably                       | Berkeley                | 2025 | [paper](https://arxiv.org/abs/2502.04327)                    |\n|              |                        |                                                              |                         |      |                                                              |\n| SLM          | PRIME                  | Process Reinforcement through Implicit Rewards               | PRIME-RL                | 2025 | [paper](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f), [GitHub](https://github.com/PRIME-RL/PRIME) |\n| SLM          | rStar-Math             | rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking | MicroSoft               | 2025 | [paper](https://arxiv.org/abs/2501.04519), [GitHub](https://github.com/microsoft/rStar) |\n| SLM          | rStar                  | rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers | MicroSoft               | 2024 | [paper](https://arxiv.org/pdf/2408.06195), [GitHub](https://github.com/zhentingqi/rStar) |\n|              |                        |                                                              |                         |      |                                                              |\n| Unlearn      |                        | A Closer Look at Machine Unlearning for Large Language Models | Sea AI                  | 2024 | [paper](https://arxiv.org/abs/2410.08109v1), [GitHub](https://github.com/sail-sg/closer-look-LLM-unlearning) |\n| Unlearn      | Quark                  | Quark: Controllable Text Generation with Reinforced [Un]learning | Allen                   | 2022 | [paper](http://arxiv.org/abs/2205.13636), [GitHub](https://github.com/GXimingLu/Quark) |\n|              |                        |                                                              |                         |      |                                                              |\n| Align        | ReMax                  | ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models | CUHK                    | 2024 | [paper](https://arxiv.org/abs/2310.10505)                    |\n| Align        |                        | A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More | Salesforce              | 2024 | [paper](https://arxiv.org/abs/2407.16216)                    |\n| Align        |                        | Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback | Allen                   | 2024 | [paper](https://arxiv.org/abs/2406.09279), [GitHub](https://github.com/hamishivi/EasyLM) |\n| Align        |                        | Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey | Capital One             | 2024 | [paper](http://arxiv.org/abs/2409.11564)                     |\n| Align        | RLHF                   | Training language models to follow instructions with human feedback | OpenAI                  | 2022 | [paper](https://arxiv.org/abs/2203.02155)                    |\n| Align        | NLPO                   | Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization | Allen                   | 2022 | [paper](http://arxiv.org/abs/2210.01241), [GitHub](https://github.com/allenai/rl4lms) |\n| Align        |                        | Fine-Tuning Language Models from Human Preferences           | OpenAI                  | 2020 | [paper](http://arxiv.org/abs/1909.08593), [GitHub](https://github.com/openai/lm-human-preferences) |\n| Align        | RLOO                   | Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs | Cohere                  | 2024 | [paper](https://arxiv.org/abs/2402.14740)                    |\n|              |                        |                                                              |                         |      |                                                              |\n| Optimization | CISPO                  | MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention | MiniMax                 | 2025 | [paper](https://arxiv.org/abs/2506.13585), [GitHub](https://github.com/MiniMax-AI/MiniMax-M1) |\n| Optimization | VAPO                   | VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks | ByteDance Seed          | 2025 | [paper](https://arxiv.org/abs/2504.05118)                    |\n| Optimization | Dr. DAPO               | Understanding R1-Zero-Like Training: A Critical Perspective  | Sea AI Lab              | 2025 | [paper](http://arxiv.org/abs/2503.20783), [GitHub](https://github.com/sail-sg/understand-r1-zero) |\n| Optimization | DAPO                   | DAPO: An Open-Source LLM Reinforcement Learning System at Scale | ByteDance Seed          | 2025 | [paper](https://arxiv.org/abs/2503.14476), [GitHub](https://github.com/BytedTsinghua-SIA/DAPO) |\n| Optimization | GRPO                   | DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models | DeepSeek                | 2024 | [paper](https://arxiv.org/abs/2402.03300)                    |\n| Optimization | DPO                    | Direct Preference Optimization: Your Language Model is Secretly a Reward Model | Stanford                | 2024 | [paper](https://arxiv.org/abs/2305.18290)                    |\n| Optimization |                        | Decision Transformer: Reinforcement Learning via Sequence Modeling | Berkeley                | 2021 | [paper](https://arxiv.org/abs/2106.01345), [GitHub](https://github.com/kzl/decision-transformer) |\n| Optimization | PPO                    | Proximal Policy Optimization Algorithms                      | OpenAI                  | 2017 | [paper](https://arxiv.org/abs/1707.06347)                    |\n| Optimization | REINFORCE multi-sample | Buy 4 Reinforce Samples, Get a Baseline for Free!            | University of Amsterdam | 2019 | [paper](https://openreview.net/pdf?id=r1lgTGL5DE)            |\n| Optimization | REINFORCE              | Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning | Northeastern University | 1992 | [paper](https://people.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) |\n\n## Appendix\n\n- ① DeepSeek-R1相关复现：\n    - [Jiayi-Pan/TinyZero: Clean, minimal, accessible reproduction of DeepSeek R1-Zero](https://github.com/Jiayi-Pan/TinyZero)\n    - [huggingface/open-r1: Fully open reproduction of DeepSeek-R1](https://github.com/huggingface/open-r1)\n    - [hkust-nlp/simpleRL-reason: This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data](https://github.com/hkust-nlp/simpleRL-reason)\n    - [ZihanWang314/RAGEN: RAGEN is the first open-source reproduction of DeepSeek-R1 on AGENT training.](https://github.com/ZihanWang314/ragen)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhscspring%2Frl-llm-nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhscspring%2Frl-llm-nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhscspring%2Frl-llm-nlp/lists"}