https://github.com/tsinghuac3i/awesome-rl-reasoning-recipes

Awesome RL Reasoning Recipes ("Triple R")
https://github.com/tsinghuac3i/awesome-rl-reasoning-recipes
List: awesome-rl-reasoning-recipes
awesome-list deepseek-r1 llm open-source reasoning rl
Last synced: 2 months ago
JSON representation
Awesome RL Reasoning Recipes ("Triple R")
Host: GitHub
URL: https://github.com/tsinghuac3i/awesome-rl-reasoning-recipes
Owner: TsinghuaC3I
License: mit
Created: 2025-03-20T08:09:40.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-04-09T14:36:13.000Z (2 months ago)
Last Synced: 2025-04-09T15:43:12.572Z (2 months ago)
Topics: awesome-list, deepseek-r1, llm, open-source, reasoning, rl
Homepage:
Size: 482 KB
Stars: 282
Watchers: 3
Forks: 14
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

ultimate-awesome - awesome-rl-reasoning-recipes - Awesome RL Reasoning Recipes ("Triple R"). (Other Lists / Julia Lists)
README

        # Awesome RL Reasoning Recipes ("Triple R")

[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated collection covering models, datasets, reward designs, optimization methods, hyperparameters, empirical findings, theoretical insights, and everything about reasoning with reinforcement learning.

## Contents

> ⚠️⚠️⚠️ The following table of contents highlight only a selection of projects, which provide detailed configurations. For the most recent updates, please scroll to the bottom of the table:

> - [Jump to Latest LLM Projects 🚀🚀🚀](#llm_latest)

> - [Jump to Latest VLM & Agent Projects 🚀🚀🚀](#vlm_latest)

- [Awesome RL Reasoning Recipes ("Triple R")](#awesome-rl-reasoning-recipes-triple-r)

  - [Contents](#contents)

  - [Overview](#overview)

    - [Large Language Models](#large-language-models)

    - [Multimodal and Agents](#multimodal-and-agents)

  - [Projects](#projects)

    - [Large Language Models](#large-language-models-1)

        - [2025.0102, PRIME-RL](#20250102-prime-rl)

        - [2025.0122, DeepSeek-R1](#20250122-deepseek-r1)

      - [2025.0122, Kimi k1.5](#20250122-kimi-k15)

      - [2025.0124, TinyZero](#20250124-tinyzero)

      - [2025.0125, SimpleRL](#20250125-simplerl)

      - [2025.0206, Demysitify-long-CoT](#20250206-demysitify-long-cot)

      - [2025.0210, DeepScaler](#20250210-deepscaler)

      - [2025.0210, Logic-RL](#20250210-logic-rl)

      - [2025.0210, OREAL](#20250210-oreal)

      - [2025.0217, LIMR](#20250217-limr)

      - [2025.0217, Open-Reasoner-Zero](#20250217-open-reasoner-zero)

      - [ 2025.0225, SWE-RL](#-20250225-swe-rl)

      - [2025.0303, VC-PPO ](#20250303-vc-ppo-)

      - [2025.0306, LCPO-L1 ](#20250306-lcpo-l1-)

      - [2025.0310, MetaRL ](#20250310-metarl-)

      - [2025.0318, TOPR ](#20250318-topr-)

      - [2025.0318, DAPO](#20250318-dapo)

      - [2025.0320, Open RS](#20250320-open-rs)

      - [2025.0321, Oat-Zero](#20250321-oat-zero)

      - [2025.0407, VAPO](#20250407-vapo)

    - [Multimodal and Agents](#multimodal-and-agents-1)

      - [2025.0128, open-r1-multimodal](#20250128-open-r1-multimodal)

      - [2025.0202, R1-V](#20250202-r1-v)

      - [2025.0215, VLM-R1](#20250215-vlm-r1)

      - [2025.0303, Visual-RFT](#20250303-visual-rft)

      - [2025.0306, r1-vlm](#20250306-r1-vlm)

      - [2025.0310, VisualThinker-R1-Zero](#20250310-visualthinker-r1-zero)

      - [2025.0310, MM-Eureka](#20250310-mm-eureka)

      - [2025.0310, Curr\_ReFT](#20250310-curr_reft)

      - [2025.0315, MetaSpatial](#20250315-metaspatial)

  - [Contributing](#contributing)

      - [202x.0x0x, Template](#202x0x0x-template)

  - [Citation](#citation)

## Overview

**This collection covers recent progress in reinforcement learning for large language model reasoning, starting from 2025 in the timeline.**

### Large Language Models

| Date      | Project 
| --------- 
| 2025.0102 | PRIME-RL 
| 2025.0122 | DeepSeek-R1 
| 2025.0122 | Kimi k1.5 
| 2025.0124 | TinyZero 
| 2025.0124 | Open-R1 
| 2025.0125 | simpleRL-reason 
| 2025.0126 | RAGEN 
| 2025.0203 | Verifiers 
| 2025.0205 
| 2025.0210 | DeepScaler 
| 2025.0210 | Logic-RL 
| 2025.0210 | OREAL 
| 2025.0217 | LIMR 
| 2025.0218 
| 2025.0225 | SWE-RL 
| 2025.0303 | VC-PPO 
| 2025.0306 | LCPO-L1 
| 2025.0310 | MRT 
| 2025.0318 | TOPR 
| 2025.0318 | DAPO 
| 2025.0319 | SWEET-RL 
| 2025.0320 | Open RS 
| 2025.0321 | Oat-Zero 
| 2025.0321 | FastCuRL 
| 2025.0401 | Z1 
| 2025.0401 | VAPO 
|2025.0407 |  ConciseRL  | 
| 2025.0409 | AdaRFT 
| 
2025.0x0x |

| Org                                | Intro                                                        | HF Model                                                     | HF Dataset                                                   | Takeaway Messages                                                 | | ------------------ | ---------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | THU & UIUC 
 Shang AILab      | [Paper](https://arxiv.org/abs/2502.01456)
[GitHub](https://github.com/PRIME-RL/PRIME)
 [More](#primerl) | [Eurus-2-7B-PRIME](https://huggingface.co/PRIME-RL/Eurus-2-7B-PRIME) 
[Eurus-2-7B-PRIME-Zero](https://huggingface.co/PRIME-RL/Eurus-2-7B-PRIME-Zero) | [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) | ClickPRIME offers scalable Reinforcement Learning by using dense, token-level implicit rewards derived only from final outcomes. This bypasses costly step-by-step annotations, providing fine-grained feedback to improve sample efficiency and reasoning. | | DeepSeek                           | [Paper](https://arxiv.org/abs/2501.12948)
[GitHub](https://github.com/deepseek-ai/DeepSeek-R1/tree/main)
[More](#deepseek-r1) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) 
[DeepSeek-R1-Zero](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero) | ——                                                           | ClickDeepSeek-R1's core contribution is demonstrating large-scale RL from scratch (600B+) without SFT, achieving emergent "aha moments" (self-reflective reasoning) and matching OpenAI o1's performance at 1/30 cost | | Kimi                               | [Paper](https://arxiv.org/abs/2501.12599)
[GitHub](https://github.com/MoonshotAI/Kimi-k1.5)
[More](#kimi-k1.5) | ——                                                           | ——                                                           | ClickKimi 1.5 introduces a simplified RL framework that leverages long-context scaling (128k tokens) and improved policy optimization (e.g., online mirror descent) to enhance reasoning and multimodal performance. | | Berkeley                           | [Twitter](https://x.com/jiayi_pirate/status/1882839370505621655)
[GitHub](https://github.com/Jiayi-Pan/TinyZero)
[More](#tinyzero) | ——                                                           | [Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | ClickTinyZero's core contribution is demonstrating that smaller language models (e.g., 1.5B-3B parameters) can develop complex reasoning, search, and self-verification abilities through Reinforcement Learning, replicating capabilities of larger models like DeepSeek R1-Zero at extremely low cost (<$30). | | Huggingface                        | [GitHub](https://github.com/huggingface/open-r1)             | [OpenR1-Qwen-7B](https://huggingface.co/open-r1/OpenR1-Qwen-7B)
[OlympicCoder-7B](https://huggingface.co/open-r1/OlympicCoder-7B)
[OlympicCoder-32B](https://huggingface.co/open-r1/OlympicCoder-32B) | [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)
[codeforces](https://huggingface.co/datasets/open-r1/codeforces) | ClickOpen-R1's core contribution is providing the first fully open-source replication and release of the DeepSeek R1-Zero Reinforcement Learning training pipeline. Its main insight or goal is to democratize access to these advanced RL techniques for enhancing LLM reasoning and planning. | | HKUST                              | [Paper](https://hkust-nlp.notion.site/simplerl-reason)
[GitHub](https://github.com/hkust-nlp/simpleRL-reason)
[More](#simplerl) | [Qwen-2.5-Math-7B-SimpleRL-Zero](https://huggingface.co/hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero)
[Qwen-2.5-Math-7B-SimpleRL](https://huggingface.co/hkust-nlp/Qwen-2.5-Math-7B-SimpleRL) | [MATH](https://huggingface.co/datasets/EleutherAI/hendrycks_math) | ClickResearchers replicated the DeepSeek-R1-Zero and DeepSeek-R1 training using a 7B model with only 8K MATH examples, achieving strong results on complex mathematical reasoning. | | RAGEN-AI                           | [GitHub](https://github.com/RAGEN-AI/RAGEN)                  | ——                                                           | ——                                                           | ClickRAGEN introduces a RL framework to train reasoning-capable LLM agents for interactive, stochastic environments. Its core contribution is the Reasoning-Interaction Chain Optimization (RICO) algorithm, which jointly optimizes reasoning and action strategies by reinforcing entire trajectories. | | Independent                        | [GitHub](https://github.com/willccbb/verifiers) | ——                                                           | ——                                                           | ClickThis repository contains a set of tools for reinforcement learning with LLMs in verifiable environments. It can be used for LLM Agent RL in verifable environments. | | Demystify-long-cot | CMU                                | [Paper](https://arxiv.org/abs/2502.03373)
[GitHub](https://github.com/eddycmu/demystify-long-cot)
[More](#demystify) | ——                                                           | ——                                                           | ClickThe paper elucidates the role of RL in stabilizing and enhancing long CoT reasoning in LLMs, highlighting the necessity of reward shaping and verifiable reward signals for complex reasoning tasks. | | Agentica-Org                       | [Blog](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)
[GitHub](https://github.com/agentica-project/deepscaler)
[More](#deepscaler) | [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | [DeepScaleR-Preview-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) | ClickDeepScaleR's core contribution is demonstrating that a small 1.5B parameter model, fine-tuned using scaled Reinforcement Learning (RL) and an iterative context lengthening scheme, can surpass the reasoning performance of larger, state-of-the-art models like OpenAI's O1-Preview on complex benchmarks (e.g., AIME math problems). | | MSRA & Ubiquant                    | [Paper](https://arxiv.org/pdf/2502.14768)
[GitHub](https://github.com/Unakar/Logic-RL)
[More](#logicrl) | ——                                                           | [knights-and-knaves](https://huggingface.co/datasets/K-and-K/knights-and-knaves)   [knights-and-knaves-ZH](https://huggingface.co/datasets/Trae1ounG/knights-and-knaves-ZH)  | ClickThe paper introduces Logic-RL, a rule-based reinforcement learning framework that enables large language models to develop o3-mini-level reasoning skills through training on logic puzzles. The reasoning capabilities can also be transferred to other domains like math. | | Shanghai AI Lab 
 SJTU & CUHK | [Paper](https://arxiv.org/abs/2502.06781)
 [GitHub](https://github.com/InternLM/OREAL)
 [More](#oreal) | [OREAL-32B](https://huggingface.co/internlm/OREAL-32B)  [OREAL-7B](https://huggingface.co/internlm/OREAL-7B)
[OREAL-DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B)
[OREAL-32B-SFT](https://huggingface.co/internlm/OREAL-32B-SFT)
[OREAL-7B-SFT](https://huggingface.co/internlm/OREAL-7B-SFT) | [OREAL-RL-Prompts](https://huggingface.co/datasets/internlm/OREAL-RL-Prompts) | ClickThe paper introduces OREAL, a reinforcement learning framework for mathematical reasoning with binary feedback. It proves that behavior cloning on positive samples is sufficient for optimal learning and proposes reward reshaping for negative samples. A token-level reward model addresses sparse rewards in long reasoning chains. OREAL achieves state-of-the-art results on math benchmarks. | | SJTU                               | [Paper](https://arxiv.org/pdf/2502.11886)
[GitHub](https://github.com/GAIR-NLP/LIMR)
 [More](#limr) | [LIMR](https://huggingface.co/GAIR/LIMR)                     | [LIMR](https://huggingface.co/datasets/GAIR/LIMR)            | ClickThe paper challenges the assumption that scaling up RL training data inherently improves performance in language models, instead finding that a strategically selected subset of 1,389 samples can outperform a full 8,523-sample dataset. | | Open-Reasoner-Zero | StepFun & THU                      | [Paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) 
[GitHub](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/)
   [More](#openreaon-zero) | [Open-Reasoner-Zero-7B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B)
[Open-Reasoner-Zero-32B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-32B) | [ORZ-Math-57k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data) | ClickThe Open-Reasoner-Zero model has achieved notable performance, with Open-Reasoner-Zero-32B outperforming DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark while requiring significantly fewer training steps. | | FAIR at Meta                       | [Paper](https://arxiv.org/abs/2502.18449)
[GitHub](https://github.com/facebookresearch/swe-rl)
[More](#swerl) | ——                                                           | ——                                                           | ClickSWE-RL enhances LLMs' code reasoning through RL using open-source software evolution data, achieving state-of-the-art results in software engineering tasks and demonstrating generalized reasoning capabilities beyond coding. | | Bytedance                          | [Paper](https://arxiv.org/abs/2503.01491)
[More](#vcppo) | ——                                                           | ——                                                           | ClickVC-PPO (Value-Calibrated PPO) diagnoses PPO's collapse in long CoT tasks as stemming from value function inaccuracies (initialization bias and reward signal decay in long sequences). Its core contribution is modifying PPO with value pretraining and decoupled GAE for actor and critic. | | CMU                                | [Paper](https://arxiv.org/abs/2503.04697)
[GitHub](https://github.com/cmu-l3/l1)
[More](#lcpol1) | [L1-Qwen-1.5B-Max](https://huggingface.co/l3lab/L1-Qwen-1.5B-Max)
 [L1-Qwen-1.5B-Exact](https://huggingface.co/l3lab/L1-Qwen-1.5B-Exact) | ——                                                           | ClickL1 introduces Length Controlled Policy Optimization (LCPO), a RL method enabling precise control over a reasoning model's thinking time (output length) via prompt instructions. It shows that RL effectively controls reasoning duration and unexpectedly enhances even short-chain reasoning capabilities. | | CMU                                | [Paper](https://arxiv.org/pdf/2503.07572)
[Project](https://cohenqu.github.io/mrt.github.io/)
[GitHub](https://github.com/CMU-AIRe/MRT) | ——                                                           | ——                                                           | ClickMRT (Mixed-Reality Trajectory Preferences) introduces a novel method for fine-tuning cooperative LLM agents. It effectively blends human preferences on real interaction trajectories with AI preferences on synthetic variations, improving data efficiency. This mixed-reality approach surpasses purely AI-driven feedback (RLAIF), especially for complex, multi-turn collaborative tasks. | | Mila & Reliant AI                  | [Paper](https://arxiv.org/abs/2503.14286v2)
[More](#topr) | ——                                                           | ——                                                           | ClickTOPR (Tapered Off-Policy REINFORCE) introduces a novel RL algorithm for fine-tuning LLMs. Its core contribution is using asymmetric, tapered importance sampling to modify REINFORCE, enabling stable and efficient off-policy learning. This allows reusing past data effectively without the instability often seen in other methods and without needing explicit KL regularization. | | Bytedance 
 THU               | [Paper](https://arxiv.org/pdf/2503.14476)
[GitHub](https://github.com/BytedTsinghua-SIA/DAPO)
[More](#dapo) | ——                                                           | [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) | ClickDAPO algorithm introduces four key techniques (Clip-Higher, Dynamic Sampling, Token-Level Loss, Overlong Shaping) for stable and efficient long-chain-of-thought RL training, surpassing previous SoTA results efficiently. | | Meta AI                            | [Paper](https://arxiv.org/abs/2503.15478)
[GitHub](https://github.com/facebookresearch/sweet_rl/tree/main) | ——                                                           | [collaborative_agent_bench](https://huggingface.co/datasets/facebook/collaborative_agent_bench) | ClickSweet-RL introduces a novel RL algorithm for multi-turn collaborative reasoning LLM agents. Its core contribution is improving credit assignment across long interactions by using an asymmetric actor-critic structure where the critic leverages additional training-time information for step-wise evaluation. | | VNU University of Science & Knovel Engineering Lab                            | [Paper](https://arxiv.org/pdf/2503.16219)
[GitHub](https://github.com/knoveleng/open-rs)
[More](#open-rs) | [Open-RS1](https://huggingface.co/knoveleng/Open-RS1)
[Open-RS2](knoveleng/Open-RS2)
[Open-RS3](https://huggingface.co/knoveleng/Open-RS3) | [open-s1](https://huggingface.co/datasets/knoveleng/open-s1)
[open-deepscaler](https://huggingface.co/datasets/knoveleng/open-deepscaler)
[open-rs](https://huggingface.co/datasets/knoveleng/open-rs) | ClickThe study investigates the potential of RL to improve reasoning in small LLMs. The results demonstrate rapid reasoning gains, with accuracy improvements on mathematical reasoning benchmarks, and highlight the efficacy of RL-based fine-tuning for small LLMs as a cost-effective alternative to large-scale approaches, using high-quality training data. | | Sail-Sg                            | [Paper](https://arxiv.org/abs/2503.20783)
[GitHub](https://github.com/sail-sg/understand-r1-zero)
[More](#oat-zero) | [Qwen2.5-Math-7B-Oat-Zero](https://huggingface.co/sail/Qwen2.5-Math-7B-Oat-Zero)
[Qwen2.5-Math-1.5B-Oat-Zero](https://huggingface.co/sail/Qwen2.5-Math-1.5B-Oat-Zero)
[Llama-3.2-3B-Oat-Zero](https://huggingface.co/sail/Llama-3.2-3B-Oat-Zero) | [MATH](https://huggingface.co/datasets/EleutherAI/hendrycks_math) | ClickThis work critically analyzes R1-Zero-like RL training. It reveals base model properties and GRPO algorithm biases (e.g., length bias) significantly impact outcomes. It contributes the efficient, unbiased Dr. GRPO algorithm and an open-source recipe/codebase for better understanding and reproduction. | | Tencent Hunyuan                    | [Paper](https://arxiv.org/abs/2503.17287)
[GitHub](https://github.com/nick7nlp/FastCuRL) | [FastCuRL-1.5B-Preview](https://huggingface.co/Nickyang/FastCuRL-1.5B-Preview) | [FastCuRL](https://huggingface.co/datasets/Nickyang/FastCuRL) | ClickFastCuRL introduces a simple, efficient Curriculum RL method for LLMs. Its core contribution uses target perplexity to dynamically scale the standard RL loss (like PPO), creating an effective curriculum without complex reward models or auxiliary components, enabling faster, more stable training. | | THU                    | [Paper](https://arxiv.org/abs/2504.00810)
[GitHub](https://github.com/efficientscaling/Z1) | [Z1-7B](https://huggingface.co/efficientscaling/Z1-7B) | [Z1-Code-Reasoning-107K](https://huggingface.co/datasets/efficientscaling/Z1-Code-Reasoning-107K) | ClickThis paper proposes training LLMs on code-related reasoning trajectories using a curated dataset and a "Shifted Thinking Window" technique. This allows models to reduce excessive thinking tokens, achieving efficient test-time scaling and generalizing reasoning abilities. | | ByteDance Seed                    | [Paper](https://arxiv.org/pdf/2504.05118)
 | —— | —— | ClickVAPO offers an integrated solution that effectively alleviates value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signal. | Wand AI   | [Paper](https://arxiv.org/pdf/2504.05185) | —— | —— | ClickThis work challenges the idea that longer reasoning chains in LLMs inherently mean better accuracy. It uses mathematical analysis of RL principles, particularly PPO, to show that lengthier responses often arise from the optimization process itself, not necessarily improved reasoning. | | USC LIME Lab                    | [Paper](https://arxiv.org/abs/2504.05520)
[GitHub](https://github.com/uscnlp-lime/verl) | —— | [DeepScaleR_Difficulty](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) | ClickAdaRFT proposes Adaptive Curriculum Reinforcement Finetuning to improve LLM reasoning training efficiency. It dynamically adjusts task difficulty based on recent reward signals, accelerating learning by keeping challenges optimally balanced. Experiments on competition math benchmarks show up to 2x fewer steps and improved accuracy, using standard PPO with minimal changes. | |                      | [Paper]()
[GitHub]() | [hf models]() | [hf datasets]() | Clickinsights and contributions about RL for reasoning within 30 words. |

### Multimodal and Agents

| Date      | Project               | Org                | Intro                                                        | HF Model                                                     | HF Dataset                                                   | Takeaway Messages                                                 |

| --------- | --------------------- | ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |

| 2025.0128 | Open-R1-MultiModal    | LLMs Lab           | [GitHub](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal)
[More](#open-r1-mm) | [Qwen2-VL-2B-GRPO-8k](https://huggingface.co/lmms-lab/Qwen2-VL-2B-GRPO-8k)
[Qwen2-VL-7B-GRPO-8k](https://huggingface.co/lmms-lab/Qwen2-VL-7B-GRPO-8k) | [multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) | ClickOpen-R1-MultiModal provides an open-source replication of R1-Zero-like RL for Multimodal LLMs, aiming to enhance complex visual reasoning. It demonstrates the effectiveness of these RL techniques for boosting multimodal performance and promotes reproducibility in the field. |

| 2025.0202 | R1-V                  | Deep Agent         | [Blog](https://deepagent.notion.site/rlvr-in-vlms)
[GitHub](https://github.com/Deep-Agent/R1-V)
[More](#r1v) | ——                                                           | [Clevr_CoGenT_TrainA_R1](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1) | ClickR1-V applies RL, specifically RLV-Instruct, to fine-tune VLMs. It enhances complex visual reasoning and instruction-following capabilities in VLMs beyond standard supervised fine-tuning. |

| 2025.0215 | VLM-R1                | OmAI Lab           | [Blog](https://om-ai-lab.github.io/index.html) 
[GitHub](https://github.com/om-ai-lab/VLM-R1)
[More](#vlmr1) | [OVD](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321)
[Math](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-Math-0305) 
[REC](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps) | ——                                                           | ClickVLM-R1 applies R1-style RL to VLMs, improving stability and generalization on visual reasoning tasks. It shows that RL enhances VLM generalization beyond standard fine-tuning, achieving SOTA results, particularly on complex or out-of-domain benchmarks. |

| 2025.0227 | Med-RLVR              | Microsoft Research | [Paper](https://arxiv.org/pdf/2502.19655)
[More](#medrlvr) | ——                                                           | ——                                                           | ClickThe Med-RLVR framework demonstrates emergent medical reasoning via RL, achieving performance parity with SFT on in-distribution tasks and improving out-of-distribution generalization, all without explicit reasoning supervision, showcasing RL's potential in medicine. |

| 2025.0303 | ReSearch              | Agent-RL           | [GitHub](https://github.com/Agent-RL/ReSearch)
[More](#research) | ——                                                           | ——                                                           | ClickThe project train LLMs from scratch, utilizing RL with GRPO to learn to reason via search operations, without reliance on pre-existing reasoning frameworks or supervised data. |

| 2025.0303 | Visual-RFT                | SJTU & Shanghai AI Lab & CUHK        | [Paper](https://arxiv.org/pdf/2503.01785)
[GitHub](https://github.com/Liuziyu77/Visual-RFT)
[More](#research) | [Reasoning Grounding](https://huggingface.co/Zery/Qwen2-VL-7B_visual_rft_lisa_IoU_reward) | [COCO_base65](https://huggingface.co/datasets/laolao77/ViRFT_COCO_base65)
[COCO](https://huggingface.co/datasets/laolao77/ViRFT_COCO)
[COCO_8_classes_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_COCO_8_cate_4_shot)
[LVIS_few_shot](https://huggingface.co/datasets/laolao77/ViRFT_LVIS_few_shot)
[Flower_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_flower_4_shot)
[FGVC_Aircraft_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_fgvc_aircraft_4_shot)
[Car196_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_car196_4shot)
[Pets37_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_pets37_4shot)                                                           | ClickVisual-RFT introduces Visual Reinforcement Fine-tuning, which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning. |

| 2025.0306 | R1-VLM                | GroundLight        | [Blog](https://www.groundlight.ai/blog/visual-reasoning-models)
[GitHub](https://github.com/groundlight/r1_vlm)
[More](#r1-vlm) | ——                                                           | ——                                                           | ClickR1-VLM enhances VLMs using RL, contributing significantly improved performance on complex visual reasoning tasks (spatial, counting, logic) where standard models falter. It shows that RL effectively unlocks advanced, multi-step reasoning capabilities specifically for vision-language understanding. |

| 2025.0310 | VisualThinker-R1-Zero | TurningPoint       | [Paper](https://arxiv.org/pdf/2503.05132) 
[GitHub](https://github.com/turningpoint-ai/VisualThinker-R1-Zero)
[More](#visual-r1-zero) | [VisualThinker-R1-Zero](https://huggingface.co/turningpoint-ai/VisualThinker-R1-Zero) | ——                                                           | ClickVisualThinker-R1-Zero adapts the R1-Zero RL paradigm (no supervised fine-tuning) to VLMs, achieving SoTa visual reasoning. It shows that complex visual reasoning can be effectively cultivated directly via RL on a base VLM, bypassing supervised data needs. |

| 2025.0310 | MM-EUREKA | USTC & ZTE & NEU      | [Paper](https://arxiv.org/pdf/2503.07365) 
[Github](https://github.com/ModalMinds/MM-EUREKA) 
 [More](#mm-eureka) | [MM-Eureka-Qwen-7B](https://huggingface.co/FanqingM/MM-Eureka-Qwen-7B) | [MM-Eureka-Dataset](https://huggingface.co/datasets/FanqingM/MM-Eureka-Dataset)       | ClickMM-EUREKA reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, which demonstrates that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches.  |

| 2025.0310 | Curr-ReFT | Shanghai AI Lab & SJTU & HKU       | [Paper](https://arxiv.org/pdf/2503.07065)
[GitHub](https://github.com/ding523/Curr_REFT)
[More](#curr-reft) | [3B-Curr-ReFT](https://huggingface.co/ZTE-AIM/3B-Curr-ReFT)
[7B-Curr-ReFT](https://huggingface.co/ZTE-AIM/7B-Curr-ReFT) | [Curr-ReFT-data](https://huggingface.co/datasets/ZTE-AIM/Curr-ReFT-data)       | ClickCurr-ReFT proposes a Curriculum Reinforcement Finetuning strategy to enhance the out-of-distribution generalization and reasoning abilities. The curriculum paradim ensures steady progression. Moreover, a rejected sampling-based self-improvement is proposed to maintain the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples.  |

| 2025.0311 | LLM-R1                | CUHK & Ant Group   | [Paper](https://arxiv.org/pdf/2503.07536)
[GitHub](https://github.com/TideDra/lmm-r1) | ——                                                           | ——                                                           | ClickLLM-R1 contributes the RMAVO algorithm to stably enhance LLM reasoning using RL, preventing reward hacking and achieving SOTA results with smaller models via an open-source implementation. It shows that reward model assistance in value optimization is key for stable RL. |

| 2025.0311 | Vision-R1             | ECNU & Xiaohongshu | [Paper](https://arxiv.org/abs/2503.06749)
[GitHub](https://github.com/Osilly/Vision-R1) | ——                                                           | [Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold) | ClickVision-R1 adapts the R1-Zero RL paradigm for VLMs, training them on visual reasoning chains. Its contribution is significantly boosting complex multimodal reasoning performance. It shows that RL applied to explicit reasoning steps effectively enhances VLM capabilities. |

| 2025.0315 | MetaSpatial                | Northwestern University                                | [Paper](https://arxiv.org/abs/2503.18470)
[Project](https://github.com/PzySeere/MetaSpatial)
[GitHub](https://github.com/PzySeere/MetaSpatial) | ——                                                           | [3D_Reasoning](https://huggingface.co/datasets/zhenyupan/3d_layout_reasoning)                                                          | ClickMetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development. |

| 2025.0318 | R1-Searcher           | RUC                | [Paper](https://arxiv.org/pdf/2503.05592)
[GitHub](https://github.com/RUCAIBox/R1-Searcher) | [Llama-3.1-8B-instruct-RAG-RL](https://huggingface.co/XXsongLALA/Llama-3.1-8B-instruct-RAG-RL) 
[Qwen-2.5-7B-base-RAG-RL](https://huggingface.co/XXsongLALA/Qwen-2.5-7B-base-RAG-RL) | [RAG-RL-Hotpotqa](https://huggingface.co/datasets/XXsongLALA/RAG-RL-Hotpotqa-with-2wiki) | ClickR1-Searcher enhances LLM reasoning via RL by training the model to perform adaptive model-based search during generation. This integration enables flexible thinking depth, improving reasoning efficiency and performance compared to fixed-step methods like R1-Zero. |

| 2025.0404 | MAYE           | SJTU & GAIR                | [Paper](https://arxiv.org/pdf/2504.02587)
[GitHub](https://github.com/GAIR-NLP/MAYE) |——  | [ManTle/MAYE](https://huggingface.co/datasets/ManTle/MAYE) | ClickMAYE is a transparent, reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). Its codebase is developed entirely from scratch without relying on any existing RL toolkits. |

| 
2025.0x0x
 |             |                      | [Paper]()
[GitHub]() | [hf models]() | [hf datasets]() | Clickinsights and contributions about RL for reasoning within 30 words. |

## Projects

### Large Language Models

##### 
2025.0102, PRIME-RL


| Project or Paper      | [Process Reinforcement through Implicit Rewards](https://arxiv.org/abs/2502.01456) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [PRIME-RL/PRIME](https://github.com/PRIME-RL/PRIME)          |

| Backbone Model        | Qwen2.5-Math-7B-Base                                         |

| RL Algorithm          | PPO/REINFORCE++/RLOO/GRPO + Online PRM                       |

| Training Dataset      | [PRIME-RL/Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data), 150K |

| Rollout Configuration | 256 prompts * 4 responses Online Prompt Filtering (Accuracy in [0.2,0.8]) |

| Reward Function       | Rule-based Rewards + Implicit Process Rewards                |

| Policy Optimization   | PPO loss, without KL loss                                    |

| Benchmark             | GPT-4o level on AIME 2024, AMC, MATH-500, Minerva Math, OlympiadBench, LeetCode, and LiveCodeBench |

| Core Insights         | Implicit PRM efficiently addresses reward sparsity, distribution shift, and scalability by directly learning token-level rewards within a language model framework, eliminating the need for separate value models or prior training. |

| Additional Notes      |                                                              |

##### 
2025.0122, DeepSeek-R1


| Project or Paper      | [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/pdf/2501.12948) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [deepseek-ai/DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) |

| Backbone Model        | DeepSeek-V3-Base                                             |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | Unclear                                                      |

| Rollout Configuration | 4~64 samples for each prompt, temperature = 0.6              |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | vanilla GRPO Loss                                            |

| Benchmark             | OpenAI-o1 level on AIME 2024, Codeforces, GPQA Diamond, MATH-500, MMLU, SWE-bench Verified. |

| Core Insights         | RL can boost LLMs' reasoning. DeepSeek - R1 series models prove effective, and distilling reasoning to small models works, while highlighting challenges in other methods. |

| Additional Notes      |                                                              |

#### 
2025.0122, Kimi k1.5


| Project or Paper      | [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/pdf/2501.12599) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [MoonshotAI/Kimi-k1.5](https://github.com/MoonshotAI/Kimi-k1.5) |

| Backbone Model        | Kimi k-series model (closed source)                          |

| RL Algorithm          | Online Policy Mirror Decent/Length Penalty Reward/Curriculum Sampling/Prioritized Sampling/Chain-of-Thought RM/Long2short RL |

| Training Dataset      | Code: 1000 contest problem; Math: 800k in-context learning/800k chain-of-thought data;  Vision: unknown number of real-world/synthetic visual reasoning/text-rendered data |

| Rollout Configuration | None                                                         |

| Reward Function       | Outcome Reward+Length Penalty Reward                         |

| Policy Optimization   | Online Policy Mirror Decent                                  |

| Benchmark             | Matching OpenAI’s o1 on AIME/MATH500/Codeforces/MathVista    |

| Core Insights         | Effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results, outperforming existing short-CoT models. |

| Additional Notes      |                                                              |

#### 
2025.0124, TinyZero


| Project or Paper      | not applicable                                               |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [Tiny-Zero](https://github.com/Jiayi-Pan/TinyZero) |

| Backbone Model        | QWen-2.5-3B                                                  |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | countdown                                                    |

| Rollout Configuration | 1024                                                         |

| Reward Function       | 0/1 reward                                                   |

| Policy Optimization   | vanilla GRPO: (KL Loss; Length Penalty; Token-level loss)    |

| Benchmark             | test set of countdown                                        |

| Core Insights         | Aha moment can be reproducible on puzzle-style data.         |

| Additional Notes      |                                                              |

#### 
2025.0125, SimpleRL


| Project or Paper      | [7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient](https://hkust-nlp.notion.site/simplerl-reason) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [hkust-nlp/simpleRL-reason](https://github.com/hkust-nlp/simpleRL-reason) |

| Backbone Model        | Qwen2.5-Math-7B                                              |

| RL Algorithm          | PPO based on OpenRLHF                                        |

| Training Dataset      | [MATH](https://huggingface.co/datasets/EleutherAI/hendrycks_math), 8k Level3-Level5 |

| Rollout Configuration | 128 prompts * 8 responses; Temperature = 0.6                 |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | PPO loss with 0.01 KL coefficient                            |

| Benchmark             | GPT-4o level on AIME2024, AMC2023, College Math, Gaokao2023en, GSM8k, MATH500, Minerva Math, and OlympiadBench. |

| Core Insights         | Long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only few high-quality examples with rule-based rewards only. |

| Additional Notes      |                                                              |

#### 
2025.0206, Demysitify-long-CoT
 

| Project or Paper      | [Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [eddycmu/demystify-long-cot](https://github.com/eddycmu/demystify-long-cot) |

| Backbone Model        | Qwen2.5-Math-7B Llama-3.1-8B                                 |

| RL Algorithm          | PPO (OpenRLHF)                                               |

| Training Dataset      | 7500 training samples of MATH                                |

| Rollout Configuration | 512 prompts * 8 responses; temperature=0.7, top-p=0.95 Context Length Prompt=2048, Gen=14384 tokens |

| Reward Function       | Rule-based Reward + Cosine Length Reward + Repetition Penalty Reward |

| Policy Optimization   | KL coefficy=0.01, gamma=1, lambda=1                          |

| Benchmark             | MATH, AIME, TheoremQA, MMLU-Pro-1k                           |

| Core Insights         | Reward shaping can be used to stabilize and control CoT length while improving accuracy. Cosine reward can be tuned to incentivize various length scaling behaviors, length rewards will be hacked with enough compute. But this can be mitigated using a repetition penalty. |

| Additional Notes      |                                                              |

#### 
2025.0210, DeepScaler


| Project or Paper      | [DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [deepscaler](https://github.com/agentica-project/deepscaler) |

| Backbone Model        | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B                    |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [Omni-MATH](https://omni-math.github.io/) and [Still](https://github.com/RUCAIBox/Slow_Thinking_with_LLMs) |

| Rollout Configuration | 128 * 16 (bs*n);temperature=0.6; Context Length:8K->16K->24K |

| Reward Function       | 0/1 reward                                                   |

| Policy Optimization   | vanilla GRPO: (KL Loss; Length Penalty; Token-level loss)    |

| Benchmark             | AIME 2024/ MATH 500 /AMC 2023 / Minerva Math/ OlympiadBench  |

| Core Insights         | 1. RL scaling can manifest in small models as well. 2. Iterative lengthening enables more effective length scaling. |

| Additional Notes      | Iteratively increasing the max context for RL training.      |

#### 
2025.0210, Logic-RL
 

| Project or Paper      | [Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning](https://arxiv.org/abs/2502.14768) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [Unakar/Logic-RL](https://github.com/Unakar/Logic-RL)        |

| Backbone Model        | Qwen2.5-Math-7B / Qwen2.5-7B-Instruct                        |

| RL Algorithm          | REINFORCE++                                                  |

| Training Dataset      | [Knights and Knaves (K&K) puzzles](https://huggingface.co/datasets/K-and-K/knights-and-knaves), 6.2k |

| Rollout Configuration | 8 prompts * 8 responses; Temperature = 0.7                   |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | REINFORCE++ loss with 0.001 unbiased KL coefficient.         |

| Benchmark             | o3-mini-high level on K&K logic puzzle                       |

| Core Insights         | With simple REINFORCE++ with KL loss, 7B model develops advanced reasoning skills that are absent from the logic corpus and generates to other tasks like math. |

| Additional Notes      |  |

#### 
2025.0210, OREAL


| Project or Paper      | [Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning](https://arxiv.org/abs/2502.06781) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [InternLM/OREAL](https://github.com/InternLM/OREA)           |

| Backbone Model        | Qwen2.5-7B / Qwen2.5-32B                                     |

| RL Algorithm          |   |

| Training Dataset      | [OREAL-RL-Prompts](https://huggingface.co/datasets/internlm/OREAL-RL-Prompts), 4k |

| Rollout Configuration | 64 prompts * 16 responses; Temprature = 1.0; Online Accuracy Filtering; Retain only one correct and wrong solutions |

| Reward Function       | Outcome Reward Signal by rule-based verifier and Qwen2.5-72B-Instruct + Token level Reward |

| Policy Optimization   | OREAL loss with 0.01 KL coefficient                          |

| Benchmark             | R1-level on MATH500, AIME2024, AIME2025-I, LiveMath, Olympiad |

| Core Insights         | Behavior cloning on positive samples is sufficient for optimal learning and reward reshaping for negative samples is needed for consistent gradient estimation. A token-level reward model can be trained to addresses sparse rewards. |

| Additional Notes      | |

#### 
2025.0217, LIMR


| Project or Paper      | [LIMR: Less is More for RL Scaling](https://arxiv.org/abs/2502.11886) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [GAIR-NLP/LIMR](https://github.com/GAIR-NLP/LIMR)            |

| Backbone Model        | Qwen2.5-Math-7B                                              |

| RL Algorithm          | PPO based on OpenRLHF                                        |

| Training Dataset      | [LIMR](https://huggingface.co/datasets/GAIR/LIMR), 1.4k      |

| Rollout Configuration | 1024 prompts * 8 responses; Temperature = 1.2                |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | PPO loss with 0.01 KL coefficient                            |

| Benchmark             | GPT-4o level on AIME2024, MATH500, AMC2023                   |

| Core Insights         | Precisely selected data may be the key to unlocking the enhanced reasoning capabilities of LLMs. |

| Additional Notes      | The author uses the trained model's average reward curve as a reference for measuring sample effectiveness. |

#### 
2025.0217, Open-Reasoner-Zero


| Project or Paper      | [Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [Open-Reasoner-Zero/Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero) |

| Backbone Model        | Qwen2.5-7B, Qwen2.5-32B                                      |

| RL Algorithm          | PPO based on OpenRLHF                                        |

| Training Dataset      | [ORZ-MATH](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data), 57k |

| Rollout Configuration | 128 prompts * 64 responses; Temperature = 1.0                |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | PPO loss without KL loss                                     |

| Benchmark             | GPT-4o level on GPQA-Diamond, MATH500, AIME2024              |

| Core Insights         | Vanilla PPO with GAE and rule-based rewards without KL loss is sufficient to sclae up response length and benchmark performance. |

| Additional Notes      |                                                              |

#### 
 2025.0225, SWE-RL
 

| Project or Paper      | [SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution](https://arxiv.org/abs/2502.18449) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [facebookresearch/swe-rl](https://github.com/facebookresearch/swe-rl) |

| Backbone Model        | Llama-3.3-70B-Instruct                                       |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | Publicly available repositories                              |

| Rollout Configuration | 32 prompts * 16 rollouts                                     |

| Reward Function       | Similarity Function (`difflib.SequenceMatcher`)              |

| Policy Optimization   | Normalized Rewards for Advantage Calculation                 |

| Benchmark             | GPT-4o level on SWE-bench Verified (41%)                     |

| Core Insights         | SWE-RL enhances LLMs' code reasoning through RL using open-source software evolution data, achieving state-of-the-art results in software engineering tasks and demonstrating generalized reasoning capabilities beyond coding. |

| Additional Notes      |                                                              |

#### 
2025.0303, VC-PPO 


| Project or Paper      | [What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the Secret](https://arxiv.org/pdf/2503.01491) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | N/A                                                          |

| Backbone Model        | Qwen2.5-32B-Base                                             |

| RL Algorithm          | VC-PPO                                                       |

| Training Dataset      | A compilation of questions from all past AIME competitions (excluding the last two years), supplemented with artificially constructed challenging mathematical problems. |

| Rollout Configuration | N/A                                                          |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | PPO loss, Value Estimate with Decoupled-GAE                  |

| Benchmark             | AIME 2024, GPQA, CodeForces                                  |

| Core Insights         | VC-PPO addresses PPO’s failure in long CoT tasks by pretraining the value model to correct initialization bias and decoupling GAE between the actor and critic to mitigate reward signal decay. |

| Additional Notes      |                                                              |

#### 
2025.0306, LCPO-L1 


| Project or Paper      | [L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning ](https://www.arxiv.org/pdf/2503.04697) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [cmu-l3/l1](https://github.com/cmu-l3/l1)                    |

| Backbone Model        | DeepScaleR-1.5B-Preview                                      |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [DeepScaleR-Preview-Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset), 40k |

| Rollout Configuration | 128 prompts                                                  |

| Reward Function       | correctness reward + length penalty                          |

| Policy Optimization   | GRPO loss + length penalty loss                              |

| Benchmark             | AIME 2025, MATH, AMC, Olympiad-Bench, GPQA, LSAT, MMLU       |

| Core Insights         | Address the uncontrolled reasoning length issue in language models. It uses reinforcement learning to optimize for both accuracy and adherence to user - specified length constraints. By training models with a reward function that combines correctness and length - related terms, LCPO enables precise length control, allowing for a better trade - off between computational cost and accuracy in various reasoning tasks. |

| Additional Notes      |                                                              |

#### 
2025.0310, MetaRL 


| Project or Paper      | [Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning](https://arxiv.org/pdf/2503.07572) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [CMU-AIRe/MRT](https://github.com/CMU-AIRe/MRT)              |

| Backbone Model        | DeepSeek-R1-Distill-Qwen-32B/7B/1.5B / DeepScaleR-1.5B-Preview /  Llama-3.1-8B/3B-Instruct |

| RL Algorithm          | MRT                                                          |

| Training Dataset      | None                                                         |

| Rollout Configuration | 256 prompts * 4 responses, temperature = 0.7                 |

| Reward Function       | 0/1 reward + dense reward                                    |

| Policy Optimization   | SFT Loss + Dense Reward Bonus Loss                           |

| Benchmark             | AIME 2024, AIME 2025, AMC 2023, MinervaMATH, MATH500         |

| Core Insights         | Treating test - time computation of LLMs as a meta - RL problem. It uses a progress - based reward function to guide the model, and optimizes the policy to balance exploration and exploitation, aiming to improve the efficiency and performance of LLMs at test time. |

| Additional Notes      |                                                              |

#### 
2025.0318, TOPR 


| Project or Paper      | [Tapered Off-Policy REINFORCE Stable and efficient reinforcement learning for LLMs](https://arxiv.org/pdf/2503.14286v2) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [None](http://None)                                          |

| Backbone Model        | Llama 3 8B/70B                                               |

| RL Algorithm          | Tapered Off-Policy REINFORCE (TOPR)                          |

| Training Dataset      | GSM8K and MATH                                               |

| Rollout Configuration | 64/8 solutions each question for GSM8k/MATH, respectively    |

| Reward Function       | Implicit Reward as DPO (contrastive learning with preference data |

| Policy Optimization   | Optimization with off-policy samples and without KL penalty  |

| Benchmark             | The performance of 8B language models can match with 70B-parameter model's on GSM8K and MATH |

| Core Insights         | Properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the “wasted inference” that comes with discarding negative examples. |

| Additional Notes      | This method may speed up learning while maintaining stable learning dynamics, without the use of KL regularization and online sampling. |

#### 
2025.0318, DAPO


| Project or Paper      | [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://arxiv.org/pdf/2503.14476) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [BytedTsinghua-SIA/DAPO](https://github.com/BytedTsinghua-SIA/DAPO) |

| Backbone Model        | Qwen2.5-32B-Base                                             |

| RL Algorithm          | DAPO                                                         |

| Training Dataset      | [BytedTsinghua-SIA/DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k), 17K |

| Rollout Configuration | 512 prompts * 16 responses Dynamic Sampling                  |

| Reward Function       | Rule-based Rewards + Length-Aware Penalty Reward (Overlong Filtering strategy)           |

| Policy Optimization   | DAPO loss, without KL loss                                   |

| Benchmark             | DeepSeek-R1-Zero-Qwen-32B level on AIME 2024                 |

| Core Insights         | DAPO introduces four key techniques: Clip-Higher to promote diversity and prevent entropy collapse; Dynamic Sampling to enhance training efficiency and stability; Token-Level Policy Gradient Loss to refine long-chain reasoning; and Overlong Reward Shaping to reduce reward noise and stabilize training. |

| Additional Notes      |                                                              |

#### 
2025.0320, Open RS


| Project or Paper      | [Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t](https://arxiv.org/pdf/2503.16219) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [knoveleng/open-rs](https://github.com/knoveleng/open-rs) |

| Backbone Model        | DeepSeek-R1-Distill-Qwen-1.5B                                                 |

| RL Algorithm          | GRPO                 |

| Training Dataset      | [open-s1](https://huggingface.co/datasets/knoveleng/open-s1), 18.6k[open-deepscalar](https://huggingface.co/datasets/knoveleng/open-deepscaler), 21k[open-rs](https://huggingface.co/datasets/knoveleng/open-rs), 7k                             |

| Rollout Configuration | 24 prompts * 6 responses, Temperature = 0.7                                                          |

| Reward Function       | Rule-based Rewards with Cosine Reward assigning higher rewards to shorter but correct response.                                           |

| Policy Optimization   | GRPO       |

| Benchmark             | AIME2024, AMC, MATH500, Minerva Math and OlympiadBench       |

| Core Insights         | 1. High-quality data can boost performance compared with large amount low-quality data. 2. The difficulty of training data influences the training results. 3. Consine rewards can stabilize completion lengths, improving training consistency, |

| Additional Notes      |                                                              |

#### 
2025.0321, Oat-Zero


| Project or Paper      | [Understanding R1-Zero-Like Training: A Critical Perspective](https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [sail-sg/understand-r1-zero](https://github.com/sail-sg/understand-r1-zero) |

| Backbone Model        | Qwen2.5-1.5B                                                 |

| RL Algorithm          | Dr. GRPO (fixes GRPO’s bias in optimization)                 |

| Training Dataset      | Questions sampled from the MATH;                             |

| Rollout Configuration | N/A                                                          |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | Dr. GRPO Loss (remove two normalization terms in GRPO)       |

| Benchmark             | AIME2024, AMC, MATH500, Minerva Math and OlympiadBench       |

| Core Insights         | 1. The DeepSeek-V3-Base model exhibits significant reasoning capabilities, termed an “aha moment,” even prior to reinforcement learning fine-tuning. 2. GRPO introduces an optimization bias that artificially increases response length during training, particularly affecting incorrect outputs. |

| Additional Notes      |                                                              |

#### 
2025.0407, VAPO


| Project or Paper      | [VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks](https://arxiv.org/pdf/2504.05118) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | -- |

| Backbone Model        | Qwen-32B                                                 |

| RL Algorithm          | Value-based Augmented PPO              |

| Training Dataset      | N/A                             |

| Rollout Configuration | N/A                                                          |

| Reward Function       | Rule-based Rewards                                           |

| Policy Optimization   | PPO       |

| Benchmark             | AIME 2024       |

| Core Insights         | VAPO integrates clip-higher, token-level loss, value-pretraining, decoupled-GAE, self-imitation learning and group-sampling.  |

| Additional Notes      |       First value-based RL training framework to outperform value-free methods on long COT tasks significantly   |

### Multimodal and Agents

#### 
2025.0128, open-r1-multimodal


| Project or Paper      | [EvolvingLMMs-Lab/open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [EvolvingLMMs-Lab/open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) |

| Backbone Model        | Qwen2-VL 2B/7B                                               |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified), 7.69K; |

| Rollout Configuration | 1 (prompts + images) * 8 responses; Temperature=0.9;         |

| Reward Function       | Rule-based Rewards (Choice, Format)                          |

| Policy Optimization   | PPO Loss                                                     |

| Benchmark             | MMMU:  
2B: +4.02% vs w./ reasoning, -4.48% vs base; 
7B: 7.5% vs w./ reasoning, -1.2% vs base;  
Mathvista-mini:  
2B: +0.8% vs w./ reasoning, -2.2% vs base; 
7B: -0.3% vs w./ reasoning, +3.5% vs base; |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0202, R1-V


| Project or Paper      | [RLVR in Vision Language Models: Findings, Questions and Directions](https://deepagent.notion.site/rlvr-in-vlms) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [Deep-Agent/R1-V](https://github.com/Deep-Agent/R1-V[)       |

| Backbone Model        | Visual Counting/Complex Visual Reasoning: Qwen2-VL-2B-Instruct; 
Geometry Reasoning: Qwen2.5-VL-7B-Instruct; |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | Visual Counting: [Clevr CoGenT-A](https://huggingface.co/datasets/leonardPKU/clevr_cogen_a_train), 70K
 Geometry Reasoning: [GeoQA-Train](https://huggingface.co/datasets/leonardPKU/GEOQA_R1V_Train_8K), 8K
 Complex Visual Reasoning: [Clevr_CoGenT_TrainA_R1](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1), 37.8K |

| Rollout Configuration | 1 (prompts + images) * 8 responses; Temperature=1.0;         |

| Reward Function       | Rule-based Rewards (Accuracy: Number/bool, Format)           |

| Policy Optimization   | PPO Loss                                                     |

| Benchmark             | Visual Counting (Acc.): Clevr CoGenT-B: 46%, comparable vs base and CoT+SFT models; SuperClevr:  40%, 11% higher than base and CoT+SFT models;  
Geometry Reasoning (Acc., no CoT data): GeoQA-Test: 24%, 1% higher than base and SFT models;  
Complex Visual Reasoning: SuperClevr: 53.48%, 49.28% higher than base and CoT+SFT models; |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0215, VLM-R1


| Project or Paper      | [VLM-R1: A stable and generalizable R1-style Large Vision-Language Model](https://om-ai-lab.github.io/index.html) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [om-ai-lab/VLM-R1](https://github.com/om-ai-lab/VLM-R1)      |

| Backbone Model        | Qwen2.5-VL-3B                                                |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [COCO](https://cocodataset.org/#download), 83K, no improvement for both STF and RL model
 [Description Detection Dataset](https://github.com/shikras/d-cube), 24K; |

| Rollout Configuration | 8 (prompts + images) * 8 responses; Temperature=0.9;         |

| Reward Function       | Rule-based Rewards (IoU, Format)                             |

| Policy Optimization   | PPO Loss + KL Loss (default 0.04)                            |

| Benchmark             | OVDEval: 6.55%, 4.51% higher than base and SFT models in NMS-AP; 
COCO(filter out images with more than 10 bbox): 6.1%, 2.6% higher than base and SFT models in MAP; |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0303, Visual-RFT


| Project or Paper      | [Visual-RFT: Visual Reinforcement Fine-Tuning ](https://arxiv.org/pdf/2503.01785) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [Liuziyu77/Visual-RFT](https://github.com/Liuziyu77/Visual-RFT)      |

| Backbone Model        | Qwen2-VL-2/7B                                                |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [ViRFT Datasets] (https://huggingface.co/collections/laolao77/virft-datasets-67bc271b6f2833eccc0651df)，6K/6k/32/408/400/784/148; |

| Rollout Configuration | 2 * 8 responses; Temperature=1.0;                            |

| Reward Function       | Rule-based Rewards (Accuracy, IoU, Format)；                  |

| Policy Optimization   | PPO Loss                         |

| Benchmark             | Qwen2-VL-2B:
Fine-grained classification, Avg. Acc.（Flower102, Pets37, FGVC-Aircraft, Car196）:
1-shot +24.3%, +28.6%, 2-shot +27.5%, +24.7, 4-shot +25.9% , +26.3%, 8-shot +29.1%, +24.8%, 16-shot +29.3%, +21.3%, compared to base and SFT;
Object Detection (COCO), mAP:
1-shot +14.0%, +14.1%, 2-shot +21.9%, +20.5%, 4-shot +21.0%, +15.4%, 8-shot +27.8%, 17.2%, 16-shot +27.2%, +15.5%,  compared to base and SFT;
Rare Object Detection (LVIS), mAP:
10-shot, +15.4%, +9.4%,  compared to base and SFT;
Open Vocabulary Object Detection, mAP:
COCO: +21.5%, +17.7%, compared to base and SFT;
LVIS: +18.0%, +13.1%, compared to base and SFT;
Reasoning Grounding (LISA), mIoU:
+10.7%, +9.3%,  compared to base and SFT;

Qwen2-VL-7B:
Object Detection (COCO), mAP:
4-shot +11.3%, +10.2%,  compared to base and SFT;
Rare Object Detection (LVIS), mAP:
10-shot, +18.4%, +6.2%,  compared to base and SFT;
Open Vocabulary Object Detection, mAP:
COCO: +9.5%, +10.1%, compared to base and SFT;
LVIS: +14.7%, +6.4%, compared to base and SFT;
Reasoning Grounding (LISA), mIoU:
+3.5%, +4.8%,  compared to base and SFT; |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0306, r1-vlm


| Project or Paper      | [GRPO for vision - Teaching an LLM to reason about images](https://www.groundlight.ai/blog/visual-reasoning-models) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [groundlight/r1_vlm](https://github.com/groundlight/r1_vlm)  |

| Backbone Model        | Qwen2.5-VL-3B-Instruct                                       |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | Message Decoding: [message-decoding-words-and-sequences-r1](https://huggingface.co/datasets/sunildkumar/message-decoding-words-and-sequences-r1), 27K
 Message Decoding-Single Word [message-decoding-words-r1](https://huggingface.co/datasets/sunildkumar/message-decoding-words-r1), 10K
Digit Recognition: [digit-recognition-r1 ](https://huggingface.co/datasets/sunildkumar/digit-recognition-r1), 2K |

| Rollout Configuration | 1 (prompts + images) * 9 responses; Temperature=1.0;         |

| Reward Function       | Rule-based Rewards (Decoding, Correctness, Format)           |

| Policy Optimization   | PPO Loss + KL Loss (default 0.01)                            |

| Benchmark             | -                                                            |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0310, VisualThinker-R1-Zero


| Project or Paper      | [R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model](https://arxiv.org/pdf/2503.05132) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [turningpoint-ai/VisualThinker-R1-Zero](https://github.com/turningpoint-ai/VisualThinker-R1-Zero) |

| Backbone Model        | Qwen2-VL-2B                                                  |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [SAT](https://huggingface.co/datasets/array/SAT), 218K;      |

| Rollout Configuration | 4 (prompts + images) * 8 responses; Temperature=1.0;         |

| Reward Function       | Rule-based Rewards (Accuracy-String Match, Format);         |

| Policy Optimization   | PPO Loss + KL Loss (default 0.04)                            |

| Benchmark             | CV-Bench (Choice): +25% vs base, +10.83% vs SFT;
 BLINK: +46.44 vs base, +0.75% vs SFT; 
VSR: +62.32% vs base, +26.53% vs SFT; |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0310, MM-Eureka


| Project or Paper      | [MM-EUREKA: EXPLORING VISUAL AHA MOMENT WITH RULE-BASED LARGE-SCALE REINFORCEMENT LEARNING](https://arxiv.org/pdf/2503.07365) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [ModalMinds/MM-EURRKA](https://github.com/ModalMinds/MM-EUREKA) |

| Backbone Model        | InternVL2.5-Pretrained-38B                                                 |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [MM-Eureka-Dataset](https://huggingface.co/datasets/FanqingM/MM-Eureka-Dataset), 55K;      |

| Rollout Configuration | 128 (prompts + images) * 8 responses; Temperature=1.0;         |

| Reward Function       | Rule-based Rewards (Accuracy-String Match, Format);         |

| Policy Optimization   | PPO Loss                            |

| Benchmark             | +9.2%, +4.7% compared with base model on OlympicBench and L12 respectively; |

| Core Insights         |                                                              |

| Additional Notes      |      

|

#### 
2025.0310, Curr_ReFT


| Project or Paper      | [Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning](https://arxiv.org/pdf/2503.07065) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [ding523/Curr_REFT](https://github.com/ding523/Curr_REFT) |

| Backbone Model        | Qwen2.5-VL-3/7B                                                 |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [RefCOCO](https://github.com/lichengunc/refer), 3K for training;
[Math360K](https://huggingface.co/datasets/Zhiqiang007/MathV360K), [Geo170K](https://huggingface.co/datasets/Luckyjhg/Geo170K), 3K for training;     |

| Rollout Configuration | 1 * 4 responses; Temperature=1.0;         |

| Reward Function       | Difficulty-aware Rule-based Rewards (Accuracy, IoU, Format)；         |

| Policy Optimization   | PPO Loss                            |

| Benchmark             | ID (In-Distribution), compared to base and SFT:
Math (Math360K+Geo170K, 1K for testing): -3B +11.0%, +8.8%, -7B +11.2%, +4.4%;
Detection (RefCOCO, 1K for testing): -3B +58.0%, +14.6%, -7B +51.3%, +2.5%;
Classification: (RefCOCO, 1K for testing) -3B +31.9%, +21.3%, -7B +44.6%, +4.2%;

OOD(Out-of-Distribution), compared to base and SFT: 
Math (CLEVER-70K, 0.5K for testing): -3B +55.9%, +42.9%, -7B +46.6%, +21.6%;
Detection (Refgta, 1K for testing): -3B +43.3%, +13.3% -7B +41.7%, +28.3%;
Classification: (Pascal-VOC, 1K for testing) -3B +15.4%, +18.0%, -7B +12.5%, +6.5%; |

| Core Insights         |                                                              |

| Additional Notes      |                                                              |

#### 
2025.0315, MetaSpatial


| Project or Paper      | [MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse](https://arxiv.org/abs/2503.18470) |

| --------------------- | ------------------------------------------------------------ |

| GitHub                | [PzySeere/MetaSpatial](https://github.com/PzySeere/MetaSpatial) |

| Backbone Model        | Qwen2.5-VL-7B                                                 |

| RL Algorithm          | GRPO                                                         |

| Training Dataset      | [3D-Reasoning-Dataset](https://huggingface.co/datasets/zhenyupan/3d_layout_reasoning), 50;      |

| Rollout Configuration | 16 (prompts + images) * 4 responses; Temperature=1.0;         |

| Reward Function       | Rule-based Rewards (Format Reward, Physics Reward, Rendering-based Reward);         |

| Policy Optimization   | PPO Loss                            |

| Benchmark             | +74% compared with base model on test-set of 3d-reasoning dataset; |

| Core Insights         |   Injecting physics reward and gpt-4o-based rendering evaluation reward.                                                    |

| Additional Notes      |   

## Contributing

If you have any updates or improvements for this document, please feel free to submit a **Pull Request**. Thank you!

#### 
202x.0x0x, Template


| Project or Paper      | [Project name or Paper title]()                          |

| :-------------------- | :------------------------------------------------------- |

| GitHub                | [Username/Project]()                                     |

| Backbone Model        | (Base / Instruct / Reasoning;  HF Model)                 |

| RL Algorithm          | (PPO / GRPO / RLOO / REINFORCE++; OpenRLHF / Verl / Trl) |

| Training Dataset      | (Size / Source / HF Dataset)                             |

| Rollout Configuration | (Batch Size * N Samples ; Temperature; Dynamic Sampling) |

| Reward Function       | (Outcome; Process; Repetition & Length)                  |

| Policy Optimization   | (KL Loss; Length Penalty; Token-level loss)              |

| Benchmark             | (MATH/GPQA; R1 level; GPT-4o level)                      |

| Core Insights         | (Empirical / Theoretical / Insightful Curves)            |

| Additional Notes      | (e.g., code snippet)                                     |

## Citation

 If you find our repository useful in your research, please star us ⭐ and consider citing:

```tex

@misc{zhang2025TripleR,

  title={Awesome RL Recipes for Reasoning},

  author={Kaiyan Zhang, Yuchen Fan, Yuxin Zuo, Guoli Jia, Xingtai Lv, Xuekai Zhu, Ermo Hua, Ning Ding, Biqing Qi, Bowen Zhou},

  year={2025},

  howpublished={\url{https://github.com/}},

  note={Github Repository},

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tsinghuac3i/awesome-rl-reasoning-recipes

Awesome Lists containing this project

README