awesome-rl-reasoning-recipes

Awesome RL Reasoning Recipes ("Triple R")
https://github.com/tsinghuac3i/awesome-rl-reasoning-recipes

Last synced: 3 days ago
JSON representation

Overview
- Large Language Models
 - Paper - Policy REINFORCE) introduces a novel RL algorithm for fine-tuning LLMs. Its core contribution is using asymmetric, tapered importance sampling to modify REINFORCE, enabling stable and efficient off-policy learning. This allows reusing past data effectively without the instability often seen in other methods and without needing explicit KL regularization.</details> |
 - Paper - Policy REINFORCE) introduces a novel RL algorithm for fine-tuning LLMs. Its core contribution is using asymmetric, tapered importance sampling to modify REINFORCE, enabling stable and efficient off-policy learning. This allows reusing past data effectively without the instability often seen in other methods and without needing explicit KL regularization.</details> |
 - Paper - rs) [More](#open-rs) | [Open-RS1](https://huggingface.co/knoveleng/Open-RS1) [Open-RS2](knoveleng/Open-RS2) [Open-RS3](https://huggingface.co/knoveleng/Open-RS3) | [open-s1](https://huggingface.co/datasets/knoveleng/open-s1) [open-deepscaler](https://huggingface.co/datasets/knoveleng/open-deepscaler) [open-rs](https://huggingface.co/datasets/knoveleng/open-rs) | <details><summary>Click</summary>The study investigates the potential of RL to improve reasoning in small LLMs. The results demonstrate rapid reasoning gains, with accuracy improvements on mathematical reasoning benchmarks, and highlight the efficacy of RL-based fine-tuning for small LLMs as a cost-effective alternative to large-scale approaches, using high-quality training data.</details> |
 - Paper - rs) [More](#open-rs) | [Open-RS1](https://huggingface.co/knoveleng/Open-RS1) [Open-RS2](knoveleng/Open-RS2) [Open-RS3](https://huggingface.co/knoveleng/Open-RS3) | [open-s1](https://huggingface.co/datasets/knoveleng/open-s1) [open-deepscaler](https://huggingface.co/datasets/knoveleng/open-deepscaler) [open-rs](https://huggingface.co/datasets/knoveleng/open-rs) | <details><summary>Click</summary>The study investigates the potential of RL to improve reasoning in small LLMs. The results demonstrate rapid reasoning gains, with accuracy improvements on mathematical reasoning benchmarks, and highlight the efficacy of RL-based fine-tuning for small LLMs as a cost-effective alternative to large-scale approaches, using high-quality training data.</details> |
 - Paper - sg/understand-r1-zero) [More](#oat-zero) | [Qwen2.5-Math-7B-Oat-Zero](https://huggingface.co/sail/Qwen2.5-Math-7B-Oat-Zero) [Qwen2.5-Math-1.5B-Oat-Zero](https://huggingface.co/sail/Qwen2.5-Math-1.5B-Oat-Zero) [Llama-3.2-3B-Oat-Zero](https://huggingface.co/sail/Llama-3.2-3B-Oat-Zero) | [MATH](https://huggingface.co/datasets/EleutherAI/hendrycks_math) | <details><summary>Click</summary>This work critically analyzes R1-Zero-like RL training. It reveals base model properties and GRPO algorithm biases (e.g., length bias) significantly impact outcomes. It contributes the efficient, unbiased Dr. GRPO algorithm and an open-source recipe/codebase for better understanding and reproduction.</details> |
 - Paper - sg/understand-r1-zero) [More](#oat-zero) | [Qwen2.5-Math-7B-Oat-Zero](https://huggingface.co/sail/Qwen2.5-Math-7B-Oat-Zero) [Qwen2.5-Math-1.5B-Oat-Zero](https://huggingface.co/sail/Qwen2.5-Math-1.5B-Oat-Zero) [Llama-3.2-3B-Oat-Zero](https://huggingface.co/sail/Llama-3.2-3B-Oat-Zero) | [MATH](https://huggingface.co/datasets/EleutherAI/hendrycks_math) | <details><summary>Click</summary>This work critically analyzes R1-Zero-like RL training. It reveals base model properties and GRPO algorithm biases (e.g., length bias) significantly impact outcomes. It contributes the efficient, unbiased Dr. GRPO algorithm and an open-source recipe/codebase for better understanding and reproduction.</details> |
 - Paper - 1.5B-Preview](https://huggingface.co/Nickyang/FastCuRL-1.5B-Preview) | [FastCuRL](https://huggingface.co/datasets/Nickyang/FastCuRL) | <details><summary>Click</summary>FastCuRL introduces a simple, efficient Curriculum RL method for LLMs. Its core contribution uses target perplexity to dynamically scale the standard RL loss (like PPO), creating an effective curriculum without complex reward models or auxiliary components, enabling faster, more stable training.</details> |
 - Paper - 1.5B-Preview](https://huggingface.co/Nickyang/FastCuRL-1.5B-Preview) | [FastCuRL](https://huggingface.co/datasets/Nickyang/FastCuRL) | <details><summary>Click</summary>FastCuRL introduces a simple, efficient Curriculum RL method for LLMs. Its core contribution uses target perplexity to dynamically scale the standard RL loss (like PPO), creating an effective curriculum without complex reward models or auxiliary components, enabling faster, more stable training.</details> |
 - Paper - 7B](https://huggingface.co/efficientscaling/Z1-7B) | [Z1-Code-Reasoning-107K](https://huggingface.co/datasets/efficientscaling/Z1-Code-Reasoning-107K) | <details><summary>Click</summary>This paper proposes training LLMs on code-related reasoning trajectories using a curated dataset and a "Shifted Thinking Window" technique. This allows models to reduce excessive thinking tokens, achieving efficient test-time scaling and generalizing reasoning abilities.</details> |
 - Paper
 - Paper
 - Paper - 7B](https://huggingface.co/efficientscaling/Z1-7B) | [Z1-Code-Reasoning-107K](https://huggingface.co/datasets/efficientscaling/Z1-Code-Reasoning-107K) | <details><summary>Click</summary>This paper proposes training LLMs on code-related reasoning trajectories using a curated dataset and a "Shifted Thinking Window" technique. This allows models to reduce excessive thinking tokens, achieving efficient test-time scaling and generalizing reasoning abilities.</details> |
 - Paper
 - Paper
 - Paper - lime/verl) | —— | [DeepScaleR_Difficulty](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) | <details><summary>Click</summary>AdaRFT proposes Adaptive Curriculum Reinforcement Finetuning to improve LLM reasoning training efficiency. It dynamically adjusts task difficulty based on recent reward signals, accelerating learning by keeping challenges optimally balanced. Experiments on competition math benchmarks show up to 2x fewer steps and improved accuracy, using standard PPO with minimal changes.</details> |
 - Paper - RLVR framework demonstrates emergent medical reasoning via RL, achieving performance parity with SFT on in-distribution tasks and improving out-of-distribution generalization, all without explicit reasoning supervision, showcasing RL's potential in medicine.</details> |
 - Paper - ai/DeepSeek-R1/tree/main) [More](#deepseek-r1) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) [DeepSeek-R1-Zero](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero) | —— | <details><summary>Click</summary>DeepSeek-R1's core contribution is demonstrating large-scale RL from scratch (600B+) without SFT, achieving emergent "aha moments" (self-reflective reasoning) and matching OpenAI o1's performance at 1/30 cost</details> |
 - Paper - ai/DeepSeek-R1/tree/main) [More](#deepseek-r1) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) [DeepSeek-R1-Zero](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero) | —— | <details><summary>Click</summary>DeepSeek-R1's core contribution is demonstrating large-scale RL from scratch (600B+) without SFT, achieving emergent "aha moments" (self-reflective reasoning) and matching OpenAI o1's performance at 1/30 cost</details> |
 - Paper - k1.5) [More](#kimi-k1.5) | —— | —— | <details><summary>Click</summary>Kimi 1.5 introduces a simplified RL framework that leverages long-context scaling (128k tokens) and improved policy optimization (e.g., online mirror descent) to enhance reasoning and multimodal performance.</details> |
 - Twitter - Pan/TinyZero) [More](#tinyzero) | —— | [Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | <details><summary>Click</summary>TinyZero's core contribution is demonstrating that smaller language models (e.g., 1.5B-3B parameters) can develop complex reasoning, search, and self-verification abilities through Reinforcement Learning, replicating capabilities of larger models like DeepSeek R1-Zero at extremely low cost (<$30).</details> |
 - GitHub - Qwen-7B](https://huggingface.co/open-r1/OpenR1-Qwen-7B) [OlympicCoder-7B](https://huggingface.co/open-r1/OlympicCoder-7B) [OlympicCoder-32B](https://huggingface.co/open-r1/OlympicCoder-32B) | [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) [codeforces](https://huggingface.co/datasets/open-r1/codeforces) | <details><summary>Click</summary>Open-R1's core contribution is providing the first fully open-source replication and release of the DeepSeek R1-Zero Reinforcement Learning training pipeline. Its main insight or goal is to democratize access to these advanced RL techniques for enhancing LLM reasoning and planning.</details> |
 - Paper - long-cot) [More](#demystify) | —— | —— | <details><summary>Click</summary>The paper elucidates the role of RL in stabilizing and enhancing long CoT reasoning in LLMs, highlighting the necessity of reward shaping and verifiable reward signals for complex reasoning tasks.</details> |
 - GitHub - Qwen-7B](https://huggingface.co/open-r1/OpenR1-Qwen-7B) [OlympicCoder-7B](https://huggingface.co/open-r1/OlympicCoder-7B) [OlympicCoder-32B](https://huggingface.co/open-r1/OlympicCoder-32B) | [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) [codeforces](https://huggingface.co/datasets/open-r1/codeforces) | <details><summary>Click</summary>Open-R1's core contribution is providing the first fully open-source replication and release of the DeepSeek R1-Zero Reinforcement Learning training pipeline. Its main insight or goal is to democratize access to these advanced RL techniques for enhancing LLM reasoning and planning.</details> |
 - GitHub
 - Paper - long-cot) [More](#demystify) | —— | —— | <details><summary>Click</summary>The paper elucidates the role of RL in stabilizing and enhancing long CoT reasoning in LLMs, highlighting the necessity of reward shaping and verifiable reward signals for complex reasoning tasks.</details> |
 - Paper - RL) [More](#logicrl) | —— | [knights-and-knaves](https://huggingface.co/datasets/K-and-K/knights-and-knaves) [knights-and-knaves-ZH](https://huggingface.co/datasets/Trae1ounG/knights-and-knaves-ZH) | <details><summary>Click</summary>The paper introduces Logic-RL, a rule-based reinforcement learning framework that enables large language models to develop o3-mini-level reasoning skills through training on logic puzzles. The reasoning capabilities can also be transferred to other domains like math.</details> |
 - Paper - NLP/LIMR) [More](#limr) | [LIMR](https://huggingface.co/GAIR/LIMR) | [LIMR](https://huggingface.co/datasets/GAIR/LIMR) | <details><summary>Click</summary>The paper challenges the assumption that scaling up RL training data inherently improves performance in language models, instead finding that a strategically selected subset of 1,389 samples can outperform a full 8,523-sample dataset.</details> |
 - Paper - NLP/LIMR) [More](#limr) | [LIMR](https://huggingface.co/GAIR/LIMR) | [LIMR](https://huggingface.co/datasets/GAIR/LIMR) | <details><summary>Click</summary>The paper challenges the assumption that scaling up RL training data inherently improves performance in language models, instead finding that a strategically selected subset of 1,389 samples can outperform a full 8,523-sample dataset.</details> |
 - Paper - PPO (Value-Calibrated PPO) diagnoses PPO's collapse in long CoT tasks as stemming from value function inaccuracies (initialization bias and reward signal decay in long sequences). Its core contribution is modifying PPO with value pretraining and decoupled GAE for actor and critic.</details> |
 - Paper - PPO (Value-Calibrated PPO) diagnoses PPO's collapse in long CoT tasks as stemming from value function inaccuracies (initialization bias and reward signal decay in long sequences). Its core contribution is modifying PPO with value pretraining and decoupled GAE for actor and critic.</details> |
 - Paper - l3/l1) [More](#lcpol1) | [L1-Qwen-1.5B-Max](https://huggingface.co/l3lab/L1-Qwen-1.5B-Max) [L1-Qwen-1.5B-Exact](https://huggingface.co/l3lab/L1-Qwen-1.5B-Exact) | —— | <details><summary>Click</summary>L1 introduces Length Controlled Policy Optimization (LCPO), a RL method enabling precise control over a reasoning model's thinking time (output length) via prompt instructions. It shows that RL effectively controls reasoning duration and unexpectedly enhances even short-chain reasoning capabilities.</details> |
 - Paper - 32B](https://huggingface.co/internlm/OREAL-32B) [OREAL-7B](https://huggingface.co/internlm/OREAL-7B) [OREAL-DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B) [OREAL-32B-SFT](https://huggingface.co/internlm/OREAL-32B-SFT) [OREAL-7B-SFT](https://huggingface.co/internlm/OREAL-7B-SFT) | [OREAL-RL-Prompts](https://huggingface.co/datasets/internlm/OREAL-RL-Prompts) | <details><summary>Click</summary>The paper introduces OREAL, a reinforcement learning framework for mathematical reasoning with binary feedback. It proves that behavior cloning on positive samples is sufficient for optimal learning and proposes reward reshaping for negative samples. A token-level reward model addresses sparse rewards in long reasoning chains. OREAL achieves state-of-the-art results on math benchmarks.</details> |
 - Paper - Generation Reward Optimization (AGRO) frim the consistency condition across any possible generation of the model. AGRO achieves a better convergence than KL-regularized policy gradient method.</details> |
 - Paper, GitHub - Thinking-v1.5 is a high-performing reasoning model that combines curated chain-of-thought data, stable reinforcement learning, and advanced infrastructure to achieve strong results across math, coding, and logic tasks.</details> |
 - Blog - OR1) | [Skywork-OR1-32B-Preview](https://huggingface.co/Skywork/Skywork-OR1-32B-Preview) [Skywork-OR1-7B-Preview](https://huggingface.co/Skywork/Skywork-OR1-7B-Preview) [Skywork-OR1-Math-7B](https://huggingface.co/Skywork/Skywork-OR1-Math-7B) | [Skywork-OR1-RL-Data](https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data) | <details><summary>Click</summary> Skywork-OR1 is a series of robust open-source models trained on carefully curated math and code data. The training process incorporates several modifications to the original GRPO, including offline and online data filtering, multi-stage training, and adaptive entropy control. </details> |
 - Paper - RL/TTRL) | —— | —— | <details><summary>Click</summary>This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs).</details> |
 - Paper - k1.5) [More](#kimi-k1.5) | —— | —— | <details><summary>Click</summary>Kimi 1.5 introduces a simplified RL framework that leverages long-context scaling (128k tokens) and improved policy optimization (e.g., online mirror descent) to enhance reasoning and multimodal performance.</details> |
 - Twitter - Pan/TinyZero) [More](#tinyzero) | —— | [Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | <details><summary>Click</summary>TinyZero's core contribution is demonstrating that smaller language models (e.g., 1.5B-3B parameters) can develop complex reasoning, search, and self-verification abilities through Reinforcement Learning, replicating capabilities of larger models like DeepSeek R1-Zero at extremely low cost (<$30).</details> |
 - Paper - RL) [More](#logicrl) | —— | [knights-and-knaves](https://huggingface.co/datasets/K-and-K/knights-and-knaves) [knights-and-knaves-ZH](https://huggingface.co/datasets/Trae1ounG/knights-and-knaves-ZH) | <details><summary>Click</summary>The paper introduces Logic-RL, a rule-based reinforcement learning framework that enables large language models to develop o3-mini-level reasoning skills through training on logic puzzles. The reasoning capabilities can also be transferred to other domains like math.</details> |
 - Paper - l3/l1) [More](#lcpol1) | [L1-Qwen-1.5B-Max](https://huggingface.co/l3lab/L1-Qwen-1.5B-Max) [L1-Qwen-1.5B-Exact](https://huggingface.co/l3lab/L1-Qwen-1.5B-Exact) | —— | <details><summary>Click</summary>L1 introduces Length Controlled Policy Optimization (LCPO), a RL method enabling precise control over a reasoning model's thinking time (output length) via prompt instructions. It shows that RL effectively controls reasoning duration and unexpectedly enhances even short-chain reasoning capabilities.</details> |
 - Paper - SIA/DAPO) [More](#dapo) | —— | [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) | <details><summary>Click</summary>DAPO algorithm introduces four key techniques (Clip-Higher, Dynamic Sampling, Token-Level Loss, Overlong Shaping) for stable and efficient long-chain-of-thought RL training, surpassing previous SoTA results efficiently.</details> |
 - Paper - lime/verl) | —— | [DeepScaleR_Difficulty](https://huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty) | <details><summary>Click</summary>AdaRFT proposes Adaptive Curriculum Reinforcement Finetuning to improve LLM reasoning training efficiency. It dynamically adjusts task difficulty based on recent reward signals, accelerating learning by keeping challenges optimally balanced. Experiments on competition math benchmarks show up to 2x fewer steps and improved accuracy, using standard PPO with minimal changes.</details> |
 - Paper - reasoning/d1) [Project](https://dllm-reasoning.github.io) | —— | —— | <details><summary>Click</summary> This paper propose d1 to adapt pre-trained masked dLLMs into reasoning via a combination of SFT and RL. The RL method used is named diffu-GRPO. </details> |
- Agentic Applications
 - Paper - RL introduces a novel RL algorithm for multi-turn collaborative reasoning LLM agents. Its core contribution is improving credit assignment across long interactions by using an asymmetric actor-critic structure where the critic leverages additional training-time information for step-wise evaluation.</details> |
 - GitHub - existing reasoning frameworks or supervised data.</details> |
 - GitHub - capable LLM agents for interactive, stochastic environments. Its core contribution is the Reasoning-Interaction Chain Optimization (RICO) algorithm, which jointly optimizes reasoning and action strategies by reinforcing entire trajectories.</details> |
 - GitHub
 - Paper - R1) [More](#search-r1) | [Search-R1](https://huggingface.co/collections/PeterJinGo/search-r1-67d1a021202731cb065740f5) | [2018 Wikipedia](https://huggingface.co/datasets/PeterJinGo/wiki-18-corpus) | <details><summary>Click</summary>The paper introduces Search-R1, a novel RL framework that enables LLMs to interact with search engines in an interleaved manner with their own reasoning. The framework is shown to be effective, with experiments demonstrating average relative improvements of 41% and 20% over RAG baselines, and providing insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning.</details> |
 - Paper - Searcher) | [Llama-3.1-8B-instruct-RAG-RL](https://huggingface.co/XXsongLALA/Llama-3.1-8B-instruct-RAG-RL) [Qwen-2.5-7B-base-RAG-RL](https://huggingface.co/XXsongLALA/Qwen-2.5-7B-base-RAG-RL) | [RAG-RL-Hotpotqa](https://huggingface.co/datasets/XXsongLALA/RAG-RL-Hotpotqa-with-2wiki) | <details><summary>Click</summary>R1-Searcher enhances LLM reasoning via RL by training the model to perform adaptive model-based search during generation. This integration enables flexible thinking depth, improving reasoning efficiency and performance compared to fixed-step methods like R1-Zero.</details> |
 - Paper - RL introduces a novel RL algorithm for multi-turn collaborative reasoning LLM agents. Its core contribution is improving credit assignment across long interactions by using an asymmetric actor-critic structure where the critic leverages additional training-time information for step-wise evaluation.</details> |
 - Paper - Searcher) | [Llama-3.1-8B-instruct-RAG-RL](https://huggingface.co/XXsongLALA/Llama-3.1-8B-instruct-RAG-RL) [Qwen-2.5-7B-base-RAG-RL](https://huggingface.co/XXsongLALA/Qwen-2.5-7B-base-RAG-RL) | [RAG-RL-Hotpotqa](https://huggingface.co/datasets/XXsongLALA/RAG-RL-Hotpotqa-with-2wiki) | <details><summary>Click</summary>R1-Searcher enhances LLM reasoning via RL by training the model to perform adaptive model-based search during generation. This integration enables flexible thinking depth, improving reasoning efficiency and performance compared to fixed-step methods like R1-Zero.</details> |
- Multimodal Models
 - GitHub - r1-mm) | [Qwen2-VL-2B-GRPO-8k](https://huggingface.co/lmms-lab/Qwen2-VL-2B-GRPO-8k) [Qwen2-VL-7B-GRPO-8k](https://huggingface.co/lmms-lab/Qwen2-VL-7B-GRPO-8k) | [multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) | <details><summary>Click</summary>Open-R1-MultiModal provides an open-source replication of R1-Zero-like RL for Multimodal LLMs, aiming to enhance complex visual reasoning. It demonstrates the effectiveness of these RL techniques for boosting multimodal performance and promotes reproducibility in the field.</details> |
 - Blog - ai-lab/VLM-R1) [More](#vlmr1) | [OVD](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321) [Math](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-Math-0305) [REC](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps) | —— | <details><summary>Click</summary>VLM-R1 applies R1-style RL to VLMs, improving stability and generalization on visual reasoning tasks. It shows that RL enhances VLM generalization beyond standard fine-tuning, achieving SOTA results, particularly on complex or out-of-domain benchmarks.</details> |
 - Blog - vlm) | —— | —— | <details><summary>Click</summary>R1-VLM enhances VLMs using RL, contributing significantly improved performance on complex visual reasoning tasks (spatial, counting, logic) where standard models falter. It shows that RL effectively unlocks advanced, multi-step reasoning capabilities specifically for vision-language understanding.</details> |
 - Paper - language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development.</details> |
 - Paper - R1](https://huggingface.co/collections/Kangheng/perception-r1-67f6b14f89d307a0ece985af) | [Perception-R1](https://huggingface.co/collections/Kangheng/perception-r1-67f6b14f89d307a0ece985af) | <details><summary>Click</summary>Perception-R1 explores the effects of RL on different perception tasks, the researchers observe that the percep- tual perplexity is a major factor in determining the effectiveness of RL. The scalable Perception-R1 achieves remarkable performance on the perception tasks. </details> |
 - Paper - AI-Lab/VL-Rethinker) | [TIGER-Lab/VL-Rethinker-7B](https://huggingface.co/TIGER-Lab/VL-Rethinker-7B) [TIGER-Lab/VL-Rethinker-72B](https://huggingface.co/TIGER-Lab/VL-Rethinker-72B) | —— | <details><summary>Click</summary>VL-Rethinker proposes Selective Sample Replay (SSR) and Forced Rethinking to enhance fast-thinking models.The model achieves remarkable performance on multi-disciplinary benchmarks. </details> |
 - Paper - RFT) [More](#research) | [Reasoning Grounding](https://huggingface.co/Zery/Qwen2-VL-7B_visual_rft_lisa_IoU_reward) | [COCO_base65](https://huggingface.co/datasets/laolao77/ViRFT_COCO_base65) [COCO](https://huggingface.co/datasets/laolao77/ViRFT_COCO) [COCO_8_classes_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_COCO_8_cate_4_shot) [LVIS_few_shot](https://huggingface.co/datasets/laolao77/ViRFT_LVIS_few_shot) [Flower_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_flower_4_shot) [FGVC_Aircraft_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_fgvc_aircraft_4_shot) [Car196_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_car196_4shot) [Pets37_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_pets37_4shot) | <details><summary>Click</summary>Visual-RFT introduces Visual Reinforcement Fine-tuning, which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning.</details> |
 - Paper - EUREKA) [More](#mm-eureka) | [MM-Eureka-Qwen-7B](https://huggingface.co/FanqingM/MM-Eureka-Qwen-7B) | [MM-Eureka-Dataset](https://huggingface.co/datasets/FanqingM/MM-Eureka-Dataset) | <details><summary>Click</summary>MM-EUREKA reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, which demonstrates that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. </details> |
 - Paper - reft) | [3B-Curr-ReFT](https://huggingface.co/ZTE-AIM/3B-Curr-ReFT) [7B-Curr-ReFT](https://huggingface.co/ZTE-AIM/7B-Curr-ReFT) | [Curr-ReFT-data](https://huggingface.co/datasets/ZTE-AIM/Curr-ReFT-data) | <details><summary>Click</summary>Curr-ReFT proposes a Curriculum Reinforcement Finetuning strategy to enhance the out-of-distribution generalization and reasoning abilities. The curriculum paradim ensures steady progression. Moreover, a rejected sampling-based self-improvement is proposed to maintain the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. </details> |
 - Website - R1-V-Mini excels in the domain of visual reasoning, while also demonstrating top-tier performance in mathematical, code, and textual reasoning tasks. It supports a context length of 100k.</details> |
 - Technical Report - VL) |[moonshotai/Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) | —— | <details><summary>Click</summary>Kimi-VL-Thinking is designed to enhance long-horizon reasoning capabilities in vision-language tasks. Built on a foundation of long CoT SFT and RL, with only 2.8 parameters, Kimi-VL-Thinking achieves strong performance across a range of tasks requiring long-term reasoning. It excels in domains such as MMMU, MathVision, and MathVista, achieving impressive scores of 61.7, 36.8, and 71.3, respectively.</details> |
 - Paper - R1) | —— | —— | <details><summary>Click</summary>VideoChat-R1 provides a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, which exhibiting remarkable performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. </details> |
 - Blog - Agent/R1-V) [More](#r1v) | —— | [Clevr_CoGenT_TrainA_R1](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1) | <details><summary>Click</summary>R1-V applies RL, specifically RLV-Instruct, to fine-tune VLMs. It enhances complex visual reasoning and instruction-following capabilities in VLMs beyond standard supervised fine-tuning.</details> |
 - Paper - RFT) [More](#research) | [Reasoning Grounding](https://huggingface.co/Zery/Qwen2-VL-7B_visual_rft_lisa_IoU_reward) | [COCO_base65](https://huggingface.co/datasets/laolao77/ViRFT_COCO_base65) [COCO](https://huggingface.co/datasets/laolao77/ViRFT_COCO) [COCO_8_classes_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_COCO_8_cate_4_shot) [LVIS_few_shot](https://huggingface.co/datasets/laolao77/ViRFT_LVIS_few_shot) [Flower_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_flower_4_shot) [FGVC_Aircraft_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_fgvc_aircraft_4_shot) [Car196_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_car196_4shot) [Pets37_4_shot](https://huggingface.co/datasets/laolao77/ViRFT_CLS_pets37_4shot) | <details><summary>Click</summary>Visual-RFT introduces Visual Reinforcement Fine-tuning, which extends reinforcement learning with verified rewards on visual perception tasks that are effective with limited data for fine-tuning.</details> |
 - Paper - ai/VisualThinker-R1-Zero) [More](#visual-r1-zero) | [VisualThinker-R1-Zero](https://huggingface.co/turningpoint-ai/VisualThinker-R1-Zero) | —— | <details><summary>Click</summary>VisualThinker-R1-Zero adapts the R1-Zero RL paradigm (no supervised fine-tuning) to VLMs, achieving SoTa visual reasoning. It shows that complex visual reasoning can be effectively cultivated directly via RL on a base VLM, bypassing supervised data needs.</details> |
 - Paper - EUREKA) [More](#mm-eureka) | [MM-Eureka-Qwen-7B](https://huggingface.co/FanqingM/MM-Eureka-Qwen-7B) | [MM-Eureka-Dataset](https://huggingface.co/datasets/FanqingM/MM-Eureka-Dataset) | <details><summary>Click</summary>MM-EUREKA reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, which demonstrates that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. </details> |
 - Paper - reft) | [3B-Curr-ReFT](https://huggingface.co/ZTE-AIM/3B-Curr-ReFT) [7B-Curr-ReFT](https://huggingface.co/ZTE-AIM/7B-Curr-ReFT) | [Curr-ReFT-data](https://huggingface.co/datasets/ZTE-AIM/Curr-ReFT-data) | <details><summary>Click</summary>Curr-ReFT proposes a Curriculum Reinforcement Finetuning strategy to enhance the out-of-distribution generalization and reasoning abilities. The curriculum paradim ensures steady progression. Moreover, a rejected sampling-based self-improvement is proposed to maintain the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. </details> |
 - Paper - R1) | —— | [Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold) | <details><summary>Click</summary>Vision-R1 adapts the R1-Zero RL paradigm for VLMs, training them on visual reasoning chains. Its contribution is significantly boosting complex multimodal reasoning performance. It shows that RL applied to explicit reasoning steps effectively enhances VLM capabilities.</details> |
 - Paper - NLP/MAYE) |—— | [ManTle/MAYE](https://huggingface.co/datasets/ManTle/MAYE) | <details><summary>Click</summary>MAYE is a transparent, reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). Its codebase is developed entirely from scratch without relying on any existing RL toolkits.</details> |
 - Blog - Agent/R1-V) [More](#r1v) | —— | [Clevr_CoGenT_TrainA_R1](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1) | <details><summary>Click</summary>R1-V applies RL, specifically RLV-Instruct, to fine-tune VLMs. It enhances complex visual reasoning and instruction-following capabilities in VLMs beyond standard supervised fine-tuning.</details> |
 - Paper - ai/VisualThinker-R1-Zero) [More](#visual-r1-zero) | [VisualThinker-R1-Zero](https://huggingface.co/turningpoint-ai/VisualThinker-R1-Zero) | —— | <details><summary>Click</summary>VisualThinker-R1-Zero adapts the R1-Zero RL paradigm (no supervised fine-tuning) to VLMs, achieving SoTa visual reasoning. It shows that complex visual reasoning can be effectively cultivated directly via RL on a base VLM, bypassing supervised data needs.</details> |
 - Paper - r1) | —— | —— | <details><summary>Click</summary>LLM-R1 contributes the RMAVO algorithm to stably enhance LLM reasoning using RL, preventing reward hacking and achieving SOTA results with smaller models via an open-source implementation. It shows that reward model assistance in value optimization is key for stable RL.</details> |
 - Paper - R1) | —— | [Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold) | <details><summary>Click</summary>Vision-R1 adapts the R1-Zero RL paradigm for VLMs, training them on visual reasoning chains. Its contribution is significantly boosting complex multimodal reasoning performance. It shows that RL applied to explicit reasoning steps effectively enhances VLM capabilities.</details> |
 - GitHub - Math-v0-7B](https://huggingface.co/MMR1/MMR1-Math-v0-7B) | [MMR1-Math-RL-Data-v0](https://huggingface.co/datasets/MMR1/MMR1-Math-RL-Data-v0) | <details><summary>Click</summary>MMR1-Math-v0 achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances.</details> |
 - Paper - RFT) [Project](https://tanhuajie.github.io/ReasonRFT/) | —— | [tanhuajie2001/Reason-RFT-CoT-Dataset](https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset/) | <details><summary>Click</summary>Reason-RFT introduces a two-phase training paradim: (1) SFT with CoT data to activate reasoning potential, followed by (2) GRPO-based reinforcement learning to enhance generalization, which further has potential applications in Emobodied AI.</details> |
 - Paper - NLP/MAYE) |—— | [ManTle/MAYE](https://huggingface.co/datasets/ManTle/MAYE) | <details><summary>Click</summary>MAYE is a transparent, reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). Its codebase is developed entirely from scratch without relying on any existing RL toolkits.</details> |
- Multimodal and Agents
 - Blog - ai-lab/VLM-R1) [More](#vlmr1) | [OVD](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321) [Math](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-Math-0305) [REC](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps) | —— | <details><summary>Click</summary>VLM-R1 applies R1-style RL to VLMs, improving stability and generalization on visual reasoning tasks. It shows that RL enhances VLM generalization beyond standard fine-tuning, achieving SOTA results, particularly on complex or out-of-domain benchmarks.</details> |
 - Paper - language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development.</details> |
Projects
- Large Language Models
- Multimodal and Agents
 - multimodal-open-r1-8k-verified
 - Deep-Agent/R1-V
 - Clevr CoGenT-A - Train](https://huggingface.co/datasets/leonardPKU/GEOQA_R1V_Train_8K), 8K Complex Visual Reasoning: [Clevr_CoGenT_TrainA_R1](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1), 37.8K |
 - om-ai-lab/VLM-R1
 - COCO - cube), 24K; |
 - Liuziyu77/Visual-RFT
 - GRPO for vision - Teaching an LLM to reason about images
 - groundlight/r1_vlm
 - message-decoding-words-and-sequences-r1 - Single Word [message-decoding-words-r1](https://huggingface.co/datasets/sunildkumar/message-decoding-words-r1), 10K Digit Recognition: [digit-recognition-r1 ](https://huggingface.co/datasets/sunildkumar/digit-recognition-r1), 2K |
 - turningpoint-ai/VisualThinker-R1-Zero
 - SAT
 - ModalMinds/MM-EURRKA
 - MM-Eureka-Dataset
 - ding523/Curr_REFT
 - RefCOCO
 - PzySeere/MetaSpatial
 - 3D-Reasoning-Dataset
News
- 2025-04-23

Programming Languages

Python 31 Jupyter Notebook 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-rl-reasoning-recipes

Overview

Large Language Models

Agentic Applications

Multimodal Models

Multimodal and Agents

Projects

Large Language Models

Multimodal and Agents

News