{"id":29552038,"url":"https://github.com/ElliottYan/LUFFY","last_synced_at":"2025-07-18T05:02:30.869Z","repository":{"id":289191937,"uuid":"969637328","full_name":"ElliottYan/LUFFY","owner":"ElliottYan","description":"Official Repository of \"Learning to Reason under Off-Policy Guidance\"","archived":false,"fork":false,"pushed_at":"2025-07-16T06:37:18.000Z","size":15385,"stargazers_count":253,"open_issues_count":2,"forks_count":30,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-17T09:33:07.532Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/2504.14945","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ElliottYan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-20T15:51:39.000Z","updated_at":"2025-07-17T07:05:37.000Z","dependencies_parsed_at":"2025-05-21T12:31:08.082Z","dependency_job_id":"28659d44-48f9-4ed4-8625-8d8e19cce6a9","html_url":"https://github.com/ElliottYan/LUFFY","commit_stats":null,"previous_names":["elliottyan/luffy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ElliottYan/LUFFY","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ElliottYan%2FLUFFY","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ElliottYan%2FLUFFY/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ElliottYan%2FLUFFY/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ElliottYan%2FLUFFY/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ElliottYan","download_url":"https://codeload.github.com/ElliottYan/LUFFY/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ElliottYan%2FLUFFY/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265703046,"owners_count":23813914,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-18T05:01:10.472Z","updated_at":"2025-07-18T05:02:30.822Z","avatar_url":"https://github.com/ElliottYan.png","language":"Python","funding_links":[],"categories":["4. 算法","Papers","🤝 OPD-RL Hybrids — Inside-RL OPD"],"sub_categories":["4.2 Reinforcement Learning","2025","🔁 Iterative Self-Bootstrapping"],"readme":"\u003cdiv align=\"center\"\u003e\n\n\n\u003ch1 style=\"display: flex; justify-content: center; align-items: center; gap: 10px; margin: 0;\"\u003e\n  \u003cimg src=\"./figures/logo.png\" alt=\"LUFFY Icon\" width=\"50\"\u003e\n  LUFFY: Learning to Reason Under Off‑Policy Guidance\n\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eA general framework for off-policy learning in large reasoning models.\u003c/em\u003e\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/luffy_intro_new.jpg\" alt=\"overview\" style=\"width: 66%; height: auto;\"\u003e\n\u003c/div\u003e\n\n\n[![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge\u0026logo=arxiv\u0026logoColor=white)](http://arxiv.org/abs/2504.14945) [![alphaXiv](https://img.shields.io/badge/discussion-A42C25?style=for-the-badge\u0026logo=arxiv\u0026logoColor=white\u0026color=blue\n)](https://www.alphaxiv.org/abs/2504.14945) [![Github](https://img.shields.io/badge/LUFFY-000000?style=for-the-badge\u0026logo=github\u0026logoColor=000\u0026logoColor=white)](https://github.com/ElliottYan/LUFFY)   [![Hugging Face Collection](https://img.shields.io/badge/LUFFY_Collection-fcd022?style=for-the-badge\u0026logo=huggingface\u0026logoColor=000)](https://huggingface.co/collections/Elliott/luffy-rl-6804e1f5d1ebe66ba8ac92f4) [![Twitter](https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge\u0026logo=twitter\u0026logoColor=white)](https://x.com/yafuly/status/1914559433549676962)\n\n\n\n\n\n\u003c/div\u003e\n\n---\n\n# 📚 Overview\n- 🎉 [News](#news)  \n- 📖 [Introduction](#introduction)  \n- ✨ [Getting Started](#getting-started)  \n- 🔧 [Usage](#usage)  \n- 📃 [Evaluation](#evaluation)  \n- 🎈 [Citation](#citation)  \n- 🌻 [Acknowledgement](#acknowledgement)  \n\u003c!-- - 📈 [Star History](#star-history) --\u003e\n\n\n---\n\n\n# 🎉News\n- **[2025/05/30]** We integrate the implementation and scripts of **other off-policy learning methods** including SFT, SFT+RL and RL w/ SFT Loss (multi-task learning).\n- **[2025/05/21]** We have updated the paper [version](https://arxiv.org/abs/2504.14945), which re-evaluates all models using a more accurate verifier and adds comparisons with other off-policy learning methods, including RL with SFT Loss and SFT+RL.\n- **[2025/04/23]** Our paper now trending on [alphaXiv](https://www.alphaxiv.org/abs/2504.14945)! We welcome feedback and discussion.\n- **[2025/04/23]** 🎉 Ranked **#1** of the day on [Huggingface Daily Papers](https://huggingface.co/papers/2504.14945).\n- **[2025/04/20]** LUFFY paper available on [arXiv](http://arxiv.org/abs/2504.14945). \n\n\u003c!-- - **[2025/04/20]** The models and datasets are released on [HuggingFace](https://huggingface.co/collections/Elliott/luffy-rl-6804e1f5d1ebe66ba8ac92f4).\n- **[2025/04/20]** LUFFY codebase is released along with evaluation scripts. Try it out! --\u003e\n\n---\n\n# 📖Introduction\n\nLUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces **policy shaping** via regularized importance sampling to emphasize low-probability yet crucial actions.\n\n![overview](./figures/luffy_performance.jpg)\n\n### Key Highlights:\n- **Off-Policy Guidance:** Seamlessly integrates external reasoning traces to bootstrap learning from stronger models.\n- **Dynamic Balance:** Learns when to imitate and when to explore, adapting over the course of training.\n- **Policy Shaping:** Emphasizes important actions often ignored in standard policy gradients, enabling better generalization.\n\n\n\n---\n\n# ✨Getting Started\n\n## Installation\n\nYou can install LUFFY dependencies by running the following commands:\n```bash\nconda create -n luffy python=3.10\nconda activate luffy\ncd luffy\npip install -r requirements.txt\npip install -e .\ncd verl\npip install -e .\n```\n\nIf you encounter issues when installing flash-attn, we recommend you to install it here \n[flash-attn](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.3). For example, we use this version. \n```bash\nwget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl\npip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl\n```\n\n## Repo Structure\n\nThis repository includes:\n\n- `luffy`: Codes for training LUFFY using off-policy reasoning traces. Our main code changes are in luffy/verl/verl/mix_src.\n- `data`: Data and code for training and evaluating LUFFY. \n- `exp_scripts`: Example script to train LUFFY.\n- `eval_scripts`: Evaluation scripts on math and out-of-distribution benchmarks.\n\nLUFFY is built on top of the GRPO framework and supports plug-and-play integration with off-policy traces from models such as DeepSeek-R1.\n\n---\n\n\n\n\n\n# 🔧Usage\n\n## Data Preparation\nYou need to first run the data preparation script to get the training data in parquet format.\n```bash\ncd data\npython prepare_train.py\n```\n\n## Training\n\nWe provide an example script to train LUFFY on our subset of OpenR1-Math-220k. You can run the following command to train LUFFY:\n\n```bash\n  cd exp_scripts\n  bash train.sh\n```\n\n## Other Off-Policy Baselines\n### SFT\nFirst clone the OpenRLHF repository and prepare the data to SFT format. *(We plan to integrate the SFT pipeline directly into LUFFY in the near future.)*\n```bash\ngit clone https://github.com/OpenRLHF/OpenRLHF\ncd data\npython prepare_sft.py\n```\nThen, you can run the SFT training command. \n```\nRESULT_DIR=\"Your result directory\"\nDATA_DIR=\"Your data directory\"\nWANDB_KEY=\"Your Wandb Key\"\n\nMODEL_PATH=Elliott/Qwen2.5-Math-7B-16k-think\nMASTER_ADDR=`scontrol show hostname $SLURM_JOB_NODELIST | head -n1`\nMASTER_PORT=$((RANDOM % 101 + 20000))\nDEVICES=\"0,1,2,3,4,5,6,7\"\ndeepspeed --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR --include localhost:$DEVICES --module openrlhf.cli.train_sft \\\n   --max_len 16384 \\\n   --dataset $DATA_DIR \\\n   --input_key prompt \\\n   --output_key target \\\n   --train_batch_size 64 \\\n   --apply_chat_template \\\n   --micro_train_batch_size 1 \\\n   --max_samples 500000 \\\n   --pretrain $MODEL_PATH \\\n   --save_path $RESULT_DIR \\\n   --logging_steps 1 \\\n   --eval_steps -1 \\\n   --zero_stage 2 \\\n   --max_epochs 3 \\\n   --adam_offload \\\n   --packing_samples \\\n   --bf16 \\\n   --flash_attn \\\n   --save_hf_ckpt \\\n   --learning_rate 5e-5 \\\n   --lr_warmup_ratio 0.1 \\\n   --wandb_project r1_sft_distill \\\n   --wandb_run_name qwen-7b-base-sft \\\n   --use_wandb $WANDB_KEY \\\n   --gradient_checkpointing\n```\n\n\n### RL w/ SFT Loss\n```bash\n  cd exp_scripts\n  bash train_rl_sft_loss.sh\n```\n\n### SFT + RL\nWe use heldout data for RL training, following previous works like PRIME.\n```bash\n  cd data\n  python prepare_train_sft_rl.py\n  cd ../exp_scripts\n  bash train_sft_rl.sh\n```\n\n## Inference\n\nHere’s an example of using LUFFY for inference:\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to view inference example\u003c/summary\u003e\n\n```python\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\n\nmodel_path=\"Elliott/LUFFY-Qwen-Math-7B-Zero\"\n\nquestion = \"which number is larger? 9.11 or 9.9?\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_path)\nmessages = [{\"role\": \"user\", \"content\": question}]\nchat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n\nllm = LLM(model=model_path)\nparams = SamplingParams(temperature=0.6, max_tokens=8192)\noutputs = llm.generate([chat], params)\nprint(outputs[0].outputs[0].text)\n```\n\n\u003c/details\u003e\n\n\n## Models\n\n| **Model**                          | **Huggingface** |  **Base Model** |\n|-----------------------------------|------------------|------------------|\n| LUFFY-Qwen-Math-7B-Zero | https://huggingface.co/Elliott/LUFFY-Qwen-Math-7B-Zero |  Qwen2.5-Math-7B |\n| LUFFY-Qwen-Math-7B-SFT | https://huggingface.co/Elliott/Qwen2.5-Math-7B-SFT | Qwen2.5-Math-7B |\n| LUFFY-Qwen-Math-7B-SFT-RL | https://huggingface.co/Elliott/Qwen2.5-Math-7B-SFT-RL | Qwen2.5-Math-7B |\n| LUFFY-Qwen-Math-1.5B-Zero | https://huggingface.co/Elliott/LUFFY-Qwen-Math-1.5B-Zero | Qwen2.5-Math-1.5B |\n| LUFFY-Qwen-Instruct-7B | https://huggingface.co/Elliott/LUFFY-Qwen-Instruct-7B | Qwen2.5-7B-Instruct |\n\n---\n\n# 📃Evaluation\n\n## Reproducing the Results \nWe currently support automated evaluation on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). The platform provides specialized system prompts for a range of RL models, including LUFFY, SimpleRL, OpenReasoner, PRIME, and OAT.\n\nYou can reproduce our results by running the following commands:\n```bash\nROOT=YOUR_ROOT_PATH\nDATA=$ROOT/data/valid.all.parquet\n\nOUTPUT_DIR=./results/\nmkdir -p $OUTPUT_DIR\n\n# If you want to evaluate other models, you can change the model path and name.\nMODEL_PATH=Elliott/LUFFY-Qwen-Math-7B-Zero\nMODEL_NAME=luffy\n\nif [ $MODEL_NAME == \"eurus-2-7b-prime-zero\" ]; then\n  TEMPLATE=prime\nelif [ $MODEL_NAME == \"simple-rl-zero\" ]; then\n  TEMPLATE=qwen\nelse\n  TEMPLATE=own\nfi\n\nCUDA_VISIBLE_DEVICES=0,1,2,3 python eval_scripts/generate_vllm.py \\\n  --model_path $MODEL_PATH \\\n  --input_file $DATA \\\n  --remove_system True \\\n  --add_oat_evaluate True \\\n  --output_file $OUTPUT_DIR/$MODEL_NAME.jsonl \\\n  --template $TEMPLATE \u003e $OUTPUT_DIR/$MODEL_NAME.log\n```\n\n\n\n## LUFFY on Qwen2.5-Math-7B (zero-RL)\nLUFFY is evaluated on six competition-level benchmarks, achieving state-of-the-art results among all zero-RL methods. It surpasses both on-policy RL and imitation learning (SFT), especially in generalization:\n\n\n\n| **Model**                          | **AIME 2024** | **AIME 2025** | **AMC** | **MATH-500** | **Minerva** | **Olympiad** | **Avg.** |\n|-----------------------------------|-------------|-------------|---------|---------------|-------------|---------------|----------|\n| Qwen2.5-Math-7B                      |11.5 | 4.9 | 31.3 | 43.6 | 7.4 | 15.6 | 19.0 |\n| Qwen2.5-Math-7B-Instruct             |12.5  | 10.2 | 48.5 | 80.4 | 32.7 | 41.0 | 37.6   |\n| SimpleRL-Zero                     | 27.0 | 6.8  | 54.9 | 76.0 | 25.0 | 34.7 | 37.4     |\n| OpenReasoner-Zero                 | 16.5 | 15.0 | 52.1 | 82.4 | 33.1 | 47.1 | 41.0    |\n| PRIME-Zero                        | 17.0 | 12.8 | 54.0 | 81.4 | **39.0** | 40.3 | 40.7    |\n| Oat-Zero                          | **33.4**  | 11.9 | 61.2 | 78.0 | 34.6 | 43.4 | 43.7   |\n| **LUFFY-Qwen-Math-7B-Zero**                         | 29.4        | **23.1**        | **65.6**| **87.6**      | 37.5        | **57.2**      | **50.1** |\n\n---\n\n\n\nLUFFY also generalizes well to out-of-distribution tasks, with over +6.2 average gain on ARC-C, GPQA, and MMLU-Pro.\n\n\n| **Model**                         | **ARC-c** | **GPQA-diamond** | **MMLU-Pro** | **Avg.** |\n|----------------------------------|-----------|------------------|--------------|----------|\n| Qwen2.5-Math-7B             | 18.2 | 11.1 | 16.9 | 15.4  |\n| Qwen2.5-Math-7B-Instruct         | 70.3 | 24.7 | 34.1 | 43.0    |\n| SimpleRL-Zero                    | 30.2 | 23.2 | 34.5 | 29.3     |\n| OpenReasoner-Zero                       | 66.2 | 29.8 | 58.7 | 51.6     |\n| PRIME-Zero                         | 73.3 | 18.2 | 32.7 | 41.4   |\n| Oat-Zero                | 70.1 | 23.7 | 41.7 | 45.2    |\n| **LUFFY-Qwen-Math-7B-Zero**                        | **80.5** |  **39.9** | **53.0** | **57.8** |\n\n\nWe further compare LUFFY with alternative off-policy learning methods, including SFT, RL w/ SFT Loss and SFT+RL (see our paper for details):\n\n| **Model**                          | **GPU Hours** | **Data Usage (On/Off)** | **AIME 2024** | **AIME 2025** | **AMC** | **MATH-500** | **Minerva** | **Olympiad** | **Avg.** |\n|-----------------------------------|-------------|-------------|-------------|-------------|---------|---------------|-------------|---------------|----------|\n| SFT                      | 24*8 | 0 / 64k | 22.2 | 22.3 | 52.8 | 82.6 | 40.8 | 43.7 | 44.1 |\n| RL w/ SFT Loss             |  133*8    | 64k*7 / 64k  |  19.5 | 16.4 | 49.7 | 80.4 | 34.9 | 39.4 | 40.1  |\n| SFT+RL                      | 130*8 |  64k*8/135k |  25.8 | **23.1** | 62.7 | 87.2 | 39.7 | 50.4 | 48.2 |\n| **LUFFY-Qwen-Math-7B-Zero**            | 77*8 | 64k*7 / 64k              | 29.4        | **23.1**        | 65.6 | **87.6**      | 37.5        | **57.2**      | 50.1 |\n| **LUFFY-Qwen-Math-7B-Zero-Extra**       |    130*8       |   110k*7 / 110k      | **30.7** | 22.5 | **66.2**|  86.8 | **41.2** | 55.3 | **50.4** |\n\n---\n\n## LUFFY on Qwen2.5-Math-1.5B\n| **Model**                          | **AIME 2024** | **AIME 2025** | **AMC** | **MATH-500** | **Minerva** | **Olympiad** | **Avg.** |\n|-----------------------------------|-------------|-------------|---------|---------------|-------------|---------------|----------|\n| Qwen2.5-Math-1.5B                  |   7.2 |  3.6 | 26.4 | 28.0 | 9.6 | 21.2 | 16.0 |\n| Qwen2.5-Math-1.5B-Instruct            |  12.1 | 8.9 | 48.1 | 77.4 | 28.7 | 39.1 | 35.7 |\n| **LUFFY-Qwen-Math-1.5B-Zero**             | **16.0** | **13.1** | **47.1** | **80.2** | **30.5** | **41.0** | **38.0** |\n\n\n\n## LUFFY on Qwen2.5-Instruct-7B \n| **Model**                          | **AIME 2024** | **AIME 2025** | **AMC** | **MATH-500** | **Minerva** | **Olympiad** | **Avg.** |\n|-----------------------------------|-------------|-------------|---------|---------------|-------------|---------------|----------|\n| Qwen2.5-7B-Instruct           | 11.7 | 7.5 | 43.8 | 71.8 | 30.9 | 40.4|  34.4|\n| **LUFFY-Qwen-Instruct-7B**             | **17.7** |  **14.8** | **50.9** | **82.0** | **31.3** | **47.4** | **40.7** |\n\n\n# 🌻Acknowledgement\n\nLUFFY builds upon [veRL](https://github.com/volcengine/verl) and [deepscaler](https://github.com/agentica-project/rllm), and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for datasets and backbones, including [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math), and [DeepSeek-R1](https://github.com/deepseek-ai/deepseek-r1) model. \n\n# 📬 Contact\n\nFor questions, feedback, or collaboration opportunities, feel free to reach out:\n- Jianhao Yan: elliottyan37@gmail.com\n- Yafu Li: yafuly@gmail.com\n\n# Citation\nIf you find our model, data, or evaluation code useful, please kindly cite our paper:\n```bib\n@misc{luffy,\n      title={Learning to Reason under Off-Policy Guidance}, \n      author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},\n      year={2025},\n      eprint={2504.14945},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2504.14945}, \n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FElliottYan%2FLUFFY","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FElliottYan%2FLUFFY","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FElliottYan%2FLUFFY/lists"}