{"id":26660223,"url":"https://github.com/sail-sg/understand-r1-zero","last_synced_at":"2025-03-25T12:02:11.568Z","repository":{"id":283701250,"uuid":"951018704","full_name":"sail-sg/understand-r1-zero","owner":"sail-sg","description":"Understanding R1-Zero-Like Training: A Critical Perspective","archived":false,"fork":false,"pushed_at":"2025-03-21T17:23:46.000Z","size":13580,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-21T17:32:01.566Z","etag":null,"topics":["llm","r1-zero","reasoning","rl"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sail-sg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-19T03:22:52.000Z","updated_at":"2025-03-21T17:23:50.000Z","dependencies_parsed_at":"2025-03-21T17:43:57.877Z","dependency_job_id":null,"html_url":"https://github.com/sail-sg/understand-r1-zero","commit_stats":null,"previous_names":["sail-sg/understand-r1-zero"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Funderstand-r1-zero","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Funderstand-r1-zero/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Funderstand-r1-zero/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sail-sg%2Funderstand-r1-zero/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sail-sg","download_url":"https://codeload.github.com/sail-sg/understand-r1-zero/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245458702,"owners_count":20618697,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","r1-zero","reasoning","rl"],"created_at":"2025-03-25T12:01:01.587Z","updated_at":"2025-03-25T12:02:11.538Z","avatar_url":"https://github.com/sail-sg.png","language":"Python","funding_links":[],"categories":["Projects","A01_文本生成_文本对话","Python","Open-source"],"sub_categories":["Large Language Models","大语言对话模型及数据","Codebase"],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Understanding R1-Zero-Like Training: A Critical Perspective\n\n[Zichen Liu*†](https://lkevinzc.github.io/), [Changyu Chen*](https://cameron-chen.github.io/), [Wenjun Li*](https://wenjunli-0.github.io/), [Penghui Qi*](https://scholar.google.com/citations?user=CLRsGEMAAAAJ\u0026hl=en)\n\n[Tianyu Pang](https://p2333.github.io/), [Chao Du](https://duchao0726.github.io/), [Wee Sun Lee](https://scholar.google.com/citations?user=8PCrLgwAAAAJ\u0026hl=en), [Min Lin](https://scholar.google.com.sg/citations?user=BGONmkIAAAAJ\u0026hl=en)\n\n*Core Contributors, †Project Lead\n\n[![Paper](https://img.shields.io/badge/Paper-8CA1AF?logo=readthedocs\u0026logoColor=white)](./understand-r1-zero.pdf)\n\n[![Github](https://img.shields.io/badge/Understand%20R1%20Zero-000000?style=for-the-badge\u0026logo=github\u0026logoColor=000\u0026logoColor=white)](https://github.com/sail-sg/understand-r1-zero)  [![Hugging Face Collection](https://img.shields.io/badge/Model_Collection-fcd022?style=for-the-badge\u0026logo=huggingface\u0026logoColor=000)](https://huggingface.co/collections/sail/oat-zero-understanding-r1-zero-like-training-67dcdb07b9f3eb05f1501c4a)\n\n\u003cdiv align=\"center\" style=\"font-family: Arial, sans-serif;\"\u003e\n  \u003cp\u003e\n    \u003ca href=\"#updates\" style=\"text-decoration: none; font-weight: bold;\"\u003e🎉 Updates\u003c/a\u003e •\n    \u003ca href=\"#links\" style=\"text-decoration: none; font-weight: bold;\"\u003e🔗 Links\u003c/a\u003e •\n    \u003ca href=\"#tldr\" style=\"text-decoration: none; font-weight: bold;\"\u003e📖 TL;DR\u003c/a\u003e\n  \u003c/p\u003e\n  \u003cp\u003e\n    \u003ca href=\"#usage\" style=\"text-decoration: none; font-weight: bold;\"\u003e💻 Usage \u003c/a\u003e •\n    \u003ca href=\"#citation\" style=\"text-decoration: none; font-weight: bold;\"\u003e🍊 Citation\u003c/a\u003e •\n    \u003ca href=\"#acknowledgement\" style=\"text-decoration: none; font-weight: bold;\"\u003e🌻 Acknowledgement\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/div\u003e\n\n\u003c/div\u003e\n\n## Updates\n\n* 21/03/2025: 🎉 We release our paper, models and codebase. Our R1-Zero training is implemented with 🌾 [Oat](https://github.com/sail-sg/oat), a highly modular, research-friendly and efficient LLM RL framework.\n\n## Links\n\n* **Understanding R1-Zero-Like Training**\n  * 📄 [Paper](./understand-r1-zero.pdf)\n  * 🤗 [Models](https://huggingface.co/collections/sail/oat-zero-understanding-r1-zero-like-training-67dcdb07b9f3eb05f1501c4a)\n\n* **There May Not Be Aha Moment in R1-Zero-like Training — A Pilot Study**\n  * 📄 [Blog](https://oatllm.notion.site/oat-zero)\n  * 💻 [Code](https://github.com/sail-sg/oat-zero)\n\n* **OAT: A research-friendly framework for LLM online alignment**\n  * 💻 [Codebase](https://github.com/sail-sg/oat)\n\n## TL;DR\nTo understand R1-Zero-like training, we critically examine two core components: **base models**\nand **reinforcement learning**. We highlight our findings below.\n\n### On base models:\n1. **DeepSeek-V3-Base already exhibit \"Aha moment\"**.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/deepseek-base-aha.png\" width=70%/\u003e\n\u003c/p\u003e\n\n2. As the popular choice for R1-Zero-like training, Qwen2.5 base models demonstrate strong reasoning capabilities\neven **without** prompt templates: the average benchmark scores improve by **~60%** (compared to the traditional 4-shot prompting)!\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/qwen-math-base-scores.png\" width=70%/\u003e\n\u003c/p\u003e\n\n### On reinforcement learning:\n\n3. GRPO leads to **biased** optimization! We propose a simple fix that improves token efficiency\nwhile maintaining reasoning performance, termed as Dr. GRPO (GRPO **D**one **R**ight).\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/drgrpo.png\" width=80%/\u003e\n\u003c/p\u003e\n\n4. In R1-Zero-like training, the template and the question set perform a duet to affect the RL dynamics\n   * (Left Plot) For Qwen2.5-Math-1.5B, a mismatched template (e.g., R1 template) in fact **destructs the reasoning capabilities before RL reconstructing it**. This makes the improvement impressive on the surface.\n   * (Middle Plot) However, if a template does not deviate from the pretraining distribution too far, even a small and completely o.o.d. question set (e.g., GSM8K) could induce the reasoning ability equally well, by reinforcing correct reasoning behaviors instead of infusing new knowledge.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/template-data-duet.png\" width=80%/\u003e\n\u003c/p\u003e\n\n5. Beyond Qwen, Llama can also be RL-tuned from base models. In this case, domain-specific pretraining will improves RL ceiling.\n   * (Right Plot) GRPO can even make Llama with math knowledge \"Aha\" by increasing the output length; however, it is likely due to its length bias, which can be removed by Dr. GRPO.\n \u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/llama-r1-zero.png\" width=70%/\u003e\n\u003c/p\u003e\n\n### Our minimalist R1-Zero recipe:\nOur analysis suggests a minimalist recipe for R1-Zero-like training: \n\nWe RL-tune Qwen2.5-\nMath-7B using the (unbiased) Dr. GRPO algorithm on MATH level 3-5 questions with the Qwen-Math template, and achieve state-of-the-art performance with only 27 hours compute on 8× A100 GPUs.\n \u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/benchmark.png\" width=90%/\u003e\n\u003c/p\u003e\n\nIf you are interested in more details, please check out our [paper](./understand-r1-zero.pdf)!\n\n## Usage\n\n### Install\n\nWe recommend a clean `python==3.10` environment for development.\n\n```diff\n# Install vllm \u0026 oat, the LLM RL framework we developed r1-zero training on.\npip install vllm==0.7.2 \u0026\u0026 pip install oat-llm==0.0.9\n\n# Install this package locally to use the math grader.\ngit clone git@github.com:sail-sg/understand-r1-zero.git \u0026\u0026 cd understand-r1-zero\npip install -e .\n```\n\n### Training\n\nWe implement R1-Zero training by extending Oat's Learner and Actor components. Please see [train_zero_math.py](./train_zero_math.py) for a step-by-step guide.\n\n```diff\n# Patch LD_LIBRARY_PATH to avoid dependency errors:\nexport LD_LIBRARY_PATH=$(python -c \"import sysconfig; print(sysconfig.get_config_var('LIBDIR'))\"):$LD_LIBRARY_PATH\n\n# Run the experiment (tested on 8 x A100-40G) with Dr. GRPO:\n# (change to `--critic_type grpo` for running GRPO)\npython train_zero_math.py \\\n    --critic_type drgrpo \\\n    --gpus 8 \\\n    --enable_prefix_caching \\\n    --collocate \\\n    --vllm_sleep \\\n    --vllm_gpu_ratio 0.35 \\\n    --gradient-checkpointing \\\n    --flash-attn \\\n    --bf16 \\\n    --rnd-seed \\\n    --learning_rate 0.000001 \\\n    --lr_scheduler constant \\\n    --num_ppo_epochs 1 \\\n    --beta 0 \\\n    --oracle_type reward \\\n    --oracle math \\\n    --pretrain Qwen/Qwen2.5-Math-1.5B \\\n    --prompt_template r1 \\\n    --zero-stage 2 \\\n    --ref_offload \\\n    --prompt_data ./datasets/train/math_12k \\\n    --train_split train \\\n    --input_key problem \\\n    --output_key answer \\\n    --max-train 9999999 \\\n    --num_prompt_epoch 20 \\\n    --prompt_max_length 1024 \\\n    --num_samples 8 \\\n    --temperature 1 \\\n    --top_p 1 \\\n    --generate_max_length 3000 \\\n    --save_steps -1 \\\n    --train_batch_size 128 \\\n    --train_batch_size_per_device 1 \\\n    --mini_train_batch_size_per_device 1 \\\n    --rollout_batch_size 128 \\\n    --rollout_batch_size_per_device 16 \\\n    --pi_buffer_maxlen_per_device 128 \\\n    --eval_batch_size 200 \\\n    --eval_steps 16 \\\n    --eval_temperature 0 \\\n    --eval_generate_max_length 3000 \\\n    --eval_data ./datasets/evaluation_suite \\\n    --eval_input_key input \\\n    --use-wb \\\n    --wb-run-name qwen2.5-Math-1.5b-r1-zero \\\n    --wb_project oat-zero\n```\nPlease see [here](./examples/) for more example scripts.\n\n### Evaluation\n```diff\n# Evaluate our models:\npython evaluate_model.py --model_name sail/Qwen2.5-Math-7B-Oat-Zero\npython evaluate_model.py --model_name sail/Qwen2.5-Math-1.5B-Oat-Zero\npython evaluate_model.py --model_name sail/Llama-3.2-3B-Oat-Zero --template r1\n\n# Evaluate baseline models:\npython evaluate_model.py --model_name Qwen/Qwen2.5-Math-1.5B\npython evaluate_model.py --model_name Qwen/Qwen2.5-Math-7B\npython evaluate_model.py --model_name hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero\npython evaluate_model.py --model_name PRIME-RL/Eurus-2-7B-PRIME-Zero\npython evaluate_model.py --model_name Open-Reasoner-Zero/Open-Reasoner-Zero-7B\n```\n\n### Serving DeepSeek Models\n\nWe provide a script to serve DeepSeek-V3-Base and DeepSeek-R1-Zero on k8s cluster.\n\n```diff\n# prerequisites:\n# 1. download the model weights\n# 2. starting a k8s job with sglang docker image \"lmsysorg/sglang:v0.4.3.post2-cu125\"\n\n# start the server:\nbash deploy_dpsk/serving.sh \u003cmodel_name\u003e \u003cnum_nodes\u003e\n```\n\nExample of API call: \n```python\nfrom openai import OpenAI\n\n# MASTER_ADDR is the environment variable set by the k8s job\napi_base = \"http://{MASTER_ADDR}:30000/v1\"\napi_key = \"EMPTY\"\n\nclient = OpenAI(\n    api_key=api_key,\n    base_url=api_base,\n)\n\n# send requests to the server ...\n```\n\nNotes:\n- Your k8s container should have environment variable `MASTER_ADDR` and `MASTER_PORT` set.\n- Hardware requirements: `2 x 8 x H100/800/20` for FP8 and `4 x 8 x A100/A800` for BF16.\n- Please refer to sglang's [official tutorial](https://docs.sglang.ai/references/deepseek.html) for more details.\n\n## Citation\n\nIf you find our works useful for your research, please consider citing:\n\n- This paper:\n```bibtex\n@misc{liu2025understanding,\n  title={Understanding R1-Zero-Like Training: A Critical Perspective},\n  author={Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin},\n  year={2025},\n  howpublished={\\url{https://github.com/sail-sg/understand-r1-zero}},\n}\n```\n\n- Our blog that conducted the first investigation on the \"Aha moment\":\n```bibtex\n@misc{liu2025there,\n  title={There May Not be Aha Moment in R1-Zero-like Training — A Pilot Study},\n  author={Zichen Liu and Changyu Chen and Wenjun Li and Tianyu Pang and Chao Du and Min Lin},\n  year={2025},\n  howpublished={\\url{https://oatllm.notion.site/oat-zero}},\n  note={Notion Blog},\n}\n```\n\n- The training framework:\n```bibtex\n@misc{liu2025oat,\n  title={OAT: A research-friendly framework for LLM online alignment},\n  author={Zichen Liu and Changyu Chen and Chao Du and Wee Sun Lee and Min Lin},\n  year={2025}\n  howpublished={\\url{https://github.com/sail-sg/oat}},\n}\n```\n\n## Acknowledgement\n* This work is supported by [Sea AI Lab](https://sail.sea.com/) for computing resources.\n* The training codes are built on [Oat](https://github.com/sail-sg/oat), which employs [vLLM](https://github.com/vllm-project/vllm), [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [launchpad](https://github.com/google-deepmind/launchpad). We serve DeepSeek models using [SGLang](https://github.com/sgl-project/sglang).\n* The base models are from [Qwen2.5-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B), [Llama](https://huggingface.co/meta-llama/Llama-3.2-3B), and [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base).\n* We thank [Qingfeng Lan](https://lancelqf.github.io/about/) for his time in thoroughly reviewing our code.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Funderstand-r1-zero","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsail-sg%2Funderstand-r1-zero","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsail-sg%2Funderstand-r1-zero/lists"}