{"id":27151218,"url":"https://github.com/lsdefine/simple_GRPO","last_synced_at":"2025-04-08T14:25:06.856Z","repository":{"id":276710386,"uuid":"930045546","full_name":"lsdefine/simple_GRPO","owner":"lsdefine","description":"A very simple  GRPO implement for reproducing r1-like LLM thinking.","archived":false,"fork":false,"pushed_at":"2025-04-03T06:46:04.000Z","size":399,"stargazers_count":847,"open_issues_count":24,"forks_count":73,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-07T14:40:41.087Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lsdefine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-10T00:56:39.000Z","updated_at":"2025-04-07T14:31:11.000Z","dependencies_parsed_at":"2025-03-24T05:24:28.415Z","dependency_job_id":"c1919c3e-b5da-4a63-a387-81fd41ce0af3","html_url":"https://github.com/lsdefine/simple_GRPO","commit_stats":null,"previous_names":["lsdefine/simple_grpo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsdefine%2Fsimple_GRPO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsdefine%2Fsimple_GRPO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsdefine%2Fsimple_GRPO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsdefine%2Fsimple_GRPO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lsdefine","download_url":"https://codeload.github.com/lsdefine/simple_GRPO/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247858197,"owners_count":21007878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-08T14:25:04.843Z","updated_at":"2025-04-08T14:25:06.829Z","avatar_url":"https://github.com/lsdefine.png","language":"Python","readme":"# 🚀🚀🚀 simple_GRPO 🚀🚀🚀\nA very simple GRPO implement for reproducing r1-like LLM thinking.\nThis is a simple open source implementation that utilizes the core loss calculation formula referenced from Hugging Face's trl. \nWe make the simplest codebase to support: \n- Save the GPU memory to make a feasible and efficient training. \n- Quickly understand RL processes such as GRPO from a teaching perspective. \n- Quickly try a lot of things, such as improved multi-answer generation, regrouping, penalty on KL, and parameter tuning.\n- \"Aha moment\" is observed during the early stages of model training.\n\n## ✨NEW\n- 2025/02/19: Added a loss triton implementation, which has a little speedup, but you can choose not to use it. See *simple_grpo_v1* fold\n- 2025/02/19: Added regroup version, implemented sampling of generated data on ref_server. See *regroup_ver* fold\n- 2025/02/27: Added vllm package to accelerate the inference.\n- 2025/03/24: Added reinforce++ algorithm.Usage is the same as before.\n\n## 🌟 Features\n### 💡 Simplicity\nThe project code is simple, with only about 200 lines of code spread across 2 files. It only depends on standard libraries such as _deepspeed_ and _torch_, without requiring dependencies like ray. It is designed to allow for more complex interventions.\n\n### 🤖 Splited Reference Model\nThe reference model part is decoupled, which allows it to be run on different GPUs (even on a different machine with 4090). This avoids having the reference model and the training model on the same GPU, preventing multiple copies created by torch’s multiprocessing, and enabling training of a 7B model on 80G A800.\n\n### 💃 Performance\nTraining completed in under 1 hour on 1*A800 GPUs. Both Qwen2.5-7B and Qwen2.5-3B exhibited an \"Aha moment\" within the first 30 optimization steps.\n\n### 🥳 Core Loss Calculation\nThe loss calculation formula is based on Hugging Face's trl. We extend our gratitude to Hugging Face for their contribution.\n\n## 🙌 Environment\nThe runtime environment is in the requirements.txt\nso you can\n``` bash\npip install -r requirements.txt\n```\nAt least two GPUs are needed.\n\n## Usage\n### Now, if you have three GPUs or more, you will have a better choice!!!\nRun the following command:\n``` bash\nCUDA_VISIBLE_DEVICES=7 python ref_server.py\n```\nThis just uses one GPU to collect and run the reference model.\n\nIn *grpo_vllm_one.py*, set the generation device index ​relative to the visible devices​ in next step:\n``` bash\ngen_device = 1\n```\nThen, open another bash:\n``` bash\nCUDA_VISIBLE_DEVICES=2,3,4,5,6 deepspeed grpo_vllm_one.py\n```\n## ✨ Experimental Results\n\n1. Runtime Environment\n- Hardware Setup: 2×A800 (80GB) GPUs\n- Configuration:\n  - Training: 1 GPU with Zero-Stage 2 optimization\n  - Inference: Dedicated 1 GPU (3090/4090 compatible)\n\n2. Training Performance\n   \n| Model        | Steps | Time       |\n|--------------|-------|------------|\n| Qwen2.5-3B   | 60    | 12m 34s    |\n| Qwen2.5-7B   | 60    | 16m 40s    |\n\n2.1 Qwen2.5-3B\n\n\u003cimg src=\"https://github.com/lsdefine/simple_GRPO/blob/main/images/Qwen2dot5-3B-res.jpg\" alt=\"Qwen2.5-3B Accuracy\" width=\"70%\" style=\"display: inline-block; margin-right: 5px;\"\u003e  \n\n- Accuracy:\n  - Stabilizes above 60% after 5 optimization steps\n  - Peaks at ~70%\n- Format Compliance:\n  - Reaches ≈100% after 30 steps\n\n2.2 Qwen2.5-7B\n\n\u003cimg src=\"https://github.com/lsdefine/simple_GRPO/blob/main/images/Qwen2dot5-7B-res.jpg\" alt=\"Qwen2.5-3B Accuracy\" width=\"70%\" style=\"display: inline-block; margin-right: 5px;\"\u003e  \n\n- Accuracy:\n  - Maintains \u003e90% throughout training\n- Format Compliance:\n  - Achieves 100% within 30 steps\n- Convergence: Both metrics show rapid convergence\n\n## Aha Moment\n1. Qwen2.5-3B (Step 20)\n\u003e `\u003cthink\u003e`\n\u003e ...\n\u003e To find out how many more eggs Camden painted than Arnold,  we subtract the number of eggs painted by Arnold from the number of eggs painted by Camden. So, 21 - 28 = -7.  **However, this result doesn't make sense in the context of the problem,\n\u003e as Camden cannot have painted fewer eggs than Arnold.\n\u003e Let's reconsider the initial solution steps:\n\u003e we should verify if the initial assumption about the relationship between Camden and Sarah's eggs is correct.**\n\u003e ...\n\u003e `\u003c/think\u003e`\n\u003e `\u003canswer\u003e`-7`\u003c/answer\u003e`\n\n2. Qwen2.5-7B (Step 20)\n\n\u003e `\u003cthink\u003e`\n\u003e ...\n\u003e Therefore, Joanne gathered 350 apples from the average trees.\n\u003e **However, this doesn't seem right because the total should be 500 and we've already accounted for 150,\n\u003e leaving room only for 350 from the average trees, which contradicts the total. Let's reassess.**\n\u003e ...\n\u003e `\u003c/think\u003e`\n\u003e `\u003canswer\u003e`350`\u003c/answer\u003e`\n \n## 😊 TODO\n- Answer generation may be invalid due to a group containing all wrong answers or all correct answers. We need group reorganization and better answer generation.\n- GPU memory is still tight if it generates long cots. We have to split the groups to make the batch smaller.\n\nWe have implemented and are testing these features. They will be available soon.\n\n## 🎉🎉🎉 Project Members\n\nThis project is led by Dr. Jiaqing Liang and Professor Yanghua Xiao from KnowledgeWorks Lab, Fudan University. The core development team includes Ph.D. candidate Jinyi Han, Master's student Xinyi Wang, and other contributors. We gratefully acknowledge their dedication to this work.\n\n## 👏👏👏 Citation\n\nIf you find the code in our project useful, please consider citing our work as follows:\n\n```\n@misc{KW-R1,\n  author = {Jiaqing Liang, Jinyi Han, Xinyi Wang, Zishang Jiang, Chengyuan Xiong, Boyu Zhu, Jie Shi, Weijia Li, Tingyun Li, Yanghua Xiao},\n  title = {KW-R1: A Simple Implementation of the GRPO Algorithm},\n  year = {2025},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/lsdefine/simple_GRPO}},\n}\n```\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=lsdefine/simple_GRPO\u0026type=Date)](https://star-history.com/#lsdefine/simple_GRPO\u0026Date)\n","funding_links":[],"categories":["A01_文本生成_文本对话","🧪 Demos \u0026 Projects"],"sub_categories":["大语言对话模型及数据","RL-based LLM tuning"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flsdefine%2Fsimple_GRPO","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flsdefine%2Fsimple_GRPO","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flsdefine%2Fsimple_GRPO/lists"}