{"id":18183160,"url":"https://github.com/volcengine/veRL","last_synced_at":"2025-04-01T21:31:03.165Z","repository":{"id":260409098,"uuid":"881221486","full_name":"volcengine/verl","owner":"volcengine","description":"verl: Volcano Engine Reinforcement Learning for LLMs","archived":false,"fork":false,"pushed_at":"2025-03-30T16:00:32.000Z","size":3266,"stargazers_count":5905,"open_issues_count":273,"forks_count":589,"subscribers_count":42,"default_branch":"main","last_synced_at":"2025-03-30T16:00:36.708Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://verl.readthedocs.io/en/latest/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/volcengine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-31T06:11:15.000Z","updated_at":"2025-03-30T15:44:40.000Z","dependencies_parsed_at":"2024-10-31T06:25:48.605Z","dependency_job_id":"b05a43e6-c82d-492a-bba3-7e2b6b046a8f","html_url":"https://github.com/volcengine/verl","commit_stats":null,"previous_names":["volcengine/verl"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/volcengine%2Fverl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/volcengine%2Fverl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/volcengine%2Fverl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/volcengine%2Fverl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/volcengine","download_url":"https://codeload.github.com/volcengine/verl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246712991,"owners_count":20821828,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-02T20:00:39.359Z","updated_at":"2025-04-01T21:31:03.150Z","avatar_url":"https://github.com/volcengine.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Industry Strength Reinforcement Learning"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003ch1 style=\"text-align: center;\"\u003everl: Volcano Engine Reinforcement Learning for LLM\u003c/h1\u003e\n\n[![GitHub Repo stars](https://img.shields.io/github/stars/volcengine/verl)](https://github.com/volcengine/verl/stargazers)\n![GitHub forks](https://img.shields.io/github/forks/volcengine/verl)\n[![Twitter](https://img.shields.io/twitter/follow/verl_project)](https://twitter.com/verl_project)\n\u003ca href=\"https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA\"\u003e\u003cimg src=\"https://img.shields.io/badge/Slack-verl-blueviolet?logo=slack\u0026amp\"\u003e\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/pdf/2409.19256\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=EuroSys\u0026message=Paper\u0026color=red\"\u003e\u003c/a\u003e\n![GitHub contributors](https://img.shields.io/github/contributors/volcengine/verl)\n[![Documentation](https://img.shields.io/badge/documentation-blue)](https://verl.readthedocs.io/en/latest/)\n\u003ca href=\"https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG\"\u003e\u003cimg src=\"https://img.shields.io/badge/微信-green?logo=wechat\u0026amp\"\u003e\u003c/a\u003e\n\n\nverl is a flexible, efficient and production-ready RL training library for large language models (LLMs).\n\nverl is the open-source version of **[HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256v2)** paper.\n\nverl is flexible and easy to use with:\n\n- **Easy extension of diverse RL algorithms**: The hybrid-controller programming model enables flexible representation and efficient execution of complex Post-Training dataflows. Build RL dataflows such as GRPO, PPO in a few lines of code.\n\n- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as FSDP, Megatron-LM, vLLM, SGLang, etc\n\n- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.\n\n- Ready integration with popular HuggingFace models\n\n\nverl is fast with:\n\n- **State-of-the-art throughput**: SOTA LLM training and inference engine integrations and SOTA RL throughput.\n\n- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.\n\n\u003c/p\u003e\n\n## News\n- [2025/03] [DAPO](https://dapo-sia.github.io/) is the open-sourced SOTA RL algorithm that achieves 50 points on AIME 2024 based on the Qwen2.5-32B pre-trained model, surpassing the previous SOTA achieved by DeepSeek's GRPO (DeepSeek-R1-Zero-Qwen-32B). DAPO's training is fully powered by verl and the reproduction code is [publicly available](https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo) now.\n- [2025/03] We will present verl(HybridFlow) at EuroSys 2025. See you in Rotterdam!\n- [2025/03] We introduced the programming model of verl at the [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg) and [verl intro and updates](https://github.com/eric-haibin-lin/verl-community/blob/main/slides/verl-lmsys-meetup.pdf) at the [LMSys Meetup](https://lu.ma/ntjrr7ig) in Sunnyvale mid March.\n- [2025/02] verl v0.2.0.post2 is released! See [release note](https://github.com/volcengine/verl/releases/) for details.\n- [2025/01] [Doubao-1.5-pro](https://team.doubao.com/zh/special/doubao_1_5_pro) is released with SOTA-level performance on LLM \u0026 VLM. The RL scaling preview model is trained using verl, reaching OpenAI O1-level performance on math benchmarks (70.0 pass@1 on AIME).\n\u003cdetails\u003e\u003csummary\u003e more... \u003c/summary\u003e\n\u003cul\u003e\n  \u003cli\u003e[2025/02] We presented verl in the \u003ca href=\"https://lu.ma/ji7atxux\"\u003eBytedance/NVIDIA/Anyscale Ray Meetup\u003c/a\u003e. See you in San Jose!\u003c/li\u003e\n  \u003cli\u003e[2024/12] verl is presented at Ray Forward 2024. Slides available \u003ca href=\"https://github.com/eric-haibin-lin/verl-community/blob/main/slides/Ray_Forward_2024_%E5%B7%AB%E9%94%A1%E6%96%8C.pdf\"\u003ehere\u003c/a\u003e\u003c/li\u003e\n  \u003cli\u003e[2024/10] verl is presented at Ray Summit. \u003ca href=\"https://www.youtube.com/watch?v=MrhMcXkXvJU\u0026list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U\u0026index=37\"\u003eYoutube video\u003c/a\u003e available.\u003c/li\u003e\n  \u003cli\u003e[2024/12] The team presented \u003ca href=\"https://neurips.cc/Expo/Conferences/2024/workshop/100677\"\u003ePost-training LLMs: From Algorithms to Infrastructure\u003c/a\u003e at NeurIPS 2024. \u003ca href=\"https://github.com/eric-haibin-lin/verl-data/tree/neurips\"\u003eSlides\u003c/a\u003e and \u003ca href=\"https://neurips.cc/Expo/Conferences/2024/workshop/100677\"\u003evideo\u003c/a\u003e available.\u003c/li\u003e\n  \u003cli\u003e[2024/08] HybridFlow (verl) is accepted to EuroSys 2025.\u003c/li\u003e\n\u003c/ul\u003e   \n\u003c/details\u003e\n\n## Key Features\n\n- **FSDP** and **Megatron-LM** for training.\n- **vLLM**, **SGLang**(experimental) and **HF Transformers** for rollout generation.\n- Compatible with Hugging Face Transformers and Modelscope Hub: Qwen-2.5, Llama3.1, Gemma2, DeepSeek-LLM, etc\n- Supervised fine-tuning.\n- Reinforcement learning with [PPO](examples/ppo_trainer/), [GRPO](examples/grpo_trainer/), [ReMax](examples/remax_trainer/), [REINFORCE++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm), [RLOO](examples/rloo_trainer/), [PRIME](recipe/prime/), etc.\n  - Support model-based reward and function-based reward (verifiable reward)\n  - Support vision-language models (VLMs) and [multi-modal RL](examples/grpo_trainer/run_qwen2_5_vl-7b.sh)\n- Flash attention 2, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [sequence parallelism](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh).\n- Scales up to 70B models and hundreds of GPUs.\n- Experiment tracking with wandb, swanlab, mlflow and tensorboard.\n\n## Upcoming Features\n- DeepSeek 671b optimizations with Megatron v0.11\n- Multi-turn rollout optimizations\n\n## Getting Started\n\n\u003ca href=\"https://verl.readthedocs.io/en/latest/index.html\"\u003e\u003cb\u003eDocumentation\u003c/b\u003e\u003c/a\u003e\n\n**Quickstart:**\n- [Installation](https://verl.readthedocs.io/en/latest/start/install.html)\n- [Quickstart](https://verl.readthedocs.io/en/latest/start/quickstart.html)\n- [Programming Guide](https://verl.readthedocs.io/en/latest/hybrid_flow.html)\n\n**Running a PPO example step-by-step:**\n- Data and Reward Preparation\n  - [Prepare Data for Post-Training](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)\n  - [Implement Reward Function for Dataset](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)\n- Understanding the PPO Example\n  - [PPO Example Architecture](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html)\n  - [Config Explanation](https://verl.readthedocs.io/en/latest/examples/config.html)\n  - [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)\n\n**Reproducible algorithm baselines:**\n- [PPO, GRPO, ReMax](https://verl.readthedocs.io/en/latest/experiment/ppo.html)\n\n**For code explanation and advance usage (extension):**\n- PPO Trainer and Workers\n  - [PPO Ray Trainer](https://verl.readthedocs.io/en/latest/workers/ray_trainer.html)\n  - [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)\n  - [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/index.html)\n- Advance Usage and Extension\n  - [Ray API design tutorial](https://verl.readthedocs.io/en/latest/advance/placement.html)\n  - [Extend to Other RL(HF) algorithms](https://verl.readthedocs.io/en/latest/advance/dpo_extension.html)\n  - [Add Models with the FSDP Backend](https://verl.readthedocs.io/en/latest/advance/fsdp_extension.html)\n  - [Add Models with the Megatron-LM Backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html)\n  - [Deployment using Separate GPU Resources](https://github.com/volcengine/verl/tree/main/examples/split_placement)\n\n**Blogs from the community**\n- [使用verl进行GRPO分布式强化学习训练最佳实践](https://www.volcengine.com/docs/6459/1463942)\n- [HybridFlow veRL 原文浅析](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/readme.md)\n- [最高提升20倍吞吐量！豆包大模型团队发布全新 RLHF 框架，现已开源！](https://team.doubao.com/en/blog/%E6%9C%80%E9%AB%98%E6%8F%90%E5%8D%8720%E5%80%8D%E5%90%9E%E5%90%90%E9%87%8F-%E8%B1%86%E5%8C%85%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%9B%A2%E9%98%9F%E5%8F%91%E5%B8%83%E5%85%A8%E6%96%B0-rlhf-%E6%A1%86%E6%9E%B6-%E7%8E%B0%E5%B7%B2%E5%BC%80%E6%BA%90)\n\n\n## Performance Tuning Guide\nThe performance is essential for on-policy RL algorithm. We have written a detailed [performance tuning guide](https://verl.readthedocs.io/en/latest/perf/perf_tuning.html) to help you optimize performance.\n\n## Use vLLM v0.8\nveRL now supports vLLM\u003e=0.8.0 when using FSDP as the training backend. Please refer to [this document](https://github.com/volcengine/verl/blob/main/docs/README_vllm0.8.md) for installation guide and more information.\n\n## Citation and acknowledgement\n\nIf you find the project helpful, please cite:\n- [HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256v2)\n- [A Framework for Training Large Language Models for Code Generation via Proximal Policy Optimization](https://i.cs.hku.hk/~cwu/papers/gmsheng-NL2Code24.pdf)\n\n```bibtex\n@article{sheng2024hybridflow,\n  title   = {HybridFlow: A Flexible and Efficient RLHF Framework},\n  author  = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},\n  year    = {2024},\n  journal = {arXiv preprint arXiv: 2409.19256}\n}\n```\n\nverl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and supported by Anyscale, Bytedance, LMSys.org, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, University of Hong Kong, and many more.\n\n## Awesome work using verl\n- [TinyZero](https://github.com/Jiayi-Pan/TinyZero): a reproduction of **DeepSeek R1 Zero** recipe for reasoning tasks ![GitHub Repo stars](https://img.shields.io/github/stars/Jiayi-Pan/TinyZero)\n- [DAPO](https://dapo-sia.github.io/): the fully open source SOTA RL algorithm that beats DeepSeek-R1-zero-32B ![GitHub Repo stars](https://img.shields.io/github/stars/volcengine/verl)\n- [SkyThought](https://github.com/NovaSky-AI/SkyThought): RL training for Sky-T1-7B by NovaSky AI team. ![GitHub Repo stars](https://img.shields.io/github/stars/NovaSky-AI/SkyThought)\n- [simpleRL-reason](https://github.com/hkust-nlp/simpleRL-reason): SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild ![GitHub Repo stars](https://img.shields.io/github/stars/hkust-nlp/simpleRL-reason)\n- [Easy-R1](https://github.com/hiyouga/EasyR1): **Multi-modal** RL training framework ![GitHub Repo stars](https://img.shields.io/github/stars/hiyouga/EasyR1)\n- [OpenManus-RL](https://github.com/OpenManus/OpenManus-RL): LLM Agents RL tunning framework for multiple agent environments. ![GitHub Repo stars](https://img.shields.io/github/stars/OpenManus/OpenManus-RL)\n- [deepscaler](https://github.com/agentica-project/deepscaler): iterative context scaling with GRPO ![GitHub Repo stars](https://img.shields.io/github/stars/agentica-project/deepscaler)\n- [PRIME](https://github.com/PRIME-RL/PRIME): Process reinforcement through implicit rewards ![GitHub Repo stars](https://img.shields.io/github/stars/PRIME-RL/PRIME)\n- [RAGEN](https://github.com/ZihanWang314/ragen): a general-purpose reasoning **agent** training framework ![GitHub Repo stars](https://img.shields.io/github/stars/ZihanWang314/ragen)\n- [Logic-RL](https://github.com/Unakar/Logic-RL): a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset. ![GitHub Repo stars](https://img.shields.io/github/stars/Unakar/Logic-RL)\n- [Search-R1](https://github.com/PeterGriffinJin/Search-R1): RL with reasoning and **searching (tool-call)** interleaved LLMs ![GitHub Repo stars](https://img.shields.io/github/stars/PeterGriffinJin/Search-R1)\n- [ReSearch](https://github.com/Agent-RL/ReSearch): Learning to **Re**ason with **Search** for LLMs via Reinforcement Learning ![GitHub Repo stars](https://img.shields.io/github/stars/Agent-RL/ReSearch)\n- [DeepRetrieval](https://github.com/pat-jj/DeepRetrieval): Hacking **Real Search Engines** and **retrievers** with LLMs via RL for **information retrieval** ![GitHub Repo stars](https://img.shields.io/github/stars/pat-jj/DeepRetrieval)\n- [cognitive-behaviors](https://github.com/kanishkg/cognitive-behaviors): Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs ![GitHub Repo stars](https://img.shields.io/github/stars/kanishkg/cognitive-behaviors)\n- [PURE](https://github.com/CJReinforce/PURE): **Credit assignment** is the key to successful reinforcement fine-tuning using **process reward model** ![GitHub Repo stars](https://img.shields.io/github/stars/CJReinforce/PURE)\n- [MetaSpatial](https://github.com/PzySeere/MetaSpatial): Reinforcing **3D Spatial Reasoning** in **VLMs** for the **Metaverse** ![GitHub Repo stars](https://img.shields.io/github/stars/PzySeere/MetaSpatial)\n- [DeepEnlighten](https://github.com/DolbyUUU/DeepEnlighten): Reproduce R1 with **social reasoning** tasks and analyze key findings ![GitHub Repo stars](https://img.shields.io/github/stars/DolbyUUU/DeepEnlighten)\n- [Code-R1](https://github.com/ganler/code-r1): Reproducing R1 for **Code** with Reliable Rewards ![GitHub Repo stars](https://img.shields.io/github/stars/ganler/code-r1)\n- [self-rewarding-reasoning-LLM](https://arxiv.org/pdf/2502.19613): self-rewarding and correction with **generative reward models** ![GitHub Repo stars](https://img.shields.io/github/stars/RLHFlow/Self-rewarding-reasoning-LLM)\n- [critic-rl](https://github.com/HKUNLP/critic-rl): LLM critics for code generation ![GitHub Repo stars](https://img.shields.io/github/stars/HKUNLP/critic-rl)\n- [DQO](https://arxiv.org/abs/2410.09302): Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization\n- [FIRE](https://arxiv.org/abs/2410.21236): Flaming-hot initiation with regular execution sampling for large language models\n\n## Contribution Guide\nContributions from the community are welcome! Please check out our [project roadmap](https://github.com/volcengine/verl/issues/22) and [release plan](https://github.com/volcengine/verl/issues/354) to see where you can contribute.\n\n### Code formatting\nWe use yapf (Google style) to enforce strict code formatting when reviewing PRs. To reformat your code locally, make sure you have installed the **latest** version of `yapf`\n```bash\npip3 install yapf --upgrade\n```\nThen, make sure you are at top level of verl repo and run\n```bash\nbash scripts/format.sh\n```\nWe are HIRING! Send us an [email](mailto:haibin.lin@bytedance.com) if you are interested in internship/FTE opportunities in MLSys/LLM reasoning/multimodal alignment.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvolcengine%2FveRL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvolcengine%2FveRL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvolcengine%2FveRL/lists"}