https://github.com/0russwest0/Agent-R1
https://github.com/0russwest0/Agent-R1
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/0russwest0/Agent-R1
- Owner: 0russwest0
- License: mit
- Created: 2025-03-04T02:29:05.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-12T11:33:27.000Z (about 1 year ago)
- Last Synced: 2025-03-12T12:29:49.460Z (about 1 year ago)
- Language: Python
- Size: 853 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ai-papers - [Agent-R1 - AI/RAGEN)\]\[[VAGEN](https://github.com/RAGEN-AI/VAGEN)\]\[[OpenManus-RL](https://github.com/OpenManus/OpenManus-RL)\]\[[SWEET-RL](https://arxiv.org/abs/2503.15478)\]\[[APIGen-MT](https://arxiv.org/abs/2504.03601)\] (NLP / 3. Pretraining)
- Awesome-Long-Chain-of-Thought-Reasoning - Agent-R1
- StarryDivineSky - 0russwest0/Agent-R1 - R1 是一个专注于通过端到端强化学习(End-to-End Reinforcement Learning)训练强大语言模型代理(LLM Agents)的开源项目。其核心目标是开发能够自主完成复杂任务的AI代理,通过与环境的互动、试错和奖励反馈来学习优化行为策略。项目采用强化学习方法,让代理根据环境提供的奖励信号调整决策,例如在文本任务中生成更符合用户需求的回复,或在模拟环境中执行更高效的指令。Agent-R1 的关键创新包括自定义奖励模型(Reward Model)和奖励塑造(Reward Shaping)技术:奖励模型通过人类反馈数据训练,帮助代理理解哪些行为值得奖励;而奖励塑造则通过调整奖励函数,加速训练过程并提升性能。例如,在游戏环境中,代理可能通过获得更高分数的奖励来学习击败对手的策略。 项目的工作原理基于强化学习框架,代理通过与环境(如文本任务、游戏或现实世界场景)交互,接收反馈并更新其策略。训练过程使用如PPO(近端策略优化)或DDPG(深度确定性策略梯度)等算法,根据累积奖励调整行为。Agent-R1 的模块化设计允许用户替换不同组件,如奖励模型、环境模拟器或算法,使其适用于多种应用场景。例如,代理可被训练用于自动化客服、游戏AI或机器人控制等任务。此外,项目支持多种环境,包括文本生成、游戏场景甚至物理机器人,展示了其广泛的适用性。 该项目的特色包括:1)端到端强化学习的完整流程,无需人工干预;2)模块化架构,便于扩展和定制;3)支持多环境适配,提升通用性;4)与主流大语言模型(如GPT、LLaMA)兼容,可直接集成现有模型。通过这些设计,Agent-R1 为研究者和开发者提供了一个灵活且高效的工具,用于探索语言模型代理在复杂任务中的潜力。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- awesome-rl-for-agents - [Code
README
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

**2025.3.18 Update:** We have added support for **process rewards**! You can now assign rewards for each tool call based on its effectiveness. To balance process rewards with outcome rewards, we implemented reward normalization inspired by [PRIME](https://github.com/PRIME-RL/PRIME).
## Overview
**Agent-R1** is an open-source framework designed to accelerate research and development at this critical intersection. Our framework employs **End-to-End** reinforcement learning to train agents in specific environments. Developers need only define domain-specific tools and reward functions to extend Agent-R1 to their unique use cases, eliminating the need for complex workflow engineering. We hope our modest contribution can benefit the open-source community, making it easier for researchers and developers to create and explore agents in their own domains, collectively advancing the development of autonomous agents. For more details on the algorithm, see [algorithm doc](https://github.com/0russwest0/Agent-R1/blob/main/docs/algorithm/algorithm.md).

## Key Features
- **Multi-turn Tool Calling**: End-to-end reinforcement learning on complete interaction trajectories, allowing agents to learn from sequences of actions
- **Custom Tools and Environments**: Compatible with mainstream LLM tool calling formats, making it easy to extend with your own tools and scenarios
- **Multiple RL Algorithms**: Supports diverse reinforcement learning approaches including PPO, GRPO, and REINFORCE++
- **Reasoning before Action**: Jointly optimize reasoning and action strategies over entire trajectories.
## Upcoming Features
- **Immediate Action Rewards**: Per-action reward mechanisms to complement trajectory-level reinforcement
- **Expanded Model Support**: Integration with more foundation models beyond the currently supported Qwen
- **Additional Use Cases**: More example implementations across diverse scenarios and domains
## Get Started
- [Environment Setup](https://github.com/0russwest0/Agent-R1/blob/main/docs/getting_started/installation.md)
- [Quick Start: Try Default Search Tool on HotpotQA](https://github.com/0russwest0/Agent-R1/blob/main/docs/getting_started/quickstart.md)
### Results on HotpotQA
#### PPO

#### REINFORCE++

#### GRPO

We can see that the model (Qwen2.5-1.5B-Instruct) effectively learns to think and then invoke the tool in multiple rounds when faced with challenging multi-hop questions, ultimately achieving improved the EM results. The effectiveness of different reinforcement learning algorithms varies, but the general trend is the same.
Notably, our experiments reveal a striking correlation: EM scores, number of tool calls (turns), and final response length all display consistent trends across training. This demonstrates a novel dimension of scaling laws—one that relates to the frequency of agent-environment interactions. As the agent learns to interact more effectively with its environment through multiple tool calls, performance improves proportionally, suggesting that the ability to engage in multiple rounds of environment interaction may be as crucial to agent performance as traditional scaling factors.
## Extending Agent-R1 with Your Own Tools and Environments
**Extending Agent-R1** is straightforward: create **custom tools** by extending the `Tool` base class, implement **data preprocessing** scripts to format your dataset, and define **reward functions** for task-specific evaluation. Register these components in their respective directories, and configure a training script to adapt Agent-R1 to your use case.
For detailed implementation guidance, examine the existing code:
- Tools: `agent_r1/tool/tools/calculator_tool.py`, `search_tool.py`
- Data processing: `examples/data_preprocess/hotpotqa.py`
- Reward functions: `verl/utils/reward_score/qa_em_and_format.py`
See the [extending doc](https://github.com/0russwest0/Agent-R1/blob/main/docs/extend/extending.md) for details.
## Feedback
We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.
## Contributors
[**Jie Ouyang**\*](https://github.com/0russwest0), [**Ruiran Yan**\*](https://github.com/RuiranYan), [**Yucong Luo**\*](https://github.com/GodFire66666), Zirui Liu, Shuo Yu
## Acknowledgements
We extend our gratitude to [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1) for providing the DeepSeek-R1 model and inspiring ideas. We are also thankful to the [veRL](https://github.com/volcengine/verl) team for their robust infrastructure support. Additionally, we acknowledge the [RAGEN](https://github.com/ZihanWang314/ragen) team for their groundbreaking discoveries, which significantly influenced our early exploration. Lastly, we deeply appreciate the insightful discussions and contributions from Jie Ouyang, Ruiran Yan, and Yucong Luo.
## Citation
```md
@misc{Agent-R1,
author = {Jie Ouyang, Ruiran Yan, Yucong Luo, Zirui Liu, Shuo Yu},
title = {Training Powerful LLM Agents with End-to-End Reinforcement Learning},
year = {2025},
organization = {GitHub},
url = {https://github.com/0russwest0/Agent-R1},
}
```