https://github.com/the-swarm-corporation/agentgym
A framework making it effortless to convert any llm model into a reasoning agent like o1 or DeepSeek's r1
https://github.com/the-swarm-corporation/agentgym
agents ai alibaba deepseek llms o1 qwen r1 rl
Last synced: 3 months ago
JSON representation
A framework making it effortless to convert any llm model into a reasoning agent like o1 or DeepSeek's r1
- Host: GitHub
- URL: https://github.com/the-swarm-corporation/agentgym
- Owner: The-Swarm-Corporation
- License: mit
- Created: 2025-01-29T16:33:46.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-06-27T12:41:15.000Z (4 months ago)
- Last Synced: 2025-06-29T13:44:50.260Z (4 months ago)
- Topics: agents, ai, alibaba, deepseek, llms, o1, qwen, r1, rl
- Language: Python
- Homepage: https://swarms.ai
- Size: 2.39 MB
- Stars: 21
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Agent Gym
[](https://discord.gg/swarms) [](https://www.youtube.com/@kyegomez3242) [](https://www.linkedin.com/in/kye-g-38759a207/) [](https://x.com/kyegomezb)
Convert any model into a r1-like reasoning hyper-intelligent agent. Leverages TRL, Huggingface, and various other libraries. This is a work in progress. Our goal is to make it easy to train any model into a reasoning agent.
- Sources:
- [Open R1 Blog](https://huggingface.co/blog/open-r1)
- [GRPO Documentation from trl](https://huggingface.co/docs/trl/main/en/grpo_trainer)
- [Huggingface Docs](https://huggingface.co/docs/transformers/main/en/index)
- [GRPO Docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)## Installation
```bash
pip3 install -U agentgym
```## Usage
```python
from agentgym.r1_pipeline import R1Pipeline, SFTConfigr1_pipeline = R1Pipeline(
sft_model="Qwen/Qwen2-0.5B-Instruct",
tokenizer_name="Qwen/Qwen2-0.5B-Instruct",
sft_dataset="trl-lib/tldr",
sft_args=SFTConfig(output_dir="/tmp"),
only_grpo=True,
model_name="Qwen/Qwen2-0.5B-Instruct"
)r1_pipeline.run()
```
## Architecture
The architecture is as follows:
- SFT: Supervised Fine-Tuning
- GRPO: Generative Reinforcement Policy Optimization-> model -> sft -> grpo -> model
```mermaid
graph TD;
A[model] --> B[sft]
B --> C[grpo]
C --> D[reasoning model]
```# License
MIT