{"id":50510407,"url":"https://github.com/Gen-Verse/Open-AgentRL","last_synced_at":"2026-06-19T14:00:36.106Z","repository":{"id":336134283,"uuid":"1075117476","full_name":"Gen-Verse/Open-AgentRL","owner":"Gen-Verse","description":"Open-source Agentic RL for LLMs — RLAnything \u0026 DemyAgent","archived":false,"fork":false,"pushed_at":"2026-02-03T03:39:41.000Z","size":9594,"stargazers_count":186,"open_issues_count":1,"forks_count":25,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-02-03T12:43:51.049Z","etag":null,"topics":["agent-rl","coding-agent","entropy-method","gui-agent","llm-agent","llm-reasoning","multi-agent-reinforcement-learning","reinforcement-learning"],"latest_commit_sha":null,"homepage":"https://huggingface.co/collections/Gen-Verse/open-agentrl","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gen-Verse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"Notice.txt","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-13T04:08:07.000Z","updated_at":"2026-02-03T12:19:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Gen-Verse/Open-AgentRL","commit_stats":null,"previous_names":["gen-verse/open-agentrl"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Gen-Verse/Open-AgentRL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gen-Verse%2FOpen-AgentRL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gen-Verse%2FOpen-AgentRL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gen-Verse%2FOpen-AgentRL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gen-Verse%2FOpen-AgentRL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gen-Verse","download_url":"https://codeload.github.com/Gen-Verse/Open-AgentRL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gen-Verse%2FOpen-AgentRL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34534278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-rl","coding-agent","entropy-method","gui-agent","llm-agent","llm-reasoning","multi-agent-reinforcement-learning","reinforcement-learning"],"created_at":"2026-06-02T20:00:26.252Z","updated_at":"2026-06-19T14:00:36.092Z","avatar_url":"https://github.com/Gen-Verse.png","language":"Python","funding_links":[],"categories":["🤝 OPD-RL Hybrids — Inside-RL OPD"],"sub_categories":["🔁 Iterative Self-Bootstrapping"],"readme":"\u003cdiv align=\"center\"\u003e\r\n\u003cimg src=\"figs/image2.png\" width=\"330\"\u003e\r\n\r\n\r\n\u003c/div\u003e\r\n\r\n\r\n\r\n### RLAnything (ICML 2026) \u0026 AutoTool (ICML 2026), DemyAgent: Open-Source RL for LLMs and Agentic Scenarios\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\u003cdetails\u003e\r\n  \u003csummary\u003e\r\n    \u003cb\u003eRLAnything\u003c/b\u003e\r\n    \u003ca href=\"https://arxiv.org/abs/2602.02488\"\u003e\r\n      \u003cimg src=\"https://img.shields.io/badge/Paper-Arxiv%202602.02488-red?logo=arxiv\u0026logoColor=red\" alt=\"Paper\" height=\"18\" /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://huggingface.co/collections/Gen-Verse/open-agentrl\"\u003e\r\n      \u003cimg src=\"https://img.shields.io/badge/Models-Policy%20\u0026%20Reward-FFCC00?logo=huggingface\u0026logoColor=yellow\" alt=\"Model\" height=\"18\" /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://yinjjiew.github.io/projects/rlanything/\"\u003e\r\n      \u003cimg src=\"https://img.shields.io/badge/Blog-RLAnything-blue?logo=rss\u0026logoColor=white\" alt=\"Blog\" height=\"18\" /\u003e\r\n    \u003c/a\u003e\r\n    \u003cb\u003e(click to expand)\u003c/b\u003e\r\n  \u003c/summary\u003e\r\n\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003ch3\u003e\r\n    RLAnything: Forge Environment, Policy, and Reward Model\u003cbr\u003e\r\n    in Completely Dynamic RL System\r\n  \u003c/h3\u003e\r\n\u003c/div\u003e\r\n\r\n\r\n\r\n\u003ctable class=\"center\"\u003e     \u003ctr\u003e     \u003ctd width=100% style=\"border: none\"\u003e\u003cimg src=\"figs/rlanythingoverview.png\" style=\"width:100%\"\u003e\u003c/td\u003e     \u003c/tr\u003e     \u003ctr\u003e     \u003ctd width=\"100%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eAn overview of our research on RLAnything. \u003c/td\u003e   \u003c/tr\u003e \u003c/table\u003e\r\n\r\nIn this work, we propose RLAnything, a reinforcement learning framework that dynamically optimizes each component through closed-loop optimization, amplifying learning signals and strengthening the overall system:\r\n* The policy is trained with integrated feedback from outcome and step-wise signals from reward model, better than using outcome only.\r\n* Reward model is jointly optimized via consistency feedback, which in turn further improves policy training.\r\n* Our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience.\r\n* Through extensive experiments, we demonstrate each added component consistently improves the overall system.\r\n* We show that step-wise signals from optimized reward-model outperform outcome signals that rely on human labels.\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"figs/rlanythingpaperoverview.png\" alt=\"Figure 1\" width=\"600\"\u003e\r\n\u003c/p\u003e\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\u003cdetails\u003e\r\n  \u003csummary\u003e\r\n    \u003cb\u003eDemyAgent\u003c/b\u003e\r\n    \u003ca href=\"https://arxiv.org/abs/2510.11701\"\u003e\r\n      \u003cimg\r\n        src=\"https://img.shields.io/badge/Paper-Arxiv%202510.11701-red?logo=arxiv\u0026logoColor=red\"\r\n        alt=\"Paper\"\r\n        height=\"18\"\r\n        style=\"vertical-align: middle;\"\r\n      /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://huggingface.co/collections/Gen-Verse/open-agentrl-68eda4c05755ca5a8c663656\"\u003e\r\n      \u003cimg\r\n        src=\"https://img.shields.io/badge/Datasets-Agent%20RL%20Datasets-orange?logo=huggingface\u0026logoColor=yellow\"\r\n        alt=\"Data\"\r\n        height=\"18\"\r\n        style=\"vertical-align: middle;\"\r\n      /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://huggingface.co/Gen-Verse/DemyAgent-4B\"\u003e\r\n      \u003cimg\r\n        src=\"https://img.shields.io/badge/DemyAgent%204B-DemyAgent%204B%20Model-FFCC00?logo=huggingface\u0026logoColor=yellow\"\r\n        alt=\"Model\"\r\n        height=\"18\"\r\n        style=\"vertical-align: middle;\"\r\n      /\u003e\r\n    \u003c/a\u003e\r\n    \u003cb\u003e(click to expand)\u003c/b\u003e\r\n  \u003c/summary\u003e\r\n\r\n\u003cdiv\u003e\r\n\u003ch3\u003eDemystifying Reinforcement Learning in Agentic Reasoning\u003c/h3\u003e\u003c/div\u003e\r\n\r\n\r\n\r\n\r\n    \r\n\u003ctable class=\"center\"\u003e     \u003ctr\u003e     \u003ctd width=100% style=\"border: none\"\u003e\u003cimg src=\"figs/overview.png\" style=\"width:100%\"\u003e\u003c/td\u003e     \u003c/tr\u003e     \u003ctr\u003e     \u003ctd width=\"100%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eAn overview of our research on agentic RL. \u003c/td\u003e   \u003c/tr\u003e \u003c/table\u003e\r\n\r\nIn this work, we systematically investigate three dimensions of agentic RL: **data, algorithms, and reasoning modes**. Our findings reveal: \r\n* Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives; \r\n* Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency; \r\n* Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning. \r\n\r\nWe also contribute [high-quality SFT and RL datasets](https://huggingface.co/collections/Gen-Verse/open-agentrl-68eda4c05755ca5a8c663656), demonstrating that **simple recipes enable even [4B models](https://huggingface.co/Gen-Verse/DemyAgent-4B) to outperform 32B models** on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6.\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\u003cdetails\u003e\r\n  \u003csummary\u003e\r\n    \u003cb\u003eAutoTool\u003c/b\u003e\r\n    \u003ca href=\"https://arxiv.org/abs/2512.13278\"\u003e\r\n      \u003cimg\r\n        src=\"https://img.shields.io/badge/Paper-Arxiv%202512.13278-red?logo=arxiv\u0026logoColor=red\"\r\n        alt=\"Paper\"\r\n        height=\"18\"\r\n        style=\"vertical-align: middle;\"\r\n      /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"autotool/\"\u003e\r\n      \u003cimg\r\n        src=\"https://img.shields.io/badge/Code-AutoTool-181717?logo=github\u0026logoColor=white\"\r\n        alt=\"Code\"\r\n        height=\"18\"\r\n        style=\"vertical-align: middle;\"\r\n      /\u003e\r\n    \u003c/a\u003e\r\n    \u003ca href=\"https://icml.cc/\"\u003e\r\n      \u003cimg\r\n        src=\"https://img.shields.io/badge/ICML-2026-blue\"\r\n        alt=\"ICML 2026\"\r\n        height=\"18\"\r\n        style=\"vertical-align: middle;\"\r\n      /\u003e\r\n    \u003c/a\u003e\r\n    \u003cb\u003e(click to expand)\u003c/b\u003e\r\n  \u003c/summary\u003e\r\n\r\n\u003cdiv\u003e\r\n\u003ch3\u003eAutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning\u003c/h3\u003e\u003c/div\u003e\r\n\r\n\r\n\r\n\r\n    \r\n\u003ctable class=\"center\"\u003e     \u003ctr\u003e     \u003ctd width=100% style=\"border: none\"\u003e\u003cimg src=\"autotool/assets/intro_illus.png\" style=\"width:100%\"\u003e\u003c/td\u003e     \u003c/tr\u003e     \u003ctr\u003e     \u003ctd width=\"100%\" style=\"border: none; text-align: center; word-wrap: break-word\"\u003eAn overview of our research on AutoTool. \u003c/td\u003e   \u003c/tr\u003e \u003c/table\u003e\r\n\r\nIn this work, we move beyond the fixed-toolset assumption in agentic RL and train LLM agents with **dynamic tool-selection** capabilities over large, evolving toolsets:\r\n* We construct a **200K tool-use trajectory dataset** with explicit tool-selection rationales, covering **1,346 tools and 120 task types** across math, science, search QA, code generation, and multimodal reasoning;\r\n* AutoTool employs a **dual-phase optimization pipeline**: Phase I stabilizes tool-integrated reasoning trajectories with SFT and RL, and Phase II refines multi-step tool selection with a **KL-regularized Plackett–Luce ranking** objective;\r\n* Trained on only 460 seen tools, AutoTool **generalizes to a full 1,346-tool library (including 886 unseen tools)** at inference time, and consistently outperforms advanced LLM agents and tool-integration methods across ten benchmarks, with average gains of **+6.4%** (math \u0026 science), **+4.5%** (search QA), **+7.7%** (code), and **+6.9%** (multimodal).\r\n\r\nSee [`autotool/`](autotool/) for the full training framework and instructions.\r\n\r\n\u003c/details\u003e\r\n\r\n| | **RLAnything** | **DemyAgent** | **AutoTool** |\r\n|---|---|---|---|\r\n| **Focus** | Closed-loop RL optimization | Agentic reasoning | Dynamic tool selection |\r\n| **Core Idea** | Joint optimization of policy, reward model \u0026 environment | Real trajectories + exploration-friendly techniques + deliberative reasoning | Tool-selection rationales + PL-ranking refinement over evolving toolsets |\r\n| **Release** | LLM/GUI/Coding Policy \u0026 Reward Model | 3K SFT + 30K RL Data, SOTA-level DemyAgent-4B | Dual-phase training code, 200K tool-use data (coming soon) |\r\n\r\n\r\n## 🚩 New Updates\r\n\r\n- **[2026.5]** 🎉 **AutoTool is accepted by ICML 2026!** We open-source our work [**AutoTool**](https://arxiv.org/abs/2512.13278) under [`autotool/`](autotool/), including:\r\n  - Dual-phase training code: Phase-I trajectory stabilization (SFT + RL) and Phase-II tool-selection refinement with KL-regularized Plackett–Luce ranking\r\n  - Example data format of our 200K tool-selection trajectory dataset (1,346 tools, 120 task types) — full training data and toolset release coming soon\r\n\r\n- **[2026.5]** 🎉 **RLAnything is accepted by ICML 2026!** Check out our [paper](https://arxiv.org/abs/2602.02488), [models](https://huggingface.co/collections/Gen-Verse/open-agentrl), and [blog](https://yinjjiew.github.io/projects/rlanything/).\r\n\r\n- **[2026.2]** 🦞 We release [**OpenClaw-RL**](https://github.com/Gen-Verse/OpenClaw-RL), a new fully asynchronous RL framework built on top of Open-AgentRL, targeting **personalized agentic AI** trained from live conversation feedback. OpenClaw-RL introduces:\r\n  - **Binary RL (GRPO):** PRM-based scalar reward from next-state feedback for policy optimization\r\n  - **On-Policy Distillation (OPD):** Token-level directional learning from hindsight hints — richer than any scalar signal\r\n  - **Zero API keys \u0026 fully self-hosted:** conversation data never leaves your infrastructure\r\n\r\n- **[2026.2]** We fully open-source our work [**RLAnything**](https://arxiv.org/abs/2602.02488), including:\r\n  - Training code across GUI Agent, LLM Agent, and Coding LLM settings.\r\n  - Model checkpoints: both the policy models ([RLAnything-7B/8B](https://huggingface.co/collections/Gen-Verse/open-agentrl)) and reward models ([RLAnything-Reward-8B/14B](https://huggingface.co/collections/Gen-Verse/open-agentrl)) across these settings.\r\n  - Evaluation Scripts for our models \r\n\r\n\r\n- **[2025.10]** We fully open-source our work [**DemyAgent**](https://arxiv.org/abs/2510.11701), including:\r\n  - Training code for both SFT and RL stages\r\n  - High-quality SFT dataset (3K samples) and RL dataset (30K samples)\r\n  - Model checkpoints: SFT models (Qwen2.5-7B-RA-SFT, Qwen3-4B-RA-SFT) and RL-trained model ([DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B))\r\n  - Evaluation Scripts for our models\r\n\r\n\r\n\r\n\r\n## 🧭 Navigation\r\n\r\n\r\n    \r\n- **DemyAgent**:\r\n  - [Get Started](#demyagent-get-started)\r\n  - [Training](#demyagent-train)\r\n    - [Cold-Start SFT](#demyagent-cold-sft) \r\n    - [Agentic RL](#demyagent-agent-rl) \r\n  - [Evaluation](#demyagent-eval)\r\n  - [Results](#demyagent-result)\r\n\r\n- **RLAnything**:\r\n  - [Get Started](#rlanything-get-started)\r\n  - [Training](#rlanything-train)\r\n    - [Computer Control](#rlanything-computer-control) \r\n    - [Text-based Game](#rlanything-text-game) \r\n    - [RLVR Coding](#rlanything-coding) \r\n  - [Evaluation](#rlanything-eval)\r\n  - [Results](#rlanything-result)\r\n\r\n- **AutoTool**: see [`autotool/README.md`](autotool/README.md) for installation, training (Phase I \u0026 II), and results\r\n\r\n## 🚀 Get Started\r\n\r\n\u003ca id=\"demyagent-get-started\"\u003e\u003c/a\u003e\r\n### DemyAgent\r\n\r\n```bash\r\ngit clone https://github.com/Gen-Verse/Open-AgentRL.git\r\nconda create -n OpenAgentRL python=3.11 \r\nconda activate OpenAgentRL\r\ncd Open-AgentRL\r\nbash scripts/install_vllm_sglang_mcore.sh\r\npip install -e .[vllm]\r\n```\r\n\r\n\u003ca id=\"rlanything-get-started\"\u003e\u003c/a\u003e\r\n### RLAnything\r\n\r\n```bash\r\nconda create --name rlanything python=3.10\r\nsource activate rlanything\r\npip install -r requirements_rlanything.txt\r\n```\r\n\r\n\u003ca id=\"demyagent-train\"\u003e\u003c/a\u003e\r\n## 🔧 DemyAgent Training\r\n\r\n\r\n\r\n\u003ca id=\"demyagent-cold-sft\"\u003e\u003c/a\u003e\r\n### Cold-Start SFT\r\n\r\nBefore you start SFT, make sure you have downloaded the [3K Agentic SFT Data](https://huggingface.co/datasets/Gen-Verse/Open-AgentRL-SFT-3K) and the corresponding base models like [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507). Configure [qwen3_4b_sft.sh](recipe/demystify/qwen3_4b_sft.sh) and [qwen2_7b_sft.sh](recipe/demystify/qwen2_7b_sft.sh), and set the absolute paths to your model and the `.parquet` data files.\r\n\r\n- [🤗 3K Agentic SFT Data](https://huggingface.co/datasets/Gen-Verse/Open-AgentRL-SFT-3K)\r\n\r\n- **TRAIN_DATA**: Path to the `.parquet` file of the SFT dataset\r\n- **EVAL_DATA**: Path to the evaluation data (can be set to the same as **TRAIN_DATA**)\r\n- **MODEL_PATH**: Path to your base models like [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) or [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)\r\n- **SAVE_PATH**: Directory to save the SFT model checkpoints\r\n\r\nAfter all configurations are set, simply run the code below to finetune Qwen3-4B-Instruct-2507:\r\n\r\n```bash\r\nbash recipe/demystify/qwen3_4b_sft.sh\r\n```\r\n\r\nAfter obtaining the SFT models, utilize the following command to merge the model:\r\n\r\n```bash\r\npython3 -m verl.model_merger merge --backend fsdp --local_dir xxx/global_step_465 --target_dir xxx/global_step_465/huggingface\r\n```\r\n\u003ca id=\"demyagent-agent-rl\"\u003e\u003c/a\u003e\r\n### Agentic RL\r\n\r\n- [🤗 30K Agentic RL Data](https://huggingface.co/datasets/Gen-Verse/Open-AgentRL-30K)\r\n\r\nAfter obtaining the SFT models (you can also directly use our provided checkpoints [Qwen2.5-7B-RA-SFT](https://huggingface.co/Gen-Verse/Qwen2.5-7B-RA-SFT) and [Qwen3-4B-RA-SFT](https://huggingface.co/Gen-Verse/Qwen3-4B-RA-SFT)), you can start Agentic RL with our GRPO-TCR recipe.\r\n\r\nFirst, download our [30K Agentic RL Data](https://huggingface.co/datasets/Gen-Verse/Open-AgentRL-30K) and the [evaluation datasets](https://huggingface.co/datasets/Gen-Verse/Open-AgentRL-Eval).\r\n\r\nThen, configure the [SandboxFusion](https://github.com/bytedance/SandboxFusion) environment for code execution.\r\n\r\nThere are two ways to create a sandbox:\r\n\r\n1. **Local Deployment**: Deploy SandboxFusion locally by referring to [the SandboxFusion deployment documentation](https://bytedance.github.io/SandboxFusion/docs/docs/get-started#local-deployment)\r\n2. **Cloud Service**: Use Volcano Engine Cloud FaaS service by referring to [Volcano Engine Code Sandbox](https://www.volcengine.com/docs/6662/1539235)\r\n\r\nUsing either method, obtain an API endpoint (something like `https://\u003cip-address-or-domain-name\u003e/run_code`), and configure it in **`recipe/demystify/sandbox_fusion_tool_config.yaml`** and **the function check_correctness in`verl/utils/reward_score/livecodebench/code_math.py`**.\r\n\r\nNext, configure the Agentic RL scripts [grpo_tcr_qwen2_7b.sh](recipe/demystify/grpo_tcr_qwen2_7b.sh) and [grpo_tcr_qwen3_4b.sh](recipe/demystify/grpo_tcr_qwen3_4b.sh):\r\n\r\n- **open_agent_rl**: Path to the `.parquet` file of the agentic RL dataset\r\n- **model_path**: Path to the SFT models\r\n- **aime2024/aime2025**: Benchmark datasets evaluated every 10 training steps. Set the absolute paths to the `.parquet` files of the benchmarks. You can also add more benchmarks like GPQA-Diamond in **test_files**\r\n- **default_local_dir**: Directory to save your RL checkpoints\r\n\r\n**Training Resources**: We conducted our training on one $8\\times \\text{Tesla-A100}$ node with a batch size of 64.\r\n\r\nAfter finishing the configurations, run the code below to conduct Agentic RL with the GRPO-TCR recipe:\r\n\r\n```bash\r\nbash recipe/demystify/grpo_tcr_qwen3_4b.sh\r\n```\r\n\r\nYou can observe the training dynamics and evaluation results in Weights \u0026 Biases (wandb).\r\n\r\n\r\n\r\n\u003ca id=\"rlanything-train\"\u003e\u003c/a\u003e\r\n## 🔧 RLAnything Training\r\n\r\n\r\n\u003ca id=\"rlanything-computer-control\"\u003e\u003c/a\u003e\r\n### Computer Control (OSWorld)\r\n\r\nOur reinforcement learning and evaluation pipeline for OSWorld is built on a pool of virtual machines running in parallel on cloud instances. Specifically, we use Volcengine Cloud in our experiments. Before training, you need to set up the security group and VM image on Volcengine following [these instructions](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/volcengine/VOLCENGINE_GUIDELINE_CN.md).\r\n\r\nBefore training, you need set `osworld_rl.yaml` in configs. The detailed instructions are within it. To start the RLAnything training, simply\r\n```bash\r\npython osworld_rl.py config=configs/osworld_rl.yaml\r\n# you need to set num_node in osworld_rl.yaml to 1 if you only use one node.\r\n```\r\nIn our experiments, we train with multiple nodes:\r\n```bash\r\nif [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then   \r\n    python osworld_rl.py config=configs/osworld_rl.yaml\r\nelse\r\n    exec tail -f /dev/null\r\nfi\r\n# directly submit this to head machine\r\n```\r\n\r\n\r\n\r\n\u003ca id=\"rlanything-text-game\"\u003e\u003c/a\u003e\r\n### Text-based Game (AlfWorld)\r\n\r\nTo conduct reinforcement learning or evaluation on AlfWorld, you need to first download the AlfWorld data with the following commands (after you have pip installed the rlanything environment)\r\n```bash\r\nalfworld-download\r\n```\r\nThen you will have a directory that contains at least `detectors`, `json_2.1.1`, and `logic`. This will be the directory to save AlfWorld environment files. Our adapted environment files will be generated under `alfworld_file_path/json_2.1.1/alfworld_rl/syn_train`. `syn_train` saves accepted environment files, while `temp_train` saves generated files which to be validated (not accept yet).\r\n\r\nBefore training, you need set `alfworld_rl.yaml` in configs. The detailed instructions are within it. To start the RLAnything training, simply\r\n```bash\r\npython alfworld_rl.py config=configs/alfworld_rl.yaml\r\n# you need to set num_node in alfworld_rl.yaml to 1 if you only use one node.\r\n```\r\nIn our experiments, we train with multiple nodes:\r\n```bash\r\nif [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then   \r\n    python alfworld_rl.py config=configs/alfworld_rl.yaml\r\nelse\r\n    exec tail -f /dev/null\r\nfi\r\n# directly submit this to head machine\r\n```\r\n\r\n\r\n\r\n\u003ca id=\"rlanything-coding\"\u003e\u003c/a\u003e\r\n### Coding\r\n\r\nYou need to first download the training and evaluation dataset. Simply open `./data` and follow the instructions to do so. Then you can start the training with\r\n```bash\r\npython coding_rl.py config=configs/coding_rl.yaml\r\n# you need to set num_node in coding_rl.yaml to 1 if you only use one node.\r\n```\r\nIn our experiments, we train on 4 nodes\r\n```bash\r\nif [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then   \r\n    python coding_rl.py config=configs/coding_rl.yaml\r\nelse\r\n    exec tail -f /dev/null\r\nfi\r\n```\r\n\r\n\r\n\r\n\r\n\r\n\r\n\u003ca id=\"demyagent-eval\"\u003e\u003c/a\u003e\r\n## 📊 DemyAgent Evaluation\r\n\r\n\r\n\r\nIf you have already trained a model, you can refer to the following process for agentic reasoning capability evaluation. Alternatively, you can download our checkpoint from [🤗 DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B) for direct testing.\r\n\r\n### AIME2024/2025 and GPQA-Diamond\r\n\r\nConfigure the scripts [eval_qwen2_7b_aime_gpqa.sh](recipe/demystify/eval/eval_qwen2_7b_aime_gpqa.sh) and [eval_qwen3_4b_aime_gpqa.sh](recipe/demystify/eval/eval_qwen3_4b_aime_gpqa.sh). The configuration process is similar to the training setup—set the paths to your models and `.parquet` files of the benchmarks.\r\n\r\nSimply run the code below to evaluate performance on AIME2024/2025 and GPQA-Diamond:\r\n\r\n```bash\r\nbash recipe/demystify/eval/eval_qwen3_4b_aime_gpqa.sh\r\n```\r\n\r\nYou can observe the average@32/pass@32/maj@32 metrics from your wandb project.\r\n\r\n### LiveCodeBench-v6\r\n\r\nFirst, run inference for the benchmark:\r\n\r\n```bash\r\nbash recipe/demystify/eval/eval_qwen3_4b_livecodebench.sh\r\n```\r\n\r\nSpecifically, we save the validation rollout paths in **VAL_SAVE_PATH**. After obtaining the validation rollouts, refer to the official evaluation process for local results in [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench).\r\n\r\n\r\n\u003ca id=\"rlanything-eval\"\u003e\u003c/a\u003e\r\n## 📊 RLAnything Evaluation\r\n\r\n### Computer Control (OSWorld)\r\n\r\nTo eval the model on OSWorld, use\r\n```bash\r\npython osworld_eval.py config=configs/osworld_eval.yaml\r\n```\r\nYou can also evaluate with multi-nodes to accelerate.\r\n```bash\r\nif [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then   \r\n    python osworld_eval.py config=configs/osworld_eval.yaml\r\nelse\r\n    exec tail -f /dev/null\r\nfi\r\n```\r\n\r\n\r\n\r\n### Text-based Game (AlfWorld)\r\n\r\nTo eval the model on AlfWorld, use\r\n```bash\r\npython alfworld_eval.py config=configs/alfworld_eval.yaml\r\n```\r\n\r\n### Coding\r\n\r\nFor evaluation, simply\r\n```bash\r\npython coding_eval.py config=configs/coding_eval.yaml\r\n```\r\n\r\n\r\n\r\n\u003ca id=\"demyagent-result\"\u003e\u003c/a\u003e\r\n## 📈 DemyAgent Results\r\n\r\n\r\nWe provide the evaluation results of the agentic reasoning abilities of our models on challenging benchmarks including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6.\r\n\r\n|                            | **MATH**     |              | **Science**      | **Code**             |\r\n| -------------------------- | ------------ | ------------ | ---------------- | -------------------- |\r\n| **Method**                 | **AIME2024** | **AIME2025** | **GPQA-Diamond** | **LiveCodeBench-v6** |\r\n| *Self-Contained Reasoning* |              |              |                  |                      |\r\n| Qwen2.5-7B-Instruct        | 16.7         | 10.0         | 31.3             | 15.2                 |\r\n| Qwen3-4B-Instruct-2507     | 63.3         | 47.4         | 52.0             | **35.1**             |\r\n| Qwen2.5-72B-Instruct       | 18.9         | 15.0         | 49.0             | -                    |\r\n| DeepSeek-V3                | 39.2         | 28.8         | 59.1             | 16.1                 |\r\n| DeepSeek-R1-Distill-32B    | 70.0         | 46.7         | 59.6             | -                    |\r\n| DeepSeek-R1-Zero (671B)    | 71.0         | 53.5         | 59.6             | -                    |\r\n| *Agentic Reasoning*        |              |              |                  |                      |\r\n| Qwen2.5-7B-Instruct        | 4.8          | 5.6          | 25.5             | 12.2                 |\r\n| Qwen3-4B-Instruct-2507     | 17.9         | 16.3         | 44.3             | 23.0                 |\r\n| ToRL-7B                    | 43.3         | 30.0         | -                | -                    |\r\n| ReTool-32B                 | 72.5         | 54.3         | -                | -                    |\r\n| Tool-Star-3B               | 20.0         | 16.7         | -                | -                    |\r\n| ARPO-7B                    | 30.0         | 30.0         | 53.0             | 18.3                 |\r\n| rStar2-Agent-14B           | **80.6**     | \u003cu\u003e69.8\u003c/u\u003e  | **60.9**         | -                    |\r\n| **DemyAgent-4B (Ours)**    | \u003cu\u003e72.6\u003c/u\u003e  | **70.0**     | \u003cu\u003e58.5\u003c/u\u003e      | \u003cu\u003e26.8\u003c/u\u003e          |\r\n\r\nAs demonstrated in the table above, despite having only 4B parameters, **DemyAgent-4B** matches or even outperforms much larger models (14B/32B) across challenging benchmarks. Notably, **DemyAgent-4B achieves state-of-the-art agentic reasoning performance**, surpassing [ReTool-32B](https://arxiv.org/pdf/2504.11536) and [rStar2-Agent-14B](https://arxiv.org/pdf/2508.20722), and even outperforming long-CoT models like DeepSeek-R1-Zero on AIME2025, which further validates the insights of our research.\r\n\r\n\r\n\u003ca id=\"rlanything-result\"\u003e\u003c/a\u003e\r\n## 📈 RLAnything Results\r\n\r\nIn the following table, we demonstrate the effectiveness of the **RLAnything** framework. Each added dynamic component contributes to improvements in the overall system.\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"figs/rlanythingmaintable.png\" alt=\"Figure 1\" width=\"800\"\u003e\r\n\u003c/p\u003e\r\n\r\nWe further scale the optimization for GUI agent and achieves SoTA results:\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"figs/rlanythingscaleosworld.png\" alt=\"Figure 1\" width=\"600\"\u003e\r\n\u003c/p\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"figs/rlanythingosworldbench.png\" alt=\"Figure 1\" width=\"800\"\u003e\r\n\u003c/p\u003e\r\n\r\n\r\n## 📝 Citation\r\n\r\n```bibtex\r\n@article{yu2025demystify,\r\n  title={Demystifying Reinforcement Learning in Agentic Reasoning},\r\n  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},\r\n  journal={arXiv preprint arXiv:2510.11701},\r\n  year={2025}\r\n}\r\n\r\n@article{wang2026rlanything,\r\n  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},\r\n  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},\r\n  journal={Forty-third International Conference on Machine Learning},\r\n  year={2026}\r\n}\r\n\r\n@inproceedings{zou2026autotool,\r\n  title={AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning},\r\n  author={Jiaru Zou and Ling Yang and Yunzhe Qi and Sirui Chen and Mengting Ai and Ke Shen and Jingrui He and Mengdi Wang},\r\n  booktitle={Forty-third International Conference on Machine Learning},\r\n  year={2026}\r\n}\r\n```\r\n\r\n## 🙏 Acknowledgements\r\n\r\nThis work aims to explore more efficient paradigms for Agentic RL. Our implementation builds upon the excellent codebases of [VeRL](https://github.com/volcengine/verl) and [ReTool](https://github.com/ReTool-RL/ReTool). We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGen-Verse%2FOpen-AgentRL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGen-Verse%2FOpen-AgentRL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGen-Verse%2FOpen-AgentRL/lists"}