{"id":20215879,"url":"https://github.com/thudm/webrl","last_synced_at":"2025-05-16T12:08:43.708Z","repository":{"id":261168761,"uuid":"881763976","full_name":"THUDM/WebRL","owner":"THUDM","description":"Building Open LLM Web Agents with Self-Evolving Online Curriculum RL","archived":false,"fork":false,"pushed_at":"2025-04-30T03:35:33.000Z","size":31381,"stargazers_count":368,"open_issues_count":1,"forks_count":26,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-04-30T04:30:07.645Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-01T07:08:54.000Z","updated_at":"2025-04-30T03:35:37.000Z","dependencies_parsed_at":"2025-04-12T09:42:39.556Z","dependency_job_id":null,"html_url":"https://github.com/THUDM/WebRL","commit_stats":null,"previous_names":["thudm/webrl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FWebRL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FWebRL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FWebRL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FWebRL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/WebRL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254527087,"owners_count":22085918,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T06:25:14.639Z","updated_at":"2025-05-16T12:08:43.654Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning\n\n\u003c/div\u003e\n\n![image](./assets/webrl.png)\n\n*Technique adopted in [AutoGLM](https://xiao9905.github.io/AutoGLM/), a series of Phone Use and Web Browser Use Foundation Agents*\n\n\u003cp align=\"center\"\u003e\n   📃 \u003ca href=\"https://arxiv.org/abs/2411.02337\" target=\"_blank\"\u003e Paper \u003c/a\u003e | 🤗 \u003ca href=\"https://huggingface.co/THUDM/webrl-glm-4-9b\" target=\"_blank\"\u003e WebRL-GLM-4-9B \u003c/a\u003e | \u003ca href=\"https://huggingface.co/THUDM/webrl-llama-3.1-8b\" target=\"_blank\"\u003e WebRL-LLaMA-3.1-8B \u003c/a\u003e | \u003ca href=\"https://www.modelscope.cn/collections/WebRL-77a3e54a2dde4b\" target=\"_blank\"\u003e ModelScope \u003c/a\u003e\n\u003c/p\u003e\n\n***\n\nWebRL, a self-evolving online curriculum learning framework designed for training web agents, targeting the WebArena environment. \n\n## 🚀 Quick Start\n\n### Dependencies\n\nFirst, create a conda environment and install all pip package requirements.\n\n```bash\nconda create -n webrl python==3.10\nconda activate webrl\n\ncd WebRL\npip install -e .\n```\n\n### Model checkpoints\n\n#### Actor checkpoints\n\nThe WebRL-GLM-4-9B checkpoint was released here and we use it:\n\n- [WebRL-GLM-4-9B checkpoint](https://huggingface.co/THUDM/webrl-glm-4-9b)\n- [WebRL-Llama-3.1-8B checkpoint](https://huggingface.co/THUDM/webrl-llama-3.1-8b)\n- [WebRL-Llama-3.1-70B checkpoint](https://huggingface.co/THUDM/webrl-llama-3.1-70b)\n\n#### ORM checkpoint\n\nThe checkpoint for Outcome-supervised Reward Model (ORM) is as follow:\n\n- [ORM-Llama-3.1-8B checkpoint](https://huggingface.co/THUDM/webrl-orm-llama-3.1-8b/tree/main)\n\n\n\n### ✈️ Train SFT model\n\nWe use LLaMA-Factory to train the SFT baseline, which is the starting model for WebRL. We release the code and data used for training. You can train the SFT baseline with the following commands:\n\n```bash\ncd LLaMA-Factory\nbash run.sh examples/train_full/llama3_full_policy_web.yaml\n```\n\n### ✈️ Train WebRL\n\nAfter training the SFT baseline, you should use it as the initial model of the actor and critic.  You can train WebRL with the following commands:\n\n```bash\nbash run_multinode.sh\n```\n\nThis command is used to train the actor and critic in each phase.\n\n### 💡 Generating New Instructions\n\nYou can generate new instructions with the following commands:\n\n```bash\npython scripts/gen_task.py\n```\n\n### 🛜 Interaction and Evaluation\n\nThe instruction and script for interaction with WebArena is provided in [VAB-WebArena-Lite](https://github.com/THUDM/VisualAgentBench/tree/main/VAB-WebArena-Lite).\nYou can implement the interaction process of WebRL according to the [``Evaluating in WebRL Setting (Text Modal)``](https://github.com/THUDM/VisualAgentBench/tree/main/VAB-WebArena-Lite#-evaluating-in-webrl-setting-text-modal) section of VAB-WebArena-Lite.\n\n\nTo enable interaction with WebArena, you need to configure each task in the same format as the sample test case provided in the ``test_webarena_lite.raw.json`` file in VAB-WebArena-Lite. Below is the template for a task configuration:\n\n```python\n{\n  \n  \"sites\": [\n    \u003csite\u003e # possible choices: \"shopping_admin\", \"map\", \"shopping\", \"reddit\", \"gitlab\"\n  ],\n  \"task_id\": \u003cYour task id\u003e\n  \"require_login\": true,\n  \"storage_state\": \"./.auth/shopping_admin_state.json\",\n  \"start_url\": \u003cstart url of site\u003e, # possible choices: \"__SHOPPING_ADMIN__\", \"__SHOPPING__\", \"__GITLAB__\", \"__MAP__\", \"__REDDIT__\"\n  \"geolocation\": null,\n  \"intent_template\": \"\",\n  \"instantiation_dict\": {},\n  \"intent\": \u003cTask\u003e,\n  \"require_reset\": false,\n  \"eval\": {\n    \"eval_types\": [\n      \"string_match\"\n    ],\n    \"reference_answers\": {\n      \"exact_match\": \"N/A\"\n    },\n    \"reference_url\": \"\",\n    \"program_html\": [],\n    \"string_note\": \"\",\n    \"reference_answer_raw_annotation\": \"\"\n  },\n  \"intent_template_id\": 0\n}\n```\n\nAfter configuring the tasks, use the script ``scripts/generate_test_data.py`` to generate the configuration files. Make sure to modify the data path in the script to point to the JSON file containing your configured interaction cases.\n\nAfter interaction finished, run ``scripts/process_data.py`` to process the interaction trajectories.\n\n```bash\npython scripts/process_data.py \\\n  --stage 1 2 \\\n  --add_reward \\\n  --rollout_path \u003cdirectory_of_interaction_trajectories\u003e \\\n  --experience_paths \"path1\", \"path2\" \\ \n  --orm_path \u003cpath_to_ORM_model\u003e \\\n  --actor_path \u003cpath_to_actor_model_for_computing_perplexity\u003e \\\n  --output_path \u003cpath_to_output_file\u003e\n```\n- `stage`: Specifies the processing method for the data\n  - 1: Convert rollout trajectories into the required format.\n  - 2: Incorporate historical experiences filtered by perplexity.\n- `add_reward`: Apply ORM to label each trajectory.\n- `output_path`: The file containing processed interaction trajectories, ready for direct use in training.\n  - stage 1: Processed interaction trajectories will be saved in this file. Contains data without historical experiences.\n  - stage 2: An additional file, output_path + '_filter', will also be generated.\n    - output_path: Contain data without historical experiences.\n    - output_path + '_filter': Contain data with historical experiences.\n- `rollout_path`: Path to the `traces` subfolder containing initial interaction trajectories, typically generated after running Webarena-Lite.\n- `experience_paths`: List of file paths to processed interaction data (`output_path`) from previous phases. We provide the SFT data with the modified format that can be used as experience data, in `/scripts/webarena_lite_sft.pt`.\n\nBoth output_path and output_path + '_filter' are formatted for direct use in subsequent training.\n\n\n## Citation\n```\n@article{qi2024webrl,\n  title={WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning},\n  author={Qi, Zehan and Liu, Xiao and Iong, Iat Long and Lai, Hanyu and Sun, Xueqiao and Yang, Xinyue and Sun, Jiadai and Yang, Yu and Yao, Shuntian and Zhang, Tianjie and others},\n  journal={arXiv preprint arXiv:2411.02337},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fwebrl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fwebrl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fwebrl/lists"}