{"id":20215889,"url":"https://github.com/thudm/visualagentbench","last_synced_at":"2025-10-07T04:18:47.239Z","repository":{"id":252877507,"uuid":"839916063","full_name":"THUDM/VisualAgentBench","owner":"THUDM","description":"Towards Large Multimodal Models as Visual Foundation Agents","archived":false,"fork":false,"pushed_at":"2025-04-24T14:59:48.000Z","size":5831,"stargazers_count":214,"open_issues_count":15,"forks_count":6,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-05-24T12:04:49.739Z","etag":null,"topics":["gpt","llm-agent","multimodal-large-language-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-08-08T15:26:11.000Z","updated_at":"2025-05-24T03:31:36.000Z","dependencies_parsed_at":"2024-08-23T04:26:30.159Z","dependency_job_id":"9141f162-03ed-4283-b514-990ae2368405","html_url":"https://github.com/THUDM/VisualAgentBench","commit_stats":null,"previous_names":["thudm/visualagentbench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/THUDM/VisualAgentBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FVisualAgentBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FVisualAgentBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FVisualAgentBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FVisualAgentBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/VisualAgentBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FVisualAgentBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278718227,"owners_count":26033717,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt","llm-agent","multimodal-large-language-models"],"created_at":"2024-11-14T06:25:19.811Z","updated_at":"2025-10-07T04:18:47.233Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VisualAgentBench (VAB)\n\n![](./assets/cover.png)\n\n\u003cp align=\"center\"\u003e\n   \u003ca href=\"\" target=\"_blank\"\u003e🌐 Website\u003c/a\u003e | \u003ca href=\"https://arxiv.org/abs/2408.06327\" target=\"_blank\"\u003e📃 Paper \u003c/a\u003e | \u003ca href=\"https://www.modelscope.cn/datasets/shawliu9/VAB-Training\" target=\"_blank\"\u003e 🗂️ VAB Training (ModelScope)\n\u003c/p\u003e\n\n# VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents\n\n**VisualAgentBench (VAB)** is the first benchmark designed to systematically evaluate and develop large multi models (LMMs) as visual foundation agents, which comprises 5 distinct environments across 3 types of representative visual agent tasks (Embodied, GUI, and Visual Design)\n\nhttps://github.com/user-attachments/assets/4a1a5980-48f9-4a70-a900-e5f58ded69b4\n\n- VAB-OmniGibson (Embodied)\n- VAB-Minecraft (Embodied)\n- VAB-Mobile (GUI)\n- VAB-WebArena-Lite (GUI, based on [WebArena](https://github.com/web-arena-x/webarena) and [VisualWebArena](https://github.com/web-arena-x/visualwebarena))\n- VAB-CSS (Visual Design)\n\nCompared to its predecessor [AgentBench](https://github.com/THUDM/AgentBench), VAB highlights visual inputs and the enabling of **Foundation Agent** capability development with training open LLMs/LMMs on trajectories. \n\n![](./assets/visualagentbench.png)\n\n![](./assets/intro.png)\n\n## Table of Contents\n\n-   [Quick Start](#quick-start)\n-   [Dataset Summary](#dataset-summary)\n-   [Leaderboard](#leaderboard)\n-   [Quick Start](#quick-start)\n-   [Acknowledgement](#acknowledgement)\n-   [Citation](#citation)\n\n## Quick Start\n\nThis section will first give you an overview to the use and architecture of VAB.\nNext, it will guide you on how to use `gpt-4o-2024-05-13` as an exemplar agent to launch 4 concurrent `VAB-Minecraft` tasks.\n\n### Overview on VAB Framework\n\nTo allow fast evaluation over agent tasks, we leverage AgentBench's framework as the backbone (currently for VAB-OmniGibson, VAB-Minecraft, and VAB-CSS).\nIf you are interested in its detailed implementation, please refer to AgentBench's [Framework Introduction](https://github.com/THUDM/AgentBench/blob/main/docs/Introduction_en.md) (which may not be necessary).\nBasically, the framework calls all LLM/LMM in API formats via `Agent-Controller`, and accesses to environments via `Task-Controller`.\nThe `Assigner` will automatically assign evaluation tasks by pairing `Agent-Controller` and `Task-Controller` to optimize the overall evaluation speed.\nFor more detailed configuration and launch methods, please check [Configuration Guide](docs/Config_en.md)\nand [Program Entrance Guide](docs/Entrance_en.md).\n\n![](./assets/framework.png)\n\n### Step 1. Prerequisites for All Environments\n\nClone this repo and install the dependencies.\n\n```bash\ncd VisualAgentBench\nconda create -n vab python=3.9\nconda activate vab\npip install -r requirements.txt\n```\n\nEnsure that [Docker](https://www.docker.com/) is properly installed.\n\n```bash\ndocker ps\n```\n\nFor specific environments, please refer to their additional prerequisites respectively.\nFor VAB-WebArena-Lite, it is based on [WebArena](https://github.com/webarena-x/webarena) with some modifications, so please read its individual setup carefully.\n\n* [VAB-OmniGibson Setup](docs/detailed_setups/VAB-OmniGibson.md)\n* [VAB-Minecraft Setup](docs/detailed_setups/VAB-Minecraft.md)\n* VAB-Mobile: Ongoing\n* [VAB-WebArena-Lite Setup](VAB-WebArena-Lite) (Separate installation and evaluation method)\n* [VAB-CSS](docs/detailed_setups/VAB-CSS.md)\n\n### Step 2. Configure the Agent\n\nFill in your OpenAI API Key at the correct location in `configs/agents/openai-chat.yaml`.\n\nYou can try using `python -m src.client.agent_test` to check if your agent is configured correctly.\n\n\n### Step 3. Start the task server\n\nStarting the task worker involves specific tasks. Manual starting might be cumbersome; hence, we provide an automated script.\n\nThe assumption for this step is that ports from 5000 to 5015 are available. For Mac OS system, you may want to follow [here](https://stackoverflow.com/questions/69955686/why-cant-i-run-the-project-on-port-5000) to free port 5000 to use.\n\n```bash\npython -m src.start_task -a\n```\n\nThis will launch 4 task_workers for `VAB-Minecraft` tasks and automatically connect them to the controller on port 5000. **After executing this command, please allow approximately 1 minute for the task setup to complete.** If the terminal shows \".... 200 OK\", you can open another terminal and follow step 4.\n\n### Step 4. Start the assigner\n\nThis step is to actually start the tasks.\n\nIf everything is correctly configured so far, you can now initiate the task tests.\n\n```bash\npython -m src.assigner --auto-retry\n```\n\n### Next Steps\n\nIf you wish to launch more tasks or use other models, you can refer to the content in [Configuration Guide](docs/Config_en.md) and [Program Entrance Guide](docs/Entrance_en.md).\n\nFor instance, if you want to launch VAB-OmniGibson tasks, in step 3:\n\n```bash\npython -m src.start_task -a -s omnigibson 2\n```\n\nIn step 4: \n\n```bash\npython -m src.assigner --auto-retry --config configs/assignments/omnigibson.yaml\n```\n\nYou can modify the config files to launch other tasks or change task concurrency.\n\n## Dataset Summary\n\nWe offer two splits for each dataset: Testing and Training. Different from its predecessor [AgentBench](https://github.com/THUDM/AgentBench), VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs.\n\n![](./assets/statistics.png)\n\n## Leaderboard\n\nHere is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere **Prompting**, and open LMMs are tested after **Multitask Finetuning** on VAB training set, as they usually fail to follow complicated agent task instructions.\n\n![](./assets/leaderboard.png)\n\n\n## Acknowledgement\nThis project is heavily built upon the following repositories (to be updated):\n\n* [AgentBench](https://github.com/THUDM/AgentBench): which serves as the backbone framework of this project for efficient and reliable parallel agent evaluation.\n* [WebArena](https://github.com/web-arena-x/webarena) and [VisualWebArena](https://github.com/web-arena-x/visualwebarena): which serve as the testing framework and data source for VAB-WebArena-Lite dataset.\n* [OmniGibson](https://github.com/StanfordVL/OmniGibson): which serves as the environment for VAB-OmniGibson.\n* [JARVIS-1](https://github.com/CraftJarvis/JARVIS-1): VAB-Minecraft's framework is adapted from JARVIS-1's pipeline.\n* [STEVE-1](https://github.com/Shalev-Lifshitz/STEVE-1): which serves as the action executor for VAB-Minecraft.\n\n## Citation\n\n```\n@article{liu2024visualagentbench,\n  title={VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents},\n  author={Liu, Xiao and Zhang, Tianjie and Gu, Yu and Iong, Iat Long and Xu, Yifan and Song, Xixuan and Zhang, Shudan and Lai, Hanyu and Liu, Xinyi and Zhao, Hanlin and others},\n  journal={arXiv preprint arXiv:2408.06327},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fvisualagentbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fvisualagentbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fvisualagentbench/lists"}