{"id":13486432,"url":"https://github.com/THUDM/AgentBench","last_synced_at":"2025-03-27T20:33:19.842Z","repository":{"id":186621967,"uuid":"671759870","full_name":"THUDM/AgentBench","owner":"THUDM","description":"A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)","archived":false,"fork":false,"pushed_at":"2025-01-30T17:10:34.000Z","size":23971,"stargazers_count":2438,"open_issues_count":58,"forks_count":184,"subscribers_count":25,"default_branch":"main","last_synced_at":"2025-03-20T18:10:10.152Z","etag":null,"topics":["chatgpt","gpt-4","llm","llm-agent"],"latest_commit_sha":null,"homepage":"https://llmbench.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-28T04:32:06.000Z","updated_at":"2025-03-20T12:29:34.000Z","dependencies_parsed_at":"2023-10-22T11:21:13.222Z","dependency_job_id":"6ded4df0-5751-4c18-a066-5ad6e23a4ab0","html_url":"https://github.com/THUDM/AgentBench","commit_stats":null,"previous_names":["thudm/agentbench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FAgentBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FAgentBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FAgentBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FAgentBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/AgentBench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245916698,"owners_count":20693396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","gpt-4","llm","llm-agent"],"created_at":"2024-07-31T18:00:45.784Z","updated_at":"2025-03-27T20:33:19.837Z","avatar_url":"https://github.com/THUDM.png","language":"Python","readme":"# AgentBench\n\n![](./assets/cover.jpg)\n\n\u003cp align=\"center\"\u003e\n   \u003ca href=\"https://llmbench.ai\" target=\"_blank\"\u003e🌐 Website\u003c/a\u003e | \u003ca href=\"https://twitter.com/thukeg\" target=\"_blank\"\u003e🐦 Twitter\u003c/a\u003e | \u003ca href=\"mailto:agentbench@googlegroups.com\"\u003e✉️ Google Group\u003c/a\u003e | \u003ca href=\"https://arxiv.org/abs/2308.03688\" target=\"_blank\"\u003e📃 Paper \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n👋 Join our \u003ca href=\"https://join.slack.com/t/agentbenchcol-huw1944/shared_invite/zt-20ixabcuv-31cFLBAkqGQxQkJqrWVEVg\" target=\"_blank\"\u003eSlack\u003c/a\u003e  for \u003ci\u003eQ \u0026 A\u003c/i\u003e or \u003ci\u003e\u003cb\u003ecollaboration\u003c/b\u003e on next version of AgentBench\u003c/i\u003e!\n\u003c/p\u003e\n\n## 🔥[2024.08.13] Introducing [VisualAgentBench](https://github.com/THUDM/VisualAgentBench)\n\nVisualAgentBench is designed for evaluating and training visual foundation agents based on large multimodel models (LMMs). We introduce 5 distinct environments spanning \n\n* Embodied: VAB-OmniGibson, VAB-Minecraft\n* GUI: VAB-Mobile, VAB-WebArena-Lite\n* Visual Design: VAB-CSS\n\nto systematically benchmark 17 LMMs (proprietary \u0026 open LMMs). We also provide the trajectory dataset for behavior cloning training on open LMMs for you to develop your own visual foundation agents!\n\n## 📌Introducing AgentBench v0.2🎉\n\nYou are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1).\n\nBased on [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1), we:\n\n-   Updated the framework architecture for easier use and extension\n-   Adjusted some task settings\n-   Added test results for more models\n-   Released the full data for the Dev and Test sets\n\n# AgentBench: Evaluating LLMs as Agents\n\nhttps://github.com/THUDM/AgentBench/assets/129033897/656eed6e-d9d9-4d07-b568-f43f5a451f04\n\n**AgentBench** is the first benchmark designed to evaluate **LLM-as-Agent** across a diverse spectrum of different\nenvironments. It encompasses 8 distinct environments to provide a more comprehensive evaluation of the LLMs' ability to\noperate as autonomous agents in various scenarios. These environments include 5 freshly created domains, namely\n\n-   Operating System (OS)\n-   Database (DB)\n-   Knowledge Graph (KG)\n-   Digital Card Game (DCG)\n-   Lateral Thinking Puzzles (LTP)\n\nas well as 3 recompiled from published datasets:\n\n-   House-Holding (HH) ([ALFWorld](https://github.com/alfworld/alfworld))\n-   Web Shopping (WS) ([WebShop](https://github.com/princeton-nlp/webshop))\n-   Web Browsing (WB) ([Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web))\n\n![](./assets/agentbench.png)\n\n## Table of Contents\n\n-   [Dataset Summary](#dataset-summary)\n-   [Leaderboard](#leaderboard)\n-   [Quick Start](#quick-start)\n-   [Next Steps](#next-steps)\n-   [Citation](#citation)\n\n## Dataset Summary\n\nWe offer two splits for each dataset: Dev and Test. The multi-turn interaction requires an LLMs to generate around 4k\nand 13k times respectively.\n\n![](./assets/statistics.png)\n\n## Leaderboard\n\nHere is the scores on test set (standard) results of AgentBench.\n\n![](./assets/leaderboard.png)\n\nWhile LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance towards practical\nusability are significant.\n\n![](./assets/intro.png)\n\n## Quick Start\n\nThis section will guide you on how to quickly use gpt-3.5-turbo-0613 as an agent to launch the `dbbench-std` and `os-std` tasks.\nFor the specific framework structure, please refer to [Framework Introduction](docs/Introduction_en.md).\nFor more detailed configuration and launch methods, please check [Configuration Guide](docs/Config_en.md)\nand [Program Entrance Guide](docs/Entrance_en.md).\n\n### Step 1. Prerequisites\n\nClone this repo and install the dependencies.\n\n```bash\ncd AgentBench\nconda create -n agent-bench python=3.9\nconda activate agent-bench\npip install -r requirements.txt\n```\n\nEnsure that [Docker](https://www.docker.com/) is properly installed.\n\n```bash\ndocker ps\n```\n\nBuild required images for `dbbench-std` and `os-std`.\n\n```bash\ndocker pull mysql\ndocker pull ubuntu\ndocker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default\ndocker build -f data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles --tag local-os/packages\ndocker build -f data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles --tag local-os/ubuntu\n```\n\n### Step 2. Configure the Agent\n\nFill in your OpenAI API Key at the correct location in `configs/agents/openai-chat.yaml`. (e.g. `gpt-3.5-turbo-0613`)\n\nYou can try using `python -m src.client.agent_test` to check if your agent is configured correctly.\n\nBy default, `gpt-3.5-turbo-0613` will be started. You can replace it with other agents by modifying the parameters:\n\n```bash\npython -m src.client.agent_test --config configs/agents/api_agents.yaml --agent gpt-3.5-turbo-0613\n```\n\n### Step 3. Start the task server\n\nStarting the task worker involves specific tasks. Manual starting might be cumbersome; hence, we provide an automated\nscript.\n\nThe assumption for this step is that ports from 5000 to 5015 are available. For Mac OS system, you may want to follow [here](https://stackoverflow.com/questions/69955686/why-cant-i-run-the-project-on-port-5000) to free port 5000 to use.\n\n```bash\npython -m src.start_task -a\n```\n\nThis will launch five task_workers each for `dbbench-std` and `os-std` tasks and automatically connect them\nto the controller on port 5000. **After executing this command, please allow approximately 1 minute for the task setup to complete.** If the terminal shows \".... 200 OK\", you can open another terminal and follow step 4.\n\n### Step 4. Start the assigner\n\nThis step is to actually start the tasks.\n\nIf everything is correctly configured so far, you can now initiate the task tests.\n\n```bash\npython -m src.assigner\n```\n\n## Next Steps\n\nIf you wish to launch more tasks or use other models, you can refer to the content\nin [Configuration Guide](docs/Config_en.md) and [Program Entrance Guide](docs/Entrance_en.md).\n\nFor the environment of the remaining five tasks, you will need to download the Docker images we provide.\n\n```\nlonginyu/agentbench-ltp\nlonginyu/agentbench-webshop\nlonginyu/agentbench-mind2web\nlonginyu/agentbench-card_game\nlonginyu/agentbench-alfworld\n```\n\nThe resource consumption of a single task_worker for the eight tasks is roughly as follows; consider this when\nlaunching:\n\n| Task Name | Start-up Speed | Memory Consumption |\n| --------- | -------------- | ------------------ |\n| webshop   | ~3min          | ~15G               |\n| mind2web  | ~5min          | ~1G                |\n| db        | ~20s           | \u003c 500M             |\n| alfworld  | ~10s           | \u003c 500M             |\n| card_game | ~5s            | \u003c 500M             |\n| ltp       | ~5s            | \u003c 500M             |\n| os        | ~5s            | \u003c 500M             |\n| kg        | ~5s            | \u003c 500M             |\n\n\n### Deploy the KnowledgeGraph service loacally\nthe KnowledgeGraph task depends on an online service which now is not stable, if you want to deploy the service locally, you can follow steps below:\n\n**step1.** \u003cbr /\u003e\ndownload the database and setup the service [freebase-setup](https://github.com/dki-lab/Freebase-Setup).\n\n\n**step2.** \u003cbr /\u003e\nchange this line `sparql_url: \"http://164.107.116.56:3093/sparql\"` to `sparql_url: \"\u003cyour service api of sparql\u003e\"` in `/configs/tasks/kg.yaml`.\n\n**P.S.** you should start your KG service before you start the agent tasks services.\n\n## Extending AgentBench\n\nIf you wish to add new tasks to AgentBench, you may refer to [Extension Guide](docs/Extension_en.md).\n\n## References\n\nAvalon task is merged from [AvalonBench](https://github.com/jonathanmli/Avalon-LLM/), which implements a multi-agent framework.\n\n## Citation\n\n```\n@article{liu2023agentbench,\n  title   = {AgentBench: Evaluating LLMs as Agents},\n  author  = {Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang},\n  year    = {2023},\n  journal = {arXiv preprint arXiv: 2308.03688}\n}\n```\n","funding_links":[],"categories":["Evaluation-Benchmarks","🤖 LLM \u0026 Chatbot Testing","Uncategorized","chatgpt","Python","Datasets-or-Benchmark","Projects","A01_文本生成_文本对话","🔬 Autonomous Research \u0026 Self-Improving Agents","Benchmark/Evaluator","SDK, Libraries, Frameworks","Research and Papers","📚 Educational Resources","🔍 Observability and Evaluation","Training","Building","Testing Frameworks","⚙️ Agent Operations","Agentic security","评测基准","Benchmarks \u0026 Datasets","📋 Contents","Simulation \u0026 Benchmarking Environments","Benchmarks","Agent Harnessing and Evaluation","Evaluation \u0026 Testing","Evaluation Benchmarks"],"sub_categories":["Uncategorized","Agent能力","Benchmarks","大语言对话模型及数据","Evaluation \u0026 Benchmarks","Advanced Components","Python library, sdk or frameworks","Research \u0026 Papers","Benchmark","Category-Specific Testing Tools","📊 Evaluation","3. ** [Effective harnesses for long-running agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)**","Task-Specific Benchmarks","📈 9. Evaluation, Benchmarks \u0026 Datasets","Multimodal Model Benchmarks","Agent","Benchmark Reality Check (real-world tool use)","Sandboxing \u0026 Execution"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTHUDM%2FAgentBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTHUDM%2FAgentBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTHUDM%2FAgentBench/lists"}