{"id":21164190,"url":"https://github.com/open-compass/GTA","last_synced_at":"2025-07-09T16:33:11.883Z","repository":{"id":248050002,"uuid":"811330900","full_name":"open-compass/GTA","owner":"open-compass","description":"[NeurIPS 2024 D\u0026B Track] GTA: A Benchmark for General Tool Agents","archived":false,"fork":false,"pushed_at":"2025-03-28T16:01:33.000Z","size":10466,"stargazers_count":94,"open_issues_count":1,"forks_count":8,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-06-09T05:46:25.093Z","etag":null,"topics":["llm-agent","llm-evaluation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/open-compass.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-06T11:46:09.000Z","updated_at":"2025-05-23T02:39:46.000Z","dependencies_parsed_at":"2025-02-24T03:12:06.049Z","dependency_job_id":"10679610-7cdf-4c97-b078-3aa3547f5a11","html_url":"https://github.com/open-compass/GTA","commit_stats":null,"previous_names":["open-compass/gta"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/open-compass/GTA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGTA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGTA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGTA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGTA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/open-compass","download_url":"https://codeload.github.com/open-compass/GTA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FGTA/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264494968,"owners_count":23617474,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-agent","llm-evaluation"],"created_at":"2024-11-20T14:01:15.057Z","updated_at":"2025-07-09T16:33:11.840Z","avatar_url":"https://github.com/open-compass.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Benchmarks"],"sub_categories":["大语言对话模型及数据"],"readme":"# GTA: A Benchmark for General Tool Agents\n\n\u003cdiv align=\"center\"\u003e\n\n[⬇️ [Dataset](https://github.com/open-compass/GTA/releases/download/v0.1.0/gta_dataset.zip)]\n[📃 [Paper](https://arxiv.org/abs/2407.08713)]\n[🌐 [Project Page](https://open-compass.github.io/GTA/)]\n[🤗 [Hugging Face](https://huggingface.co/datasets/Jize1/GTA)]\n[🏆 [Newest Leaderboard](#-leaderboard-mar-2025)]\n\u003c/div\u003e\n\n## 🌟 Introduction\n\n\u003eIn developing general-purpose agents, significant focus has been placed on integrating large language models (LLMs) with various tools. This poses a challenge to the tool-use capabilities of LLMs. However, there are evident gaps between existing tool evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only inputs, which fail to reveal the agents' real-world problem-solving abilities effectively. \n\nGTA is a benchmark to evaluate the tool-use capability of LLM-based agents in real-world scenarios. It features three main aspects:\n- **Real user queries.** The benchmark contains 229 human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. \n- **Real deployed tools.** GTA provides an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance.\n- **Real multimodal inputs.** Each query is attached with authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely.\n\u003cdiv align=\"center\"\u003e\n \u003cimg src=\"figs/dataset.jpg\" width=\"800\"/\u003e\n\u003c/div\u003e\n\nThe comparison of GTA queries with AI-generated queries is shown in the table below. The steps and tool types for queries in ToolBench and m\\\u0026m's are explicitly stated, as marked in red and blue. The queries in APIBench are simple, only containing one step. Our GTA's queries are both step-implicit and tool-implicit.\n\u003cdiv align=\"center\"\u003e\n \u003cimg src=\"figs/implicit.jpg\" width=\"800\"/\u003e\n\u003c/div\u003e\n\n## 📣 What's New\n\n- **[2025.3.25]** Update 🏆Leaderboard, Mar. 2025, including new models such as Deepseek-R1, Deepseek-V3, Qwen-QwQ, Qwen-2.5-max series.\n- **[2024.9.26]** GTA is accepted to NeurIPS 2024 Dataset and Benchmark Track! 🎉🎉🎉\n- **[2024.7.11]** Paper available on arXiv. ✨✨✨\n- **[2024.7.3]** Release the evaluation and tool deployment code of GTA. 🔥🔥🔥\n- **[2024.7.1]** Release the GTA dataset on Hugging Face. 🎉🎉🎉\n\n## 📚 Dataset Statistics\nGTA comprises a total of 229 questions. The basic dataset statistics is presented below.  The number of tools involved in each question varies from 1 to 4. The steps to resolve the questions range from 2 to 8.\n\n\u003cdiv align=\"center\"\u003e\n \u003cimg src=\"figs/statistics.jpg\" width=\"800\"/\u003e\n\u003c/div\u003e\n\nThe detailed information of 14 tools are shown in the table below.\n\n\u003cdiv align=\"center\"\u003e\n \u003cimg src=\"figs/tools.jpg\" width=\"800\"/\u003e\n\u003c/div\u003e\n\n## 🏆 Leaderboard, July 2024\n\nWe evaluate the language models in two modes:\n- **Step-by-step mode.** It is designed to evaluate the model's fine-grained tool-use capabilities. In this mode, the model is provided with the initial $n$ steps of the reference tool chain as prompts, with the expectation to predict the action in step $n+1$. Four metrics are devised under step-by-step mode: ***InstAcc*** (instruction following accuracy), ***ToolAcc*** (tool selection accuracy), ***ArgAcc*** (argument prediction accuracy), and ***SummAcc*** (answer summarizing accuracy).\n\n- **End-to-end mode.** It is designed to reflect the tool agent's actual task executing performance. In this mode, the model actually calls the tools and solves the problem by itself. We use ***AnsAcc*** (final answer accuracy) to measure the accuracy of the execution result. Besides, we calculate four ***F1 scores of tool selection: P, L, O, C*** in perception, operation, logic, and creativity categories, to measure the tool selection capability. \n\nHere is the performance of various LLMs on GTA. Inst, Tool, Arg, Summ, and Ans denote InstAcc, ToolAcc, ArgAcc SummAcc, and AnsAcc, respectively. P, O, L, C denote the F1 score of tool selection in Perception, Operation, Logic, and Creativity categories. ***Bold*** denotes the best score among all models. \u003cins\u003e*Underline*\u003c/ins\u003e denotes the best score under the same model scale. ***AnsAcc*** reflects the overall performance.\n\n**Models** | **Inst** | **Tool** | **Arg** | **Summ** | **P** | **O** | **L** | **C** | **Ans** | **Ans+I**\n---|---|---|---|---|---|---|---|---|---|---\n💛 ***API-based*** | | | | | | | | | |\ngpt-4-1106-preview | 85.19 | 61.4 | \u003cins\u003e**37.88**\u003c/ins\u003e | \u003cins\u003e**75**\u003c/ins\u003e | 67.61 | 64.61 | 74.73 |89.55 |  \u003cins\u003e**46.59**\u003c/ins\u003e | \u003cins\u003e**44.9**\u003c/ins\u003e\ngpt-4o | \u003cins\u003e**86.42**\u003c/ins\u003e | \u003cins\u003e**70.38**\u003c/ins\u003e | 35.19 | 72.77 | \u003cins\u003e**75.56**\u003c/ins\u003e | \u003cins\u003e**80**\u003c/ins\u003e | \u003cins\u003e**78.75**\u003c/ins\u003e | 82.35 | 41.52 | 40.05\ngpt-3.5-turbo | 67.63 | 42.91 | 20.83 | 60.24 | 58.99 | 62.5 | 59.85 | \u003cins\u003e**97.3**\u003c/ins\u003e | 23.62 | 21.18\nclaude3-opus |64.75 | 54.4 | 17.59 | 73.81 | 41.69 | 63.23 | 46.41 | 42.1 | 23.44 | 14.47\nmistral-large | 58.98 | 38.42 | 11.13 | 68.03 | 19.17 | 30.05 | 26.85 | 38.89 | 17.06 | 11.94\n💚 ***Open-source*** | | | | | | | | | |\nqwen1.5-72b-chat | \u003cins\u003e48.83\u003c/ins\u003e | 24.96 | \u003cins\u003e7.9\u003c/ins\u003e | 68.7 | 12.41 | 11.76 | 21.16 | 5.13 | \u003cins\u003e13.32\u003c/ins\u003e | \u003cins\u003e10.22\u003c/ins\u003e\nqwen1.5-14b-chat | 42.25 | 18.85 | 6.28 | 60.06 | 19.93 | 23.4 | \u003cins\u003e39.83\u003c/ins\u003e | 25.45 | 12.42 | 9.33\nqwen1.5-7b-chat | 29.77 | 7.36 | 0.18 | 49.38 | 0 | 13.95 | 16.22 | 36 | 10.56 | 7.93\nmixtral-8x7b-instruct | 28.67 | 12.03 | 0.36 | 54.21 | 2.19 | \u003cins\u003e34.69\u003c/ins\u003e | 37.68 | 42.55 | 9.77 | 9.33\ndeepseek-llm-67b-chat | 9.05 | 23.34 | 0.18 | 11.51 | 14.72 | 23.19 | 22.22 | 27.42 | 9.51 | 7.93\nllama3-70b-instruct | 47.6 | \u003cins\u003e36.8\u003c/ins\u003e | 4.31 | \u003cins\u003e69.06\u003c/ins\u003e | \u003cins\u003e32.37\u003c/ins\u003e | 22.37 | 36.48 | 31.86 | 8.32 | 6.25\nmistral-7b-instruct | 26.75 | 10.05 | 0 | 51.06 | 13.75 | 33.66 | 35.58 | 31.11 | 7.37 | 5.54\ndeepseek-llm-7b-chat | 10.56 | 16.16 | 0.18 | 18.27 | 20.81 | 15.22 | 31.3 | 37.29 | 4 | 3.01\nyi-34b-chat | 23.23 | 10.77 | 0 | 34.99 | 11.6 | 11.76 | 12.97 | 5.13 | 3.21 | 2.41\nllama3-8b-instruct | 45.95 | 11.31 | 0 | 36.88 | 19.07 | 23.23 | 29.83 | \u003cins\u003e42.86\u003c/ins\u003e | 3.1 | 2.74\nyi-6b-chat | 21.26 | 14.72 | 0 | 32.54 | 1.47 | 0 | 1.18 | 0 | 0.58 | 0.44\n\n\n## 🏆 Leaderboard, Mar. 2025\n\n**Models** | **Inst** | **Tool** | **Arg** | **Summ** | **P** | **O** | **L** | **C** | **Ans** | **Ans+I**\n---|---|---|---|---|---|---|---|---|---|---\n💛 ***API-based*** | | | | | | | | | |\ndeepseek-v3 | 68.31 | 40.57 | 25.49 | 66.05 | 70.81 | 73.28 | 77.70 | 86.15 | \u003cins\u003e**44.78**\u003c/ins\u003e | \u003cins\u003e**49.67**\u003c/ins\u003e\nqwen-max-2.5 | 83.54 | 58.35 | 29.62 | 72.87 | 69.86 | 76.92 | 74.55 | \u003cins\u003e**89.55**\u003c/ins\u003e | 41.73 | 45.91\ngpt-4o | \u003cins\u003e**86.42**\u003c/ins\u003e | \u003cins\u003e**70.38**\u003c/ins\u003e | \u003cins\u003e**35.19**\u003c/ins\u003e | 72.77 | \u003cins\u003e**75.56**\u003c/ins\u003e | \u003cins\u003e**80**\u003c/ins\u003e | \u003cins\u003e**78.75**\u003c/ins\u003e | 82.35 | 41.52 | 40.05\n💚 ***Open-source*** | | | | | | | | | | |\nqwq-32b | 27.02 | 13.82 | 0 | 47.5 | \u003cins\u003e59.12\u003c/ins\u003e | \u003cins\u003e54.7\u003c/ins\u003e | 44.35 | 45.61 | \u003cins\u003e27.31\u003c/ins\u003e | \u003cins\u003e22.36\u003c/ins\u003e\ndeepseek-r1-distill-llama-70b | 30.73 | 7.72 | 0.36 | 48.46 | 34.03 | 42.37 | 27.23 | 37.5 | 13.09 | 10.21\ndeepseek-r1-distill-llama-8b | 27.3 | 14.72 | 0 | 52.6 | 22.29 | 38.78 | 23.59 | 39.13 | 11.10 | 9.45\nqwen2.5-7b-instruct | \u003cins\u003e56.38\u003c/ins\u003e | 32.85 | 5.57 | \u003cins\u003e65.75\u003c/ins\u003e | 20.67 | 29.17 | 20.83 | 45.83 | 9.06 | 8.95\nllama-3.1-8b-instruct | 41.15 | 24.24 | 1.08 | 64.71 | 36.32 | 43.69 | 47.3 | 21.59 | 8.78 | 8.08\nministral-8b-instruct-2410 | 42.39 | 22.08 | 2.15 | 61.4 | 19.28 | 42.96 | \u003cins\u003e49.59\u003c/ins\u003e | \u003cins\u003e58.06\u003c/ins\u003e | 6.46 | 7.4\nmistral-large-instruct-2411 | 50.89 | \u003cins\u003e40.75\u003c/ins\u003e | \u003cins\u003e15.44\u003c/ins\u003e | 60.74 | 22.41 | 30.77 | 33.77 | 30.77 | 7.35 | 6.99\nllama-3.1-nemotron-70b-instruct-hf | 27.43 | 18.31 | 0 | 51.24 | 20.18 | 35.59 | 23.89 | 23.73 | 8.6 | 6.46\n\n\n## 🚀 Evaluate on GTA\n\n### Prepare GTA Dataset\n1. Clone this repo.\n```shell\ngit clone https://github.com/open-compass/GTA.git\ncd GTA\n```\n2. Download the dataset from [release file](https://github.com/open-compass/GTA/releases/download/v0.1.0/gta_dataset.zip).\n```shell\nmkdir ./opencompass/data\n```\nPut it under the folder ```./opencompass/data/```. The structure of files should be:\n```\nGTA/\n├── agentlego\n├── opencompass\n│   ├── data\n│   │   ├── gta_dataset\n│   ├── ...\n├── ...\n```\n\n### Prepare Your Model\n1. Download the model weights.\n```shell\npip install -U huggingface_hub\n# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False\nhuggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False\n```\n2. Install [LMDeploy](https://github.com/InternLM/lmdeploy).\n```shell\nconda create -n lmdeploy python=3.10\nconda activate lmdeploy\n```\nFor CUDA 12:\n```shell\npip install lmdeploy\n```\nFor CUDA 11+:\n```shell\nexport LMDEPLOY_VERSION=0.4.0\nexport PYTHON_VERSION=310\npip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118\n```\n3. Launch a model service.\n```shell\n# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]\nlmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat\n```\n### Deploy Tools\n1. Install [AgentLego](https://github.com/InternLM/agentlego).\n```shell\nconda create -n agentlego python=3.11.9\nconda activate agentlego\ncd agentlego\npip install -r requirements_all.txt\npip install agentlego\npip install -e .\nmim install mmengine\nmim install mmcv==2.1.0\n```\nOpen ```~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py```, then set ```_supports_sdpa = False``` to ```_supports_sdpa = True``` in line 1279.\n\n2. Deploy tools for GTA benchmark.\n\nTo use the GoogleSearch and MathOCR tools, you should first get the Serper API key from https://serper.dev, and the Mathpix API key from https://mathpix.com/. Then export these keys as environment variables.\n\n```shell\nexport SERPER_API_KEY='your_serper_key_for_google_search_tool'\nexport MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'\nexport MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'\n```\n\nStart the tool server.\n\n```shell\nagentlego-server start --port 16181 --extra ./benchmark.py  `cat benchmark_toollist.txt` --host 0.0.0.0\n```\n### Start Evaluation\n1. Install [OpenCompass](https://github.com/open-compass/opencompass).\n```shell\nconda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y\nconda activate opencompass\ncd agentlego\npip install -e .\ncd ../opencompass\npip install -e .\n```\nhuggingface_hub==0.25.2 (\u003c0.26.0)\ntransformers==4.40.1\n2. Modify the config file at ```configs/eval_gta_bench.py``` as below.\n\nThe ip and port number of **openai_api_base** is the ip of your model service and the port number you specified when using lmdeploy.\n\nThe ip and port number of **tool_server** is the ip of your tool service and the port number you specified when using agentlego.\n\n```python\nmodels = [\n  dict(\n        abbr='qwen1.5-7b-chat',\n        type=LagentAgent,\n        agent_type=ReAct,\n        max_turn=10,\n        llm=dict(\n            type=OpenAI,\n            path='qwen1.5-7b-chat',\n            key='EMPTY',\n            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',\n            query_per_second=1,\n            max_seq_len=4096,\n            stop='\u003c|im_end|\u003e',\n        ),\n        tool_server='http://10.140.0.138:16181',\n        tool_meta='data/gta_dataset/toolmeta.json',\n        batch_size=8,\n    ),\n]\n```\n\nIf you infer and evaluate in **step-by-step** mode, you should comment out **tool_server** and enable **tool_meta** in ```configs/eval_gta_bench.py```, and set infer mode and eval mode to **every_with_gt** in ```configs/datasets/gta_bench.py```:\n```python\nmodels = [\n  dict(\n        abbr='qwen1.5-7b-chat',\n        type=LagentAgent,\n        agent_type=ReAct,\n        max_turn=10,\n        llm=dict(\n            type=OpenAI,\n            path='qwen1.5-7b-chat',\n            key='EMPTY',\n            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',\n            query_per_second=1,\n            max_seq_len=4096,\n            stop='\u003c|im_end|\u003e',\n        ),\n        # tool_server='http://10.140.0.138:16181',\n        tool_meta='data/gta_dataset/toolmeta.json',\n        batch_size=8,\n    ),\n]\n```\n```python\ngta_bench_infer_cfg = dict(\n    prompt_template=dict(\n        type=PromptTemplate,\n        template=\"\"\"{questions}\"\"\",\n    ),\n    retriever=dict(type=ZeroRetriever),\n    inferencer=dict(type=AgentInferencer, infer_mode='every_with_gt'),\n)\ngta_bench_eval_cfg = dict(evaluator=dict(type=GTABenchEvaluator, mode='every_with_gt'))\n```\n\nIf you infer and evaluate in **end-to-end** mode, you should comment out **tool_meta** and enable **tool_server** in ```configs/eval_gta_bench.py```, and set infer mode and eval mode to **every** in ```configs/datasets/gta_bench.py```:\n```python\nmodels = [\n  dict(\n        abbr='qwen1.5-7b-chat',\n        type=LagentAgent,\n        agent_type=ReAct,\n        max_turn=10,\n        llm=dict(\n            type=OpenAI,\n            path='qwen1.5-7b-chat',\n            key='EMPTY',\n            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',\n            query_per_second=1,\n            max_seq_len=4096,\n            stop='\u003c|im_end|\u003e',\n        ),\n        tool_server='http://10.140.0.138:16181',\n        # tool_meta='data/gta_dataset/toolmeta.json',\n        batch_size=8,\n    ),\n]\n```\n```python\ngta_bench_infer_cfg = dict(\n    prompt_template=dict(\n        type=PromptTemplate,\n        template=\"\"\"{questions}\"\"\",\n    ),\n    retriever=dict(type=ZeroRetriever),\n    inferencer=dict(type=AgentInferencer, infer_mode='every'),\n)\ngta_bench_eval_cfg = dict(evaluator=dict(type=GTABenchEvaluator, mode='every'))\n```\n\n3. Infer and evaluate with OpenCompass.\n```shell\n# infer only\npython run.py configs/eval_gta_bench.py --max-num-workers 32 --debug --mode infer\n```\n```shell\n# evaluate only\n# srun -p llmit -q auto python run.py configs/eval_gta_bench.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval\nsrun -p llmit -q auto python run.py configs/eval_gta_bench.py --max-num-workers 32 --debug --reuse 20240628_115514 --mode eval\n```\n```shell\n# infer and evaluate\npython run.py configs/eval_gta_bench.py -p llmit -q auto --max-num-workers 32 --debug\n```\n\n\n# 📝 Citation\nIf you use GTA in your research, please cite the following paper:\n```bibtex\n@misc{wang2024gtabenchmarkgeneraltool,\n      title={GTA: A Benchmark for General Tool Agents}, \n      author={Jize Wang and Zerun Ma and Yining Li and Songyang Zhang and Cailian Chen and Kai Chen and Xinyi Le},\n      year={2024},\n      eprint={2407.08713},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2407.08713}, \n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2FGTA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-compass%2FGTA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2FGTA/lists"}