{"id":13754126,"url":"https://github.com/open-compass/T-Eval","last_synced_at":"2025-05-09T22:30:50.284Z","repository":{"id":213629217,"uuid":"729712342","full_name":"open-compass/T-Eval","owner":"open-compass","description":"[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step","archived":false,"fork":false,"pushed_at":"2024-04-03T21:05:37.000Z","size":2830,"stargazers_count":230,"open_issues_count":36,"forks_count":14,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-11-16T06:05:03.629Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://open-compass.github.io/T-Eval/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/open-compass.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-10T05:18:46.000Z","updated_at":"2024-11-11T03:10:35.000Z","dependencies_parsed_at":"2024-01-23T04:21:52.022Z","dependency_job_id":"e9cbc639-3942-4cad-9fd2-8aac84792799","html_url":"https://github.com/open-compass/T-Eval","commit_stats":null,"previous_names":["open-compass/t-eval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FT-Eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FT-Eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FT-Eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/open-compass%2FT-Eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/open-compass","download_url":"https://codeload.github.com/open-compass/T-Eval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224884612,"owners_count":17386121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:41.214Z","updated_at":"2024-11-16T06:31:38.562Z","avatar_url":"https://github.com/open-compass.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Benchmark"],"sub_categories":["大语言对话模型及数据","Tool Use"],"readme":"# T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step\n\n[![arXiv](https://img.shields.io/badge/arXiv-2312.14033-b31b1b.svg)](https://arxiv.org/abs/2312.14033)\n[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)\n\n## ✨ Introduction  \n\nThis is an evaluation harness for the benchmark described in [T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step](https://arxiv.org/abs/2312.14033). \n\n[[Paper](https://arxiv.org/abs/2312.14033)]\n[[Project Page](https://open-compass.github.io/T-Eval/)]\n[[LeaderBoard](https://open-compass.github.io/T-Eval/leaderboard.html)]\n[[HuggingFace](https://huggingface.co/datasets/lovesnowbest/T-Eval)]\n\n\u003e Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool-utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability.\n\n\u003cdiv\u003e\n\u003ccenter\u003e\n\u003cimg src=\"figs/teaser.png\"\u003e\n\u003c/div\u003e\n\n## 🚀 What's New\n\n- **[2024.02.22]** Release new [data](https://drive.google.com/file/d/1AqFOV7mVnVMy7gr3DyryHtIAPVarIITw/view?usp=sharing) and [1/5 subset](https://drive.google.com/file/d/1DgCMjquEIJ2v14Xu6uB6w3UEzaYXZbUL/view?usp=sharing)(both Chinese and English) and code for faster inference! 🚀🚀🚀 The leaderboard will be updated soon! We also provide template examples for reference~\n- **[2024.01.08]** Release [ZH Leaderboard](https://open-compass.github.io/T-Eval/leaderboard_zh.html) and ~~[ZH data](https://drive.google.com/file/d/1z25duwZAnBrPN5jYu9-8RMvfqnwPByKV/view?usp=sharing)~~, where the questions and answer formats are in Chinese. （公布了中文评测数据集和榜单）✨✨✨\n- **[2023.12.22]** Paper available on [ArXiv](https://arxiv.org/abs/2312.14033). 🔥🔥🔥\n- **[2023.12.21]** Release the test scripts and data for T-Eval. 🎉🎉🎉\n\n## 🧾 TODO\n\n- [x] Support Batch Inference. NOTE: Some models (ChatGLM, Qwen, InternV1) does not support batch inference.\n- [x] Change the role of function response from `system` to `function`.\n- [x] Merge consecutive same role conversations.\n- [x] Provide template configs for open-sourced models.\n- [x] Provide dev set for T-Eval, reducing the evaluation time.\n- [x] Optimize the inference pipeline of huggingface model provided by Lagent, which will be 3x faster. **(Please upgrade Lagent to v0.2)**\n- [ ] Support inference on Opencompass.\n\n~~NOTE: These TODOs will be started after 2024.2.1~~ Thanks for your patience!\n\n## 🛠️ Preparations\n\n```bash\n$ git clone https://github.com/open-compass/T-Eval.git\n$ cd T-Eval\n$ pip install -r requirements.txt\n$ git clone https://github.com/InternLM/lagent.git\n$ cd lagent \u0026\u0026 pip install -e .\n```\n\n##  🛫️ Get Started\n\nWe support both API-based models and HuggingFace models via [Lagent](https://github.com/InternLM/lagent).\n\n### 💾 Test Data\n\nWe provide both google drive \u0026 huggingface dataset to download test data:\n\n1. Google Drive\n\n~~[[EN data](https://drive.google.com/file/d/1ebR6WCCbS9-u2x7mWpWy8wV_Gb6ltgpi/view?usp=sharing)] (English format) [[ZH data](https://drive.google.com/file/d/1z25duwZAnBrPN5jYu9-8RMvfqnwPByKV/view?usp=sharing)] (Chinese format)~~\n[T-Eval Data](https://drive.google.com/file/d/1nQ0pn26qd0FGU8UkfSTxNdu6uWI0QXTY/view?usp=sharing)\n\n2. HuggingFace Datasets\n\nYou can also access the dataset through huggingface via this [link](https://huggingface.co/datasets/lovesnowbest/T-Eval).\n\n```python\nfrom datasets import load_dataset\ndataset = load_dataset(\"lovesnowbest/T-Eval\")\n```\n\nAfter downloading, please put the data in the `data` folder directly:\n```\n- data/\n  - instruct_v2.json\n  - plan_json_v2.json\n  ...\n```\n\n### 🤖 API Models\n\n1. Set your OPENAI key in your environment.\n```bash\nexport OPENAI_API_KEY=xxxxxxxxx\n```\n2. Run the model with the following scripts\n```bash\n# test all data at once\nsh test_all_en.sh api gpt-4-1106-preview gpt4\n# test ZH dataset\nsh test_all_zh.sh api gpt-4-1106-preview gpt4\n# test for Instruct only\npython test.py --model_type api --model_path gpt-4-1106-preview --resume --out_name instruct_gpt4.json --out_dir work_dirs/gpt4/ --dataset_path data/instruct_v2.json --eval instruct --prompt_type json\n```\n\n### 🤗 HuggingFace Models\n\n1. Download the huggingface model to your local path.\n2. Modify the `meta_template` json according to your tested model.\n3. Run the model with the following scripts\n```bash\n# test all data at once\nsh test_all_en.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE\n# test ZH dataset\nsh test_all_zh.sh hf $HF_PATH $HF_MODEL_NAME $META_TEMPLATE\n# test for Instruct only\npython test.py --model_type hf --model_path $HF_PATH --resume --out_name instruct_$HF_MODEL_NAME.json --out_dir data/work_dirs/ --dataset_path data/instruct_v1.json --eval instruct --prompt_type json --model_display_name $HF_MODEL_NAME --meta_template $META_TEMPLATE\n```\n\n### 💫 Final Results\nOnce you finish all tested samples, a detailed evluation results will be logged at `$out_dir/$model_display_name/$model_display_name_-1.json` (For ZH dataset, there is a `_zh` suffix). To obtain your final score, please run the following command:\n```bash\npython teval/utils/convert_results.py --result_path $out_dir/$model_display_name/$model_display_name_-1.json\n```\n\n## 🔌 Protocols\n\nT-Eval adopts multi-conversation style evaluation to gauge the model. The format of our saved prompt is as follows:\n```python\n[\n    {\n        \"role\": \"system\",\n        \"content\": \"You have access to the following API:\\n{'name': 'AirbnbSearch.search_property_by_place', 'description': 'This function takes various parameters to search properties on Airbnb.', 'required_parameters': [{'name': 'place', 'type': 'STRING', 'description': 'The name of the destination.'}], 'optional_parameters': [], 'return_data': [{'name': 'property', 'description': 'a list of at most 3 properties, containing id, name, and address.'}]}\\nPlease generate the response in the following format:\\ngoal: goal to call this action\\n\\nname: api name to call\\n\\nargs: JSON format api args in ONLY one line\\n\"\n    },\n    {\n        \"role\": \"user\",\n        \"content\": \"Call the function AirbnbSearch.search_property_by_place with the parameter as follows: 'place' is 'Berlin'.\"\n    }\n]\n```\nwhere `role` can be ['system', 'user', 'assistant'], and `content` must be in string format. Before infering it by a LLM, we need to construct it into a raw string format via `meta_template`. `meta_template` examples are provided at [meta_template.py](teval/utils/meta_template.py):\n```python\n[\n    dict(role='system', begin='\u003c|System|\u003e:', end='\\n'),\n    dict(role='user', begin='\u003c|User|\u003e:', end='\\n'),\n    dict(\n        role='assistant',\n        begin='\u003c|Bot|\u003e:',\n        end='\u003ceoa\u003e\\n',\n        generate=True)\n]\n```\nYou need to specify the `begin` and `end` token based on your tested huggingface model at [meta_template.py](teval/utils/meta_template.py) and specify the `meta_template` args in `test.py`, same as the name you set in the `meta_template.py`. As for OpenAI model, we will handle that for you.\n\n\n## 📊 Benchmark Results\n\nMore detailed and comprehensive benchmark results can refer to 🏆 [T-Eval official leaderboard](https://open-compass.github.io/T-Eval/leaderboard.html) !\n\n\u003cdiv\u003e\n\u003ccenter\u003e\n\u003cimg src=\"figs/teval_results.png\"\u003e\n\u003c/div\u003e\n\n### ✉️ Submit Your Results\n\nYou can submit your inference results (via running test.py) to this [email](lovesnow@mail.ustc.edu.cn). We will run your predictions and update the results in our leaderboard. Please also provide the scale of your tested model. A sample structure of your submission should be like:\n```\n$model_display_name/\n    instruct_$model_display_name/\n        query_0_1_0.json\n        query_0_1_1.json\n        ...\n    plan_json_$model_display_name/\n    plan_str_$model_display_name/\n    ...\n```\n\n## ❤️ Acknowledgements\n\nT-Eval is built with [Lagent](https://github.com/InternLM/lagent) and [OpenCompass](https://github.com/open-compass/opencompass). Thanks for their awesome work!\n\n## 🖊️ Citation\n\nIf you find this project useful in your research, please consider cite:\n```\n@article{chen2023t,\n  title={T-Eval: Evaluating the Tool Utilization Capability Step by Step},\n  author={Chen, Zehui and Du, Weihua and Zhang, Wenwei and Liu, Kuikun and Liu, Jiangning and Zheng, Miao and Zhuo, Jingming and Zhang, Songyang and Lin, Dahua and Chen, Kai and others},\n  journal={arXiv preprint arXiv:2312.14033},\n  year={2023}\n}\n```\n\n## 💳 License\n\nThis project is released under the Apache 2.0 [license](./LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2FT-Eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-compass%2FT-Eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-compass%2FT-Eval/lists"}