{"id":13407441,"url":"https://github.com/OpenBMB/ToolBench","last_synced_at":"2025-03-14T12:31:04.591Z","repository":{"id":170209277,"uuid":"646333922","full_name":"OpenBMB/ToolBench","owner":"OpenBMB","description":"[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.","archived":false,"fork":false,"pushed_at":"2024-11-18T09:29:52.000Z","size":60333,"stargazers_count":4906,"open_issues_count":126,"forks_count":432,"subscribers_count":49,"default_branch":"master","last_synced_at":"2025-03-06T03:53:55.629Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://openbmb.github.io/ToolBench/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenBMB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-28T03:46:17.000Z","updated_at":"2025-03-05T15:31:44.000Z","dependencies_parsed_at":"2023-10-12T13:21:49.820Z","dependency_job_id":"ebb67427-0329-4a5d-a7a9-0bdd7f50ffcf","html_url":"https://github.com/OpenBMB/ToolBench","commit_stats":{"total_commits":158,"total_committers":20,"mean_commits":7.9,"dds":0.5063291139240507,"last_synced_commit":"0aaf368734fdc72284b31b4c85467f0153489e84"},"previous_names":["openbmb/toolbench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FToolBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FToolBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FToolBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FToolBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenBMB","download_url":"https://codeload.github.com/OpenBMB/ToolBench/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243577757,"owners_count":20313696,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T20:00:40.359Z","updated_at":"2025-03-14T12:31:04.574Z","avatar_url":"https://github.com/OpenBMB.png","language":"Python","readme":"\u003cdiv align= \"center\"\u003e\n    \u003ch1\u003e 🛠️ToolBench🤖\u003c/h1\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n![Dialogues](https://img.shields.io/badge/Tool\\_Num-3451-red?style=flat-square)\n![Dialogues](https://img.shields.io/badge/API\\_Num-16464-red?style=flat-square)\n![Dialogues](https://img.shields.io/badge/Current\\_Dataset\\_Size-126K-red?style=flat-square)\n![Dialogues](https://img.shields.io/badge/Total\\_API\\_Call-469K-red?style=flat-square)\n![Dialogues](https://img.shields.io/badge/Average\\_Reasoning\\_Traces-4.0-red?style=flat-square)\n![Dialogues](https://img.shields.io/badge/Tool\\_LLaMA-Released-green?style=flat-square)\n\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#model\"\u003eModel\u003c/a\u003e •\n  \u003ca href=\"#data\"\u003eData Release\u003c/a\u003e •\n  \u003ca href=\"#web-ui\"\u003eWeb Demo\u003c/a\u003e •\n  \u003ca href=\"#tooleval\"\u003eTool Eval\u003c/a\u003e •\n  \u003ca href=\"https://arxiv.org/pdf/2307.16789.pdf\"\u003ePaper\u003c/a\u003e •\n  \u003ca href=\"#citation\"\u003eCitation\u003c/a\u003e\n\n\u003c/p\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/ToolLLaMA-logo.png\" width=\"350px\"\u003e\n\u003c/div\u003e\n\n🔨This project (ToolLLM) aims to construct **open-source, large-scale, high-quality** instruction tuning SFT data to facilitate the construction of powerful LLMs with general **tool-use** capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced [function call](https://openai.com/blog/function-calling-and-other-api-updates) capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.\n\n**2024.8 Update** We have updated the RapidAPI server with a new IP, please make sure you get the latest code. You can also build it locally using codes [here](https://drive.google.com/file/d/1JdbHkL2D8as1docfHyfLWhrhlSP9rZhf/view?usp=sharing).\n\n**💁‍♂️💁💁‍♀️ Join Us on [Discord](https://discord.gg/NScFnpMuRQ)!**\n\n*Read this in [中文](README_ZH.md).*\n\n## What's New\n- **[2024/3/17]** Welcome to **[StableToolBench](https://github.com/zhichengg/StableToolBench)**:\nA **stable and reliable** local toolbench server based on API response simulation. Dive deeper into the tech behind StableToolBench with [paper here](https://arxiv.org/pdf/2403.07714.pdf) and explore more on the [project homepage](https://zhichengg.github.io/stb.github.io/). Codes are available [here](https://github.com/zhichengg/StableToolBench).\n\n- **[2023/9/29]** A new version ToolEval which is more stable and covers more models including GPT4! Please refer to [**ToolEval**](https://github.com/OpenBMB/ToolBench/tree/master/toolbench/tooleval) for more details. Besides, [**ToolLLaMA-2-7b-v2**](https://huggingface.co/ToolBench/ToolLLaMA-2-7b-v2) is released with stronger tool-use capabilities. Please use the ToolLLaMA-2-7b-v2 model to reproduce our latest experimental results with the new version ToolEval.\n\n- **[2023/8/30]** Data updation, with more than **120,000** solution path annotations and **intact reasoning thoughts**! Please find `data.zip` on [Google Drive](https://drive.google.com/drive/folders/1yBUQ732mPu-KclJnuQELEhtKakdXFc3J).\n\n- **[2023/8/8]** No more hallucination! [**ToolLLaMA-2-7b-v1**](https://huggingface.co/ToolBench/ToolLLaMA-2-7b-v1) (fine-tuned from LLaMA-2-7b) is released with lower API hallucination than ChatGPT.\n\n- **[2023/8/4]** We provide **RapidAPI backend service** to free you from using your own RapidAPI key and subscribing the APIs. Please fill out our [form](https://forms.gle/S4hqVLtnqeXcNTCJA). We will review it as soon as possible and send you the ToolBench key to get start on it! \n\n- **[2023/8/1]** Our [**paper**](https://arxiv.org/abs/2307.16789) is released.\n\n- **[2023/7/27]** **New version** ToolBench is released.\n\n✨Here is an overview of the dataset construction, training, and evaluation.\n\n\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/overview.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\n✨✨Features:\n - **API Collection**: we gather **16464** representational state transfer (REST) APIs from [RapidAPI](https://rapidapi.com/hub), a platform that hosts massive real-world APIs provided by developers.\n - **Instruction Generation**: we curate instructions that involve both **single-tool** and **multi-tool** scenarios.\n - **Answer Annotation**: we develop a novel **depth-first search based decision tree** (DFSDT) to bolster the planning and reasoning ability of LLMs, which significantly improves the annotation efficiency and successfully annotates those complex instructions that cannot be answered with CoT or ReACT. We provide responses that not only include the final answer but also incorporate the model's **reasoning process, tool execution, and tool execution results**. \n - **API Retriver**: we incorporate API retrieval to equip ToolLLaMA with open-domain tool-using abilities.\n - All the data is automatically generated by OpenAI API and filtered by us, the whole data creation process is easy to scale up.\n\n\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/comparison.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\nWe also provide **A demo of using ToolLLaMA**\n\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/OpenBMB/ToolBench/assets/25274507/f1151d85-747b-4fac-92ff-6c790d8d9a31\n\n\u003c/div\u003e\n\nCurrently, our ToolLLaMA has reached the performance of ChatGPT (turbo-16k) in tool use, in the future, *we will continually improve the data quality and increase the coverage of real-world tools.*\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/performance.png\" width=\"300px\"\u003e\n\u003c/div\u003e\n\nHere is the *[Old version](https://github.com/OpenBMB/ToolBench/tree/legacy)* of ToolBench.\n\n## Data\n\n👐ToolBench is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. It is distributed under Apache License 2.0. Below is the statistics of the data :\n\n| Tool Nums | API Nums | Instance Nums | Real API Call | Reasoning Traces |\n|-----------|----------|---------------|---------------|------------------|\n| 3451      | 16464    | 126486         | 469585         | 4.0              |\n\nWe crawl 16000+ real-world APIs from [RapidAPI](https://rapidapi.com/hub), and curate realistic human instructions that involve them. Below we present a hierarchy of RapidAPI and our instruction generation process.\n\n\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/instructiongeneration.png\" width=\"800px\"\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\nToolBench contains both single-tool and multi-tool scenarios. The multi-tool scenarios can be further categorized into intra-category multi-tool and intra-collection multi-tool. We utilize DFSDT method for all scenarios to our data creation. Here is an illustration for the data creation process using DFSDT method:\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"assets/answer_anno.png\" width=\"800px\"\u003e\n\n\u003c/div\u003e\n\n### Data Release\n\n Please download our dataset using the following link: [Google Drive](https://drive.google.com/drive/folders/1yBUQ732mPu-KclJnuQELEhtKakdXFc3J) or [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/c9e50625743b40bfbe10/). *Notice that `data_0801` is the old version data.*\nThe file structure is as follows:\n```\n├── /data/\n│  ├── /instruction/\n│  ├── /answer/\n│  ├── /toolenv/\n│  ├── /retrieval/\n│  ├── /test_instruction/\n│  ├── /test_query_ids/\n│  ├── /retrieval_test_query_ids/\n│  ├── toolllama_G123_dfs_train.json\n│  └── toolllama_G123_dfs_eval.json\n├── /reproduction_data/\n│  ├── /chatgpt_cot/\n│  ├── /chatgpt_dfs/\n│  ├── ...\n│  └── /toolllama_dfs/\n```\nHere are some descriptions for the `data` directory:\n- `instruction` and `answer`: The instruction data and solution path annotation data. `G1`,`G2`, `G3` refers to single-tool, intra-category multi-tool and intra-collection multi-tool data respectively. We also have an [Atlas Explorer](https://atlas.nomic.ai/map/58aca169-c29a-447a-8f01-0d418fc4d341/030ddad7-5305-461c-ba86-27e1ca79d899) for visualization.\n- `toolenv`: The tool environment related data, containing API jsons, API codes and API example responses.\n- `retrieval`: The data used for tool retrieval is included in this directory.\n- `test_instruction` and `test_query_ids`: We sample 200 instances from every test set. The `test_instruction` directory contains test queries for each test set, and the `test_query_ids` contains query ids of the test instances in each test set.\n- `retrieval_test_query_ids`: This directory contains query ids of the test instances for retriever.\n- `toolllama_G123_dfs_train.json` and `toolllama_G123_dfs_eval.json`: Preprocessed data that can be used to train toolllama directly and reproduce our results. For preprocessing details, we split the G1, G2 and G3 data into train, eval and test parts respectively and combine the train data for training in our main experiments.\n\n*Please make sure you have downloaded the necessary data and put the directory (e.g. `data/`) under `ToolBench/`, so that the following bash scripts can navigate to the related data.*\n\n## 🤖Model\n\nWe release the [ToolLLaMA-2-7b-v2](https://huggingface.co/ToolBench/ToolLLaMA-2-7b-v2) which is trained on the latest version data, and [ToolLLaMA-7b-v1](https://huggingface.co/ToolBench/ToolLLaMA-7b-v1), [ToolLLaMA-7b-LoRA-v1](https://huggingface.co/ToolBench/ToolLLaMA-7b-LoRA-v1) which are trained on the 0801 version data. All models are trained on the released dataset in a multi-task fashion. We also release the [tool retriever](https://huggingface.co/ToolBench/ToolBench_IR_bert_based_uncased) trained under our experimental setting.\n\n## 🚀Fine-tuning\n### Install\nClone this repository and navigate to the ToolBench folder.\n```bash\ngit clone git@github.com:OpenBMB/ToolBench.git\ncd ToolBench\n```\nInstall Package (python\u003e=3.9)\n```bash\npip install -r requirements.txt\n```\nor for ToolEval only\n```bash\npip install -r toolbench/tooleval/requirements.txt\n```\n\nPrepare the data and tool environment:\n```bash\nwget --no-check-certificate 'https://drive.google.com/uc?export=download\u0026id=1XFjDxVZdUY7TXYF2yvzx3pJlS2fy78jk\u0026confirm=yes' -O data.zip\nunzip data.zip\n```\nhttps://drive.google.com/file/d/1XFjDxVZdUY7TXYF2yvzx3pJlS2fy78jk/view?usp=drive_link\n\n### Training Retriever\n- Data preprocessing:\n```bash\nexport PYTHONPATH=./\npython preprocess/preprocess_retriever_data.py \\\n    --query_file data/instruction/G1_query.json \\\n    --index_file data/test_query_ids/G1_instruction_test_query_ids.json \\\n    --dataset_name G1 \\\n    --output_dir data/retrieval/G1\n```\n- Then run the following command to train the tool retriever:\n```bash\nexport PYTHONPATH=./\npython toolbench/retrieval/train.py \\\n    --data_path data/retrieval/G1/ \\\n    --model_name bert-base-uncased \\\n    --output_path retrieval_model \\\n    --num_epochs 5 \\\n    --train_batch_size 32 \\\n    --learning_rate 2e-5 \\\n    --warmup_steps 500 \\\n    --max_seq_length 256\n```\n\n### Training ToolLLaMA\n- Data preprocessing, for G1_answer as an example:\n```bash\nexport PYTHONPATH=./\npython preprocess/preprocess_toolllama_data.py \\\n    --tool_data_dir data/answer/G1_answer \\\n    --method DFS_woFilter_w2 \\\n    --output_file data/answer/toolllama_G1_dfs.json\n```\n- Our training code is based on [FastChat](https://github.com/lm-sys/FastChat). You can use the following command to train ToolLLaMA-7b with 2 x A100 (80GB), with our preprocessed data `data/toolllama_G123_dfs_train.json`. For preprocessing details, we split the G1, G2 and G3 data into train, eval and test parts respectively and combine the train data for training in our main experiments:\n```bash\nexport PYTHONPATH=./\ntorchrun --nproc_per_node=2 --master_port=20001 toolbench/train/train_mem.py \\\n    --model_name_or_path huggyllama/llama-7b  \\\n    --data_path  data/toolllama_G123_dfs_train.json \\\n    --eval_data_path  data/toolllama_G123_dfs_eval.json \\\n    --conv_template tool-llama-single-round \\\n    --bf16 True \\\n    --output_dir toolllama \\\n    --num_train_epochs 2 \\\n    --per_device_train_batch_size 2 \\\n    --per_device_eval_batch_size 2 \\\n    --gradient_accumulation_steps 8 \\\n    --evaluation_strategy \"epoch\" \\\n    --prediction_loss_only \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 8 \\\n    --learning_rate 5e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.04 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --fsdp \"full_shard auto_wrap\" \\\n    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \\\n    --tf32 True \\\n    --source_model_max_length 2048 \\\n    --model_max_length 8192 \\\n    --gradient_checkpointing True \\\n    --lazy_preprocess True \\\n    --report_to none\n```\n\nTo train lora version:\n```bash\nexport PYTHONPATH=./\ndeepspeed --master_port=20001 toolbench/train/train_lora.py \\\n    --model_name_or_path huggyllama/llama-7b  \\\n    --data_path  data/toolllama_G123_dfs_train.json \\\n    --eval_data_path  data/toolllama_G123_dfs_eval.json \\\n    --conv_template tool-llama-single-round \\\n    --bf16 True \\\n    --output_dir toolllama_lora \\\n    --num_train_epochs 5 \\\n    --per_device_train_batch_size 4 \\\n    --per_device_eval_batch_size 2 \\\n    --gradient_accumulation_steps 2 \\\n    --evaluation_strategy \"epoch\" \\\n    --prediction_loss_only \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 8 \\\n    --learning_rate 5e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.04 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --source_model_max_length 2048 \\\n    --model_max_length 8192 \\\n    --gradient_checkpointing True \\\n    --lazy_preprocess True \\\n    --deepspeed ds_configs/stage2.json \\\n    --report_to none\n```\n\n\n## Inference With Our RapidAPI Server\nPlease fill out the [form](https://forms.gle/S4hqVLtnqeXcNTCJA) first and after reviewing we will send you the toolbench key. Then prepare your toolbench key by:\n```bash\nexport TOOLBENCH_KEY=\"your_toolbench_key\"\n```\n\n### For ToolLLaMA\n\nTo inference with ToolLLaMA, run the following commands:\n```bash\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --backbone_model toolllama \\\n    --model_path ToolBench/ToolLLaMA-7b \\\n    --max_observation_length 1024 \\\n    --observ_compress_method truncate \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file toolllama_dfs_inference_result \\\n    --toolbench_key $TOOLBENCH_KEY\n```\n\nFor **ToolLLaMA-LoRA**:\n```bash\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --backbone_model toolllama \\\n    --model_path huggyllama/llama-7b \\\n    --lora \\\n    --lora_path /path/to/your/downloaded/ToolLLaMA-7b-LoRA \\\n    --max_observation_length 1024 \\\n    --observ_compress_method truncate \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file toolllama_lora_dfs_inference_result \\\n    --toolbench_key $TOOLBENCH_KEY\n```\n\nFor ToolLLaMA-LoRA under **open-domain** setting, run:\n```bash\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline_open_domain.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --corpus_tsv_path data/retrieval/G1/corpus.tsv \\\n    --retrieval_model_path /path/to/your/retrival_model \\\n    --retrieved_api_nums 5 \\\n    --backbone_model toolllama \\\n    --model_path huggyllama/llama-7b \\\n    --lora \\\n    --lora_path /path/to/your/toolllama_lora \\\n    --max_observation_length 1024 \\\n    --observ_compress_method truncate \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file toolllama_lora_dfs_open_domain_inference_result \\\n    --toolbench_key $TOOLBENCH_KEY\n```\n\n### For OpenAI Models\nTo use ChatGPT, run:\n```bash\nexport TOOLBENCH_KEY=\"\"\nexport OPENAI_KEY=\"\"\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --backbone_model chatgpt_function \\\n    --openai_key $OPENAI_KEY \\\n    --max_observation_length 1024 \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file chatgpt_dfs_inference_result \\\n    --toolbench_key $TOOLBENCH_KEY\n```\n\nTo use Text-Davinci-003, run:\n```bash\nexport TOOLBENCH_KEY=\"\"\nexport OPENAI_KEY=\"\"\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --backbone_model davinci \\\n    --openai_key $OPENAI_KEY \\\n    --max_observation_length 1024 \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file davinci_dfs_inference_result \\\n    --toolbench_key $TOOLBENCH_KEY\n```\n\n## Inference With Your Own RapidAPI Account\nTo do inference with customized RapidAPI account, pass your **rapidapi key** through `rapidapi_key` and specify the `use_rapidapi_key` argument in the script:\n```bash\nexport RAPIDAPI_KEY=\"\"\nexport OPENAI_KEY=\"\"\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --backbone_model chatgpt_function \\\n    --openai_key $OPENAI_KEY \\\n    --max_observation_length 1024 \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file chatgpt_dfs_inference_result \\\n    --rapidapi_key $RAPIDAPI_KEY \\\n    --use_rapidapi_key\n```\n\n## API Customization\nTo do inference with customized API(s), you should prepare the API documentation and code, then modify your query. For example, to add an API **hello_world** which returns a \"hello world\" string:\n- API documentation: First generate the API documentation `hello_world.json`, which should follow this format:\n```\n{\n    \"tool_description\": \"Return hello world.\",\n    \"tool_name\": \"hello world\",\n    \"title\": \"hello world\",\n    \"api_list\": [\n        {\n            \"name\": \"get_hello_world\",\n            \"url\": \"\",\n            \"description\": \"To get 'hello world'.\",\n            \"method\": \"GET\",\n            \"required_parameters\": [],\n            \"optional_parameters\": []\n        }\n    ],\n    \"standardized_name\": \"hello_world\"\n}\n```\nThen put it under a specific category in `data/toolenv/tools/`, either one of the 49 existing categories or a new category, e.g. `Customized`. \n- API code: Create a directory naming the `hello_world` under `Customized` directory. Then write a code `api.py` to realize the function of the API and put it under `Customized/hello_world/`. The API code can be written in this format:\n```python\ndef get_hello_world():\n    \"\"\"\n    To get hello world \n    \"\"\"\n    observation = \"hello world\"\n    return observation\n```\nNow the file structure under `data/toolenv/` should be:\n```\n├── /tools/\n│  ├── /Sports/\n│  │  ├── basketball.json\n│  │  ├── /basketball/\n│  │  │  └── api.py\n│  │  └── ...\n│  ├── ...\n│  ├── /Customized/\n│  │  ├── hello_world.json\n│  │  ├── /hello_world/\n│  │  │  └── api.py\n└── response_examples\n```\n- Modify your query file, and the query file should follow the following format:\n```\n[\n    {\n        \"query\": \"I want to get a 'hello world' string.\",\n        \"query_id\": 200001,\n        \"api_list\": [\n            {\n                \"category_name\": \"Customized\",\n                \"tool_name\": \"hello world\",\n                \"api_name\": \"get_hello_world\"\n            }\n        ]\n    }\n]\n```\n- Finally we are free to inference with the **hello_world** API by running the following commands:\n```bash\nexport PYTHONPATH=./\npython toolbench/inference/qa_pipeline.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --backbone_model toolllama \\\n    --model_path ToolBench/ToolLLaMA-7b \\\n    --max_observation_length 1024 \\\n    --observ_compress_method truncate \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file /path/to/your/query/file \\\n    --output_answer_file /path/to/your/output/file \\\n    --api_customization\n```\n*Currently we only support customized API usage under close-domain setting. We plan to support open-domain soon.*\n\n\n## Setting up and running the interface\nToolBench contains a Web UI based on [Chatbot UI](https://github.com/mckaywrigley/chatbot-ui), forked to include the use of tools in the interface. It comes in two parts: the backend server, and [chatbot-ui-toolllama](https://github.com/lilbillybiscuit/chatbot-ui-toolllama). Here is a [video demo](assets/toolbench-demo.mp4).\n\n\n### Web UI\n```bash\ngit clone https://github.com/lilbillybiscuit/chatbot-ui-toolllama\ncd chatbot-ui-toolllama\nnpm install\nnpm run dev\n```\n\nThe app will be available on `http://localhost:3000/`\n\n### Backend server\n```bash\nexport PYTHONPATH=./\npython toolbench/inference/toolbench_server.py \\\n    --tool_root_dir data/toolenv/tools/ \\\n    --corpus_tsv_path data/retrieval/G1/corpus.tsv \\\n    --retrieval_model_path /path/to/your/retrival_model \\\n    --retrieved_api_nums 5 \\\n    --backbone_model toolllama \\\n    --model_path huggyllama/llama-7b \\\n    --lora \\\n    --lora_path /path/to/your/toolllama_lora \\\n    --max_observation_length 1024 \\\n    --method DFS_woFilter_w2 \\\n    --input_query_file data/test_instruction/G1_instruction.json \\\n    --output_answer_file toolllama_lora_dfs_open_domain_result \\\n    --rapidapi_key $RAPIDAPIKEY\n```\n\nThis server will be available on `http://localhost:5000/`. To start a request, call `http://localhost:5000/stream` with a GET or POST request containing a JSON object with the following fields:\n```json\n{\n    \"text\": \"What is the weather in New York today?\",\n    \"top_k\": 5,\n    \"method\": \"DFS_woFilter_w2\"\n}\n```\n\n## ToolEval\n\nBy fine-tuning LLaMA on ToolBench, we obtain **ToolLLaMA**. Considering that human evaluation can be time-consuming, we follow [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) to develop an efficient machine evaluator **ToolEval**, which incorporates two evaluation metrics:\n - **Pass Rate**: Calculates the proportion of successfully completing an instruction within limited OpenAI API calls. \n - **Preference**: Measured by comparing two answers (action sequences) for a given instruction. We pre-define a set of criteria for a better answer, which are organized as prompts for ChatGPT. We provide the test instruction and two candidate answers to the evaluator and obtain its preference. We evaluate each answer pair multiple times to improve the reliability of our system. Then we calculate the **Win Rate** (percentage of being preferred by the evaluator). More details can be found in our paper.\n\nTo validate the reliability of ChatGPT evaluator in both pass rate and win rate, we sample among four different methods (ChatGPT+ReACT, ChatGPT+DFSDT, ToolLLaMA+DFSDT and GPT4+DFSDT) to obtain solution pairs for 300 test instructions for each method. Then we engage humans to annotate the pass rate for ChatGPT+DFSDT, ToolLLaMA+DFSDT and GPT4+DFSDT, and the win rate among ChatGPT+ReACT and ChatGPT+DFSDT.\nOur ChatGPT evaluator demonstrates a high agreement of **87.1%** in pass rate and **80.3%** in win rate with human annotators. This result shows that our evaluator generates highly similar evaluation results to humans and can be viewed as a credible evaluator who simulates human evaluation on pass rate and win rate.\n\nMore details about ToolEval can be found in our paper.\n\n### Evaluation with ToolEval\n#### Install\nInstall Package (python\u003e=3.9)\n```bash\npip install -r requirements.txt\n```\n\n#### Evaluation\n*If you want to reproduce the official results, download the reproduction data `reproduction_data.zip` through [Google Drive](https://drive.google.com/drive/folders/1yBUQ732mPu-KclJnuQELEhtKakdXFc3J), unzip it and put the `reproduction_data` under `ToolBench/data/`, and skip the data preparation process.*\n- Data preparation. To evaluate your own model and method using ToolEval, first you need to prepare all the model predictions for the six test subsets. Create a directory naming with your model and method, e.g. `chatgpt_cot` then put each test set's predictions under the directory. The file sturcture of the directory should be:\n```\n├── /chatgpt_cot/\n│  ├── /G1_instruction/\n│  │  ├── /10160_CoT@1.json\n│  │  └── ...\n│  ├── /G1_tool/\n│  │  ├── /10221_CoT@1.json\n│  │  └── ...\n│  ├── ...\n│  ├── /G3_instruction/\n│  │  ├── /10221_CoT@1.json\n│  │  └── ...\n```\n\nThen preprocess the predictions by running the following commands:\n```bash\nexport RAW_ANSWER_PATH=../../data/reproduction_data/model_predictions/\nexport CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/\nexport MODEL_NAME=chatgpt_cot\nexport METHOD=CoT\nmkdir ${CONVERTED_ANSWER_PATH}/${MODEL_NAME}\nfor test_set in G1_instruction G1_category G1_tool G2_category G2_instruction G3_instruction\ndo\n    answer_dir=${RAW_ANSWER_PATH}/${MODEL_NAME}/${test_set}\n    output_file=${CONVERTED_ANSWER_PATH}/${MODEL_NAME}/${test_set}.json\n    python convert_to_answer_format.py\\\n        --answer_dir ${answer_dir} \\\n        --method ${METHOD} \\\n        --output ${output_file}\ndone\n```\nAfter that, check if there are preprocessed json files for the test sets under `${CONVERTED_ANSWER_PATH}/${MODEL_NAME}`. If so, you're ready to run the following evaluate process. If not, check if there is anything wrong with the model's predictions.\n\n- OpenAI Key. Prepare your openai key to use our evaluator. The key(s) should be stored in a json file, e.g. `path/to/your/openai_key_json_file.json`:\n```bash\n[\n    {\n        \"username\": \"your_user_name\",\n        \"passwd\": \"your_password\",\n        \"api_key\": \"your_openai_key\",\n        \"organization\": \"your_organization\"\n    },\n    ...\n]\n```\n\n- Pass rate:\n```bash\nexport CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/\nexport SAVE_PATH=pass_rate_results\nexport CANDIDATE_MODEL=chatgpt_cot\nexport API_POOL_FILE=path/to/your/openai_key_json_file.json\n\npython eval_pass_rate.py \\\n    --converted_answer_path ${CONVERTED_ANSWER_PATH} \\\n    --save_path ${SAVE_PATH} \\\n    --reference_model ${CANDIDATE_MODEL} \\\n    --test_ids ../../data/test_ids/ \\\n    --max_eval_threads 20 \\\n    --evaluate_times 7\n\n```\nThe result files will be stored under the ${SAVE_PATH}.\n\n- Win rate. The below example take ChatGPT-ReACT as reference model and GPT4-ReACT as candidate model. Notice that you need to get both model's pass rate results first, then run the following commands to evaluate the preference result of GPT4-ReACT:\n```bash\nexport CONVERTED_ANSWER_PATH=../../data/reproduction_data/model_predictions_converted/\nexport SAVE_PATH=preference_results\nexport PASS_TARE_PATH=pass_rate_results\nexport REFERENCE_MODEL=chatgpt_cot\nexport CANDIDATE_MODEL=gpt-4-0613_cot\nexport API_POOL_FILE=path/to/your/openai_key_json_file.json\n\npython eval_preference.py \\\n    --converted_answer_path ${CONVERTED_ANSWER_PATH} \\\n    --reference_model ${REFERENCE_MODEL} \\\n    --output_model ${CANDIDATE_MODEL} \\\n    --test_ids ../../data/test_ids/ \\\n    --save_path ${SAVE_PATH} \\\n    --pass_rate_result_path ${PASS_TARE_PATH} \\\n    --max_eval_threads 20 \\\n    --use_pass_rate true \\\n    --evaluate_times 7\n```\nThe result files will be stored under the ${SAVE_PATH}.\n\nPlease refer to [ToolEval](https://github.com/OpenBMB/ToolBench/tree/master/toolbench/tooleval) for more details.\n\n### 📊 Model Experiments Results\n\n\nIn our main experiments, ToolLLaMA(v2) demonstrates a compelling capability to handle both single-tool and complex multi-tool instructions, which on a par with ChatGPT.\nBelow are the main results. Win rate for each model is compared with ChatGPT-ReACT.\n\n\n**Pass Rate:**\n| Method | Model               | I1-Inst. | I1-Tool | I1-Cate. | I2-Inst. | I2-Cate. | I3-Inst. | Average |\n|--------|---------------------|----------|---------|----------|----------|----------|----------|---------|\n| ReACT  | Claude-2            | 5.5      | 3.5     | 5.5      | 6        | 6        | 14       | 6.8     |\n|        | Text-Davinci-003    | 12       | 20      | 20       | 8.5      | 14.5     | 24       | 16.5    |\n|        | ChatGPT             | 41.5     | 44      | 44.5     | 42.5     | 46.5     | 22       | 40.2    |\n|        | ToolLLaMA           | 25       | 29      | 33       | 30.5     | 31.5     | 25       | 29      |\n|        | GPT4                | 53.5       | 50.0    | 53.5       | 67.0     | 72.0     | 47.0       | 57.2    |\n| DFSDT  | Claude-2            | 20.5     | 31      | 18.5     | 17       | 20.5     | 28       | 22.6    |\n|        | Text-Davinci-003    | 43.5     | 44      | 46       | 37       | 42       | 46       | 43.1    |\n|        | ChatGPT             | 54.5     | 65      | 60.5     | 75       | 71.5     | 62       | 64.8    |\n|        | ToolLLaMA           | 57       | 61      | 62       | 77       | 77       | 66       | 66.7    |\n|        | ToolLLaMA-Retreiver | **64**       | 64      | 60.5     | **81.5**     | 68.5     | 65       | 67.3    |\n|        | GPT4                | 60       | **71.5**    | **67**       | 79.5     | **77.5**     | **71**       | **71.1**    |\n\n\n**Win Rate:** (Reference model: ChatGPT-ReACT)\n| Method | Model               | I1-Inst. | I1-Tool | I1-Cate. | I2-Inst. | I2-Cate. | I3-Inst. | Average |\n|--------|---------------------|----------|---------|----------|----------|----------|----------|---------|\n| ReACT  | Claude-2            | 31       | 27.8    | 33.8     | 35       | 31.5     | 47.5     | 34.4    |\n|        | Text-Davinci-003    | 28.5     | 35.3    | 31       | 29.8     | 29.8     | 45       | 33.2    |\n|        | ToolLLaMA           | 45       | 42      | 47.5     | 50.8     | 41.8     | 55       | 47      |\n|        | GPT4                | 60       | 58.8    | 63.5     | 65.8     | 60.3     | 78       | 64.4    |\n| DFSDT  | Claude-2            | 38       | 44.3    | 43.3     | 36.8     | 33.5     | 65       | 43.5    |\n|        | Text-Davinci-003    | 40.3     | 43.8    | 46.8     | 40.5     | 43.3     | 63       | 46.3    |\n|        | ChatGPT             | 60.5     | 62      | 57.3     | 72       | **64.8**     | 69       | 64.3    |\n|        | ToolLLaMA           | 55       | 55.3    | 54.5     | 68.5     | 58       | 69       | 60      |\n|        | ToolLLaMA-Retreiver | 62.3     | 59      | 55       | 68.5     | 60.8     | 73       | 63.1    |\n|        | GPT4                | **67.5**     | **67.8**    | **66.5**     | **73.3**     | 63.3     | **84**       | **70.4**    |\n\n\n## TODO\n- [ ] ToolLLaMA will reach GPT-4's tool-use capability.\n\n## Resources of Tool Learning\n\nWith the powerful capabilities of foundation models, we are eager to see their applications in manipulating various tools. For more resources, please refer to the following:\n\n- **BMTools**. [[Project](https://github.com/OpenBMB/BMTools)]\n\n- **Tool Learning Survey**. [[Paper](https://arxiv.org/abs/2304.08354)]\n  \n- **Tool Learning Paper List**. [[Project](https://github.com/thunlp/ToolLearningPapers)]\n\n- **WebCPM**. [[Paper](https://github.com/thunlp/WebCPM)]\n\n\n## Citation\nFeel free to cite us if you like ToolBench.\n```bibtex\n@misc{qin2023toolllm,\n      title={ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs}, \n      author={Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun},\n      year={2023},\n      eprint={2307.16789},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI}\n}\n```\n\n```bibtex\n@misc{qin2023tool,\n      title={Tool Learning with Foundation Models}, \n      author={Yujia Qin and Shengding Hu and Yankai Lin and Weize Chen and Ning Ding and Ganqu Cui and Zheni Zeng and Yufei Huang and Chaojun Xiao and Chi Han and Yi Ren Fung and Yusheng Su and Huadong Wang and Cheng Qian and Runchu Tian and Kunlun Zhu and Shihao Liang and Xingyu Shen and Bokai Xu and Zhen Zhang and Yining Ye and Bowen Li and Ziwei Tang and Jing Yi and Yuzhang Zhu and Zhenning Dai and Lan Yan and Xin Cong and Yaxi Lu and Weilin Zhao and Yuxiang Huang and Junxi Yan and Xu Han and Xian Sun and Dahai Li and Jason Phang and Cheng Yang and Tongshuang Wu and Heng Ji and Zhiyuan Liu and Maosong Sun},\n      year={2023},\n      eprint={2304.08354},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n```bibtex\n@misc{guo2024stabletoolbench,\n      title={StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models},\n      author={Guo, Zhicheng and Cheng, Sijie and Wang, Hao and Liang, Shihao and Qin, Yujia and Li, Peng and Liu, Zhiyuan and Sun, Maosong and Liu, Yang},\n      year={2024},\n      eprint={2403.07714},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","funding_links":[],"categories":["For specific usage Model/ Finetuned model","📚 Research \u0026 Benchmarks","Python","A01_文本生成_文本对话","Datasets-or-Benchmark","Projects","Benchmark/Evaluator","others","Applications","Industry Strength Natural Language Processing","Repos","Testing Frameworks","Tool Integration","Benchmarks \u0026 Datasets","Agent Integration \u0026 Deployment Tools","\u003ca name=\"Python\"\u003e\u003c/a\u003ePython","Evaluation \u0026 Testing"],"sub_categories":["📊 Benchmarks","大语言对话模型及数据","Agent能力","Benchmarks","Advanced Components","提示语（魔法）","Category-Specific Testing Tools","LangManus","Task-Specific Benchmarks","Stateful Serverless Frameworks","Sandboxing \u0026 Execution"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenBMB%2FToolBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenBMB%2FToolBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenBMB%2FToolBench/lists"}