{"id":28385532,"url":"https://github.com/x-plug/multi-llm-agent","last_synced_at":"2025-08-18T05:11:29.953Z","repository":{"id":220254160,"uuid":"751165611","full_name":"X-PLUG/Multi-LLM-Agent","owner":"X-PLUG","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-23T07:43:32.000Z","size":14451,"stargazers_count":223,"open_issues_count":5,"forks_count":26,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-26T06:38:08.559Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/X-PLUG.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-01T03:53:15.000Z","updated_at":"2025-06-25T22:58:04.000Z","dependencies_parsed_at":"2024-04-23T08:05:36.476Z","dependency_job_id":"19d2a54f-7cd1-4450-b753-dd1ec9f945ea","html_url":"https://github.com/X-PLUG/Multi-LLM-Agent","commit_stats":null,"previous_names":["x-plug/multi-llm-agent"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/X-PLUG/Multi-LLM-Agent","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FMulti-LLM-Agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FMulti-LLM-Agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FMulti-LLM-Agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FMulti-LLM-Agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/X-PLUG","download_url":"https://codeload.github.com/X-PLUG/Multi-LLM-Agent/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FMulti-LLM-Agent/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270946068,"owners_count":24672890,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-30T10:40:33.811Z","updated_at":"2025-08-18T05:11:29.916Z","avatar_url":"https://github.com/X-PLUG.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ✨α-UMi: Small LLMs Are Weak Tool Learners: A Multi-LLM Agent\n\u003cdiv align=\"center\"\u003e\nWeizhou Shen\u003csup\u003e1\u003c/sup\u003e, Chenliang Li\u003csup\u003e2\u003c/sup\u003e, Hongzhan Chen\u003csup\u003e1\u003c/sup\u003e, Ming Yan\u003csup\u003e2*\u003c/sup\u003e, Xiaojun Quan\u003csup\u003e1*\u003c/sup\u003e, Hehong Chen\u003csup\u003e2\u003c/sup\u003e, Ji Zhang\u003csup\u003e2\u003c/sup\u003e, Fei Huang\u003csup\u003e2\u003c/sup\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\nshenwzh3@mail2.sysu.edu.cn, quanxj3@mail.sysu.edu.cn, ym119608@alibaba-inc.com\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n\u003csup\u003e1\u003c/sup\u003eSun Yat-sen University \u003csup\u003e2\u003c/sup\u003eAlibaba Group\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n*Corresponding authors\n\u003c/div\u003e\n\n\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://github.com/modelscope/modelscope-agent/tree/alpha_umi\"\u003e\u003cimg src=\"assets/Demo-ModelScope-brightgreen.svg\" alt=\"Demo ModelScope\"\u003e\u003c/a\u003e\n    \u003c!-- \u003ca href=\"https://replicate.com/joehoover/mplug-owl\"\u003e\u003cimg src=\"https://replicate.com/replicate/mplug-owl/badge\" alt=\"Run with Replicate\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/X-PLUG/mPLUG-Owl/blob/main/LICENSE\"\u003e\u003cimg src=\"assets/LICENSE-Apache%20License-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e --\u003e\n    \u003ca href=\"https://arxiv.org/pdf/2401.07324.pdf\"\u003e\u003cimg src=\"assets/Paper-Arxiv-orange.svg\" \u003e\u003c/a\u003e\n    \u003ca href=\"https://hits.seeyoufarm.com\"\u003e\u003cimg src=\"https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FX-PLUG%2FMulti-LLM-Agent\u0026count_bg=%2379C83D\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=hits\u0026edge_flat=false\"/\u003e\u003c/a\u003e\n    \u003c!-- \u003ca href=\"https://twitter.com/xuhaiya2483846/status/1654640739010351106\"\u003e\u003cimg src='assets/-twitter-blue.svg'\u003e\u003c/a\u003e --\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003ca href=\"README.md\"\u003eEnglish\u003c/a\u003e | \u003ca href=\"README_zh.md\"\u003e简体中文\u003c/a\u003e\n\u003chr\u003e\n\u003c/div\u003e\n\u003c!--\nEnglish | [简体中文](README_zh.md)\n\u003chr\u003e\n--\u003e\n\n\n\n\u003cdiv align=\"center\"\u003e\n\n\n\u003cimg src=\"assets/concept.png\"  width=\"70%\"\u003e\n\nA conceptual comparison of  traditional single-LLM agent framework (top) and  alpha-UMi (bottom). \n\n\u003c/div\u003e\n\nα-UMi is a Multi-LLM collaborated agent for tool learning. It decomposes the capabilities of a single LLM into three components, namely planner,\ncaller, and summarizer. For each step of agent execution. The planner generate a rationale for the current step based on the state of the system and selects the caller or summarizer to generate downstream output. The caller is directed by the rationale and responsible for invocating specific tools to interact with. The summarizer is guided by the planner to craft the ultimate user answer based on the execution trajectory.\n\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/case_1.png\"  width=\"95%\"\u003e \n\nAn illustration of how α-UMi works to complete a task.\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/case_2.png\"  width=\"95%\"\u003e \n\nAn illustration of how α-UMi works to complete a task with reflection.\n\u003c/div\u003e\n\n\n## Spotlight\n* Enabling small LLMs to collaborate and outperform strong close-source large LLMs in tool learning.\n* More flexible prompt design than single-LLM agent system.\n* Two-stage Global-to-Local Progressive Fine-tuning (GLPFT) for successfully training the multi-LLM agent.\n\n\n\n## News\n* [04.23] We have now uploaded the processed data in [modelscope](https://modelscope.cn/datasets/shenweizhou/alpha-umi-toolbench-processed-json_format/summary)! You can directly download the data and use without any preprocess.\n* [01.30] We released code of ✨α-UMi with its pre-trained and instruction tuning checkpoints.\n\n## Checkpoints\n\n| Model | 7b | 13b |\n|-------|----|----|\n| backbone (GLPFT steage 1 checkpoint) | -/[modelscope](https://www.modelscope.cn/models/iic/alpha-umi-backbone-7b) | -/[modelscope](https://www.modelscope.cn/models/iic/alpha-umi-backbone-13b)|\n| planner | [huggingface](https://huggingface.co/shenwzh3/alpha-umi-planner-7b)  / [modelscope](https://www.modelscope.cn/models/iic/alpha-umi-planner-7b) | [huggingface](https://huggingface.co/shenwzh3/alpha-umi-planner-13b)  / [modelscope](https://www.modelscope.cn/models/iic/alpha-umi-planner-13b) |\n| caller | [huggingface](https://huggingface.co/shenwzh3/alpha-umi-caller-7b)  / [modelscope](https://www.modelscope.cn/models/iic/alpha-umi-caller-7b) | [huggingface](https://huggingface.co/shenwzh3/alpha-umi-caller-13b)  / [modelscope](https://www.modelscope.cn/models/iic/alpha-umi-caller-13b) |\n| summarizer | [huggingface](https://huggingface.co/shenwzh3/alpha-umi-summarizer-7b)  / [modelscope](https://www.modelscope.cn/models/iic/alpha-umi-summarizer-7b) | [huggingface](https://huggingface.co/shenwzh3/alpha-umi-summarizer-13b)  / [modelscope](https://www.modelscope.cn/models/iic/alpha-umi-summarizer-13b) |\n\n\n\n\n## Usage\n### Install Requirements\n1. Create conda environment\n```bash\nconda create -n multi_llm_agent python=3.10\nconda activate multi_llm_agent\n```\n\n2. Install PyTorch\n\n```\npip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2\n```\n\n3. Install other dependencies\n```bash\npip install -r requirements.txt\n```\n\n### Data Preparation\n\n**NOTE:** We have now uploaded the processed data in [modelscope](https://modelscope.cn/datasets/shenweizhou/alpha-umi-toolbench-processed-json_format/summary)! You can directly download the data and use without any preprocess.\n\n#### ToolBench\n1. First download the oringinal ToolBench dataset from [Google Drive](https://drive.google.com/drive/folders/1yBUQ732mPu-KclJnuQELEhtKakdXFc3J) or [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/c9e50625743b40bfbe10/), and put the data to ```./data``` folder.\n\n2. Preprocess data for training\n```\ncd ./GLPFT\n\nORI_DATA_DIR=\"../data/toolbench/data\" # your data path to save the toolbench raw data\nRAW_DATA_OUT_DIR=\"dataset/toolbench/train/raw_data\"\nTRAIN_DATA_OUT_DIR=\"dataset/toolbench/train\"\nexport PYTHONPATH=./\n\n\npython process_data/toolbench/prepro_raw_stage_1.py \\\n --data_dir $ORI_DATA_DIR \\\n --output_path $RAW_DATA_OUT_DIR\n\n\npython process_data/toolbench/prepro_raw_stage_2.py \\\n --input_path $RAW_DATA_OUT_DIR/raw_data_stage_1.json \\\n --output_path $RAW_DATA_OUT_DIR\n\n\n\nfor MODE in 'backbone' 'planner' 'caller' 'summarizer'\ndo\n    python process_data/toolbench/prepro_$MODE.py \\\n        --input_path $RAW_DATA_OUT_DIR/raw_data_stage_2.json \\\n        --output_path $TRAIN_DATA_OUT_DIR/train_$MODE.json \\\n        --prompt_type toolbench_$MODE\ndone\n```\n\nAfter running the above script, you will create the training data of ToolBench for GLPFT, which will be stored in ```./GLPFT/dataset/toolbench/train```.\n\n### GLPFT Training\n\nOur α-UMi adopts a two-stage GLPFT fine-tuning that first warm-up a backbone LLM and then fine-tune the planner, caller, summarizer separately.\n\n1. First, we fine-tune an LLM for the whole tool learning agent task.\n\n```\ncd ./GLPFT\n\nLLAMA_PATH=\"\" # your path for initial LLM checkpoint\nNNODE=8\nPORT=12345\nBSZ=6\nGA=1\n\nEXP_NAME=/toolbench/backbone  # path to save model\nexport PYTHONPATH=./\ntorchrun --nproc_per_node=$NNODE --master_port=$PORT train_mem.py \\\n    --model_name_or_path $LLAMA_PATH  \\\n    --data_path dataset/toolbench/train/train_backbone.json\\\n    --output_dir saved_models/$EXP_NAME \\\n    --num_train_epochs 2 \\\n    --per_device_train_batch_size $BSZ \\\n    --per_device_eval_batch_size $BSZ \\\n    --gradient_accumulation_steps $GA \\\n    --evaluation_strategy \"no\" \\\n    --eval_steps 0 \\\n    --save_strategy \"steps\" \\\n    --save_steps 500 \\\n    --save_total_limit 8 \\\n    --learning_rate 5e-5 \\\n    --warmup_ratio 0.4 \\\n    --lr_scheduler_type \"cosine\" \\\n    --gradient_checkpointing True \\\n    --deepspeed ds_configs/stage3-a100.json \\\n    --bf16 \\\n    --logging_steps 2 \\\n    --model_max_length 4096 \\\n    --report_to none \\\n    --lazy_preprocess True \n\n```\n\n\n2. After obtaining the backbone, we begin to fine-tune planner, caller and summarizer:\n\n```\ncd ./GLPFT\n\nNNODE=8\nPORT=12345\nBSZ=6\nGA=1\n\n\nBB_PATH=\"saved_models/toolbench/backbone\"\n\n\nEXP_NAME=/toolbench/planner\nexport PYTHONPATH=./\ntorchrun --nproc_per_node=$NNODE --master_port=$PORT train_mem.py \\\n    --model_name_or_path $BB_PATH  \\\n    --data_path dataset/toolbench/train/train_planner.json \\\n    --output_dir saved_models/$EXP_NAME \\\n    --num_train_epochs 1 \\\n    --per_device_train_batch_size $BSZ \\\n    --per_device_eval_batch_size $BSZ \\\n    --gradient_accumulation_steps $GA \\\n    --evaluation_strategy \"no\" \\\n    --eval_steps 0 \\\n    --save_strategy \"steps\" \\\n    --save_steps 500 \\\n    --save_total_limit 8 \\\n    --learning_rate 1e-5 \\\n    --weight_decay 0.01 \\\n    --warmup_ratio 0.2 \\\n    --lr_scheduler_type \"cosine\" \\\n    --gradient_checkpointing True \\\n    --bf16 \\\n    --logging_steps 2 \\\n    --model_max_length 4096 \\\n    --report_to none \\\n    --lazy_preprocess True\n\n\n\nEXP_NAME=/toolbench/caller\nexport PYTHONPATH=./\ntorchrun --nproc_per_node=$NNODE --master_port=$PORT train_mem.py \\\n    --model_name_or_path $BB_PATH  \\\n    --data_path dataset/toolbench/train/train_caller.json \\\n    --output_dir saved_models/$EXP_NAME \\\n    --num_train_epochs 1 \\\n    --per_device_train_batch_size $BSZ \\\n    --per_device_eval_batch_size $BSZ \\\n    --gradient_accumulation_steps $GA \\\n    --evaluation_strategy \"no\" \\\n    --eval_steps 0 \\\n    --save_strategy \"steps\" \\\n    --save_steps 500 \\\n    --save_total_limit 8 \\\n    --learning_rate 1e-5 \\\n    --weight_decay 0.01 \\\n    --warmup_ratio 0.2 \\\n    --lr_scheduler_type \"cosine\" \\\n    --gradient_checkpointing True \\\n    --bf16 \\\n    --logging_steps 2 \\\n    --model_max_length 4096 \\\n    --report_to none \\\n    --lazy_preprocess True\n\n\nEXP_NAME=/toolbench/summarizer\nexport PYTHONPATH=./\ntorchrun --nproc_per_node=$NNODE --master_port=$PORT train_mem.py \\\n    --model_name_or_path $BB_PATH  \\\n    --data_path dataset/toolbench/train/train_summarizer.json \\\n    --output_dir saved_models/$EXP_NAME \\\n    --num_train_epochs 2 \\\n    --per_device_train_batch_size $BSZ \\\n    --per_device_eval_batch_size $BSZ \\\n    --gradient_accumulation_steps $GA \\\n    --evaluation_strategy \"no\" \\\n    --eval_steps 0 \\\n    --save_strategy \"steps\" \\\n    --save_steps 500 \\\n    --save_total_limit 8 \\\n    --learning_rate 1e-5 \\\n    --weight_decay 0.01 \\\n    --warmup_ratio 0.4 \\\n    --lr_scheduler_type \"cosine\" \\\n    --gradient_checkpointing True \\\n    --bf16 \\\n    --logging_steps 2 \\\n    --model_max_length 4096 \\\n    --report_to none \\\n    --lazy_preprocess True\n\n\n```\n\n\n### Inference and evaluate\n\nWe provide the statically test data for the experiments in Section 4.1 of our paper in ```./GLPFT/dataset/toolbench/test```, we can inference and evaluate the α-UMi system as Section 4.1 by running the following script:\n```\ncd ./GLPFT\n\nNNODE=8\nPORT=12345\n\nPLAN_PATH=\"saved_models/planner\"\nCAL_PATH=\"saved_models/caller\"\nSUM_PATH=\"saved_models/summarizer\"\n\n\nLAB_DIR=output_res/toolbench\nP_TYPE_PLAN=toolbench_planner\nP_TYPE_CAL=toolbench_caller\nP_TYPE_SUM=toolbench_summarizer\n\n\nfor DOMAIN in 'in_domain' 'out_of_domain'\ndo\n    export PYTHONPATH=./\n    torchrun --nproc_per_node=$NNODE --master_port=$PORT inference_utils/toolbench/infer_pipeline.py \\\n        --planner_model_name_or_path $PLAN_PATH  \\\n        --planner_use_lora False \\\n        --caller_model_name_or_path $CAL_PATH  \\\n        --caller_use_lora False \\\n        --summarizer_model_name_or_path $SUM_PATH  \\\n        --summarizer_use_lora False \\\n        --per_device_eval_batch_size 1 \\\n        --data_path dataset/toolbench/test/$DOMAIN.json \\\n        --bf16_full_eval \\\n        --assistant_prompt_type $P_TYPE_PLAN \\\n        --caller_prompt_type $P_TYPE_CAL \\\n        --conclusion_prompt_type $P_TYPE_SUM \\\n        --max_input_length 3750 \\\n        --output_dir $LAB_DIR/$DOMAIN \n\n    python inference_utils/toolbench/evaluate-multi_agent.py \\\n    --input_path $LAB_DIR/$DOMAIN/predictions.json \\\n    --output_path $LAB_DIR/$DOMAIN/metrics.json \n\ndone\n```\n\n## α-UMi with RapidAPI Simulator\n\nWe surpport using α-UMi with the RapidAPI simulator implemented by the ToolBench team ([github](https://github.com/OpenBMB/ToolBench)), the codes are in ```./ToolBench-multiLLM```. To do so, you should first fill out the [form](https://forms.gle/oCHHc8DQzhGfiT9r6) to request a Toolbench Key from Toolbench team. Then you can begin to run the simulator with the trained Planner, Caller and Summarizer:\n\n```\ncd ToolBench-multiLLM\n\nDATA_DIR=\"../data/toolbench/data\"\nPLAN_PATH=\"../GLPFT/saved_models/planner\"\nCAL_PATH=\"../GLPFT/saved_models/caller\"\nSUM_PATH=\"../GLPFT/saved_models/summarizer\"\nEXP_NAME=\"multi-llm-agent\"\nTBKEY=\"\" # your toolbench key\n\n\n\nfor TEST_SET in 'G1_category' 'G1_instruction' 'G1_tool' 'G2_category' 'G2_instruction' 'G3_instruction'\ndo\n    export PYTHONPATH=./\n    python toolbench/inference/qa_pipeline.py \\\n        --backbone_model collab_agent_v3 \\\n        --tool_root_dir $DATA_DIR/toolenv/tools/ \\\n        --user_agent_collab True \\\n        --planner_model_path $PLAN_PATH \\\n        --planner_use_lora False \\\n        --caller_model_path $CAL_PATH \\\n        --caller_use_lora False \\\n        --summarizer_model_path $SUM_PATH \\\n        --summarizer_use_lora False \\\n        --use_multi_gpu True \\\n        --max_observation_length 1024 \\\n        --observ_compress_method truncate \\\n        --method DFS_woFilter_w2 \\\n        --input_query_file $DATA_DIR/test_instructions/$TEST_SET.json \\\n        --output_answer_file output_res/$EXP_NAME/$TEST_SET \\\n        --toolbench_key $TBKEY\ndone\n```\n\nWe also surpport compuing the pass_rate and win_rate metrics as ToolBench.\n\nTo compute pass rate:\n```\nexport PYTHONPATH=./\nexport ORI_ANSWER_PATH=output_res/multi-llm-agent\nexport CONVERTED_ANSWER_PATH=output_res/converted/multi-llm-agent\n\nmkdir ${CONVERTED_ANSWER_PATH}\nfor test_set in \"G1_instruction\" \"G1_category\" \"G1_tool\" \"G2_category\" \"G2_instruction\" \"G3_instruction\"\ndo\n    answer_dir=$ORI_ANSWER_PATH/$test_set\n    output_file=${CONVERTED_ANSWER_PATH}/${test_set}.json\n    python toolbench/tooleval/convert_to_answer_format.py\\\n        --answer_dir ${answer_dir} \\\n        --method DFS_woFilter_w2 \\\n        --output ${output_file}\ndone\n\n\nexport SAVE_PATH=pass_rate_results/multi-llm-agent\nexport CANDIDATE_MODEL=multi-llm-agent\nexport DATA_DIR=\"data/toolbench\"\nexport API_POOL_FILE=path/to/your/openai_key_json_file.json\nexport PYTHONPATH=./\npython toolbench/tooleval/eval_pass_rate.py \\\n    --converted_answer_path ${CONVERTED_ANSWER_PATH} \\\n    --save_path ${SAVE_PATH} \\\n    --reference_model ${CANDIDATE_MODEL} \\\n    --test_ids $DATA_DIR/test_query_ids \\\n    --max_eval_threads 1 \\\n    --evaluate_times 7\n```\n\n\nTo compute win_rate,  we choose chatgpt_cot as the reference model, we need to first convert the chatgpt_cot results and compute its pass rate:\n\n```\n# to evaluate win rate, we need to first convert the chatgpt_cot results and compute its pass rate\n\nexport REF_ANSWER_PATH=data/toolbench/reproduction_data/model_predictions/chatgpt_cot\nexport REF_CONVERTED_ANSWER_PATH=data/toolbench/reproduction_data/model_predictions_converted/chatgpt_cot\nfor test_set in \"G1_instruction\" \"G1_category\" \"G1_tool\" \"G2_category\" \"G2_instruction\" \"G3_instruction\"\ndo\n    answer_dir=$ORI_ANSWER_PATH/$test_set\n    output_file=${CONVERTED_ANSWER_PATH}/${test_set}.json\n    python toolbench/tooleval/convert_to_answer_format.py\\\n        --answer_dir ${answer_dir} \\\n        --method DFS_woFilter_w2 \\\n        --output ${output_file}\ndone\n\nexport SAVE_PATH=pass_rate_results/chatgpt_cot\nexport CANDIDATE_MODEL=chatgpt_cot\nexport DATA_DIR=\"data/toolbench/data\"\nexport API_POOL_FILE=path/to/your/openai_key_json_file.json\nexport PYTHONPATH=./\npython toolbench/tooleval/eval_pass_rate.py \\\n    --converted_answer_path ${CONVERTED_ANSWER_PATH} \\\n    --save_path ${SAVE_PATH} \\\n    --reference_model ${CANDIDATE_MODEL} \\\n    --test_ids $DATA_DIR/test_query_ids \\\n    --max_eval_threads 1 \\\n    --evaluate_times 7\n```\n\nThen we bengin to evaluate:\n```\nexport OUTPUT_CONVERTED_ANSWER_PATH=output_res/converted/multi-llm-agent\nexport SAVE_PATH=win_rate_results\nexport REF_PASS_TARE_PATH=pass_rate_results/chatgpt_cot\nexport OUTPUT_PASS_TARE_PATH=pass_rate_results/v9/multi-llm-agent\nexport REFERENCE_MODEL=chatgpt_cot\nexport CANDIDATE_MODEL=multi-llm-agent\n# export API_POOL_FILE=path/to/your/openai_key_json_file.json\n\n\nexport PYTHONPATH=./\npython toolbench/tooleval/eval_preference.py \\\n    --ref_converted_answer_path ${REF_CONVERTED_ANSWER_PATH} \\\n    --output_converted_answer_path ${OUTPUT_CONVERTED_ANSWER_PATH} \\\n    --reference_model ${REFERENCE_MODEL} \\\n    --output_model ${CANDIDATE_MODEL} \\\n    --test_ids data/test_query_ids/ \\\n    --save_path ${SAVE_PATH} \\\n    --ref_pass_rate_result_path ${REF_PASS_TARE_PATH} \\\n    --output_pass_rate_result_path ${OUTPUT_PASS_TARE_PATH} \\\n    --max_eval_threads 1 \\\n    --use_pass_rate true \\\n    --evaluate_times 7\n```\n\n## Experimental Results\n\nResults of the statically evaluation (step-level comparison with annotated reference)\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/result_static.png\"  width=\"95%\"\u003e \n\u003c/div\u003e\n\nResults of the real-time evaluation (calling real APIs to solve the user task)\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/result_real.png\"  width=\"95%\"\u003e \n\u003c/div\u003e\n\n## To do\n\n- [ ] Release our model and code for ToolAlpaca.\n- [ ] Release our model and code for MATH and GSM8K, and our training data (collected with TORA (Gou et al., 2023))\n- [ ] Make α-UMi generalized to more agent tasks!\n\n## Citation\n\n```\n@misc{shen2024small,\n      title={Small LLMs Are Weak Tool Learners: A Multi-LLM Agent}, \n      author={Weizhou Shen and Chenliang Li and Hongzhan Chen and Ming Yan and Xiaojun Quan and Hehong Chen and Ji Zhang and Fei Huang},\n      year={2024},\n      eprint={2401.07324},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-plug%2Fmulti-llm-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-plug%2Fmulti-llm-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-plug%2Fmulti-llm-agent/lists"}