{"id":28676524,"url":"https://github.com/zjunlp/worfbench","last_synced_at":"2025-06-13T23:04:59.853Z","repository":{"id":259256040,"uuid":"864950113","full_name":"zjunlp/WorfBench","owner":"zjunlp","description":"[ICLR 2025] Benchmarking Agentic Workflow Generation","archived":false,"fork":false,"pushed_at":"2025-02-19T07:58:04.000Z","size":1166,"stargazers_count":46,"open_issues_count":0,"forks_count":3,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-19T08:32:35.825Z","etag":null,"topics":["agent","agent-planning","agentic","agentic-workflow","agents","artificial-intelligence","benchmark","iclr2025","large-language-models","natural-language-processing","planning","workflow"],"latest_commit_sha":null,"homepage":"https://zjunlp.github.io/project/WorFBench/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-29T15:47:23.000Z","updated_at":"2025-02-19T07:58:07.000Z","dependencies_parsed_at":"2024-10-23T23:13:06.946Z","dependency_job_id":"bf986bda-71e4-4a2c-baf2-50c9d06b6c2b","html_url":"https://github.com/zjunlp/WorfBench","commit_stats":null,"previous_names":["zjunlp/worfbench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/WorfBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FWorfBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FWorfBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FWorfBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FWorfBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/WorfBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FWorfBench/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259732771,"owners_count":22903087,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","agent-planning","agentic","agentic-workflow","agents","artificial-intelligence","benchmark","iclr2025","large-language-models","natural-language-processing","planning","workflow"],"created_at":"2025-06-13T23:04:58.964Z","updated_at":"2025-06-13T23:04:59.831Z","avatar_url":"https://github.com/zjunlp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e WorfBench \u003c/h1\u003e\n\u003ch3 align=\"center\"\u003e Benchmarking Agentic Workflow Generation \u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2410.07869\" target=\"_blank\"\u003e📄arXiv\u003c/a\u003e •\n  \u003ca href=\"https://huggingface.co/papers/2410.07869\" target=\"_blank\"\u003e🤗HFPaper\u003c/a\u003e •\n  \u003ca href=\"https://zjunlp.github.io/project/WorFBench/\" target=\"_blank\"\u003e🌐Web\u003c/a\u003e •\n  \u003ca href=\"https://huggingface.co/collections/zjunlp/worfbench-66fc28b8ac1c8e2672192ea1\" target=\"_blank\"\u003e📊Dataset\u003c/a\u003e •\n  \u003ca href=\"https://notebooklm.google.com/notebook/a4c13fd7-29da-462c-a47e-69a26c0d326e/audio\" target=\"_blank\"\u003e🎧NotebookLM Audio\u003c/a\u003e\n\u003c/p\u003e\n\n[![Awesome](https://awesome.re/badge.svg)](https://github.com/zjunlp/WorFBench) \n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n![](https://img.shields.io/github/last-commit/zjunlp/WorFBench?color=green) \n\n## Table of Contents\n\n- 🌻[Acknowledgement](#acknowledgement)\n- 🌟[Overview](#overview)\n- 🔧[Installation](#installation)\n- ✏️[Model-Inference](#model-inference)\n- 📝[Workflow-Generation](#workflow-generation)\n- 🤔[Workflow-Evaluation](#workflow-evaluation)\n- 🚩[Citation](#citation)\n\u003c!-- - 🎉[Contributors](#🎉contributors) --\u003e\n\n---\n\n## 🌻Acknowledgement\n\nOur code of training module is referenced and adapted from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory). And the Dataset is collected from [ToolBench](https://github.com/openbmb/toolbench?tab=readme-ov-file), [ToolAlpaca](https://github.com/tangqiaoyu/ToolAlpaca), [Lumos](https://github.com/allenai/lumos?tab=readme-ov-file), [WikiHow](https://github.com/mahnazkoupaee/WikiHow-Dataset), [Seal-Tools](https://github.com/fairyshine/seal-tools), [Alfworld](https://github.com/alfworld/alfworld), [Webshop](https://github.com/princeton-nlp/WebShop), [IntercodeSql](https://github.com/princeton-nlp/intercode). Our end-to-end evaluation module is based on [IPR](https://github.com/WeiminXiong/IPR), [Stable ToolBench](https://github.com/THUNLP-MT/StableToolBench). Thanks for their great contributions!\n\n\n\n## 🌟Overview\n\nLarge Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. You can download our dataset from [huggingface](https://huggingface.co/collections/zjunlp/worfbench-66fc28b8ac1c8e2672192ea1)!\n\n![](./assets/main_results.jpg)\n\n\n## 🔧Installation\n\n```bash\ngit clone https://github.com/zjunlp/WorFBench\ncd WorFBench\npip install -r requirements.txt\n```\n\n\n\n## ✏️Model-Inference\n\nWe use [llama-facotry](https://github.com/hiyouga/LLaMA-Factory) to deploy local model with OpenAI-style API\n```bash\ngit clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git\ncd LLaMA-Factory\npip install -e \".[torch,metrics]\"\nAPI_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml\n```\n\n\n\n\n## 📝Workflow-Generation\nGenerate workflow with local llm api\n```bash\ntasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)\nmodel_name=your_model_name\nfor task in ${tasks[@]}; do\n    python node_eval.py \\\n        --task gen_workflow \\\n        --model_name ${model_name} \\\n        --gold_path ./gold_traj/${task}/graph_eval.json \\\n        --pred_path ./pred_traj/${task}/${model_name}/graph_eval_two_shot.json\\\n        --task_type ${task} \\\n        --few_shot \\\n\ndone\n```\n\n\n\n## 🤔Workflow-Evaluation\n\nEvaluation the workflow in the mode of *node* or *graph*\n```bash\ntasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)\nmodel_name=your_model_name\nfor task in ${tasks[@]}; do\n    python node_eval.py \\\n        --task eval_workflow \\\n        --model_name ${model_name} \\\n        --gold_path ./gold_traj/${task}/graph_eval.json \\\n        --pred_path ./pred_traj/${task}/${model_name}/graph_eval_two_shot.json\\\n        --eval_model all-mpnet-base-v2 \\\n        --eval_output ./eval_result/${model_name}_${task}_graph_eval_two_shot.json \\\n        --eval_type node \\\n        --task_type ${task} \\\n\ndone\n```\n\n\n\n## 🚩Citation\n\nIf this work is helpful, please kindly cite as:\n\n```bibtex\n@article{qiao2024benchmarking,\n  title={Benchmarking Agentic Workflow Generation},\n  author={Qiao, Shuofei and Fang, Runnan and Qiu, Zhisong and Wang, Xiaobin and Zhang, Ningyu and Jiang, Yong and Xie, Pengjun and Huang, Fei and Chen, Huajun},\n  journal={arXiv preprint arXiv:2410.07869},\n  year={2024}\n}\n```\n\n\n\n\u003c!-- ## 🎉Contributors\n\n\u003ca href=\"https://github.com/zjunlp/WorFBench/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=zjunlp/WorFBench\" /\u003e\u003c/a\u003e\n\nWe will offer long-term maintenance to fix bugs and solve issues. So if you have any problems, please put issues to us. --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fworfbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Fworfbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fworfbench/lists"}