{"id":13704211,"url":"https://github.com/sierra-research/tau-bench","last_synced_at":"2025-05-05T09:33:29.817Z","repository":{"id":245002617,"uuid":"811477986","full_name":"sierra-research/tau-bench","owner":"sierra-research","description":"Code and Data for Tau-Bench","archived":false,"fork":false,"pushed_at":"2024-10-17T17:22:07.000Z","size":949,"stargazers_count":112,"open_issues_count":5,"forks_count":13,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-10-20T01:58:25.201Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sierra-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-06T17:11:36.000Z","updated_at":"2024-10-17T17:22:11.000Z","dependencies_parsed_at":"2024-07-28T19:30:39.860Z","dependency_job_id":"6fef4f4f-f0ac-4697-9730-b668582f7de2","html_url":"https://github.com/sierra-research/tau-bench","commit_stats":null,"previous_names":["sierra-research/tau-bench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sierra-research%2Ftau-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sierra-research%2Ftau-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sierra-research%2Ftau-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sierra-research%2Ftau-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sierra-research","download_url":"https://codeload.github.com/sierra-research/tau-bench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224439809,"owners_count":17311531,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T21:01:05.693Z","updated_at":"2025-05-05T09:33:29.793Z","avatar_url":"https://github.com/sierra-research.png","language":"Python","funding_links":[],"categories":["Evaluation and Benchmarks","Tools","🧪 Benchmarks \u0026 Leaderboards","Benchmarks","Evals \u0026 Verification","Papers","Benchmark","others","Testing Frameworks","Benchmarks and Datasets","Agent Harnessing and Evaluation","Long-Term Coherence and Agentic","GUI \u0026 Computer-Use Agents","3）参考实现与开源工具（GitHub）","Tool-Use \u0026 Multi-Tool Environments","The index"],"sub_categories":["Benchmarks","T10 · Tool Use \u0026 Function Calling","Adjacent Collections","Benchmark","Math","Category-Specific Testing Tools","Benchmark Reality Check (real-world tool use)","TAU-bench","Social \u0026 Human-Robot Interaction","评测框架与 Agent Benchmarks","Tier 1 — frontier model-card standard  (27)"],"readme":"# τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains\n\n**Paper**: [https://arxiv.org/abs/2406.12045](https://arxiv.org/abs/2406.12045)\n\n## Leaderboard\n\n### Airline\n\n| Strategy       | Pass^1 | Pass^2 | Pass^3 | Pass^4 |\n| -------------- | ------ | ------ | ------ | ------ |\n| [TC (claude-3-5-sonnet-20241022)](https://www.anthropic.com/news/3-5-models-and-computer-use)      | **0.460**     | **0.326**     | **0.263**     | **0.225**     |\n| [TC (gpt-4o)](https://platform.openai.com/docs/guides/function-calling)     | 0.420     | 0.273     | 0.220     | 0.200     |\n| [TC (claude-3-5-sonnet-20240620)](https://docs.anthropic.com/en/docs/build-with-claude/tool-use)      | 0.360     | 0.224     | 0.169     | 0.139     |\n| [TC (mistral-large-2407)](https://docs.mistral.ai/capabilities/function_calling/)     | ??     | ??     | ??     | ??     |\n| [TC (gpt-4o-mini)](https://platform.openai.com/docs/guides/function-calling)     | 0.225     | 0.140     | 0.110     | 0.100     |\n| [Act](https://arxiv.org/abs/2210.03629) (gpt-4o)     | 0.365 | 0.217 | 0.160 | 0.140     |\n| [ReAct](https://arxiv.org/abs/2210.03629) (gpt-4o)     | 0.325 | 0.233 | 0.185 | 0.160     |\n\n### Retail\n\n| Strategy       | Pass^1 | Pass^2 | Pass^3 | Pass^4 |\n| -------------- | ------ | ------ | ------ | ------ |\n| [TC (claude-3-5-sonnet-20241022)](https://www.anthropic.com/news/3-5-models-and-computer-use)      | **0.692**     | **0.576**     | **0.509**     | **0.462**     |\n| [TC (gpt-4o)](https://platform.openai.com/docs/guides/function-calling)     | 0.604     | 0.491     | 0.430     | 0.383     |\n| [TC (claude-3-5-sonnet-20240620)](https://docs.anthropic.com/en/docs/build-with-claude/tool-use)      | 0.626     | 0.506     | 0.435     | 0.387     |\n| [TC (mistral-large-2407)](https://docs.mistral.ai/capabilities/function_calling/)     | ??     | ??     | ??     | ??     |\n| [TC (gpt-4o-mini)](https://platform.openai.com/docs/guides/function-calling)     | ??     | ??     | ??     | ??     |\n| [Act](https://arxiv.org/abs/2210.03629) (gpt-4o)     | ??     | ??     | ??     | ??     |\n| [ReAct](https://arxiv.org/abs/2210.03629) (gpt-4o)     | ??     | ??     | ??     | ??     |\n\n*TC = `tool-calling` strategy (the function-calling strategy reported in the paper)\n\n## Setup\n\n1. Clone this repository:\n\n```bash\ngit clone https://github.com/sierra-research/tau-bench \u0026\u0026 cd ./tau-bench\n```\n\n2. Install from source (which also installs required packages):\n\n```bash\npip install -e .\n```\n\n3. Set up your OpenAI / Anthropic / Google / Mistral / AnyScale API keys as environment variables.\n\n```bash\nOPENAI_API_KEY=...\nANTHROPIC_API_KEY=...\nGOOGLE_API_KEY=...\nMISTRAL_API_KEY=...\n```\n\n## Run\n\nRun a tool-calling agent on the τ-retail environment:\n\n```bash\npython run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10\n```\n\nSet max concurrency according to your API limit(s).\n\nTo run specific tasks, use the `--task-ids` flag. For example:\n\n```bash\npython run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10 --task-ids 2 4 6\n```\n\nThis command will run only the tasks with IDs 2, 4, and 6.\n\n## User simulators\n\nBy default, we use `gpt-4o` as the user simulator with strategy `llm`. You can use other models by setting the `--user-model` flag, or other strategies by setting the `--user-strategy` flag. For example, run a tool-calling agent with a claude user simulator:\n\n```bash\npython run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model claude-3-5-sonnet-20240620 --user-model-provider anthropic --user-strategy llm\n```\n\nOther strategies:\n\nTo run `react` user simulator:\n\n```bash\npython run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model gpt-4o --user-model-provider openai --user-strategy react\n```\n\nExample of a `react` user response:\n\n```md\nThought:\nI should provide my name and zip code as I wasn't given an email address to use.\n\nUser Response:\nSure, my name is Yusuf Rossi, and my zip code is 19122.\n```\n\nTo run `verify` user simulator:\n\n```bash\npython run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model gpt-4o --user-model-provider openai --user-strategy verify\n```\n\nThis strategy uses a subsequent LLM verification step to check if the user simulator's response is satisfactory. If not, the user simulator will be prompted to generate a new response.\n\nTo run `reflection` user simulator:\n\n```bash\npython run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --max-concurrency 10 --user-model gpt-4o --user-model-provider openai --user-strategy reflection\n```\n\nThis strategy uses a subsequent LLM verification step to check if the user simulator's response is satisfactory. If not, the user simulator will be prompted to reflect on its response and generate a new response.\n\n## Auto error identification\n\nOften times, it is difficult and time consuming to manually identify specific error locations in trajectories as they can be long and the constraints can be complex. We have provided an auto error identification tool that can do the following:\n\n1. Fault assignment: determine the entity that is responsible for the fault (user, agent, environment)\n2. Fault type classification: classify the type of fault (goal_partially_completed, used_wrong_tool, used_wrong_tool_argument, took_unintended_action)\n\nBoth of the labels are accompanied with a description.\n\nTo run the auto error identification, run:\n\n```bash\npython auto_error_identification.py --env \u003cairline/retail\u003e --platform openai --results-path \u003cthe path to your results file here\u003e --max-concurrency 16 --output-path test-auto-error-identification --max-num-failed-results 10\n```\n\nPlease note that this feature utilizes an LLM, which may lead to inaccurate error identifications.\n\n*Notice: If an error is raised due to the structure of your results file, you may have to rerun the benchmark to produce a new results file. We have recently [rewritten](https://github.com/sierra-research/tau-bench/commit/043b544371757ebb3762b3d02a6675dfe0c41798) the benchmark to be more type-safe and extensible.\n\n## Historical trajectories\n\nτ-bench might be expensive to run. We have provided a set of historical trajectories for the airline and retail environments in `./historical_trajectories`.\n\nIf you would like to contribute your historical trajectories to this benchmark, please submit a PR!\n\n## License\n\nSee `./LICENSE`.\n\n## Contact\n\nPlease submit issues or pull requests if you find problems with the benchmark.\n\n## Citation\n\n```bibtex\n@misc{yao2024tau,\n      title={$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains}, \n      author={Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan},\n      year={2024},\n      eprint={2406.12045},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI},\n      url={https://arxiv.org/abs/2406.12045}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsierra-research%2Ftau-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsierra-research%2Ftau-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsierra-research%2Ftau-bench/lists"}