https://github.com/sierra-research/tau2-bench
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
https://github.com/sierra-research/tau2-bench
ai benchmark conversational-agents language-model-agent llm
Last synced: 6 months ago
JSON representation
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
- Host: GitHub
- URL: https://github.com/sierra-research/tau2-bench
- Owner: sierra-research
- License: mit
- Created: 2025-06-09T23:46:17.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-03T21:44:57.000Z (7 months ago)
- Last Synced: 2025-12-05T20:46:41.783Z (7 months ago)
- Topics: ai, benchmark, conversational-agents, language-model-agent, llm
- Language: Python
- Homepage: https://arxiv.org/abs/2506.07982
- Size: 53.7 MB
- Stars: 504
- Watchers: 7
- Forks: 108
- Open Issues: 51
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
- awesome-ai-agents-2026 - τ²-Bench (Tau-Bench) - 🆕 Sierra Research のツール-エージェント-ユーザー対話ベンチマーク(リテール / 航空ドメイン)。マルチターンのツール使用・DB 操作・ポリシー遵守を計測。2026 年 4 月の首位は 38 評価モデル中 Claude Mythos Preview の 89.2%。MIT。  (📊 ベンチマークとリーダーボード / 自動運転)
- awesome-harness-engineering - tau2-bench - A benchmark for realistic, multi-step agent tasks where success depends on tool use and execution quality rather than a single-shot answer. (Benchmarks)
- awesome-agent-experience - τ-Bench - Benchmark for evaluating AI agents in tool-agent-user interaction across retail, airline, telecom, and banking domains; includes τ-Voice for full-duplex real-time audio evaluation with major LLM providers. (Tools / Benchmarking & Testing)
- AwesomeResponsibleAI - τ²-bench: Evaluating Conversational Agents in a Dual-Control Environment
- awesome-claude-multi-agent - τ²-Bench - Dual-control conversational agent benchmark; successor to τ-bench from Sierra Research. (Evaluation and Benchmarks)
- awesome-openclaw-skills - sierra-research/tau2-bench - Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains | 912 | (AI Models & Inference)
- awesome-agent-rl-environments - τ²-bench - bench with a banking domain, voice evaluation modality, and fixes to airline / retail tasks. Current recommended version. (Tool-Use & Multi-Tool Environments)
- awesome-harness-engineering - sierra-research/tau2-bench