https://github.com/sierra-research/tau2-bench

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
https://github.com/sierra-research/tau2-bench

ai benchmark conversational-agents language-model-agent llm

Last synced: 6 months ago
JSON representation

τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Host: GitHub
URL: https://github.com/sierra-research/tau2-bench
Owner: sierra-research
License: mit
Created: 2025-06-09T23:46:17.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-12-03T21:44:57.000Z (7 months ago)
Last Synced: 2025-12-05T20:46:41.783Z (7 months ago)
Topics: ai, benchmark, conversational-agents, language-model-agent, llm
Language: Python
Homepage: https://arxiv.org/abs/2506.07982
Size: 53.7 MB
Stars: 504
Watchers: 7
Forks: 108
Open Issues: 51
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

awesome-ai-agents-2026 - τ²-Bench (Tau-Bench) - 🆕 Sierra Research のツール-エージェント-ユーザー対話ベンチマーク（リテール / 航空ドメイン）。マルチターンのツール使用・DB 操作・ポリシー遵守を計測。2026 年 4 月の首位は 38 評価モデル中 Claude Mythos Preview の 89.2%。MIT。 ![GitHub stars](https://img.shields.io/badge/dynamic/json?label=Stars&query=%24.stargazers_count&url=https%3A%2F%2Fapi.github.com%2Frepos%2Fsierra-research%2Ftau2-bench&color=yellow&logo=github&logoColor=white&style=flat&cacheSeconds=300) (📊 ベンチマークとリーダーボード / 自動運転)
awesome-harness-engineering - tau2-bench - A benchmark for realistic, multi-step agent tasks where success depends on tool use and execution quality rather than a single-shot answer. (Benchmarks)
awesome-agent-experience - τ-Bench - Benchmark for evaluating AI agents in tool-agent-user interaction across retail, airline, telecom, and banking domains; includes τ-Voice for full-duplex real-time audio evaluation with major LLM providers. (Tools / Benchmarking & Testing)
AwesomeResponsibleAI - τ²-bench: Evaluating Conversational Agents in a Dual-Control Environment
awesome-claude-multi-agent - τ²-Bench - Dual-control conversational agent benchmark; successor to τ-bench from Sierra Research. (Evaluation and Benchmarks)
awesome-openclaw-skills - sierra-research/tau2-bench - Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains | 912 | (AI Models & Inference)
awesome-agent-rl-environments - τ²-bench - bench with a banking domain, voice evaluation modality, and fixes to airline / retail tasks. Current recommended version. (Tool-Use & Multi-Tool Environments)
awesome-harness-engineering - sierra-research/tau2-bench

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sierra-research/tau2-bench

Awesome Lists containing this project