https://github.com/youdotcom-oss/web-search-agent-evals
Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.
https://github.com/youdotcom-oss/web-search-agent-evals
agent-evaluation ai-agents benchmark claude-code codex coding-agents droid evaluation-suite gemini headless-testing llm-judge mcp model-context-protocol web-search
Last synced: 3 months ago
JSON representation
Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.
- Host: GitHub
- URL: https://github.com/youdotcom-oss/web-search-agent-evals
- Owner: youdotcom-oss
- Created: 2026-01-20T21:55:02.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-02-19T09:25:21.000Z (3 months ago)
- Last Synced: 2026-02-19T12:26:30.806Z (3 months ago)
- Topics: agent-evaluation, ai-agents, benchmark, claude-code, codex, coding-agents, droid, evaluation-suite, gemini, headless-testing, llm-judge, mcp, model-context-protocol, web-search
- Language: TypeScript
- Size: 11.8 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Codeowners: .github/CODEOWNERS
- Agents: AGENTS.md