https://github.com/youdotcom-oss/web-search-agent-evals

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.
https://github.com/youdotcom-oss/web-search-agent-evals

agent-evaluation ai-agents benchmark claude-code codex coding-agents droid evaluation-suite gemini headless-testing llm-judge mcp model-context-protocol web-search

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/youdotcom-oss/web-search-agent-evals
Owner: youdotcom-oss
Created: 2026-01-20T21:55:02.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-02-19T09:25:21.000Z (4 months ago)
Last Synced: 2026-02-19T12:26:30.806Z (4 months ago)
Topics: agent-evaluation, ai-agents, benchmark, claude-code, codex, coding-agents, droid, evaluation-suite, gemini, headless-testing, llm-judge, mcp, model-context-protocol, web-search
Language: TypeScript
Size: 11.8 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Codeowners: .github/CODEOWNERS
- Agents: AGENTS.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/youdotcom-oss/web-search-agent-evals

Awesome Lists containing this project