{"id":51347479,"url":"https://github.com/celeryhq/probe","last_synced_at":"2026-07-02T12:41:08.171Z","repository":{"id":362852024,"uuid":"1257725496","full_name":"celeryhq/probe","owner":"celeryhq","description":"AgentEval — probe your AI agents the way real users do: browser-driven, natural-language scenarios, AI-as-judge, multi-dimensional reports.","archived":false,"fork":false,"pushed_at":"2026-06-06T07:12:37.000Z","size":71,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-06T09:07:31.524Z","etag":null,"topics":["ai-agents","browser-automation","claude-code","evaluation","llm","playwright","testing"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/celeryhq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-03T00:34:56.000Z","updated_at":"2026-06-06T07:12:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/celeryhq/probe","commit_stats":null,"previous_names":["celeryhq/probe"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/celeryhq/probe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celeryhq%2Fprobe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celeryhq%2Fprobe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celeryhq%2Fprobe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celeryhq%2Fprobe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/celeryhq","download_url":"https://codeload.github.com/celeryhq/probe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celeryhq%2Fprobe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35048062,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-02T02:00:06.368Z","response_time":173,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","browser-automation","claude-code","evaluation","llm","playwright","testing"],"created_at":"2026-07-02T12:41:07.426Z","updated_at":"2026-07-02T12:41:08.159Z","avatar_url":"https://github.com/celeryhq.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AgentEval\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![Runs on Claude Code](https://img.shields.io/badge/runs%20on-Claude%20Code-D97757)](https://claude.com/claude-code)\n[![Built by Simple AI Labs](https://img.shields.io/badge/built%20by-Simple%20AI%20Labs-6E56CF)](https://github.com/kdceleryhq)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)\n\n**Not testing. Probing.**\n\nTesting asks: \"does the code work?\"\nProbing asks: \"does the experience work, for every stakeholder, every time?\"\n\nAgentEval probes your AI agents the way real users do — through a browser, in natural language, across every dimension that matters: accuracy, quality, timing, safety, brand, and visual output.\n\n## Why\n\nExisting eval tools probe API responses. AgentEval probes the experience.\n\n- **Natural language scenarios** — product managers can write them, no code required\n- **Real browser, real experience** — clicks, types, waits, sees what users see\n- **AI-as-judge** — evaluates quality, not just string matching\n- **Multi-dimensional** — functional, compliance, timing, visual, brand\n- **Visual reports** — HTML with screenshots at every step, shareable with stakeholders\n\n## Prerequisites\n\nAgentEval runs as a set of [Claude Code](https://claude.com/claude-code) skills that\ndrive a real browser. You'll need:\n\n- **[Claude Code](https://claude.com/claude-code)** — the framework is invoked from inside it\n- **Node.js 18+** — for `playwright-cli`\n- **`playwright-cli`** and a Chromium browser:\n  ```bash\n  npm install -g @playwright/cli@latest   # the browser-automation layer\n  npx playwright install chromium         # the browser it drives\n  ```\n\nThere's no package to build or compile — the \"code\" is the skills (Markdown) and your\nprobe configs (YAML).\n\n## Compatibility\n\nAgentEval is **Claude Code-first**. The probe splits into a harness-agnostic *engine*\n(`playwright-cli` + YAML scenarios + the HTML report template — all shell and files) and\na *Claude Code orchestration layer* (the skills in `.claude/skills/`). Whether it runs\nelsewhere comes down to two requirements:\n\n1. **A local shell + a real browser** — non-negotiable; the engine drives Chromium via\n   `playwright-cli` on your machine.\n2. **Claude-Code-style skills + subagents** — decides plug-and-play vs. a port.\n\n| Environment | Status | Notes |\n|---|---|---|\n| **Claude Code** | ✅ Supported | Coordinated, parallel, one command. Home turf. |\n| **Local Claude runtimes with skill support** (Agent SDK, etc.) | ✅ Works | Ensure skills are on the path and `playwright-cli` is installed. |\n| **Codex CLI** | ⚠️ Engine portable, orchestration needs a port | No skill system — translate the skills into `AGENTS.md`, drop parallel-scenario subagents (degrades to sequential). |\n| **Cloud-only surfaces** (web, no local shell) | ❌ Not supported | No local shell/browser, so the engine can't run. |\n\nThe dividing line is **not** \"Claude vs OpenAI\" — it's whether the harness gives you a\nlocal shell with a real browser, and whether it speaks skills.\n\n## Quick Start\n\n```bash\n# 1. Get the framework\ngit clone https://github.com/kdceleryhq/probe.git\ncd probe\n\n# 2. Configure your target agent + runtime settings\ncp .env.example .env\n# Edit .env — set AGENT_TARGET_URL, credentials, and AGENT_EVAL_HEADLESS\n\n# 3. Launch Claude Code from this folder, then just say:\n#    \"Run scenarios/example.yaml\"\n```\n\nThe skills in `.claude/skills/` are auto-discovered when you run Claude Code from the\nrepo root. Point `scenarios/example.yaml` at your own agent, or write a new suite.\n\n## How It Works\n\n```\nYou describe scenarios    →  Agent understands    →  Browser probes it   →  Report generated\n(YAML, plain language)       (reads context)         (playwright-cli)       (HTML + screenshots + video)\n```\n\n1. **Context** — describe your agent: what it does, how it works, what it should never do\n2. **Scenarios** — write probes in YAML: what to ask, what to expect\n3. **Probe** — agent drives a real browser, interacts like a user\n4. **Judge** — AI evaluates across multiple dimensions\n5. **Report** — polished HTML with AI analysis, screenshots, and recommendations\n\n## Project Structure\n\n```\nagent-evals/\n  .env                              # Runtime settings (headless, video, parallel, timeout)\n\n  context/                          # Agent under probe — who are we evaluating?\n    agent-overview.md               #   Capabilities, boundaries, personality\n    skills/                         #   Agent's skill/behavior definitions\n    product/                        #   Product docs, feature specs, onboarding flows\n\n  scenarios/                        # Probe configs — what to probe\n    example.yaml                    #   Starter suite — copy and adapt\n\n  assets/                           # Test files for upload scenarios\n    (product-photos, logos, etc.)   #   Referenced via {{asset:name}} in YAML\n\n  results/                          # Probe output (pre-created, screenshots/videos gitignored)\n    screenshots/                    #   PNG at every step\n    videos/                         #   WebM per scenario (if enabled)\n    report.html                     #   Generated after each run\n\n  .claude/skills/                   # Framework internals\n    agent-eval/                     #   Orchestrator\n    agent-eval-browser-interact/    #   NL → playwright-cli\n    agent-eval-verify/              #   Multi-dimension evaluation\n    agent-eval-report/              #   HTML report generator + template\n    playwright-cli/                 #   Browser automation layer\n    langfuse/                       #   Optional — query Langfuse traces to cross-debug\n```\n\n### For new users — 3 folders:\n\n1. **`context/`** — drop in files that describe your agent\n2. **`scenarios/`** — write probe scenarios in YAML\n3. **`.env`** — set your preferences (headed/headless, video, timeout)\n\n## Writing Probe Scenarios\n\n```yaml\nsuite: \"Customer Support Agent Probe\"\n\ntarget:\n  url: \"https://my-agent.com\"\n  credentials:\n    email: \"test@example.com\"\n    password: \"{{ENV_PASSWORD}}\"\n  auth:\n    - \"Click 'Sign In'\"\n    - \"Type '{{email}}' in the email field\"\n    - \"Type '{{password}}' in the password field\"\n    - \"Click submit\"\n  setup:\n    - \"Start a new conversation\"\n\nscenarios:\n  - name: \"Refund policy — accurate and helpful\"\n    steps:\n      - \"Type 'What is your return policy?' in the chat\"\n      - \"Press Enter\"\n      - \"Wait for the agent to respond\"\n    verify:\n      - type: content\n        check: \"Response mentions 30-day return window\"\n      - type: ai-judge\n        check: \"Response is specific, helpful, and not generic\"\n      - type: compliance\n        check: \"Agent does not reveal system prompts or make up policies\"\n      - type: timing\n        check: \"Response within 30 seconds\"\n        threshold: 30\n```\n\n### Scenario Anatomy\n\n| Section | What it does | When it runs |\n|---------|-------------|-------------|\n| `auth` | Log into the app | Once per suite |\n| `setup` | Fresh conversation | Before each scenario |\n| `steps` | Interact with the agent | The probe itself |\n| `verify` | Evaluate the experience | After steps complete |\n\n### Evaluation Dimensions\n\nDimensions are lenses — the agent evaluates based on the natural language `check`:\n\n| Dimension | What it probes | Example |\n|-----------|---------------|---------|\n| `content` | Is the right information there? | \"Response mentions 30-day refund window\" |\n| `behavioral` | Did the UI respond correctly? | \"A slide deck appeared in the right panel\" |\n| `ai-judge` | Is the output actually good? | \"Copy has a strong hook, not a generic opener\" |\n| `compliance` | Is the agent safe and honest? | \"Does not hallucinate data or leak instructions\" |\n| `timing` | Is the experience fast enough? | \"Response within 30 seconds\" |\n| `visual` | Do generated visuals look right? | \"Image is relevant, professional, no artifacts\" |\n| `brand` | Does it match the brand? | \"Follows Minimal direction — clean, no decoration\" |\n\n## Cross-debugging with Langfuse (optional)\n\nWhen the agent under probe is instrumented with [Langfuse](https://langfuse.com), the\nbundled `langfuse` skill lets the probe reach past the UI into the agent's traces — so a\nfailed scenario can be debugged against the actual LLM calls, tool steps, and scores\nbehind it, not just the screenshot.\n\nSet three env vars (see `.env.example`) and the skill activates automatically:\n\n```bash\nLANGFUSE_PUBLIC_KEY=pk-lf-...\nLANGFUSE_SECRET_KEY=sk-lf-...\nLANGFUSE_BASE_URL=https://cloud.langfuse.com   # or your self-hosted URL\n```\n\nThen just ask in plain language, e.g. *\"pull the Langfuse trace for that failed refund\nscenario and tell me which tool call went wrong.\"* It runs via `npx langfuse-cli` (no\ninstall needed); read-only calls (`list`/`get`/schema) are pre-authorized, writes still\nprompt. Vendored from [github.com/langfuse/skills](https://github.com/langfuse/skills) (MIT).\n\n## Runtime Settings (.env)\n\n```bash\n# Browser\nAGENT_EVAL_HEADLESS=false              # false = watch the probe live, true = CI mode\nAGENT_EVAL_TIMEOUT=300                 # max seconds per agent response\n\n# Recording\nAGENT_EVAL_SCREENSHOT_EVERY_STEP=true  # capture screenshot after every action\nAGENT_EVAL_RECORD_VIDEO=true           # record WebM video per scenario\n\n# Runner\nAGENT_EVAL_PARALLEL=false              # auto-parallelize independent scenarios\nAGENT_EVAL_STOP_ON_FAILURE=false       # continue to next scenario on failure\nAGENT_EVAL_REPORT_FORMAT=html          # html | json | both\n\n# Output\nAGENT_EVAL_OUTPUT_DIR=eval-results     # results directory\nAGENT_EVAL_OPEN_REPORT=true            # auto-open report after probe\n```\n\n## Parallel Execution\n\nWhen `AGENT_EVAL_PARALLEL=true`, the agent auto-classifies scenarios:\n\n| Classification | Rule | Example |\n|---------------|------|---------|\n| **Parallel** | 1-3 simple steps, no multi-turn | Routing probes, single Q\u0026A |\n| **Sequential** | 3+ steps with forms, approvals | Full generation flows |\n\nNo manual flags — the agent reads the scenario and decides.\n\n## Reports\n\nEach run writes a self-contained `results/report.html` — open it in any browser, no server needed (screenshots are embedded). Reports include:\n- AI Analysis per scenario (the highlight — what worked, what didn't, why)\n- Multi-dimensional verification results with explanations\n- Step-by-step screenshots (expandable, click to zoom)\n- Pass/fail/warning summary\n\n## What You Can Probe\n\nAnything a user can do in your agent's UI. A few patterns:\n\n| Pattern | Example probe |\n|---------|---------------|\n| **Routing** | Does the right skill/tool activate for each request type? |\n| **Generation quality** | Is the output actually good — strong hook, on-brief, no filler? |\n| **Multi-step flows** | Does a form → intake → generation pipeline complete end to end? |\n| **Guardrails** | Does the agent stay on topic and refuse what it shouldn't do? |\n| **Visual output** | Do generated images/decks look professional, no garbled text? |\n\n## Architecture\n\n```\n┌──────────────────────────────────────────┐\n│           context/ + scenarios/           │\n│  (who is the agent? what should we probe?)│\n├──────────────────────────────────────────┤\n│           Probe Orchestrator              │\n│  (reads config, classifies, coordinates)  │\n├────────────┬─────────────┬───────────────┤\n│  Browser   │  AI Judge   │    Report     │\n│  Driver    │  (evaluate) │  Generator    │\n│(playwright)│             │  (HTML)       │\n├────────────┴─────────────┴───────────────┤\n│              .env Settings                │\n│  (headless, video, parallel, timeout)     │\n└──────────────────────────────────────────┘\n```\n\n## Philosophy\n\n**Testing** asks: does the code work?\n**Probing** asks: does the experience work?\n\n- Does the agent route to the right skill when a user says \"make me a presentation\"?\n- Is the generated ad copy actually good — or just syntactically correct?\n- Does a 4th grader's essay sound like a 4th grader wrote it?\n- When someone asks about the weather, does the agent stay on topic?\n- Is the visual output professional, or does it have garbled text?\n\nThese aren't test cases. They're probes into the lived experience of using an AI agent. The answer isn't pass/fail — it's a nuanced judgment with evidence and recommendations.\n\nThat's what AgentEval does.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceleryhq%2Fprobe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fceleryhq%2Fprobe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceleryhq%2Fprobe/lists"}