{"id":34086988,"url":"https://github.com/hidai25/eval-view","last_synced_at":"2026-03-09T11:10:06.535Z","repository":{"id":329416057,"uuid":"1098589658","full_name":"hidai25/eval-view","owner":"hidai25","description":"Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.","archived":false,"fork":false,"pushed_at":"2026-02-15T23:40:04.000Z","size":14017,"stargazers_count":44,"open_issues_count":11,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-16T02:57:44.561Z","etag":null,"topics":["agent","agent-benchmark","agent-evaluation","agentic-ai","ai-agents","anthropic","crewai","crewai-tools","evaluation","langchain","langgraph","langgraph-python","llm","llmops","mlops","openai-assistants","pytest","testing","tools"],"latest_commit_sha":null,"homepage":"https://evalview.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hidai25.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":"docs/AGENTS.md","dco":null,"cla":null}},"created_at":"2025-11-17T22:21:38.000Z","updated_at":"2026-02-15T23:40:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hidai25/eval-view","commit_stats":null,"previous_names":["hidai25/eval-view"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/hidai25/eval-view","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hidai25%2Feval-view","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hidai25%2Feval-view/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hidai25%2Feval-view/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hidai25%2Feval-view/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hidai25","download_url":"https://codeload.github.com/hidai25/eval-view/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hidai25%2Feval-view/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29608152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T06:47:36.664Z","status":"ssl_error","status_checked_at":"2026-02-19T06:45:47.551Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","agent-benchmark","agent-evaluation","agentic-ai","ai-agents","anthropic","crewai","crewai-tools","evaluation","langchain","langgraph","langgraph-python","llm","llmops","mlops","openai-assistants","pytest","testing","tools"],"created_at":"2025-12-14T13:35:34.973Z","updated_at":"2026-03-09T11:10:06.487Z","avatar_url":"https://github.com/hidai25.png","language":"Python","funding_links":[],"categories":["Security","Tools and Code"],"sub_categories":["Observability","LLM Evaluation Tools"],"readme":"\u003c!-- mcp-name: io.github.hidai25/evalview-mcp --\u003e\n\u003c!--\n  EvalView - Open-source AI agent testing and regression detection framework\n  Keywords: AI agent testing, LLM testing, agent evaluation, regression testing for AI,\n  golden baseline testing, LangGraph testing, CrewAI testing, OpenAI agent testing,\n  AI CI/CD, pytest for AI agents, SKILL.md validation, MCP contract testing,\n  non-deterministic testing, LLM evaluation, agent regression detection,\n  provider-agnostic LLM testing, OpenAI-compatible eval, DeepSeek testing,\n  evalview add templates, evalview init wizard, first agent test,\n  agentic AI testing, multi-agent testing, autonomous agent testing,\n  LLM CI/CD, LLM hallucination detection, agent reliability, agent degradation,\n  agent behavior testing, golden file testing Python, vibe coding regression,\n  behavior-driven testing AI, Anthropic Claude agent testing, GPT agent testing,\n  agentic workflow testing, agent quality assurance, test LLM agents Python,\n  detect prompt regression, AI agent observability alternative, open source eval framework\n--\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/logo.png\" alt=\"EvalView\" width=\"350\"\u003e\n  \u003cbr\u003e\n  \u003cstrong\u003eRegression testing for AI agents.\u003c/strong\u003e\u003cbr\u003e\n  Snapshot your agent's behavior. Detect when it breaks. Block regressions in CI.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/evalview/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/evalview.svg?label=release\" alt=\"PyPI version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/evalview/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/dm/evalview.svg?label=downloads\" alt=\"PyPI downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/hidai25/eval-view/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/hidai25/eval-view?style=social\" alt=\"GitHub stars\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/hidai25/eval-view/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/hidai25/eval-view/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://opensource.org/licenses/Apache-2.0\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache_2.0-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003eIf this catches a regression for you, please ⭐ \u003ca href=\"https://github.com/hidai25/eval-view/stargazers\"\u003estar the repo\u003c/a\u003e — it helps others find it.\u003c/p\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/hero.jpg\" alt=\"EvalView — multi-turn execution trace with sequence diagram\" width=\"860\"\u003e\n  \u003cbr\u003e\n  \u003csub\u003eMulti-turn execution trace — every tool call, parameter, and response visualized\u003c/sub\u003e\n\u003c/p\u003e\n\n### How it works\n\n```\n┌────────────┐      ┌──────────┐      ┌──────────────┐\n│ Test Cases  │ ──→  │ EvalView │ ──→  │  Your Agent   │\n│   (YAML)   │      │          │ ←──  │ local / cloud │\n└────────────┘      └──────────┘      └──────────────┘\n                          │\n                ┌─────────┼─────────┐\n                │         │         │\n            Captures   Compares   Reports\n            the trace  to golden  regressions\n```\n\n**Your data stays local.** EvalView sends your test queries to your agent's API, captures the execution trace (tools called, outputs, cost, latency), and compares against your saved baseline. Nothing is sent to EvalView servers — all processing happens on your machine.\n\n### The workflow\n\n```bash\nevalview capture --agent http://localhost:8000/invoke   # 1. Record real interactions\nevalview snapshot                                        # 2. Save as baseline\nevalview check                                           # 3. Catch regressions\n# ✅ All clean — or ❌ REGRESSION: score 85 → 71\n```\n\nThat's it. No LLM-as-judge required. No API keys needed. Works with **LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, and any HTTP API**.\n\n**Ready to try it?**\n\n```bash\npip install evalview \u0026\u0026 evalview demo   # See regression detection live, ~30 seconds\n```\n\n[Full Quick Start guide →](#quick-start)\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003eSee it in action\u003c/strong\u003e (CLI demo)\u003c/summary\u003e\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/demo.gif\" alt=\"EvalView CLI Demo\" width=\"700\"\u003e\n\u003c/p\u003e\n\u003c/details\u003e\n\n---\n\n## Why EvalView?\n\nLangSmith answers \"what did my agent do?\" Braintrust answers \"how good is my agent?\" Promptfoo answers \"which prompt is better?\"\n\n**EvalView answers: \"Did my agent break?\"**\n\n|  | LangSmith | Braintrust | Promptfoo | **EvalView** |\n|---|:---:|:---:|:---:|:---:|\n| Automatic regression detection | No | Manual | No | **Yes** |\n| Golden baseline diffing | No | No | No | **Yes** |\n| Works without API keys | No | No | Partial | **Yes** |\n| Free \u0026 open source | No | No | Yes | **Yes** |\n| Works fully offline (Ollama) | No | Partial | Partial | **Yes** |\n| Agent framework adapters | LangChain only | Generic | Generic | **7 frameworks + any HTTP** |\n\n---\n\n## What EvalView Catches\n\n| Status | Meaning | Action |\n|--------|---------|--------|\n| ✅ **PASSED** | Behavior matches baseline | Ship with confidence |\n| ⚠️ **TOOLS_CHANGED** | Different tools called | Review the diff |\n| ⚠️ **OUTPUT_CHANGED** | Same tools, output shifted | Review the diff |\n| ❌ **REGRESSION** | Score dropped significantly | Fix before shipping |\n\n---\n\n## Who Is EvalView For?\n\n- **Anyone building AI agents** — know instantly if a prompt tweak, model swap, or tool change broke something\n- **AI/ML engineers running CI/CD** — a deterministic pass/fail signal that blocks regressions before production\n- **Teams shipping multi-agent systems** — catch cascading behavior changes before they reach downstream agents\n- **Skill and workflow authors** — validate that automation does exactly what it's supposed to, every time\n- **Developers using local models** — fully offline, zero API-key regression detection with Ollama or any local LLM\n\nIf you run `evalview snapshot` today and `evalview check` after every change, you're using EvalView correctly.\n\n---\n\n## What EvalView Catches\n\n| Status | What it means | What you do |\n|--------|--------------|-------------|\n| ✅ **PASSED** | Agent behavior matches baseline | Ship with confidence |\n| ⚠️ **TOOLS_CHANGED** | Agent is calling different tools | Review the diff |\n| ⚠️ **OUTPUT_CHANGED** | Same tools, output quality shifted | Review the diff |\n| ❌ **REGRESSION** | Score dropped significantly | Fix before shipping |\n\n---\n\n## How It Works\n\n**Simple workflow (recommended):**\n\n```bash\n# 1. Your agent works correctly\nevalview snapshot                 # 📸 Save current behavior as baseline\n\n# 2. You change something (prompt, model, tools)\nevalview check                    # 🔍 Detect regressions automatically\n\n# 3. EvalView tells you exactly what changed\n#    → ✅ All clean! No regressions detected.\n#    → ⚠️ TOOLS_CHANGED: +web_search, -calculator\n#    → ❌ REGRESSION: score 85 → 71\n```\n\n**Advanced workflow (more control):**\n\n```bash\nevalview run --save-golden        # Save specific result as baseline\nevalview run --diff               # Compare with custom options\n```\n\nThat's it. **Deterministic proof, no LLM-as-judge required, no API keys needed.** Add `--judge-cache` when running statistical mode to cut LLM evaluation costs by ~80%.\n\n### Progress Tracking\n\nEvalView now tracks your progress and celebrates wins:\n\n```bash\nevalview check\n# 🔍 Comparing against your baseline...\n# ✨ All clean! No regressions detected.\n# 🎯 5 clean checks in a row! You're on a roll.\n```\n\n**Features:**\n- **Streak tracking** — Celebrate consecutive clean checks (3, 5, 10, 25+ milestones)\n- **Health score** — See your project's stability at a glance\n- **Smart recaps** — \"Since last time\" summaries to stay in context\n- **Progress visualization** — Track improvement over time\n\n### Multi-Reference Goldens (for non-deterministic agents)\n\nSome agents produce valid variations. Save up to 5 golden variants per test:\n\n```bash\n# Save multiple acceptable behaviors\nevalview snapshot --variant variant1\nevalview snapshot --variant variant2\n\n# EvalView compares against ALL variants, passes if ANY match\nevalview check\n# ✅ Matched variant 2/3\n```\n\nPerfect for LLM-based agents with creative variation.\n\n---\n\n### Detecting Silent Model Updates\n\nLLM providers silently update the model behind the same API name — `claude-sonnet-4-5-latest`, `gpt-4o`, and `gemini-pro` all quietly point to new versions over time. You can't tell from the API response whether your baseline was captured on last month's model or this week's. Your agent may be \"breaking\" from a model update, not from your code.\n\nEvalView captures the model version at snapshot time and alerts you when it changes:\n\n```\nevalview check\n\n╭─ ⚠  Model Version Change Detected ──────────────────────────────────────────╮\n│                                                                               │\n│  Model changed: claude-sonnet-4-5-20250514 → claude-sonnet-4-6-20250715      │\n│                                                                               │\n│  Baselines were captured with a different model version. Output changes       │\n│  below may be caused by the model update rather than your code. If the new   │\n│  behavior looks correct, run evalview snapshot to update the baseline.        │\n╰───────────────────────────────────────────────────────────────────────────────╯\n```\n\n**No configuration needed.** Works automatically with any Anthropic adapter — `response.model` is captured from the API response and stored in the golden baseline. HTTP adapters capture model ID from response metadata when the provider returns it.\n\n---\n\n### Gradual Drift Detection\n\nYour agent passed 30 consecutive checks. But over the past month, output similarity quietly slid from 97% to 83% — each individual check passed because it was above threshold. No single check failed. No alarm fired.\n\nEvalView's drift tracker detects this slow-burning pattern and warns you before it becomes a production incident:\n\n```\nevalview check\n\n📉 summarize-test: Output similarity declining over last 10 checks: 97% → 83%\n   (slope: −1.4%/check). May indicate gradual model drift.\n   Run 'evalview check' more frequently or inspect recent changes.\n```\n\n**Automatic — nothing to configure.** Every `evalview check` appends to `.evalview/history.jsonl`. Trend detection uses OLS regression slope across the last 10 checks, so a single outlier won't trigger a false alarm. Add `.evalview/history.jsonl` to git to share drift history across your team.\n\n---\n\n### Semantic Similarity\n\nLexical diff compares text character by character. \"The answer is 4\" vs \"Four is the answer\" scores 43% similar by lexical measure — but they're semantically identical.\n\nEvalView uses OpenAI embeddings to score outputs by meaning, not just wording:\n\n```\n✗ weather-lookup: OUTPUT_CHANGED\n  Lexical similarity:    43%\n  Semantic similarity:   91%   ← meaning preserved, wording changed\n  Combined score:         74%\n```\n\n**Auto-enabled** when `OPENAI_API_KEY` is set. EvalView prints a one-time notice the first time it activates, then stays silent. To opt out permanently:\n\n```yaml\n# .evalview/config.yaml\ndiff:\n  semantic_diff_enabled: false\n```\n\nOr for a single run:\n\n```bash\nevalview check --no-semantic-diff\n```\n\nTo force it on without a config file:\n\n```bash\nevalview check --semantic-diff\n```\n\n**Cost:** ~$0.00004/test (2 texts, 1 batched embedding call via `text-embedding-3-small`). At daily CI cadence, this is under $0.01/month for a typical test suite.\n\n\u003e ⚠️ When enabled, agent outputs are sent to OpenAI's embedding API. Do not use on tests containing confidential data.\n\n---\n\n## Quick Start\n\n### Installation\n\n```bash\npip install evalview\n```\n\n### Step 1 — Capture real interactions as tests\n\n```bash\nevalview capture --agent http://localhost:8000/invoke\n# Proxy starts on localhost:8091 — point your app there instead\n# Use your agent normally, then Ctrl+C when done\n# Tests are saved to tests/test-cases/ automatically\n```\n\n\u003e **Why capture first?** Tests from real usage catch real regressions. Auto-generated tests from guessed queries score poorly and give you false confidence.\n\n### Step 2 — Save as your baseline\n\n```bash\nexport OPENAI_API_KEY='your-key'   # for LLM-as-judge scoring\nevalview snapshot\n```\n\n### Step 3 — Catch regressions forever\n\n```bash\nevalview check   # run this after every change\n```\n\n### No agent yet? Try the demo\n\n```bash\nevalview demo       # Zero setup, no API key — see regression detection live (~30 seconds)\nevalview quickstart # Set up a working example in 2 minutes\n```\n\n[Full getting started guide →](docs/GETTING_STARTED.md)\n\n---\n\n## Safety Contracts, Trace Replay \u0026 Judge Caching\n\n### `forbidden_tools` — Safety Contracts in One Line\n\nDeclare tools that must **never** be called. If the agent touches one, the test **hard-fails immediately** — score forced to 0, no partial credit — regardless of output quality. The forbidden check runs before all other evaluation criteria, so the failure reason is always unambiguous.\n\n```yaml\n# research-agent.yaml\nname: research-agent\ninput:\n  query: \"Summarize recent AI news\"\nexpected:\n  tools: [web_search, summarize]\n\n  # Safety contract: this agent is read-only.\n  # Any write or execution call is a contract violation.\n  forbidden_tools: [edit_file, bash, write_file, execute_code]\nthresholds:\n  min_score: 70\n```\n\n```\nFAIL  research-agent  (score: 0)\n  ✗ FORBIDDEN TOOL VIOLATION\n  ✗ edit_file was called — declared forbidden\n  Hard-fail: score forced to 0 regardless of output quality.\n```\n\n**Why this matters:** An agent can produce a beautiful summary _and_ silently write a file. Without `forbidden_tools`, that test passes. With it, the contract breach is caught on the first run and **blocks CI before the violation reaches production**.\n\nMatching is case-insensitive and separator-agnostic — `\"EditFile\"` catches `\"edit_file\"`, `\"edit-file\"`, and `\"editfile\"`. Violations appear as a red alert banner in HTML reports.\n\n---\n\n### HTML Trace Replay — Full Forensic Debugging\n\nEvery test result card in the HTML report has a **Trace Replay** tab showing exactly what the agent did, step by step:\n\n| Span | What it shows |\n|------|--------------|\n| **AGENT** (purple) | Root execution context |\n| **LLM** (blue) | Model name, token counts `↑1200 ↓250`, cost — click to expand the **exact prompt sent** and **model completion** |\n| **TOOL** (amber) | Tool name, parameters JSON, result — click to expand |\n\n```bash\nevalview run --output-format html   # Generates report, opens in browser automatically\n```\n\nThe prompt/completion data comes from `ExecutionTrace.trace_context`, which adapters populate via `evalview.core.tracing.Tracer`. When `trace_context` is absent the tab falls back to the `StepTrace` list — backward-compatible with all existing adapters, no changes required.\n\nThis is the \"what did the model actually see at step 3?\" view that reduces root-cause analysis from hours to seconds.\n\n---\n\n### `evalview replay` — Trajectory Diff Debugging\n\nWhen `evalview check` flags a regression, `replay` shows you exactly what changed — step by step, baseline vs. current — in the terminal and as a side-by-side HTML diagram:\n\n```bash\nevalview replay my-test            # Terminal diff + HTML report\nevalview replay my-test --no-browser  # Terminal only\n```\n\nTerminal output color codes:\n\n| Color | Meaning |\n|-------|---------|\n| **cyan** | Step matches baseline |\n| **red** | Step dropped (was in baseline, gone now) |\n| **yellow** | Step added (new, wasn't in baseline) |\n| **cyan/yellow** | Step present but arguments changed |\n\nThe HTML report opens side-by-side Mermaid sequence diagrams — baseline on the left, current on the right — so you can see the full trajectory divergence at a glance. A hint to the `evalview replay \u003ctest\u003e` command is also printed automatically after every regression in `evalview check`.\n\n---\n\n### LLM Judge Caching — 80% Cost Reduction in Statistical Mode\n\nWhen running tests multiple times (statistical mode with `variance.runs`), EvalView caches LLM judge responses to avoid redundant API calls for identical outputs:\n\n```yaml\n# test-case.yaml\nthresholds:\n  min_score: 70\n  variance:\n    runs: 10        # Run the agent 10 times\n    pass_rate: 0.8  # Require 80% pass rate\n```\n\n```bash\nevalview run   # Judge evaluates each unique output once, not 10 times\n```\n\nCache is keyed on the full evaluation context (test name, query, output, and all criteria). Entries are stored in `.evalview/.judge_cache.db` with a 24-hour TTL. Warm runs in statistical mode typically make **80% fewer LLM API calls**, directly reducing evaluation cost.\n\n---\n\n## Skills Testing, Setup Wizard \u0026 15 Test Templates\n\n**Run skill tests against any LLM provider** — Anthropic, OpenAI, DeepSeek, Kimi, Moonshot, or any OpenAI-compatible endpoint:\n\n```bash\n# Anthropic (default — unchanged)\nexport ANTHROPIC_API_KEY=your-key\nevalview skill test tests/my-skill.yaml\n\n# OpenAI\nexport OPENAI_API_KEY=your-key\nevalview skill test tests/my-skill.yaml --provider openai --model gpt-4o\n\n# Any OpenAI-compatible provider (DeepSeek, Groq, Together, etc.)\nevalview skill test tests/my-skill.yaml \\\n  --provider openai \\\n  --base-url https://api.deepseek.com/v1 \\\n  --model deepseek-chat\n\n# Or via env vars (recommended for CI)\nexport SKILL_TEST_PROVIDER=openai\nexport SKILL_TEST_API_KEY=your-key\nexport SKILL_TEST_BASE_URL=https://api.deepseek.com/v1\nevalview skill test tests/my-skill.yaml\n```\n\n**Personalized first test in under 2 minutes** — the wizard asks a few questions and generates a config + test case tuned to your actual agent:\n\n```bash\nevalview init --wizard\n# ━━━ EvalView Setup Wizard ━━━\n# 3 questions. One working test case. Let's go.\n#\n# Step 1/3 — Framework\n# What adapter does your agent use?\n#   1. HTTP / REST API    (most common)\n#   2. Anthropic API\n#   3. OpenAI API\n#   4. LangGraph\n#   5. CrewAI\n#   ...\n# Choice [1]: 4\n#\n# Step 2/3 — What does your agent do?\n# Describe your agent: customer support triage\n#\n# Step 3/3 — Tools\n# Tools: get_ticket, escalate, resolve_ticket\n#\n# Agent endpoint URL [http://localhost:2024]:\n# Model name [gpt-4o]:\n#\n# ✓ Created .evalview/config.yaml\n# ✓ Created tests/test-cases/first-test.yaml\n```\n\n**15 ready-made test patterns** — copy any to your project as a starting point:\n\n```bash\nevalview add                    # List all 15 patterns\nevalview add customer-support   # Copy to tests/customer-support.yaml\nevalview add rag-citation --tool my_retriever --query \"What is the refund policy?\"\n```\n\nAvailable patterns: `tool-not-called` · `wrong-tool-chosen` · `tool-error-handling` · `tool-sequence` · `cost-budget` · `latency-budget` · `output-format` · `multi-turn-memory` · `rag-grounding` · `rag-citation` · `customer-support` · `code-generation` · `data-analysis` · `research-synthesis` · `safety-refusal`\n\n\u003e **When to use which:**\n\u003e - `evalview init --wizard` → Day 0, blank slate, writes the first test for you\n\u003e - `evalview add \u003cpattern\u003e` → Day 3+, you know your agent's domain and want a head start\n\n---\n\n## Visual Reports\n\n**Every `evalview run` automatically opens an interactive HTML report in your browser.** No flag needed.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/report-screenshot.png\" alt=\"EvalView HTML Report — pass rate, scores, cost, latency\" width=\"860\"\u003e\n  \u003cbr\u003e\n  \u003csub\u003eOverview tab — pass rate, quality scores, cost per query, and latency at a glance\u003c/sub\u003e\n\u003c/p\u003e\n\nThe report includes tabbed **Overview** (KPI cards, score charts, cost-per-query table), **Execution Trace** (Mermaid sequence diagrams per test with full query/response), **Diffs** (golden vs actual with similarity scores), and **Timeline** (per-step latencies). Glassmorphism dark theme, fully self-contained HTML — safe to attach to PRs or Slack.\n\n```bash\nevalview run                              # Runs tests and opens report automatically\nevalview run --no-open                    # Run without opening browser (CI-safe; CI env auto-detected)\nevalview inspect latest --notes \"PR #42\" # Regenerate report for a past run\nevalview visualize --compare run1.json --compare run2.json  # Side-by-side comparison\n```\n\n### Claude Code MCP\n\nAsk Claude inline without leaving your conversation:\n\n```bash\nclaude mcp add --transport stdio evalview -- evalview mcp serve\ncp CLAUDE.md.example CLAUDE.md\n```\n\n8 MCP tools: `create_test`, `run_snapshot`, `run_check`, `list_tests`, `validate_skill`, `generate_skill_tests`, `run_skill_test`, `generate_visual_report`\n\nSee [Claude Code Integration (MCP)](#claude-code-integration-mcp) below.\n\n---\n\n## Explore \u0026 Learn\n\n### Interactive Chat\n\nTalk to your tests. Debug failures. Compare runs.\n\n```bash\nevalview chat\n```\n\n```\nYou: run the calculator test\n🤖 Running calculator test...\n✅ Passed (score: 92.5)\n\nYou: compare to yesterday\n🤖 Score: 92.5 → 87.2 (-5.3)\n   Tools: +1 added (validator)\n   Cost: $0.003 → $0.005 (+67%)\n```\n\nSlash commands: `/run`, `/test`, `/compare`, `/traces`, `/skill`, `/adapters`\n\n[Chat mode docs →](docs/CHAT_MODE.md)\n\n### EvalView Gym\n\nPractice agent eval patterns with guided exercises.\n\n```bash\nevalview gym\n```\n\n---\n\n## Production Log Import\n\nTurn existing production traffic into test cases automatically — zero manual writing required.\n\n```bash\n# Auto-detect format and generate test YAMLs\nevalview import prod.jsonl\n\n# Specify format explicitly\nevalview import traces.jsonl --format openai --output-dir tests/prod\n\n# Preview without writing anything\nevalview import logs.jsonl --max 100 --dry-run\n```\n\nSupports three log formats (auto-detected):\n\n| Format | Detection | Description |\n|--------|-----------|-------------|\n| **JSONL** | `input`/`query`/`prompt` key | Generic flat JSON logs |\n| **OpenAI** | `messages` array | Chat completion logs |\n| **EvalView capture** | `request` + `response` keys | EvalView proxy format |\n\nAfter import, run `evalview snapshot` to capture baselines for all generated tests — your eval flywheel is now running.\n\n---\n\n## Benchmark Packs\n\nMeasure your agent against curated, portable benchmark suites — comparable scores across teams and agent versions.\n\n```bash\nevalview benchmark --list            # Show available domains\nevalview benchmark rag               # Run RAG benchmark (8 tests)\nevalview benchmark coding            # Run coding benchmark (8 tests)\nevalview benchmark all               # Run all 30 tests across 4 domains\nevalview benchmark rag --export-only # Export YAMLs to tests/benchmarks/rag/\n```\n\nFour built-in domains:\n\n| Domain | Tests | What it measures |\n|--------|-------|-----------------|\n| `rag` | 8 | Retrieval, grounding, hallucination avoidance |\n| `coding` | 8 | Code generation, debugging, explanation |\n| `customer-support` | 8 | Empathy, resolution, escalation judgement |\n| `research` | 6 | Synthesis, comparison, structured output |\n\nTests use `tool_categories` (not exact tool names) so they work regardless of your agent's specific tool implementations. Each test shows a per-difficulty score bar to pinpoint where your agent is weakest.\n\n---\n\n## Supported Agents \u0026 Frameworks\n\n| Agent | E2E Testing | Trace Capture |\n|-------|:-----------:|:-------------:|\n| **Claude Code** | ✅ | ✅ |\n| **OpenAI Codex** | ✅ | ✅ |\n| **OpenClaw** | ✅ | ✅ |\n| **LangGraph** | ✅ | ✅ |\n| **CrewAI** | ✅ | ✅ |\n| **OpenAI Assistants** | ✅ | ✅ |\n| **Custom (any CLI/API)** | ✅ | ✅ |\n\nAlso works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API\n\n[Compatibility details →](docs/FRAMEWORK_SUPPORT.md)\n\n---\n\n## CI/CD Integration\n\n### The easiest path — git hooks\n\nRun `evalview check` automatically before every push, with zero CI configuration:\n\n```bash\nevalview install-hooks          # Adds evalview check to your pre-push hook\nevalview install-hooks --hook pre-commit   # Or on every commit instead\n```\n\nThe hook is safe by default: if no golden baseline exists yet, it exits silently and never blocks a push. When baselines exist, it runs `evalview check --fail-on REGRESSION` and blocks the push only on regressions.\n\n```bash\nevalview uninstall-hooks        # Remove cleanly — other hook content preserved\n```\n\nWorks in worktrees. No CI account, no YAML, no secrets needed.\n\n---\n\n### GitHub Actions\n\n```bash\nevalview init --ci    # Generates workflow file\n```\n\nOr add manually:\n\n```yaml\n# .github/workflows/evalview.yml\nname: Agent Health Check\non: [push, pull_request]\n\njobs:\n  test:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: hidai25/eval-view@v0.4.1\n        with:\n          openai-api-key: ${{ secrets.OPENAI_API_KEY }}\n          command: check                   # Use new check command\n          fail-on: 'REGRESSION'            # Block PRs on regressions\n          json: true                       # Structured output for CI\n```\n\n**Or use the CLI directly:**\n\n```yaml\n      - run: evalview check --fail-on REGRESSION --json\n        env:\n          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}\n```\n\nPRs with regressions get blocked. Add a PR comment showing exactly what changed:\n\n```yaml\n      - run: evalview ci comment\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n```\n\n[Full CI/CD setup →](docs/CI_CD.md)\n\n---\n\n## EvalView Cloud — Team Baseline Sync\n\n**Share golden baselines across your entire team.** When you log in to EvalView Cloud, every `evalview snapshot` automatically pushes your golden baselines to secure cloud storage. Every `evalview check` silently pulls any baselines you don't have locally — so a new teammate clones the repo and immediately has regression detection, with zero manual baseline sharing.\n\nOpt-in. Offline-first. Cloud errors are dim warnings — your local workflow is never blocked.\n\n### Setup (one command)\n\n```bash\nevalview login\n```\n\nThis opens GitHub OAuth in your browser. The entire flow takes about 10 seconds. After that, `snapshot` and `check` sync automatically — no other configuration needed.\n\n```\n╭─ EvalView Cloud ─────────────────────────────────────────────────────────────╮\n│                                                                               │\n│  ✓ Logged in as you@example.com                                               │\n│                                                                               │\n│  Your golden baselines will now sync to cloud automatically.                  │\n│                                                                               │\n│  Next step:                                                                   │\n│    evalview snapshot   push your existing baselines to cloud                  │\n╰───────────────────────────────────────────────────────────────────────────────╯\n```\n\n### Commands\n\n| Command | What it does |\n|---------|-------------|\n| `evalview login` | Authenticate with GitHub and enable automatic sync |\n| `evalview logout` | Disconnect — local baselines are untouched |\n| `evalview whoami` | Show currently logged-in account and user ID |\n\n### How Sync Works\n\n```\nDeveloper A                                   Developer B\n───────────────────────────────               ──────────────────────────────────\nevalview snapshot                             git clone \u003crepo\u003e\n  ✅ Baseline saved: weather-lookup           evalview check\n  ☁  Synced to cloud                           → pulls weather-lookup from cloud\n                                               ✅ All clean! No regressions.\n```\n\n**After `evalview snapshot`** — all passing golden baselines are pushed to cloud storage via upsert. A passing `☁  Synced to cloud` note is printed below the snapshot summary. If you're offline, `⚠  Cloud sync skipped (offline?)` is printed instead — the local baseline is still saved and your streak continues uninterrupted.\n\n**Before `evalview check`** — EvalView pulls any baselines that exist in the cloud but not locally. This is a fill-in-the-gaps pull: existing local baselines are never overwritten. The pull is completely silent — nothing is printed unless there's an error.\n\n### Security Model\n\n| Concern | How EvalView handles it |\n|---------|------------------------|\n| **Token storage** | Saved to `~/.evalview/auth.json` with `chmod 600` — readable only by you, never by other system users |\n| **Data isolation** | Every golden is stored under your user ID path (`{user_id}/test-name.golden.json`). Supabase RLS policies enforce that users can only access their own folder — not other users' baselines, even with a valid token |\n| **What's uploaded** | Only golden baseline JSON: tool names, output text, and scores. Source code, prompts, and agent secrets are never uploaded |\n| **Opt-in only** | Zero cloud calls are made unless you're logged in. Run `evalview logout` to stop all sync immediately |\n\n### Troubleshooting\n\n**`⚠  Cloud sync skipped (offline?)`**\nYour machine couldn't reach the cloud. The local baseline was saved normally. Sync resumes automatically on your next online `evalview snapshot`.\n\n**`Unauthorized — token may be expired`**\nRun `evalview logout \u0026\u0026 evalview login` to refresh your session. This takes about 10 seconds.\n\n**Switching accounts?**\n`evalview logout` then `evalview login`. Local baselines are never deleted on logout.\n\n---\n\n## Claude Code Integration (MCP)\n\n**Test your agent without leaving the conversation.** EvalView runs as an MCP server inside Claude Code — ask \"did my refactor break anything?\" and get the answer inline.\n\n### Setup (3 steps, one-time)\n\n```bash\n# 1. Install\npip install evalview\n\n# 2. Connect to Claude Code\nclaude mcp add --transport stdio evalview -- evalview mcp serve\n\n# 3. Make Claude Code proactive (auto-checks after every edit)\ncp CLAUDE.md.example CLAUDE.md\n```\n\n### What you get\n\n8 tools Claude Code can call on your behalf:\n\n**Agent regression testing:**\n\n| Tool | What it does |\n|------|-------------|\n| `create_test` | Generate a test case from natural language — no YAML needed |\n| `run_snapshot` | Capture current agent behavior as the golden baseline |\n| `run_check` | Detect regressions vs baseline, returns structured JSON diff |\n| `list_tests` | Show all golden baselines with scores and timestamps |\n\n**Skills testing (full 3-phase workflow):**\n\n| Tool | Phase | What it does |\n|------|-------|-------------|\n| `validate_skill` | Pre-test | Validate SKILL.md structure before running tests |\n| `generate_skill_tests` | Pre-test | Auto-generate test cases from a SKILL.md |\n| `run_skill_test` | Test | Run Phase 1 (deterministic) + Phase 2 (rubric) evaluation |\n\n**Reporting:**\n\n| Tool | What it does |\n|------|-------------|\n| `generate_visual_report` | Generate a self-contained HTML report with traces, diffs, scores, and timelines |\n\n\u003e **First time setting up?** The best test cases come from real traffic, not guesses.\n\u003e Run `evalview capture --agent \u003cyour-url\u003e` from the terminal first — it records your\n\u003e agent's real behaviour as test YAMLs, then use `run_snapshot` above to lock in the baseline.\n\n### How it works in practice\n\n**Starting fresh (best path — real traffic as tests):**\n```\nYou: I have a new agent at localhost:8000/invoke, help me set up testing\nClaude: Run this in your terminal first to capture real interactions as tests:\n          evalview capture --agent http://localhost:8000/invoke\n        Point your app at localhost:8091 and use it normally, then Ctrl+C.\n        Once you have YAMLs in tests/test-cases/, come back and I'll snapshot them.\n\nYou: Done — captured 5 interactions\nClaude: [run_snapshot] 📸 5 baselines captured — regression detection active.\n```\n\n**Day-to-day workflow:**\n```\nYou: Add a test for my weather agent\nClaude: [create_test] ✅ Created tests/weather-lookup.yaml\n        [run_snapshot] 📸 Baseline captured — regression detection active.\n\nYou: Refactor the weather tool to use async\nClaude: [makes code changes]\n        [run_check] ✨ All clean! No regressions detected.\n\nYou: Switch to a different weather API\nClaude: [makes code changes]\n        [run_check] ⚠️ TOOLS_CHANGED: weather_api → open_meteo\n                   Output similarity: 94% — review the diff?\n```\n\nNo YAML. No terminal switching. No context loss.\n\n**Skills testing example:**\n```\nYou: I wrote a code-reviewer skill, test it\nClaude: [validate_skill] ✅ SKILL.md is valid\n        [generate_skill_tests] 📝 Generated 10 tests → tests/code-reviewer-tests.yaml\n        [run_skill_test] Phase 1: 9/10 ✓  Phase 2: avg 87/100\n                         1 failure: skill didn't trigger on implicit input\n```\n\n### Manual server start (advanced)\n\n```bash\nevalview mcp serve                        # Uses tests/ by default\nevalview mcp serve --test-path my_tests/  # Custom test directory\n```\n\n---\n\n## Complete Test Case Reference\n\nEvery field available in a test case YAML, with inline comments:\n\n```yaml\n# tests/my-agent.yaml\nname: customer-support-refund          # Unique test identifier (required)\ndescription: \"Agent handles refund in 2 steps\"  # Optional — appears in reports\n\ninput:\n  query: \"I want a refund for order #12345\"  # The prompt sent to the agent (required)\n  context:                                    # Optional key-value context injected alongside\n    user_tier: \"premium\"\n\nexpected:\n  # Tools the agent should call (order-independent match)\n  tools: [get_order, process_refund]\n\n  # Exact call order, if sequence matters\n  tool_sequence: [get_order, process_refund]\n\n  # Match by intent category instead of exact name (flexible)\n  tool_categories: [order_lookup, payment_processing]\n\n  # Output quality criteria (all case-insensitive)\n  output:\n    contains: [\"refund approved\", \"3-5 business days\"]   # Must appear in output\n    not_contains: [\"sorry, I can't\", \"error\"]            # Must NOT appear in output\n\n  # Safety contract: any violation is an immediate hard-fail (score 0, no partial credit)\n  forbidden_tools: [edit_file, bash, write_file, execute_code]\n\nthresholds:\n  min_score: 70          # Minimum passing score (0-100)\n  max_cost: 0.01         # Maximum cost in USD (optional)\n  max_latency: 5000      # Maximum latency in ms (optional)\n\n  # Override global scoring weights for this test (optional)\n  weights:\n    tool_accuracy: 0.4\n    output_quality: 0.4\n    sequence_correctness: 0.2\n\n  # Statistical mode: run N times and require a pass rate (optional)\n  variance:\n    runs: 10             # Number of executions\n    pass_rate: 0.8       # Require 80% of runs to pass\n\n# Per-test overrides (optional)\nadapter: langgraph                    # Override global adapter\nendpoint: \"http://localhost:2024\"     # Override global endpoint\nmodel: \"claude-sonnet-4-6\"           # Override model for this test\nsuite_type: regression                # \"capability\" (hill-climb) or \"regression\" (safety net)\ndifficulty: medium                    # trivial | easy | medium | hard | expert\n```\n\n### Multi-Turn Conversation Tests\n\nReplace `input` with `turns` to test stateful, multi-step conversations. Each turn receives the accumulated history in `context[\"conversation_history\"]` so your agent can track context across turns.\n\n```yaml\n# tests/booking-flow.yaml\nname: flight-booking-conversation\ndescription: \"Agent books a flight across a 3-turn conversation\"\n\nturns:\n  - query: \"I want to fly from NYC to Paris next Friday\"\n    expected:\n      tools: [search_flights]\n\n  - query: \"Book the cheapest economy option\"\n    expected:\n      tools: [book_flight]\n      output:\n        contains: [\"confirmed\", \"Paris\"]\n\n  - query: \"Can you send me a confirmation email?\"\n    expected:\n      tools: [send_email]\n      output:\n        contains: [\"sent\", \"inbox\"]\n\nexpected:\n  # Top-level expected applies across ALL turns (overall pass/fail gate)\n  tools: [search_flights, book_flight, send_email]\n\nthresholds:\n  min_score: 80\n  max_cost: 0.05\n```\n\n**Rules:**\n- `turns` requires ≥ 2 entries — single-turn tests use `input`\n- Each turn may have its own `expected` block for per-turn assertions\n- `context` at the turn level is merged with `test_case.tools` and `conversation_history`\n- The merged trace covers all turns: tool calls, costs, and latency are summed\n\n---\n\n## A/B Endpoint Comparison\n\n`evalview compare` runs the same test suite against two endpoints and shows you exactly what improved, degraded, or stayed the same — before you promote a new model or refactored agent to production.\n\n```bash\nevalview compare \\\n  --v1 http://localhost:8000/invoke \\\n  --v2 http://localhost:8001/invoke \\\n  --tests tests/\n\n# With labels (appear in the report)\nevalview compare \\\n  --v1 http://prod.internal/invoke --label-v1 \"gpt-4o (prod)\" \\\n  --v2 http://staging.internal/invoke --label-v2 \"claude-sonnet (staging)\" \\\n  --tests tests/\n\n# Skip LLM judge (deterministic checks only — faster, no API cost)\nevalview compare --v1 ... --v2 ... --no-judge\n\n# Suppress auto-opening the HTML report\nevalview compare --v1 ... --v2 ... --no-open\n```\n\n**Per-test verdict table:**\n\n```\nTest                        v1 score   v2 score   Verdict\n─────────────────────────────────────────────────────────\ncustomer-support-refund     78         91         ✅ improved (+13)\nflight-booking              85         82         ⚠  degraded  (-3)\nsafety-refusal              95         95         ✓  same\n```\n\n**Use cases:**\n- Compare GPT-4o vs Claude before switching providers\n- Validate a refactored agent against the current production version\n- Measure the impact of a prompt change across your full test suite\n- Gate model upgrades in CI by checking that v2 score ≥ v1 score\n\n---\n\n## Features\n\n| Feature | Description | Docs |\n|---------|-------------|------|\n| **Multi-Turn Testing** | Test full conversations: sequential turns with injected history, per-turn `expected` assertions, merged cost + latency | [Docs](#multi-turn-conversation-tests) |\n| **A/B Endpoint Comparison** | `evalview compare --v1 \u003curl\u003e --v2 \u003curl\u003e` — run the same suite against two endpoints, get a per-test improved/degraded/same verdict table | [Docs](#ab-endpoint-comparison) |\n| **`forbidden_tools`** | Declare tools that must never be called — hard-fail on any violation, score 0, no partial credit | [Docs](#safety-contracts-trace-replay--judge-caching) |\n| **HTML Trace Replay** | Step-by-step replay of every LLM call and tool invocation — exact prompt, completion, tokens, params | [Docs](#html-trace-replay--full-forensic-debugging) |\n| **LLM Judge Caching** | Cache judge responses in statistical mode — ~80% fewer API calls, stored in `.evalview/.judge_cache.db` | [Docs](#llm-judge-caching--80-cost-reduction-in-statistical-mode) |\n| **Cloud Baseline Sync** | `evalview login` — golden baselines sync to cloud automatically after every snapshot; new teammates pull them before every check | [Docs](#evalview-cloud--team-baseline-sync) |\n| **Snapshot/Check Workflow** | Simple `snapshot` then `check` commands for regression detection | [Docs](docs/GOLDEN_TRACES.md) |\n| **Silent Model Update Detection** | Captures model version at snapshot time; alerts when provider silently swaps the model | [Docs](#detecting-silent-model-updates) |\n| **Gradual Drift Detection** | OLS regression over 10-check window catches slow similarity decline that single-threshold checks miss | [Docs](#gradual-drift-detection) |\n| **Semantic Similarity** | Auto-enabled when `OPENAI_API_KEY` is set — scores outputs by meaning, not wording. One-time notice on first run. Opt out with `--no-semantic-diff` or `semantic_diff_enabled: false` | [Docs](#semantic-similarity) |\n| **Auto-Open Visual Reports** | Every `evalview run` opens an interactive HTML report — KPI cards, Mermaid trace diagrams, diffs, cost-per-query. `--no-open` for CI. | [Docs](#visual-reports--claude-code-mcp) |\n| **Git Hook Integration** | `evalview install-hooks` — injects `evalview check` into pre-push (or pre-commit). Automatic regression blocking with zero CI config. | [Docs](#cicd-integration) |\n| **Claude Code MCP** | 8 tools — run checks, generate tests, test skills, generate visual reports inline | [Docs](#claude-code-integration-mcp) |\n| **Streak Tracking** | Habit-forming celebrations for consecutive clean checks | [Docs](docs/GOLDEN_TRACES.md) |\n| **Multi-Reference Goldens** | Save up to 5 variants per test for non-deterministic agents | [Docs](docs/GOLDEN_TRACES.md) |\n| **Chat Mode** | AI assistant: `/run`, `/test`, `/compare` | [Docs](docs/CHAT_MODE.md) |\n| **Tool Categories** | Match by intent, not exact tool names | [Docs](docs/TOOL_CATEGORIES.md) |\n| **Statistical Mode (pass@k)** | Handle flaky LLMs with `--runs N` and pass@k/pass^k metrics | [Docs](docs/STATISTICAL_MODE.md) |\n| **Cost \u0026 Latency Thresholds** | Automatic threshold enforcement per test | [Docs](docs/EVALUATION_METRICS.md) |\n| **Interactive HTML Reports** | Plotly charts, Mermaid sequence diagrams, glassmorphism theme | [Docs](docs/CLI_REFERENCE.md) |\n| **Test Generation** | Generate 100+ test variations from 1 seed test | [Docs](docs/TEST_GENERATION.md) |\n| **Suite Types** | Separate capability vs regression tests | [Docs](docs/SUITE_TYPES.md) |\n| **Difficulty Levels** | Filter by `--difficulty hard`, benchmark by tier | [Docs](docs/STATISTICAL_MODE.md) |\n| **Behavior Coverage** | Track tasks, tools, paths tested | [Docs](docs/BEHAVIOR_COVERAGE.md) |\n| **MCP Contract Testing** | Detect when external MCP servers change their interface | [Docs](docs/MCP_CONTRACTS.md) |\n| **Skills Testing** | Validate and test Claude Code / Codex SKILL.md workflows | [Docs](docs/SKILLS_TESTING.md) |\n| **Provider-Agnostic Skill Tests** | Run skill tests against Anthropic, OpenAI, DeepSeek, or any OpenAI-compatible API | [Docs](docs/SKILLS_TESTING.md#provider-agnostic-api-keys) |\n| **Test Pattern Library** | 15 ready-made YAML patterns — copy to your project with `evalview add` | [Docs](#skills-testing-setup-wizard--15-test-templates) |\n| **Personalized Init Wizard** | `evalview init --wizard` — generates a config + first test tailored to your agent | [Docs](#skills-testing-setup-wizard--15-test-templates) |\n| **Pytest Plugin** | `evalview_check` fixture for regression assertions inside standard pytest suites | [Docs](#pytest-plugin) |\n| **Programmatic API** | `run_single_test` / `check_single_test` for notebook and custom CI integration | [Docs](#programmatic-api) |\n| **Production Log Import** | `evalview import prod.jsonl` — auto-detect JSONL/OpenAI/EvalView formats, generate test YAMLs from real traffic | [Docs](#production-log-import) |\n| **Benchmark Packs** | 30 portable tests across RAG, coding, support, research — comparable scores per domain and difficulty tier | [Docs](#benchmark-packs) |\n| **Trajectory Diff (`evalview replay`)** | Step-by-step terminal + side-by-side HTML diff of baseline vs. current agent path — pinpoints where behavior diverged | [Docs](#evalview-replay--trajectory-diff-debugging) |\n\n---\n\n## Pytest Plugin\n\nUse EvalView's regression detection directly inside your existing pytest suite — no separate CLI step required.\n\n```bash\npip install evalview    # registers pytest11 entry point automatically\n```\n\n```python\n# test_my_agent.py\ndef test_weather_agent_regression(evalview_check):\n    diff = evalview_check(\"weather-lookup\")   # runs test, diffs against golden\n    assert diff.overall_severity.value in (\"passed\", \"output_changed\"), diff.summary()\n\n@pytest.mark.model_sensitive   # log a warning if the model version changed\ndef test_summarize(evalview_check):\n    diff = evalview_check(\"summarize-test\")\n    assert diff.overall_severity.value != \"regression\"\n```\n\nThe `evalview_check` fixture:\n- Automatically skips (not fails) if no golden baseline exists yet — safe to add before snapshotting\n- Returns a `TraceDiff` with `overall_severity`, `tool_diffs`, `output_diff`, and `score_diff`\n- Integrates with `--semantic-diff` by respecting the project's `.evalview/config.yaml`\n\n```bash\npytest                        # runs your whole suite including regression checks\npytest -m agent_regression   # run only EvalView-marked tests\n```\n\n---\n\n## Programmatic API\n\nRun individual tests from notebooks, scripts, or custom CI pipelines without the CLI:\n\n```python\nimport asyncio\nfrom evalview.core.runner import run_single_test, check_single_test\n\n# Run a test and get the full evaluation result\nresult = asyncio.run(run_single_test(\"weather-lookup\"))\nprint(f\"Score: {result.score}/100\")\n\n# Run and diff against the golden baseline\nresult, diff = asyncio.run(check_single_test(\"weather-lookup\"))\nprint(f\"Status: {diff.overall_severity.value}\")   # passed / output_changed / regression\nprint(f\"Output similarity: {diff.output_diff.similarity:.0%}\")\n```\n\nBoth functions respect your `.evalview/config.yaml` by default. Pass `config_path` and `test_path` to override:\n\n```python\nresult = asyncio.run(run_single_test(\n    \"weather-lookup\",\n    test_path=Path(\"tests/regression\"),\n    config_path=Path(\".evalview/config.yaml\"),\n))\n```\n\n---\n\n## Advanced: Skills Testing (Claude Code, Codex, OpenClaw)\n\nTest that your agent's code actually works — not just that the output looks right.\nBest for teams maintaining SKILL.md workflows for Claude Code, Codex, or OpenClaw.\n\n```yaml\ntests:\n  - name: creates-working-api\n    input: \"Create an express server with /health endpoint\"\n    expected:\n      files_created: [\"index.js\", \"package.json\"]\n      build_must_pass:\n        - \"npm install\"\n        - \"npm run lint\"\n      smoke_tests:\n        - command: \"node index.js\"\n          background: true\n          health_check: \"http://localhost:3000/health\"\n          expected_status: 200\n          timeout: 10\n      no_sudo: true\n      git_clean: true\n```\n\n```bash\nevalview skill test tests.yaml --agent claude-code\nevalview skill test tests.yaml --agent codex\nevalview skill test tests.yaml --agent openclaw\nevalview skill test tests.yaml --agent langgraph\n```\n\n| Check | What it catches |\n|-------|-----------------|\n| `build_must_pass` | Code that doesn't compile, missing dependencies |\n| `smoke_tests` | Runtime crashes, wrong ports, failed health checks |\n| `git_clean` | Uncommitted files, dirty working directory |\n| `no_sudo` | Privilege escalation attempts |\n| `max_tokens` | Cost blowouts, verbose outputs |\n\n[Skills testing docs →](docs/SKILLS_TESTING.md)\n\n---\n\n## Documentation\n\n**Getting Started:**\n\n| | |\n|---|---|\n| [Getting Started](docs/GETTING_STARTED.md) | [CLI Reference](docs/CLI_REFERENCE.md) |\n| [FAQ](docs/FAQ.md) | [YAML Test Case Schema](docs/YAML_SCHEMA.md) |\n| [Framework Support](docs/FRAMEWORK_SUPPORT.md) | [Adapters Guide](docs/ADAPTERS.md) |\n\n**Core Features:**\n\n| | |\n|---|---|\n| [Golden Traces (Regression Detection)](docs/GOLDEN_TRACES.md) | [Evaluation Metrics](docs/EVALUATION_METRICS.md) |\n| [Statistical Mode (pass@k)](docs/STATISTICAL_MODE.md) | [Tool Categories](docs/TOOL_CATEGORIES.md) |\n| [Suite Types (Capability vs Regression)](docs/SUITE_TYPES.md) | [Behavior Coverage](docs/BEHAVIOR_COVERAGE.md) |\n| [Cost Tracking](docs/COST_TRACKING.md) | [Test Generation](docs/TEST_GENERATION.md) |\n\n**Integrations:**\n\n| | |\n|---|---|\n| [CI/CD Integration](docs/CI_CD.md) | [MCP Contract Testing](docs/MCP_CONTRACTS.md) |\n| [Skills Testing](docs/SKILLS_TESTING.md) | [Chat Mode](docs/CHAT_MODE.md) |\n| [Trace Specification](docs/TRACE_SPEC.md) | [Tutorials](docs/TUTORIALS.md) |\n\n**Troubleshooting:**\n\n| | |\n|---|---|\n| [Debugging Guide](docs/DEBUGGING.md) | [Troubleshooting](docs/TROUBLESHOOTING.md) |\n\n**Guides:** [Testing LangGraph in CI](guides/pytest-for-ai-agents-langgraph-ci.md) | [Detecting Hallucinations in CI](guides/detecting-llm-hallucinations-in-ci.md)\n\n---\n\n## Examples\n\n| Framework | Link |\n|-----------|------|\n| Claude Code (E2E) | [examples/agent-test/](examples/agent-test/) |\n| LangGraph | [examples/langgraph/](examples/langgraph/) |\n| CrewAI | [examples/crewai/](examples/crewai/) |\n| Anthropic Claude | [examples/anthropic/](examples/anthropic/) |\n| Dify | [examples/dify/](examples/dify/) |\n| Ollama (Local) | [examples/ollama/](examples/ollama/) |\n\n**Node.js?** See [@evalview/node](sdks/node/)\n\n---\n\n## Roadmap\n\n**Shipped:** Golden traces • **Snapshot/check workflow** • **Cloud baseline sync (login/logout/whoami + silent push/pull)** • **Streak tracking \u0026 celebrations** • **Multi-reference goldens** • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build \u0026 smoke tests • Health checks • Safety guards (`no_sudo`, `git_clean`) • Claude Code \u0026 Codex adapters • **Opus 4.6 cost tracking** • MCP servers • HTML reports • Interactive chat mode • EvalView Gym • **Provider-agnostic skill tests** • **15-template pattern library** • **Personalized init wizard** • **`forbidden_tools` safety contracts** • **HTML trace replay** (exact prompt/completion per step) • **Silent model update detection** (model fingerprinting + version change panel) • **Gradual drift detection** (OLS trend analysis over JSONL history) • **Semantic diff** (`--semantic-diff`, embedding-based output comparison) • **Multi-turn conversation testing** (sequential turns with injected history, per-turn `expected` assertions) • **A/B endpoint comparison** (`evalview compare --v1 \u003curl\u003e --v2 \u003curl\u003e`)\n\n**Coming:** Agent Teams trace analysis • Grounded hallucination detection • Error compounding metrics • Container isolation\n\n[Vote on features →](https://github.com/hidai25/eval-view/discussions)\n\n---\n\n## Frequently Asked Questions\n\n**Does EvalView require an API key?**\nNo. The core regression detection — tool call diffing, sequence scoring, golden baseline comparison — is fully deterministic and works without any API key. If `OPENAI_API_KEY` is set, `evalview check` auto-enables semantic diff (~$0.00004/test). Disable it with `--no-semantic-diff` or `semantic_diff_enabled: false` in your config. LLM-as-judge output quality scoring (`evalview run`) also requires the key. `evalview snapshot` is always free.\n\n**How is EvalView different from LangSmith?**\nLangSmith is an observability platform: it records what your agent did and lets you inspect traces. EvalView is a regression testing framework: it saves a golden baseline and tells you when your agent's behavior deviates from it. They answer different questions. Many teams use both — LangSmith to understand production behavior, EvalView to gate changes in CI.\n\n**My agent is non-deterministic. How do I handle that?**\nUse multi-reference goldens: run `evalview snapshot --variant v1` and `evalview snapshot --variant v2` to save multiple acceptable behaviors (up to 5). `evalview check` compares against all variants and passes if any match. This is designed specifically for LLM-based agents with natural variation.\n\n**Can I run EvalView in GitHub Actions / CI?**\nYes — use `evalview check --fail-on REGRESSION` to exit with code 1 on regressions (blocking CI), and `--json` for structured output. See [CI/CD Integration](#cicd-integration).\n\n**How do I update a baseline after an intentional change?**\nJust run `evalview snapshot` again. It overwrites the existing baseline with the current behavior. Your streak continues.\n\n**Does EvalView work with my framework?**\nIf your agent exposes an HTTP API, it works. Native adapters exist for LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, and MCP servers. See [Supported Agents \u0026 Frameworks](#supported-agents--frameworks).\n\n**Is EvalView free?**\nYes. EvalView is Apache 2.0 open source. Cloud baseline sync (`evalview login`) is also free. There is no paid tier.\n\n[Full FAQ →](docs/FAQ.md)\n\n---\n\n## Get Help \u0026 Contributing\n\n- **Questions?** [GitHub Discussions](https://github.com/hidai25/eval-view/discussions)\n- **Bugs?** [GitHub Issues](https://github.com/hidai25/eval-view/issues)\n- **Want setup help?** Email hidai@evalview.com — happy to help configure your first tests\n- **Contributing?** See [CONTRIBUTING.md](CONTRIBUTING.md)\n\n**License:** Apache 2.0\n\n---\n\n### Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=hidai25/eval-view\u0026type=Date)](https://star-history.com/#hidai25/eval-view\u0026Date)\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003eEvalView — The open-source testing framework for AI agents.\u003c/b\u003e\u003cbr\u003e\n  Regression testing, golden baselines, CI/CD integration. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.\u003cbr\u003e\u003cbr\u003e\n  \u003ca href=\"#quick-start\"\u003eGet started\u003c/a\u003e | \u003ca href=\"docs/GETTING_STARTED.md\"\u003eFull guide\u003c/a\u003e | \u003ca href=\"docs/FAQ.md\"\u003eFAQ\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n*EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhidai25%2Feval-view","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhidai25%2Feval-view","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhidai25%2Feval-view/lists"}