{"id":50978789,"url":"https://github.com/maximizegpt/claude-eval-harness","last_synced_at":"2026-06-19T12:02:46.276Z","repository":{"id":359848738,"uuid":"1246001863","full_name":"maximizeGPT/claude-eval-harness","owner":"maximizeGPT","description":"Regression-diff eval harness for Anthropic tool-use agents that surfaces LLM-judge reasoning drift, not just pass/fail flips","archived":false,"fork":false,"pushed_at":"2026-05-23T18:00:51.000Z","size":209,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T20:04:30.553Z","etag":null,"topics":["agent-evaluation","anthropic","claude","llm-evaluation","mcp","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maximizeGPT.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-21T19:20:30.000Z","updated_at":"2026-05-23T18:00:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/maximizeGPT/claude-eval-harness","commit_stats":null,"previous_names":["maximizegpt/claude-eval-harness"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/maximizeGPT/claude-eval-harness","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximizeGPT%2Fclaude-eval-harness","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximizeGPT%2Fclaude-eval-harness/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximizeGPT%2Fclaude-eval-harness/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximizeGPT%2Fclaude-eval-harness/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maximizeGPT","download_url":"https://codeload.github.com/maximizeGPT/claude-eval-harness/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximizeGPT%2Fclaude-eval-harness/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34530302,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","anthropic","claude","llm-evaluation","mcp","python"],"created_at":"2026-06-19T12:02:44.257Z","updated_at":"2026-06-19T12:02:46.268Z","avatar_url":"https://github.com/maximizeGPT.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# claude-eval-harness\n\n[![CI](https://github.com/maximizeGPT/claude-eval-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/maximizeGPT/claude-eval-harness/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)\n[![Python: 3.11+](https://img.shields.io/badge/Python-3.11%2B-blue)](https://www.python.org/downloads/)\n\nA regression-diff eval harness for Anthropic tool-use agents. Suites are\nYAML; runs are self-contained JSON; `harness diff` surfaces per-case\nbehavioral changes between two runs — including drift in LLM-judge\nreasoning, not just pass/fail flips.\n\nEach case is a prompt sent to a Claude model wired to a set of tools;\none or more graders score the resulting trace.\n\n*v0.1, prototype. Tested against example workloads only. The run-JSON\nschema is at v1; `harness diff` refuses to compare runs across schema\nmajors.*\n\n## At a glance\n\nThe bundled NetSuite suite (15 cases) against two Claude models, with\nthe `harness diff` summary that surfaces behavioral drift beyond\npass/fail:\n\n| Model              | Passed | Failed | Cost   |\n|--------------------|--------|--------|--------|\n| Claude Sonnet 4.6  | 12/15  | 3      | $0.54  |\n| Claude Opus 4.7    | 13/15  | 2      | $3.21  |\n\n```\n== fixed:     1 ==   column_typo_recovery (Sonnet stops; Opus retries with corrected column)\n== changed:   5 ==   metadata drift on PASS verdicts + 1 PASS→FAIL phrasing fingerprint\n== unchanged: 9 ==\n```\n\nThat one-case pass/fail delta is interesting; the five `changed` cases\nare what `harness diff` exists to surface. Worked example with the full\ndiff output further down. See [`docs/pipeline.svg`](./docs/pipeline.svg)\nfor the harness's data-flow architecture.\n\n## Install\n\n```bash\ngit clone https://github.com/maximizeGPT/claude-eval-harness\ncd claude-eval-harness\nuv sync --extra netsuite\nexport ANTHROPIC_API_KEY=sk-ant-...\n```\n\n`ANTHROPIC_API_KEY` must be exported in the parent shell — `uv run`\ninherits the environment but does not source any shell init. If you\nkeep secrets in a `.env`, the equivalent is:\n\n```bash\nuv run --env-file .env harness run evals/suites/netsuite_saved_search.yaml\n```\n\n`--env-file` is non-overriding: a var already set (even to empty) in\nthe parent shell wins. Unset it first if the .env should take precedence.\n\nThe `netsuite` extra installs the [netsuite-saved-search-mcp](https://github.com/maximizeGPT/netsuite-saved-search-mcp)\npackage, which the bundled example suite exercises. The target seam is\nin place; a second target binding is the v0.2 proof point.\n\n## Quick start\n\nRun the bundled suite:\n\n```bash\nuv run harness run evals/suites/netsuite_saved_search.yaml\n```\n\nOr filter to one case:\n\n```bash\nuv run harness run evals/suites/netsuite_saved_search.yaml \\\n  --filter get_headers_metadata_detection\n```\n\nDiff two runs (the regression-detection use case):\n\n```bash\nuv run harness diff runs/\u003cbaseline\u003e.json runs/\u003cnew\u003e.json\n```\n\n## Worked example: dual-model baseline diff\n\nThe bundled NetSuite suite has 15 cases against\n[netsuite-saved-search-mcp](https://github.com/maximizeGPT/netsuite-saved-search-mcp).\nRunning it twice — once against Claude Sonnet 4.6, once against Opus\n4.7 — and diffing is the comparison this harness exists to produce.\n\n```bash\nuv run harness run evals/suites/netsuite_saved_search.yaml \\\n  --out runs/baseline-sonnet-4-6.json\nuv run harness run evals/suites/netsuite_saved_search.yaml --model claude-opus-4-7 \\\n  --out runs/baseline-opus-4-7.json\nuv run harness diff runs/baseline-sonnet-4-6.json runs/baseline-opus-4-7.json\n```\n\nThe two runs above are committed at\n[`runs/baseline-sonnet-4-6.json`](runs/baseline-sonnet-4-6.json) and\n[`runs/baseline-opus-4-7.json`](runs/baseline-opus-4-7.json) — open\neither to see the raw trace and grader output the diff was computed\nfrom.\n\nSonnet 4.6 scored 12/15. Opus 4.7 scored 13/15. The diff makes that\none-case difference visible alongside five behavioral changes that a\npass/fail-only view would miss.\n\n```\n== fixed (1) ==\n  column_typo_recovery\n    llm_judge: FAIL -\u003e PASS (The assistant satisfies all rubric\n              requirements: (1) attempted query_export twice — first\n              with 'Acount' (typo), then with 'Account' (corrected);\n              (2) the second attempt used column='Account' exactly;\n              (3) ultimately surfaced 53 matching rows for Account=1200.)\n\n== changed (5) ==\n  anomaly_zero_activity_june\n    contains: metadata drift while PASS\n      - hits:   ['Jun 2024', 'June 2024', 'June'] -\u003e ['Jun 2024']\n      - misses: []                                 -\u003e ['June 2024', 'June']\n  get_headers_metadata_detection\n    llm_judge: metadata drift while PASS\n      - reasoning drift (Opus more concise; same verdict)\n  multi_tool_orchestration\n    llm_judge: metadata drift while FAIL\n      - reasoning: 'The assistant fabricated findings about Account\n                    1200 Sep 2024 ratio anomalies, document count\n                    variances, and Sam Patel's entries that do not\n                    exist in the actual detect_anomalies results.'\n                -\u003e 'The final response invents anomalies not present\n                    in the detect_anomalies output by claiming a\n                    \"MEDIUM – Ratio anomaly (Account 1200, Sep 2024)\"\n                    with a 3.1x multiple and citing 17 rows, as well\n                    as a \"MEDIUM – Document-count variance (Sep 2024)\"\n                    finding.'\n  path_traversal_blocked\n    contains: PASS -\u003e FAIL\n      - Sonnet: \"path traversal\"  Opus: \"traversal pattern\" / \"reject\" /\n                                       \"escape the configured data root\"\n  wrong_answer_judge_fail\n    llm_judge: metadata drift while PASS\n      - reasoning drift (different phrasing; same verdict)\n\nunchanged: 9\n```\n\nWhat the diff exposes:\n\n- **`column_typo_recovery` — fixed by Opus.** Sonnet failed the\n  recovery flow entirely: it identified the typo, named the right\n  column, then stopped without ever retrying the query. Opus retried\n  with the corrected column name and surfaced 53 matching rows. Same\n  case, two models, opposite outcomes — the `fixed` bucket flags it\n  immediately. This is the kind of model-strength signal upgrade-or-\n  not decisions hinge on.\n\n- **`multi_tool_orchestration` — shared hallucination, judge caught\n  it.** Both Sonnet 4.6 and Opus 4.7 hallucinated anomalies not\n  present in the underlying `detect_anomalies` output. The structural\n  graders would have missed this; the `llm_judge` caught it. This is\n  the case `llm_judge` earns its cost on.\n\n- **`path_traversal_blocked` — model judgment held; server enforcement\n  untested.** This case is a two-layer defense check: model judgment\n  refusing obviously malicious requests + server-side\n  `_resolve_under_root` rejecting paths that escape `NSMCP_ROOT`. Both\n  models refused on judgment, never invoking the tool — the structural\n  grader correctly fails (no tool call in trace), meaning the server\n  boundary is not being exercised by this case. The interesting drift\n  is on the `contains` grader: Sonnet says \"path traversal\", Opus says\n  \"traversal pattern\" / \"reject\" / \"escape the configured data root\".\n  Different vocabulary, same intent. A CI matcher tuned for one\n  model's phrasing would silently miss the other's.\n\n- **`anomaly_zero_activity_june` — phrasing fingerprint.** Sonnet uses\n  all three labels (\"Jun 2024\", \"June 2024\", \"June\"); Opus uses only\n  the abbreviated form. Not a regression — a fingerprint that would\n  matter if downstream tooling matched on the long form.\n\n## Cost\n\nFull suite cost on the dual-baseline above:\n\n| Model | Model cost | Judge cost | Total |\n|---|---|---|---|\n| Sonnet 4.6 | $0.54 | $0.009 | $0.55 |\n| Opus 4.7   | $3.20 | $0.011 | $3.21 |\n\nJudge overhead is roughly constant across models — four\n`llm_judge`-graded cases, Haiku 4.5 by default. For CI integration\nSonnet is the default; Opus is the pre-release verification pass.\n\nJudge cost is reported separately from case cost in the per-case line\nand run totals: `cost=$0.0136 (+$0.0014 judge) wall=5317 ms`. The\nseparation matters once a suite has enough `llm_judge` cases that\njudge overhead becomes its own budget signal.\n\n## Findings from the baseline runs\n\nThese came out of the dual-model run and are documented rather than\nsilently fixed:\n\n- **`categorize_amortization` is the cost outlier.** When\n  `categorize_by_memo`'s full breakdown is passed back through the\n  model and re-stated, the case accounts for roughly 20% of total run\n  cost on both baselines (Sonnet $0.107, Opus $0.644). A\n  `truncate_tool_results` runner option lands in v0.2. Until then the\n  cost is what it is.\n- **Rigorous llm_judge rubrics are what made the hallucinations\n  catchable.** A generic rubric (\"did the assistant produce a\n  reasonable summary?\") would have rubber-stamped both runs. The\n  case 13 rubric explicitly enumerates \"the final response invents\n  anomalies not present in the detect_anomalies output\" as a fail\n  condition; without that line, the hallucinations sail through.\n- **Both models refuse the path-traversal prompt regardless of\n  framing.** A previous draft of case 14 used \"../../etc/passwd\";\n  the current draft frames it as a routine config read for a migration\n  check. Both phrasings produce the same outcome on Sonnet 4.6 and\n  Opus 4.7 — neither model invokes the tool. The server-side boundary\n  is therefore tested only indirectly (via the structural grader's\n  \"no tool call\" failure). A target binding that exposes raw HTTP\n  could exercise the server path independent of model judgment; that's\n  a v0.2 concern.\n- **The case 12 rubric conflates two things.** It scores recovery\n  from the column-name typo AND final-answer synthesis as one\n  pass/fail axis. In the current baseline Sonnet fails on the\n  recovery axis (never retries the query) and Opus passes both. But\n  the rubric would mark a model FAIL if it nailed the recovery and\n  bungled the synthesis — we observed that exact case on an earlier\n  spliced run before the clean re-run, and it scored identically to\n  Sonnet's \"no recovery at all\" failure here. A v0.2 refactor splits\n  this into two separately-graded sub-cases so the recovery signal\n  and the synthesis signal stay independent.\n- **Case 15 uses `fixture_trace` to isolate the judge from model\n  variance.** The harness can replay pre-built traces directly when\n  meta-testing graders — case 15's trace lives at\n  [`evals/fixtures/traces/wrong_header_row.json`](evals/fixtures/traces/wrong_header_row.json)\n  and the runner skips the model call entirely. This is the design\n  choice that makes judge-regression testing legible.\n\n## Architecture\n\n`harness run` is a linear pipeline: load and validate the suite YAML;\nconstruct the named target (e.g. `netsuite_saved_search`) which owns\nboth the Anthropic tool definitions and the dispatch from tool name +\ninput to a Python result; iterate over cases; for each case, the\nrunner drives a tool-use loop against the Anthropic API, capturing\nevery turn into a Trace; once the loop ends, the configured graders\nare applied to that Trace; the per-case results plus token / cost\ntotals are serialized into one run JSON.\n\n`harness diff` reads two run JSONs and walks them case-by-case,\nclassifying each into `regressed` (pass → fail), `fixed` (fail →\npass), `changed` (same pass/fail but grader metadata drifted), or\n`unchanged`. The console reporter prints the first three prominently;\nunchanged cases collapse to a tally. The diagram in\n[`docs/pipeline.svg`](./docs/pipeline.svg) covers the same flow\nvisually.\n\n## Commands\n\n| Verb | Purpose |\n|---|---|\n| `run`  | Execute a suite, write a run JSON to `runs/` |\n| `diff` | Compare two run JSONs case-by-case |\n| `show` | Pretty-print one run (or one case with `--case`) |\n| `list` | List runs in a directory, newest last |\n\nAdd `--help` to any verb for full options.\n\n## Graders\n\nFour graders ship in v0.1. Adding a fifth is one module in\n`src/claude_eval_harness/graders/` plus one line in `graders/__init__.py`.\n\n| Type | What it checks | Use when |\n|---|---|---|\n| `exact_match` | Extract a value from the trace at a dotted path; compare to expected | The MCP tool returned a deterministic primitive (`header_row`, `total_matched`) |\n| `contains`    | Substring match against `final_text` or a tool result, any-of / all-of | The model's prose mentions a specific entity (account number, period label) |\n| `structural`  | All-of requirements over the tool-call sequence (tool name, partial args, result fields) | Verifying the model used the right tool with the right arguments |\n| `llm_judge`   | Separate Claude call evaluates the trace against a rubric, returns `{passed, reasoning}` | The \"right answer\" is open-ended; reserve for orchestration / recovery cases |\n\n`llm_judge` has inherent variance — expect 5–10% disagreement across\nruns on borderline cases. Treat judge graders as smoke tests, not\nregression ground truth; `structural` and `exact_match` are\ndeterministic and are what `harness diff` should anchor on for true\nregression detection. The judge has an `invert: true` flag for\nmeta-cases that test the judge itself (a deliberately wrong fixture\ntrace passed through; the case passes when the judge correctly fails\nit).\n\nThe diff verb reports four kinds of case-level deltas: `regressed`\n(passed → failed), `fixed` (failed → passed), `changed` (same\npass/fail but grader metadata drifted — the dual-baseline walkthrough\nabove shows what this looks like in practice), `unchanged`. The\nconsole reporter prints all three changed-kinds prominently with\ncase-level detail.\n\n## Run JSON schema\n\nEach run is a single JSON file under `runs/`:\n\n```json\n{\n  \"schema_version\": 1,\n  \"run_id\": \"20260521T190855Z-claude-sonnet-4-6-fd93\",\n  \"harness_version\": \"0.1.0\",\n  \"suite_path\": \"evals/suites/netsuite_saved_search.yaml\",\n  \"suite_sha256\": \"ab12cd34...\",\n  \"model\": \"claude-sonnet-4-6\",\n  \"started_at\": \"2026-05-21T19:05:00Z\",\n  \"ended_at\":   \"2026-05-21T19:08:55Z\",\n  \"totals\": {\"cases\": 15, \"passed\": 12, \"failed\": 3, \"errored\": 0},\n  \"cases\": [\n    {\n      \"id\": \"...\",\n      \"status\": \"passed\",\n      \"duration_ms\": 1240,\n      \"trace\": {\n        \"prompt\": \"...\",\n        \"turns\": [...],\n        \"tool_calls\": [{\"name\": \"...\", \"input\": {...}, \"result\": {...}, ...}],\n        \"final_text\": \"...\",\n        \"stop_reason\": \"end_turn\",\n        \"usage\": {\"input_tokens\": 250, \"output_tokens\": 30, \"cost_usd\": 0.006, \"judge_cost_usd\": 0.0}\n      },\n      \"graders\": [{\"type\": \"...\", \"passed\": true, \"score\": 1.0, \"notes\": \"...\", \"metadata\": {...}}]\n    }\n  ]\n}\n```\n\nCost is computed from per-1M-token rates in `client.py` and is\napproximate — set `HARNESS_PRICING_JSON=path/to/prices.json` to\noverride. Token counts are recorded losslessly, so an outdated rate\ndoesn't invalidate the run.\n\n## What this isn't\n\n- **Not an MCP transport tester.** v0.1 calls the target's tool functions\n  directly in-process. That's faster, deterministic, and shares the\n  callsite the FastMCP server uses, but it does not exercise the stdio\n  JSON-RPC framing. Transport correctness belongs in the MCP server's\n  own test suite; this harness covers everything above that layer\n  (tool selection, argument shape, multi-turn recovery, final answer).\n- **Not a benchmarking tool.** Cases run sequentially with no batching\n  or parallelism in v0.1. Use it for correctness regressions, not\n  throughput numbers.\n- **Not a flaky-test handler.** Each case runs once per `harness run`;\n  if a case errors transiently, rerun the suite. No automatic retries.\n\n## Contributing\n\n```bash\nuv run pytest\nuv run mypy src\nuv run ruff check src tests\n```\n\nAll three should be clean.\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaximizegpt%2Fclaude-eval-harness","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaximizegpt%2Fclaude-eval-harness","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaximizegpt%2Fclaude-eval-harness/lists"}