{"id":51278944,"url":"https://github.com/connorrmcd6/surface-bench","last_synced_at":"2026-06-30T00:02:04.470Z","repository":{"id":365091001,"uuid":"1269824902","full_name":"Connorrmcd6/surface-bench","owner":"Connorrmcd6","description":"Surface agent-impact benchmark: does accurate documentation change LLM-agent task outcomes?","archived":false,"fork":false,"pushed_at":"2026-06-15T19:54:46.000Z","size":696,"stargazers_count":0,"open_issues_count":5,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-15T21:25:32.539Z","etag":null,"topics":["agent","context-engineering","llm-tools","research"],"latest_commit_sha":null,"homepage":"https://surface.gradientdev.xyz","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Connorrmcd6.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-15T06:12:39.000Z","updated_at":"2026-06-15T19:54:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Connorrmcd6/surface-bench","commit_stats":null,"previous_names":["connorrmcd6/surface-bench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Connorrmcd6/surface-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Connorrmcd6%2Fsurface-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Connorrmcd6%2Fsurface-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Connorrmcd6%2Fsurface-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Connorrmcd6%2Fsurface-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Connorrmcd6","download_url":"https://codeload.github.com/Connorrmcd6/surface-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Connorrmcd6%2Fsurface-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34947088,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","context-engineering","llm-tools","research"],"created_at":"2026-06-30T00:02:03.647Z","updated_at":"2026-06-30T00:02:04.454Z","avatar_url":"https://github.com/Connorrmcd6.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# surface-bench — operator manual\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20722100.svg)](https://doi.org/10.5281/zenodo.20722100)\n\u0026nbsp;License: [MIT](LICENSE) (code) · [CC BY 4.0](LICENSE-DATA) (data)\n\nEmpirically measuring how much **documentation accuracy** changes an agent's task performance — the\ngap [Surface](https://github.com/Connorrmcd6/surface) exists to protect. Surface doesn't make an agent smarter; it stops docs\nsilently rotting. So its value to an agent equals the performance delta between working from **fresh**\ndocs and **rotted** docs, plus whether *surfacing* the drift recovers the loss. This bench measures\nthose deltas directly, using drift of exactly the kind `surf check` catches (flipped operators,\ndropped guards, changed constants, reordered keys).\n\nThis is the single place to start when you (or a future agent) pick this up cold. It covers **what the\nexperiment is, how to run it, how to read the results, how to author scenarios, and the sharp edges.**\nThis is a standalone repo; it has no dependency on the [Surface](https://github.com/Connorrmcd6/surface)\nRust core — it only *consumes* the `surf` binary's output.\n\n\u003e **Status: the study is complete.** The pre-registration was frozen and git-tagged\n\u003e (`prereg-v2-multi`) before the confirmatory run, which then executed the full multi-turn matrix —\n\u003e **5 models across 3 providers, 3,250 graded completions, 0 errors.** Headline: a confident stale doc\n\u003e drives the *misled* rate to **68–100% on every model** (capability does not buy resistance), via\n\u003e three distinct failure modes (blind obedience, suppressed-but-rescuable verification, and\n\u003e verify-then-defer). Full write-up and numbers in [`PAPER.md`](PAPER.md) §6.\n\n\u003e **Companion docs:** [`PAPER.md`](PAPER.md) (the paper — pilot in §5, confirmatory results in §6),\n\u003e [`PREREGISTRATION.md`](PREREGISTRATION.md) (frozen hypotheses + analysis plan),\n\u003e [`ABC_CHECKLIST.md`](ABC_CHECKLIST.md) (benchmark-rigor self-audit),\n\u003e [`DATASET.md`](DATASET.md) (dataset card — schema + how to reuse the released data),\n\u003e [`scenarios/CHECKLIST.md`](scenarios/CHECKLIST.md) (how to author one scenario). Committed result\n\u003e snapshots: the single-shot pilot\n\u003e [`results/2026-06-13-pilot-full-matrix/report.md`](results/2026-06-13-pilot-full-matrix/report.md)\n\u003e and the confirmatory multi-turn matrix\n\u003e [`results/confirmatory-20260616T172420Z/report.md`](results/confirmatory-20260616T172420Z/report.md).\n\n---\n\n## 1. The experiment in one screen\n\nSame code + same task in every run; **only the documentation block changes**. There are five\nconditions:\n\n| | Context shown to the agent | Represents |\n|---|---|---|\n| **C0** | code only (no doc) | baseline |\n| **C1** | code + **stale** doc (true at T0, code moved to T1) | the ungoverned world |\n| **C2** | code + **fresh** doc (matches T1) | the Surface-governed world |\n| **C3** | code + stale doc + real `surf check --format json` report | \"just surface the drift\" |\n| **Cw** | code + stale doc + a generic \"may be outdated\" warning (no corrected code) | control: is it the *fix* or just *suspicion*? |\n\nAnd two run **modes**:\n\n- **single** (v1): one prompt → one completion. Cheap, reproducible, the original pilot.\n- **multi** (v2, the centerpiece): a **multi-turn agent loop** with **read-only** tools\n  (`read_file`, `grep`, `list_dir`, `final_answer`). The agent can *choose* to read the hidden\n  dependency the doc describes. This is what makes the headline non-tautological — see §4.\n\n### Hypotheses\n\n| | Claim | Read on |\n|---|---|---|\n| **H1** | C2 \u003e C1 — accuracy beats rot (the core value) | success rate |\n| **H2** | C1 \u003c C0 — rotted docs are *worse than nothing* | misled rate |\n| **H3** | C3 ≈ C2 — surfacing drift recovers the loss | success rate |\n| **H4** | verification_rate(C1) \u003c verification_rate(C0) — a confident stale doc **suppresses verification** | verification rate (multi only) |\n| **H5** | within C1, agents that read the hidden dep are correct; those that don't are misled | mediation (multi only) |\n| **H6** | C3 \u003e Cw — recovery is Surface's *corrected code*, not mere suspicion | success rate |\n\n### Two scenario families\n\n- **Cascade** (the headline): the agent edits/answers about a **visible** thing whose correctness\n  depends on a **hidden dependency** (listed in `meta.toml` `hidden_paths` — present in `code/` for\n  grading and for `surf` to seal a real divergence, but withheld from the prompt). The dependency\n  has drifted from what the stale doc says, so the doc is the agent's only (single-shot) or\n  *optional* (multi-turn) window onto the truth. `cascade-*` scenarios.\n- **Comprehension**: the drifted code *and* its contradicting doc are both visible. The model can\n  just re-read the code, so success ceilings near 100% — useful for the **token-tax** story (a stale\n  doc costs extra generation to reconcile), not the success story.\n\n---\n\n## 2. Setup\n\nThe harness env is managed with [uv](https://docs.astral.sh/uv/) (the committed `uv.lock` pins it so\nthe spend figures are reproducible). Code-edit graders run under the *same* interpreter `uv run`\nselects — no `python3`-on-PATH guessing.\n\n```sh\nuv sync                                      # base deps (anthropic)\nuv sync --extra dev                          # + pytest (to run the test suite)\nuv sync --extra plots                        # + matplotlib (report figures)\nuv sync --extra providers                    # + openai, google-genai (cross-provider runs)\n```\n\n**Install `surf`** — only `tools/author.py` (re-sealing scenarios) and the C3 report generation\nneed the `surf` binary; running the matrix, grading, and reporting do **not**. Put it on PATH (or\npoint `$SURF_BIN` at it):\n\n```sh\ncargo install --git https://github.com/Connorrmcd6/surface surf-cli   # builds + installs `surf`\n# or download a release binary from https://github.com/Connorrmcd6/surface/releases\n# the bench finds it via: $SURF_BIN -\u003e `surf` on PATH -\u003e ./bin/surf\n```\n\n**Toolchains the graders need at run time** (only when those scenarios are actually graded):\n\n- **Python** — always (via `sys.executable`).\n- **Node ≥ 22.18** on PATH — for TypeScript code scenarios (`node --test`; that release strips TS\n  types at load, so no `npm install`/`tsc`). uv does not manage Node; install it separately.\n\n**API keys** (only for the providers you select): `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`,\n`GEMINI_API_KEY` (or `GOOGLE_API_KEY`). The convenience env file `~/.surface-bench.env` is sourced\nper command in practice (`set -a; source ~/.surface-bench.env; set +a`).\n\n---\n\n## 3. Running it\n\nThe pipeline is four commands. **Everything except `run` against a real provider is free/offline.**\n\n```sh\n# (a) RUN the matrix -\u003e results/\u003cts\u003e/{raw.jsonl, run.json}\nuv run python -m surface_bench.run --models mock                      # offline pipeline check, no key\nuv run python -m surface_bench.run --models haiku --mode single --trials 10\nuv run python -m surface_bench.run --models haiku --mode multi --trials 10 --max-turns 8\n\n# (b) REPORT -\u003e results/\u003cts\u003e/{summary.json, report.md, *.png}\nuv run python -m surface_bench.report results/\u003cts\u003e\n\n# (c) ORACLE — post-run sanity tripwires (exits non-zero if any fire)\nuv run python -m surface_bench.oracle results/\u003cts\u003e\n```\n\n**CLI flags** (`run`): `--models \u003cnames…\u003e` (subset of `config.toml` models; default all),\n`--scenarios \u003cids…\u003e`, `--conditions C0 C1 C2 C3 Cw`, `--trials N`, `--mode {single,multi}`,\n`--max-turns N` (multi), `--out \u003cdir\u003e`, `--config \u003cpath\u003e`. Config defaults live in `config.toml`\n(`trials`, `temperature`, `max_tokens`, `mode`, `max_turns`, and a `[models.\u003cname\u003e]` block per\nmodel with `provider`, `model_id`, and `input_per_mtok` / `output_per_mtok` pricing).\n\n**Providers:** `provider = \"mock\" | \"anthropic\" | \"openai\" | \"gemini\"`. Mock needs no key and is the\noffline workhorse; in `--mode multi` it scripts a canned answer so the whole loop runs for free.\n\n### The matrix size (and why you stage spend)\n\n`#scenarios × #conditions × #models × N`. Multi mode multiplies each cell by the number of agent\nturns *and* re-sends the growing transcript each turn, so it is **much** more expensive than single.\n**Always stage:** `mock` → one scenario × each provider (smoke the tool round-trip) → cascade-only\nmulti at small N (pilot the verification metric) → the full matrix. Gate each stage on the oracle\n(§5) passing. Cost levers: `--max-turns` cap, tiered `--trials` (e.g. N=10 cascade / N=5\ncomprehension), and dropping comprehension from multi mode (it doesn't test verification).\n\nFor the full cross-provider run, [`RUN_CONFIRMATORY_MATRIX.md`](RUN_CONFIRMATORY_MATRIX.md) is the\noperator runbook used for the committed confirmatory matrix: run each model into its **own** `--out`\ndir (so a mid-run failure only loses that one model — the runner has no resume), then concatenate the\nper-model `raw.jsonl` files into one directory for `report`/`oracle`.\n\n---\n\n## 4. Reading the results\n\n`run` writes **`raw.jsonl`** (one row per completion — the source of truth; grading can be re-run\noffline without re-spending) and **`run.json`** (the run's parameters). `report` turns those into\n`summary.json` (machine-readable) + `report.md` (the human read) + figures.\n\n**Per-row fields** (multi-only fields absent in single rows): `scenario, task_type, tier, condition,\nmodel, trial, mode, output, input_tokens, output_tokens, cost_usd, ok, misled, detail, parsed` and,\nfor multi: `turns, stop_reason, tool_calls, verified_hidden, per_turn_tokens`.\n\n**The metrics, and what each tells you:**\n\n- **success rate** (`ok`) — got the current (T1) answer. The H1/H3/H6 axis. Deterministic: code\n  scenarios run hidden tests; QA scenarios parse a `VERDICT:` line against a rubric. **No LLM judge.**\n- **misled rate** (`misled`) — asserted the *stale* (T0) claim. The H2 axis: a rotted doc doesn't\n  just fail to help, it *causes* the wrong answer.\n- **verification_rate** (multi) — did the agent read/grep a `hidden_path` before answering? The\n  **headline of the agentic track** (H4). `verified_then_correct` is its validity check (reading the\n  truth should rescue you).\n- **output tokens** — generation cost. Input tokens differ by construction (doc-block size) so are\n  ignored; output tokens carry the token-tax signal in the comprehension family.\n- **cost_usd** — estimated spend (tokens × `config.toml` prices). The report's total is for relative\n  comparison; your provider console invoice is authoritative.\n\n**Statistics:** rates carry 95% **Wilson** intervals; condition deltas use a 95% **bootstrap** CI\n*and* a bootstrap p-value, with **Holm–Bonferroni** applied across the whole success-delta family\n(the report flags `(Holm ✓/✗)` per delta). CI-significance is the headline; Holm is the conservative\ncross-check. The report also slices deltas **by tier** (the difficulty gradient) and **by scenario**\n(catches one broken fixture hiding in a family average).\n\n**`report.md` sections:** Spend · the gradient (C2−C1 by tier) · **Verification** (multi: the H4\ndeltas + H5 mediation) · per-model rate tables + deltas (with Holm flags) · output-token tables +\ndeltas · per-scenario success. **Figures** (all from the frozen `summary.json`, with 95% Wilson CIs):\n`cascade_success.png`, `misled_rate.png`, `cost_accuracy.png` (the cost–accuracy frontier — fresh docs\nare cheapest *and* most accurate), and (multi) `verification_rate.png` — the \"does a stale doc stop the\nagent checking?\" hero chart.\n\n---\n\n## 5. The oracle (sanity tripwires)\n\n`python -m surface_bench.oracle results/\u003cts\u003e` is a cheap post-run gate that catches authoring/harness\nbugs **before** they reach a write-up (the failure mode that bit us in issue #113). It exits non-zero\n(so it can gate CI / a staged run) if, per scenario × model:\n\n- **C2-fresh \u003c 90%** — with a *fresh* doc the task must be solvable; a low cell means the scenario is\n  mis-authored (leaked stale value, broken grader, impossible task), not a real effect.\n- **a cascade C1 never misleads** — if a stale doc never produces the wrong answer, the drift isn't\n  load-bearing and the scenario measures nothing.\n\nThe mock can't satisfy these (it doesn't actually solve scenarios), so run the oracle on **real**\noutputs.\n\n---\n\n## 6. Layout\n\n```\n.                            (repo root)\n  config.toml                models, trials, temperature, mode, max_turns, per-model pricing\n  PREREGISTRATION.md         frozen hypotheses + analysis plan for the headline run\n  ABC_CHECKLIST.md           benchmark-rigor self-audit (arXiv 2507.02825)\n  surface_bench/\n    run.py                   the matrix runner (single + multi); writes raw.jsonl + run.json\n    prompts.py               assembles (system, user) per condition; hides hidden_paths\n    models.py                provider adapters: complete() (single) + step() (multi, tool-use);\n                             Anthropic / OpenAI / Gemini + Mock; neutral Step/ToolCall types\n    agent.py                 run_agent() — the multi-turn loop over the read-only tools\n    tools_runtime.py         read-only tool surface (read_file/grep/list_dir/final_answer) + sandbox\n    grade_qa.py              VERDICT-line rubric grader (QA)\n    grade_code.py            FILE-block overlay + hidden-test grader (code)\n    metrics.py               rates, Wilson/bootstrap CIs, Holm, verification, by_tier, by_scenario\n    report.py                summary.json + report.md + figures\n    oracle.py                post-run tripwires\n  tools/\n    author.py                seal a scenario's hub hashes + emit genuine surf_report.json\n    validate_scenario.py     grader-polarization self-test (offline, no spend)\n  scenarios/\n    CHECKLIST.md             how to author one scenario\n    \u003cid\u003e/                    meta.toml · task.md · hub_stale.md · hub_fresh.md · surf_report.json\n                             code/ (T1) · .author/code_t0/ (T0 overlay) · .author/solution_{correct,stale}.*\n                             grader/ (rubric.toml | grader.toml + tests/)\n  tests/                     offline pytest (tools, agent loop, adapters, metrics, scenario polarity)\n  results/\u003ctimestamp\u003e/       raw.jsonl · run.json · summary.json · report.md · *.png  (gitignored,\n                             except committed snapshots: 2026-06-13-pilot-full-matrix and\n                             confirmatory-20260616T172420Z, each with a PROVENANCE.md)\n```\n\n---\n\n## 7. Authoring a scenario (summary — full steps in `scenarios/CHECKLIST.md`)\n\nA cascade scenario clones `scenarios/cascade-quota-batcher-code/`. The loop:\n\n1. Pick a **load-bearing** drift: name an input where T0 and T1 give *different* outputs (else the\n   doc carries no weight and the bench measures nothing).\n2. Write `code/` (T1, the visible file stubbed) + `.author/code_t0/` (only the changed files, T0) +\n   `hub_stale.md`/`hub_fresh.md` (placeholder `hash: 000000000000`) + a **neutral** `task.md` +\n   the grader + `.author/solution_{correct,stale}.*` reference solutions.\n3. `python tools/author.py scenarios/\u003cid\u003e` — seals the hub hashes against the real binary and emits\n   `surf_report.json`; **fails loudly if the stale hub doesn't actually diverge.**\n4. `python tools/validate_scenario.py scenarios/\u003cid\u003e` — runs the live graders on the two reference\n   solutions and proves they **discriminate** (correct → ok \u0026 not misled; stale → not ok \u0026 misled).\n5. `--models mock` run + `oracle` to confirm the pipeline + tripwires.\n\n**Two rules with teeth:**\n\n- **Graders probe the real hidden dependency** for ground truth — never hardcode the T1 value.\n  `check_misled` hardcodes the *stale* value.\n- **Neutrality** (the #113 lesson): the stale value appears *only* in `hub_stale.md` — never in\n  `task.md` or the visible code, and never a \"the doc may be wrong\" hint. A leak re-introduces the\n  doc-trust bias the neutral system prompt is designed to remove.\n\n---\n\n## 8. Sharp edges (read before authoring or debugging)\n\n- **`surf` alpha-renames identifiers when hashing.** A drift expressed only as a *named-constant\n  swap* (e.g. `ROUND_HALF_EVEN` → `ROUND_HALF_UP`) is **invisible** to `surf check` — the two hash\n  identically. A detectable drift must change a **literal, operator, or structure** (a number, a\n  string, `\u003c`→`\u003c=`, an added/removed statement, a reordered call). `author.py` fails with \"expected\n  a 'changed' divergence\" if you trip this — redesign the drift, don't fight the hasher.\n- **`author.py` seals only the *first* anchor's hash.** Multi-anchor hubs (a genuine T3 \"multi-claim\"\n  scenario) need a small tooling change to seal every anchor; until then keep hubs single-anchor.\n- **Read-only tools by design.** The multi-turn agent can read but not edit or run code. Giving it a\n  test runner would let it brute-force ground truth and wash out the doc-trust signal we measure. A\n  full edit/run \"thrash\" loop is deliberately deferred future work.\n- **Don't stack PRs.** Scenario PRs are additive and independent — open each against `main` directly.\n  (A stacked PR once merged into its base branch instead of `main` and silently lost its content.)\n- **Spend is in the *run*, not the docs/tooling.** `author.py`, `validate_scenario.py`, `report`,\n  `oracle`, the test suite, and any `mock` run cost nothing. Only `run` against a real provider does.\n\n---\n\n## 9. Reproducibility\n\n`uv.lock` pins the Python env. `run.json` records every parameter (models + ids, trials,\ntemperature, max_tokens, mode, max_turns, conditions, scenarios, `surf --version`). `raw.jsonl`\npreserves raw outputs so grading/metrics re-run offline. Two snapshots are committed and\nre-gradeable: the single-shot pilot (`results/2026-06-13-pilot-full-matrix/`) and the confirmatory\nmulti-turn matrix (`results/confirmatory-20260616T172420Z/`). To reproduce one, `uv sync` and re-run\n`report` on the snapshot dir (it regenerates `summary.json` + figures from the frozen `raw.jsonl`); see\neach snapshot's `PROVENANCE.md` for how it was produced.\n\n---\n\n## 10. License, citation \u0026 disclosures\n\n**License (dual).** The benchmark **software** — `surface_bench/`, `tools/`, `tests/`,\n`pyproject.toml`, `uv.lock` — is released under the **MIT License** ([`LICENSE`](LICENSE)). The\n**dataset and research artifacts** — `results/`, `scenarios/`, and the written docs (`PAPER.md`,\n`PREREGISTRATION.md`, `ABC_CHECKLIST.md`, `DATASET.md`, `analysis/`) — are released under\n**CC BY 4.0** ([`LICENSE-DATA`](LICENSE-DATA)): reuse freely, including commercially, with attribution.\n\n**Using the data.** Start with [`DATASET.md`](DATASET.md) — it documents the released snapshots, the\n`raw.jsonl` schema, and how to load and regenerate the metrics offline.\n\n**Citation.** Cite via the archived DOI **[10.5281/zenodo.20722100](https://doi.org/10.5281/zenodo.20722100)**\n(concept DOI — always resolves to the latest version). Citation metadata is in\n[`CITATION.cff`](CITATION.cff) (GitHub shows a \"Cite this repository\" button).\n\n**Funding \u0026 independence.** This study received **no external or institutional funding**; it was\nconducted independently and **self-funded** by the author to empirically validate Surface. \"Independent\"\nhere means free of outside sponsors — **not** disinterested: the author also authors\n[Surface](https://github.com/Connorrmcd6/surface), the tool this benchmark measures, which is a\ndeclared competing interest. That conflict is mitigated by the study's design — pre-registration,\nfully deterministic grading (no LLM judge), released raw per-call data, and the fact that a negative\nresult on any hypothesis is reportable (see `PAPER.md` §9).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconnorrmcd6%2Fsurface-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconnorrmcd6%2Fsurface-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconnorrmcd6%2Fsurface-bench/lists"}