https://github.com/connorrmcd6/surface-bench
Surface agent-impact benchmark: does accurate documentation change LLM-agent task outcomes?
https://github.com/connorrmcd6/surface-bench
agent context-engineering llm-tools research
Last synced: about 9 hours ago
JSON representation
Surface agent-impact benchmark: does accurate documentation change LLM-agent task outcomes?
- Host: GitHub
- URL: https://github.com/connorrmcd6/surface-bench
- Owner: Connorrmcd6
- Created: 2026-06-15T06:12:39.000Z (15 days ago)
- Default Branch: main
- Last Pushed: 2026-06-15T19:54:46.000Z (15 days ago)
- Last Synced: 2026-06-15T21:25:32.539Z (14 days ago)
- Topics: agent, context-engineering, llm-tools, research
- Language: Python
- Homepage: https://surface.gradientdev.xyz
- Size: 680 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# surface-bench — operator manual
[](https://doi.org/10.5281/zenodo.20722100)
License: [MIT](LICENSE) (code) · [CC BY 4.0](LICENSE-DATA) (data)
Empirically measuring how much **documentation accuracy** changes an agent's task performance — the
gap [Surface](https://github.com/Connorrmcd6/surface) exists to protect. Surface doesn't make an agent smarter; it stops docs
silently rotting. So its value to an agent equals the performance delta between working from **fresh**
docs and **rotted** docs, plus whether *surfacing* the drift recovers the loss. This bench measures
those deltas directly, using drift of exactly the kind `surf check` catches (flipped operators,
dropped guards, changed constants, reordered keys).
This is the single place to start when you (or a future agent) pick this up cold. It covers **what the
experiment is, how to run it, how to read the results, how to author scenarios, and the sharp edges.**
This is a standalone repo; it has no dependency on the [Surface](https://github.com/Connorrmcd6/surface)
Rust core — it only *consumes* the `surf` binary's output.
> **Status: the study is complete.** The pre-registration was frozen and git-tagged
> (`prereg-v2-multi`) before the confirmatory run, which then executed the full multi-turn matrix —
> **5 models across 3 providers, 3,250 graded completions, 0 errors.** Headline: a confident stale doc
> drives the *misled* rate to **68–100% on every model** (capability does not buy resistance), via
> three distinct failure modes (blind obedience, suppressed-but-rescuable verification, and
> verify-then-defer). Full write-up and numbers in [`PAPER.md`](PAPER.md) §6.
> **Companion docs:** [`PAPER.md`](PAPER.md) (the paper — pilot in §5, confirmatory results in §6),
> [`PREREGISTRATION.md`](PREREGISTRATION.md) (frozen hypotheses + analysis plan),
> [`ABC_CHECKLIST.md`](ABC_CHECKLIST.md) (benchmark-rigor self-audit),
> [`DATASET.md`](DATASET.md) (dataset card — schema + how to reuse the released data),
> [`scenarios/CHECKLIST.md`](scenarios/CHECKLIST.md) (how to author one scenario). Committed result
> snapshots: the single-shot pilot
> [`results/2026-06-13-pilot-full-matrix/report.md`](results/2026-06-13-pilot-full-matrix/report.md)
> and the confirmatory multi-turn matrix
> [`results/confirmatory-20260616T172420Z/report.md`](results/confirmatory-20260616T172420Z/report.md).
---
## 1. The experiment in one screen
Same code + same task in every run; **only the documentation block changes**. There are five
conditions:
| | Context shown to the agent | Represents |
|---|---|---|
| **C0** | code only (no doc) | baseline |
| **C1** | code + **stale** doc (true at T0, code moved to T1) | the ungoverned world |
| **C2** | code + **fresh** doc (matches T1) | the Surface-governed world |
| **C3** | code + stale doc + real `surf check --format json` report | "just surface the drift" |
| **Cw** | code + stale doc + a generic "may be outdated" warning (no corrected code) | control: is it the *fix* or just *suspicion*? |
And two run **modes**:
- **single** (v1): one prompt → one completion. Cheap, reproducible, the original pilot.
- **multi** (v2, the centerpiece): a **multi-turn agent loop** with **read-only** tools
(`read_file`, `grep`, `list_dir`, `final_answer`). The agent can *choose* to read the hidden
dependency the doc describes. This is what makes the headline non-tautological — see §4.
### Hypotheses
| | Claim | Read on |
|---|---|---|
| **H1** | C2 > C1 — accuracy beats rot (the core value) | success rate |
| **H2** | C1 < C0 — rotted docs are *worse than nothing* | misled rate |
| **H3** | C3 ≈ C2 — surfacing drift recovers the loss | success rate |
| **H4** | verification_rate(C1) < verification_rate(C0) — a confident stale doc **suppresses verification** | verification rate (multi only) |
| **H5** | within C1, agents that read the hidden dep are correct; those that don't are misled | mediation (multi only) |
| **H6** | C3 > Cw — recovery is Surface's *corrected code*, not mere suspicion | success rate |
### Two scenario families
- **Cascade** (the headline): the agent edits/answers about a **visible** thing whose correctness
depends on a **hidden dependency** (listed in `meta.toml` `hidden_paths` — present in `code/` for
grading and for `surf` to seal a real divergence, but withheld from the prompt). The dependency
has drifted from what the stale doc says, so the doc is the agent's only (single-shot) or
*optional* (multi-turn) window onto the truth. `cascade-*` scenarios.
- **Comprehension**: the drifted code *and* its contradicting doc are both visible. The model can
just re-read the code, so success ceilings near 100% — useful for the **token-tax** story (a stale
doc costs extra generation to reconcile), not the success story.
---
## 2. Setup
The harness env is managed with [uv](https://docs.astral.sh/uv/) (the committed `uv.lock` pins it so
the spend figures are reproducible). Code-edit graders run under the *same* interpreter `uv run`
selects — no `python3`-on-PATH guessing.
```sh
uv sync # base deps (anthropic)
uv sync --extra dev # + pytest (to run the test suite)
uv sync --extra plots # + matplotlib (report figures)
uv sync --extra providers # + openai, google-genai (cross-provider runs)
```
**Install `surf`** — only `tools/author.py` (re-sealing scenarios) and the C3 report generation
need the `surf` binary; running the matrix, grading, and reporting do **not**. Put it on PATH (or
point `$SURF_BIN` at it):
```sh
cargo install --git https://github.com/Connorrmcd6/surface surf-cli # builds + installs `surf`
# or download a release binary from https://github.com/Connorrmcd6/surface/releases
# the bench finds it via: $SURF_BIN -> `surf` on PATH -> ./bin/surf
```
**Toolchains the graders need at run time** (only when those scenarios are actually graded):
- **Python** — always (via `sys.executable`).
- **Node ≥ 22.18** on PATH — for TypeScript code scenarios (`node --test`; that release strips TS
types at load, so no `npm install`/`tsc`). uv does not manage Node; install it separately.
**API keys** (only for the providers you select): `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`,
`GEMINI_API_KEY` (or `GOOGLE_API_KEY`). The convenience env file `~/.surface-bench.env` is sourced
per command in practice (`set -a; source ~/.surface-bench.env; set +a`).
---
## 3. Running it
The pipeline is four commands. **Everything except `run` against a real provider is free/offline.**
```sh
# (a) RUN the matrix -> results//{raw.jsonl, run.json}
uv run python -m surface_bench.run --models mock # offline pipeline check, no key
uv run python -m surface_bench.run --models haiku --mode single --trials 10
uv run python -m surface_bench.run --models haiku --mode multi --trials 10 --max-turns 8
# (b) REPORT -> results//{summary.json, report.md, *.png}
uv run python -m surface_bench.report results/
# (c) ORACLE — post-run sanity tripwires (exits non-zero if any fire)
uv run python -m surface_bench.oracle results/
```
**CLI flags** (`run`): `--models ` (subset of `config.toml` models; default all),
`--scenarios `, `--conditions C0 C1 C2 C3 Cw`, `--trials N`, `--mode {single,multi}`,
`--max-turns N` (multi), `--out `, `--config `. Config defaults live in `config.toml`
(`trials`, `temperature`, `max_tokens`, `mode`, `max_turns`, and a `[models.]` block per
model with `provider`, `model_id`, and `input_per_mtok` / `output_per_mtok` pricing).
**Providers:** `provider = "mock" | "anthropic" | "openai" | "gemini"`. Mock needs no key and is the
offline workhorse; in `--mode multi` it scripts a canned answer so the whole loop runs for free.
### The matrix size (and why you stage spend)
`#scenarios × #conditions × #models × N`. Multi mode multiplies each cell by the number of agent
turns *and* re-sends the growing transcript each turn, so it is **much** more expensive than single.
**Always stage:** `mock` → one scenario × each provider (smoke the tool round-trip) → cascade-only
multi at small N (pilot the verification metric) → the full matrix. Gate each stage on the oracle
(§5) passing. Cost levers: `--max-turns` cap, tiered `--trials` (e.g. N=10 cascade / N=5
comprehension), and dropping comprehension from multi mode (it doesn't test verification).
For the full cross-provider run, [`RUN_CONFIRMATORY_MATRIX.md`](RUN_CONFIRMATORY_MATRIX.md) is the
operator runbook used for the committed confirmatory matrix: run each model into its **own** `--out`
dir (so a mid-run failure only loses that one model — the runner has no resume), then concatenate the
per-model `raw.jsonl` files into one directory for `report`/`oracle`.
---
## 4. Reading the results
`run` writes **`raw.jsonl`** (one row per completion — the source of truth; grading can be re-run
offline without re-spending) and **`run.json`** (the run's parameters). `report` turns those into
`summary.json` (machine-readable) + `report.md` (the human read) + figures.
**Per-row fields** (multi-only fields absent in single rows): `scenario, task_type, tier, condition,
model, trial, mode, output, input_tokens, output_tokens, cost_usd, ok, misled, detail, parsed` and,
for multi: `turns, stop_reason, tool_calls, verified_hidden, per_turn_tokens`.
**The metrics, and what each tells you:**
- **success rate** (`ok`) — got the current (T1) answer. The H1/H3/H6 axis. Deterministic: code
scenarios run hidden tests; QA scenarios parse a `VERDICT:` line against a rubric. **No LLM judge.**
- **misled rate** (`misled`) — asserted the *stale* (T0) claim. The H2 axis: a rotted doc doesn't
just fail to help, it *causes* the wrong answer.
- **verification_rate** (multi) — did the agent read/grep a `hidden_path` before answering? The
**headline of the agentic track** (H4). `verified_then_correct` is its validity check (reading the
truth should rescue you).
- **output tokens** — generation cost. Input tokens differ by construction (doc-block size) so are
ignored; output tokens carry the token-tax signal in the comprehension family.
- **cost_usd** — estimated spend (tokens × `config.toml` prices). The report's total is for relative
comparison; your provider console invoice is authoritative.
**Statistics:** rates carry 95% **Wilson** intervals; condition deltas use a 95% **bootstrap** CI
*and* a bootstrap p-value, with **Holm–Bonferroni** applied across the whole success-delta family
(the report flags `(Holm ✓/✗)` per delta). CI-significance is the headline; Holm is the conservative
cross-check. The report also slices deltas **by tier** (the difficulty gradient) and **by scenario**
(catches one broken fixture hiding in a family average).
**`report.md` sections:** Spend · the gradient (C2−C1 by tier) · **Verification** (multi: the H4
deltas + H5 mediation) · per-model rate tables + deltas (with Holm flags) · output-token tables +
deltas · per-scenario success. **Figures** (all from the frozen `summary.json`, with 95% Wilson CIs):
`cascade_success.png`, `misled_rate.png`, `cost_accuracy.png` (the cost–accuracy frontier — fresh docs
are cheapest *and* most accurate), and (multi) `verification_rate.png` — the "does a stale doc stop the
agent checking?" hero chart.
---
## 5. The oracle (sanity tripwires)
`python -m surface_bench.oracle results/` is a cheap post-run gate that catches authoring/harness
bugs **before** they reach a write-up (the failure mode that bit us in issue #113). It exits non-zero
(so it can gate CI / a staged run) if, per scenario × model:
- **C2-fresh < 90%** — with a *fresh* doc the task must be solvable; a low cell means the scenario is
mis-authored (leaked stale value, broken grader, impossible task), not a real effect.
- **a cascade C1 never misleads** — if a stale doc never produces the wrong answer, the drift isn't
load-bearing and the scenario measures nothing.
The mock can't satisfy these (it doesn't actually solve scenarios), so run the oracle on **real**
outputs.
---
## 6. Layout
```
. (repo root)
config.toml models, trials, temperature, mode, max_turns, per-model pricing
PREREGISTRATION.md frozen hypotheses + analysis plan for the headline run
ABC_CHECKLIST.md benchmark-rigor self-audit (arXiv 2507.02825)
surface_bench/
run.py the matrix runner (single + multi); writes raw.jsonl + run.json
prompts.py assembles (system, user) per condition; hides hidden_paths
models.py provider adapters: complete() (single) + step() (multi, tool-use);
Anthropic / OpenAI / Gemini + Mock; neutral Step/ToolCall types
agent.py run_agent() — the multi-turn loop over the read-only tools
tools_runtime.py read-only tool surface (read_file/grep/list_dir/final_answer) + sandbox
grade_qa.py VERDICT-line rubric grader (QA)
grade_code.py FILE-block overlay + hidden-test grader (code)
metrics.py rates, Wilson/bootstrap CIs, Holm, verification, by_tier, by_scenario
report.py summary.json + report.md + figures
oracle.py post-run tripwires
tools/
author.py seal a scenario's hub hashes + emit genuine surf_report.json
validate_scenario.py grader-polarization self-test (offline, no spend)
scenarios/
CHECKLIST.md how to author one scenario
/ meta.toml · task.md · hub_stale.md · hub_fresh.md · surf_report.json
code/ (T1) · .author/code_t0/ (T0 overlay) · .author/solution_{correct,stale}.*
grader/ (rubric.toml | grader.toml + tests/)
tests/ offline pytest (tools, agent loop, adapters, metrics, scenario polarity)
results// raw.jsonl · run.json · summary.json · report.md · *.png (gitignored,
except committed snapshots: 2026-06-13-pilot-full-matrix and
confirmatory-20260616T172420Z, each with a PROVENANCE.md)
```
---
## 7. Authoring a scenario (summary — full steps in `scenarios/CHECKLIST.md`)
A cascade scenario clones `scenarios/cascade-quota-batcher-code/`. The loop:
1. Pick a **load-bearing** drift: name an input where T0 and T1 give *different* outputs (else the
doc carries no weight and the bench measures nothing).
2. Write `code/` (T1, the visible file stubbed) + `.author/code_t0/` (only the changed files, T0) +
`hub_stale.md`/`hub_fresh.md` (placeholder `hash: 000000000000`) + a **neutral** `task.md` +
the grader + `.author/solution_{correct,stale}.*` reference solutions.
3. `python tools/author.py scenarios/` — seals the hub hashes against the real binary and emits
`surf_report.json`; **fails loudly if the stale hub doesn't actually diverge.**
4. `python tools/validate_scenario.py scenarios/` — runs the live graders on the two reference
solutions and proves they **discriminate** (correct → ok & not misled; stale → not ok & misled).
5. `--models mock` run + `oracle` to confirm the pipeline + tripwires.
**Two rules with teeth:**
- **Graders probe the real hidden dependency** for ground truth — never hardcode the T1 value.
`check_misled` hardcodes the *stale* value.
- **Neutrality** (the #113 lesson): the stale value appears *only* in `hub_stale.md` — never in
`task.md` or the visible code, and never a "the doc may be wrong" hint. A leak re-introduces the
doc-trust bias the neutral system prompt is designed to remove.
---
## 8. Sharp edges (read before authoring or debugging)
- **`surf` alpha-renames identifiers when hashing.** A drift expressed only as a *named-constant
swap* (e.g. `ROUND_HALF_EVEN` → `ROUND_HALF_UP`) is **invisible** to `surf check` — the two hash
identically. A detectable drift must change a **literal, operator, or structure** (a number, a
string, `<`→`<=`, an added/removed statement, a reordered call). `author.py` fails with "expected
a 'changed' divergence" if you trip this — redesign the drift, don't fight the hasher.
- **`author.py` seals only the *first* anchor's hash.** Multi-anchor hubs (a genuine T3 "multi-claim"
scenario) need a small tooling change to seal every anchor; until then keep hubs single-anchor.
- **Read-only tools by design.** The multi-turn agent can read but not edit or run code. Giving it a
test runner would let it brute-force ground truth and wash out the doc-trust signal we measure. A
full edit/run "thrash" loop is deliberately deferred future work.
- **Don't stack PRs.** Scenario PRs are additive and independent — open each against `main` directly.
(A stacked PR once merged into its base branch instead of `main` and silently lost its content.)
- **Spend is in the *run*, not the docs/tooling.** `author.py`, `validate_scenario.py`, `report`,
`oracle`, the test suite, and any `mock` run cost nothing. Only `run` against a real provider does.
---
## 9. Reproducibility
`uv.lock` pins the Python env. `run.json` records every parameter (models + ids, trials,
temperature, max_tokens, mode, max_turns, conditions, scenarios, `surf --version`). `raw.jsonl`
preserves raw outputs so grading/metrics re-run offline. Two snapshots are committed and
re-gradeable: the single-shot pilot (`results/2026-06-13-pilot-full-matrix/`) and the confirmatory
multi-turn matrix (`results/confirmatory-20260616T172420Z/`). To reproduce one, `uv sync` and re-run
`report` on the snapshot dir (it regenerates `summary.json` + figures from the frozen `raw.jsonl`); see
each snapshot's `PROVENANCE.md` for how it was produced.
---
## 10. License, citation & disclosures
**License (dual).** The benchmark **software** — `surface_bench/`, `tools/`, `tests/`,
`pyproject.toml`, `uv.lock` — is released under the **MIT License** ([`LICENSE`](LICENSE)). The
**dataset and research artifacts** — `results/`, `scenarios/`, and the written docs (`PAPER.md`,
`PREREGISTRATION.md`, `ABC_CHECKLIST.md`, `DATASET.md`, `analysis/`) — are released under
**CC BY 4.0** ([`LICENSE-DATA`](LICENSE-DATA)): reuse freely, including commercially, with attribution.
**Using the data.** Start with [`DATASET.md`](DATASET.md) — it documents the released snapshots, the
`raw.jsonl` schema, and how to load and regenerate the metrics offline.
**Citation.** Cite via the archived DOI **[10.5281/zenodo.20722100](https://doi.org/10.5281/zenodo.20722100)**
(concept DOI — always resolves to the latest version). Citation metadata is in
[`CITATION.cff`](CITATION.cff) (GitHub shows a "Cite this repository" button).
**Funding & independence.** This study received **no external or institutional funding**; it was
conducted independently and **self-funded** by the author to empirically validate Surface. "Independent"
here means free of outside sponsors — **not** disinterested: the author also authors
[Surface](https://github.com/Connorrmcd6/surface), the tool this benchmark measures, which is a
declared competing interest. That conflict is mitigated by the study's design — pre-registration,
fully deterministic grading (no LLM judge), released raw per-call data, and the fact that a negative
result on any hypothesis is reportable (see `PAPER.md` §9).