{"id":50873621,"url":"https://github.com/andygeiss/harness-benchmarking","last_synced_at":"2026-06-15T07:30:42.890Z","repository":{"id":364783889,"uuid":"1259232460","full_name":"andygeiss/harness-benchmarking","owner":"andygeiss","description":"Benchmarking small local LLMs as autonomous coding agents — where they stagnate, how their code quality compares to frontier baselines, and which harness levers move completion. The instrument: a Go 'Ralph loop' harness that runs a fixed prompt until the task verifiably passes its tests.","archived":false,"fork":false,"pushed_at":"2026-06-14T13:28:42.000Z","size":427,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-14T15:09:16.223Z","etag":null,"topics":["agentic-ai","ai-agent","apple-silicon","autonomous-agents","benchmarking","coding-agent","go","golang","llm","llm-as-judge","llm-evaluation","local-llm","mlx","qwen","ralph-loop"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andygeiss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-04T10:03:27.000Z","updated_at":"2026-06-14T13:28:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/andygeiss/harness-benchmarking","commit_stats":null,"previous_names":["andygeiss/harness-benchmarking"],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/andygeiss/harness-benchmarking","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andygeiss%2Fharness-benchmarking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andygeiss%2Fharness-benchmarking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andygeiss%2Fharness-benchmarking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andygeiss%2Fharness-benchmarking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andygeiss","download_url":"https://codeload.github.com/andygeiss/harness-benchmarking/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andygeiss%2Fharness-benchmarking/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34353189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","ai-agent","apple-silicon","autonomous-agents","benchmarking","coding-agent","go","golang","llm","llm-as-judge","llm-evaluation","local-llm","mlx","qwen","ralph-loop"],"created_at":"2026-06-15T07:30:38.632Z","updated_at":"2026-06-15T07:30:42.883Z","avatar_url":"https://github.com/andygeiss.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Harness Benchmarking\n\nA research project that measures how small local LLMs behave when run as\nautonomous coding agents: where they stagnate and why, how their code quality\ncompares to frontier baselines, and which harness levers measurably move\ncompletion. The instrument is a purpose-built agent harness — a Go\nimplementation of the **\"Ralph loop\"** — that runs a single fixed prompt\nagainst a local LLM and lets the model act through tools until the task\n*verifiably* completes, with no human in the loop.\n\nThe default subject is `Qwen3.6-35B-A3B-oQ6-mtp` served by a local\n**oMLX** server (an OpenAI-compatible API) on an Apple Silicon Mac, but any\nOpenAI-compatible endpoint and model work via flags.\n\nThe measurements are the product. They live in\n[docs/stagnation.md](docs/stagnation.md) (the stagnation study: why\nfixed-window passes hit a re-orientation floor, and what clears it), in\n`logs/runs.jsonl` (one record per run), and in `logs/judgments.jsonl`\n(out-of-band code-quality scores, head-to-head against a real Sonnet-medium\nsolution — see [Code quality is a separate\naxis](#code-quality-is-a-separate-axis)). The rest of this README documents\nthe instrument and summarises the findings.\n\n\u003e **Working in this repo?** Read [CLAUDE.md](CLAUDE.md) first. It holds the\n\u003e engineering philosophy (disciplined minimalism, standard library only, Go as\n\u003e the only language) and the cross-file invariants you must not break. This\n\u003e README is the conceptual overview; CLAUDE.md is the contract.\n\n## What the harness does\n\nA run gives the model a task — a prompt plus a workspace seed — and a small set\nof sandboxed tools: read / write / edit files, list directories, run the Go\ntoolchain, and a `done` tool. The model works until it calls `done`; the harness\nthen runs a **verification command** (default `go test ./...`) and ends the run\nonly if it passes. Completion is never the model's word for it — it's a gate the\nmodel has to actually pass.\n\nTwo properties make that gate hard to game:\n\n- **The tests are the spec, and the model can't author them.** Writes to\n  `*_test.go` are refused, so the model can't pass by gutting the very test it is\n  graded against.\n- **\"Passed\" means the tests actually ran.** For a `go test` gate the harness\n  parses the `-json` event stream and accepts only when real tests passed — a\n  binary that prints `ok` and exits 0 without running anything (e.g. an\n  `os.Exit` in non-test code) is rejected.\n\nBoth raise the cost of cheating rather than making it impossible: the verdict is\nparsed from the test binary's own `-json` markers, so a non-test `.go` file\ncompiled beside the spec could forge a passing one — print a `--- PASS` line,\nthen `os.Exit` before the real test runs. It's a narrow, documented hole,\nacceptable for the local, non-adversarial model this targets; the\n[CLAUDE.md](CLAUDE.md) invariants spell it out.\n\n## Architecture\n\nTwo nested loops; the split is the whole idea.\n\n### Inner loop — one session\n\n`agent.Session.Run` (`internal/agent/loop.go`): a single tool-use session. Call\nthe model → run its tool calls → feed the results back → repeat, until the model\nstops, the task completes, or a budget trips (max steps, or context tokens).\n\n### Outer loop — the Ralph loop\n\nThe `for` in `cmd/harness/main.go`. It re-runs the session with a **fresh\ncontext** every pass. State survives between passes only on the **filesystem** —\nthe code being written, plus a `PROGRESS.md` the agent maintains as its plan\nmemory — never in process memory. That is how a run can exceed a single context\nwindow.\n\nBetween passes the loop:\n\n- **Fingerprints the workspace** and stops early if `-max-stale` consecutive\n  passes change nothing — a stuck model — instead of burning the whole budget.\n  `PROGRESS.md` is excluded from the fingerprint, since the agent rewrites it\n  every pass.\n- **Runs an end-of-pass verification probe:** if a pass changed the workspace\n  but the model stopped without a successful `done`, the loop runs the verifier\n  itself and finishes the run if it passes — so doing the work but forgetting to\n  signal it doesn't cost an extra pass. The probe shares the `done` gate's\n  verifier, so its strictness is identical.\n- **Distinguishes outcomes.** Completed, stagnated (stuck), budget (ran out of\n  passes), and fault (e.g. *every* pass errored because the endpoint was\n  unreachable) get distinct exit codes, so a transport outage is not misread as\n  a stuck model.\n\nOn exit it appends one JSON line to `logs/runs.jsonl` — config, outcome, and\naggregate token/timing metrics — written outside the sandbox and never seen by\nthe agent.\n\n### The `-memory` ablation\n\n`-memory=false` drops the `PROGRESS.md` guidance from the system prompt and\nwipes the file before each pass, so a run measures how well the model resumes\nfrom the persisted *code* alone. (See [the honest finding](#an-honest-finding-cross-pass-resume)\nbelow on when this actually matters.)\n\n### Packages\n\n- `internal/llm` — HTTP client + DTOs for the OpenAI-compatible API. Streaming\n  and non-streaming responses assemble to the same shape, so the loop is\n  identical either way.\n- `internal/tool` — the tool registry and the built-in tools: filesystem, the\n  Go-toolchain runner, and the `done` gate with its verifiers.\n- `internal/agent` — the inner session loop.\n- `cmd/harness` — wires it together; owns the Ralph loop and the system prompt.\n- `cmd/example` — a convenience runner that copies a bundled example's seed into\n  `./sandbox` and launches the harness against it.\n- `examples/` — the task catalogue (see [examples/README.md](examples/README.md)).\n\nFor the precise cross-file invariants — reasoning is never stored in history,\ncompletion always runs the verifier, execution is Go-toolchain-only, filesystem\ntools are sandboxed, and the rest — see the **Invariants** section of\n[CLAUDE.md](CLAUDE.md).\n\n## Requirements\n\n- Go 1.26+\n- A running oMLX (or other OpenAI-compatible) server — the default expects one at\n  `http://localhost:1234/v1`\n- `golangci-lint` v2.x for the lint gate (a local dev tool only; it adds nothing\n  to `go.mod`)\n\n## Quickstart\n\nThe development green gate — every change must keep all of these passing:\n\n```bash\ngo build ./...        # compile\ngo vet ./...          # static checks\ngo test ./...         # all tests\ngofmt -w .            # format\ngolangci-lint run     # lint (config in .golangci.yml)\n```\n\nRun the harness against a task:\n\n```bash\ngo run ./cmd/harness -prompt task.md -workdir ./sandbox\n```\n\nUseful flags (full list via `go run ./cmd/harness -h`):\n\n- `-stream` — stream model tokens live to stderr\n- `-debug` — log the model's reasoning trace\n- `-verify` — verification command for the done-gate (default `go test ./...`)\n- `-memory=false` — ablate the cross-pass `PROGRESS.md` memory\n- `-elide-passing=false` — ablate the read-boundary spec elision (default on; see\n  the stagnation finding below)\n- `-protect-tests` — refuse agent writes to `*_test.go` (default on)\n- `-ctx-limit` / `-max-iters` / `-max-steps` — the per-pass and per-run budgets\n\nDefaults target the local setup (model name, `:1234` endpoint, Qwen3\nthinking-mode sampling: temp 0.6 / top_p 0.95 / top_k 20); all are\nflag-overridable.\n\n## Working across several task folders\n\nThe harness drives one task in one workspace, so several independent tasks are\nsimply several runs: give each its own folder holding its own `PROMPT.md`, and\nrun the harness once per folder. Keep it sequential — one local model serves one\nrequest at a time, so parallel runs only contend for the same weights.\n\n```bash\ngo run ./cmd/harness -workdir tasks/one   -prompt tasks/one/PROMPT.md   -verify \"go test ./...\"\ngo run ./cmd/harness -workdir tasks/two   -prompt tasks/two/PROMPT.md   -verify \"go build ./...\"\ngo run ./cmd/harness -workdir tasks/three -prompt tasks/three/PROMPT.md   # -verify defaults to go test ./...\n```\n\nThis is not new machinery — it falls out of flags the harness already has:\n\n- **`-prompt` (the task) and `-workdir` (the workspace) are independent**, so a\n  folder that holds both its prompt and its code is a complete, isolated unit of\n  work; nothing leaks between folders.\n- **`-verify` is per-invocation**, so each task gates on its own command — the\n  default `go test ./...` for a task that ships a spec, a plainer `go build ./...`\n  or `go vet ./...` for one that does not.\n- **All runs share one `logs/runs.jsonl`** (the default `-log-dir`), and each\n  record's `task` field is the prompt path, so the lines stay distinguishable\n  without a per-folder log split. Launch them from the same directory so the log\n  collects in one place.\n\nTwo things to keep right: give each folder its **own `go.mod`** so its\n`go test ./...` is scoped to that task alone, and remember that a non-`go test`\n`-verify` falls back to an **exit-status-only** check — it confirms the command\nexited 0, not that the work is correct (see the gate caveats in\n[CLAUDE.md](CLAUDE.md)).\n\n## Examples\n\nRun a bundled example — this wipes `./sandbox`, copies the seed in, and runs the\nharness against it (extra flags are forwarded):\n\n```bash\ngo run ./cmd/example reverse     # rune-aware string reversal — smallest end-to-end check\ngo run ./cmd/example calc        # arithmetic evaluator: lexer → parser → eval\ngo run ./cmd/example todo        # htmx web app: net/http + html/template + embed\ngo run ./cmd/example graphkit    # six-package graph algorithms library\ngo run ./cmd/example apikit      # modular JSON HTTP API (5 packages) — built for multi-pass runs\ngo run ./cmd/example datakit     # five independent container packages — flat control for the stagnation study\ngo run ./cmd/example pipeline    # context-aware concurrent fan-out/fan-in — first concurrency example\ngo run ./cmd/example stuck       # adversarial: unsatisfiable test, exercises the stagnation guard\n```\n\nEach example is a `PROMPT.md` plus a `workspace/` seed that ships the spec (a\ntest) but no implementation. Every seed is its own Go module, so an\nunimplemented seed never reddens the repo's own `go test ./...`. Full catalogue\nand details: [examples/README.md](examples/README.md).\n\n## An honest finding: cross-pass resume\n\nThe Ralph loop exists to carry a task across context resets — but at the **default\nper-pass budget, every example completes in a single pass.** Even `graphkit` (six\npackages, ~730 lines) one-shots: ~730 lines is not enough tokens to overflow a\n52k-token pass when the model's window is 64k+ and prompt-caching keeps\naccumulation cheap. So at default budget the `PROGRESS.md` hand-off and the\n`-memory` A/B are effectively un-exercised by real work.\n\nCapping the per-pass budget *below* a task's single-pass peak does drive the loop\ninto multiple passes. Two measured `graphkit` runs on the oQ4 default bracket it,\nand the result is more nuanced than a clean \"resume works\":\n\n- **`-ctx-limit 11000` → stagnates at 4/6 packages** (12 passes, all `context`).\n  Here resume is *real but incomplete*: `graph`, `traverse`, `paths` (including a\n  cross-pass fix to its Dijkstra heap) and `toposort` accrete across resets — then\n  it stalls, because the model writes `PROGRESS.md` *once* and never updates it, so\n  every fresh pass re-derives state by re-reading and re-testing the existing\n  packages, exhausting the budget before new code is written. With the plan-memory\n  neglected, `-memory=true` quietly degrades to `-memory=false`.\n- **`-ctx-limit 16000` → completes in 2 passes** (`context` → `completed`) — the\n  first run to finish across a context reset, but a *thin* multi-pass win: pass 1\n  writes all six packages and is guillotined one step before it can verify; pass 2\n  only resumes, re-verifies, and calls `done` — it writes no new code. A clean\n  \"implementation split across passes\" still hasn't landed on this model; the band\n  between stagnation and one-shot is narrow.\n\n**Two nudges, tried and removed.** Because the first bullet pins the stall partly on\nneglected notes, an experiment added two opt-in reminders between passes — one to\nupdate `PROGRESS.md` when code advanced without it, one to implement rather than\nre-read when a pass made no code progress. Measured on `graphkit`/oQ4, neither\nlifted completion. At `-ctx-limit 11000` every run jammed at the same 3-of-6 wall\n(baseline 0/4, nudges 0–1/5): resuming means re-reading the done packages plus the\nnext spec, which fills the 11k window *before* any code can be written — a per-pass\n**budget floor** a prompt cannot lift. Raise the budget to 13000 and the floor\nclears, but there the baseline already completes on its own (3/4) and the stall\nnudge only matches it (3/4). The real lever is per-pass budget (`-ctx-limit`), not\nprompting — so both nudges were removed rather than kept as dead weight. (Raw rows\nin `logs/runs.jsonl`; single-digit samples, ordinal.)\n\n**Why per-pass budget is the lever.** Digging into that floor isolated the\nmechanism. At `-ctx-limit 11000` the model re-reads ~70% of the workspace *every\npass* — 8–9 `.go`-file reads per pass (the six specs plus the implementations) — and\n`-memory` does not change it: memory=true and memory=false land at the same rate\n(8.2 / 8.4 / 9.2 vs 8.0 / 8.2 / 9.8 reads per pass, three runs each). It is **not**\nthat the model ignores its notes — traces show it reads `PROGRESS.md` *first* on\nevery resume pass, as instructed, then re-sweeps the code anyway. The re-sweep is\nmostly **structural, not distrust**: to implement the next package the model needs\nthat package's *test* in context, and `PROGRESS.md` records *status* (\"toposort:\ntodo\") but not the *spec content* the model must read to write the code. A reset\ncontext therefore re-pays to load the working set every pass; the part notes could\nsave — re-verifying already-done packages — is the smaller slice. So the budget floor\nis inherent to the Ralph design, and per-pass budget (`-ctx-limit`) is the only lever\nthat moved completion (11k jams, 13k ~3/4, 16k two-shots) — neither prompting nor the\n`PROGRESS.md` memory reduces the re-derivation cost beneath it.\n\nThat same completing run also exposed a gap in the gate: the `components` it\ncertified is **flaky** (its SCC mispartitions on ~8% of runs, by map-iteration\norder). at the time, `go test -count=1` defeated the test *cache* but sampled a *single*\nexecution, so a non-deterministic implementation passed on a lucky roll — a\nnon-adversarial path to a falsely-green gate, distinct from the adversarial one in\n[CLAUDE.md](CLAUDE.md). The gate has since been hardened to `-count=3` (three\nindependent rolls): enough for high-rate flakes, but at ~8% no guarantee. The full\nwrite-up — including a measured 27B-oQ4 run that also one-shot — is in\n[examples/README.md](examples/README.md).\n\n**Confirmed on a second, purpose-built task.** `apikit` — five independent packages,\nbuilt for exactly this — reproduces both halves on fresh ground. At `-ctx-limit 26000`\nit one-shots (**5/5**, one pass, ~3.7 min); at `-ctx-limit 11000` it never completes —\n**8 runs, 0 completions**, all stagnating at 3–4 of 5 packages with `api` reached by\nnone. A **replicated `-memory` A/B** (n=4 each) settles the memory question on\n*outcome*, not read-counts: packages reached `{4,3,4,4}` (memory) vs `{3,3,3,4}` (no\nmemory) overlap, and \"reached 4\" (3/4 vs 1/4) is **not significant (Fisher p ≈ 0.5)**.\nThe first single pair looked like a memory win and replication dissolved it — the n=1\ntrap the \"ordinal, single-digit samples\" caveat is about. Detail in\n[examples/README.md](examples/README.md).\n\n**The mechanism, quantified — and why more budget is not the fix.** One `apikit`@11k\nstagnation log makes the floor concrete: across 12 passes (all cut by `context`, with\n`done` never called) the model spent **~192 orientation ops (`read_file` + `list_dir`)\nagainst just 8 file-mutations**, and never once wrote `api`. Each fresh pass splits its\nbudget between *loading* context — re-reading the spec and code — and *acting* on it;\nbelow a floor, loading crowds out acting entirely, and the **costliest increment sets\nthat floor**: `api`, which must hold all four feature packages plus its spec at once, is\nunreachable in any single pass, so more passes cannot help. The obvious fix — raise\n`-ctx-limit` — does not generalise: it caps at the model's context window, and the tasks\na Ralph loop exists for are *larger* than any window, so escalation only walks to the\nwall. Completing under a fixed window instead needs, per pass, a **bounded working set**\n(the slice a step needs fits the window), **cheap loading** of it, and **durable,\nmonotonic progress** — and the working set must be bounded by the *interface* a step\ntouches, not the *implementation* behind it (`api` needs four signatures, not four files).\nThat is how bounded passes compose into an unbounded task. The full mechanism — and the one\nlever that moves it — is in [docs/stagnation.md](docs/stagnation.md). The proposed pass-start\ndigest was built, A/B-tested, and reverted (null); what works is a now-built-in read-boundary elision\nthat stubs a spec once its package verifies, so a fresh pass cannot re-spend its budget re-reading\nalready-satisfied specs — fed by the verifier the loop already runs, so it adds no test execution. \nAt `apikit`@11k it is the first change other than raising the budget to complete the task — \n**4/7 runs vs 0/24 across every recorded no-elision run** (Fisher p≈0.015 against the A/B's pooled \n0/10 baseline; p≈0.035 against the interleaved 0/7 arm alone — fragile at this n) — and it \n**replicates on a second task, `graphkit`@11k, 6/6 vs 0/6** (p≈0.001 — clean separation in that A/B, \nthough the cumulative no-elision ledger at that budget is 1/20, not zero: the floor is near-\ndeterministic, not absolute). It raises completion *probability* rather than guaranteeing it \n(3/7 `apikit` runs still stagnate, and the rate is session-variable — a later clean batch ran 7/8 on this \nconfig, 15/16 pooling its other arm, which also carried the null `go doc`-interface lever; \n`graphkit` did not stagnate, at n=6). A third, deliberately *flat* task (`datakit` — five \nindependent packages, no composer) **bounds** the lever: the byte-reduction still generalises \n(−21% reads), but `datakit` has no hard floor to clear — baseline completion is noisy and \n*non-monotonic* in budget (3/6 at 8k, 5/6 at 6k) — so no significant completion effect was measured \n(the 8k arm even trends *pro*-elision, 6/6 vs 3/6, p≈0.09, underpowered at n=6) and elision rides \nalong harmlessly. Its *completion* benefit needs the hard floor that a costly composing increment \ncreates, which independent-package kits lack. The proposed `go doc`-interface lever was likewise \nbuilt, measured null at the relocated 9k floor (1/8 vs 1/8, Part 9), and reverted; the soft-limit \ncheckpoint (Lever 3) was never built — see the closing ledger\n([docs/stagnation.md](docs/stagnation.md) Part 11) for the explicit waiver.\n\nA **second model** bounds the lever on the other axis. `gemma-4-26B-A4B-it` (sampled hot, no thinking\ntrace) floors *lower* than Qwen — it completes `graphkit`@11k where Qwen, without elision, completes\n1/20 — because its **~1.6–2.4:1 load:act ratio** (vs Qwen's ~12:1) shifts the flip-point down. Yet\nelision shows **no completion effect** on the floors Gemma does have — `graphkit`@9k 0/4 vs 0/4 and\n`apikit`@11k 1/4 vs 0/4 (small-n nulls: at n=4/arm only a near-deterministic effect is detectable) —\nbecause those floors\nare *budget-bound* (too tight to bank an increment) or *correctness-bound* (Gemma writes `api` but gets\nthe routing wrong), not the spec re-reading elision removes. So the lever's completion benefit needs\n**both** a composer floor (task) **and** a re-read-bound model (Qwen's re-sweep habit) — read-byte\nreduction generalises across models, completion does not. See [docs/stagnation.md](docs/stagnation.md)\nPart 10.\n\n## Code quality is a separate axis\n\nPassing the gate means the code is *correct*, not that it is *good*; the harness\nonly gates on the first. The **judge** skill (`.claude/skills/judge/SKILL.md`)\nmeasures the second, out of band: an Opus-as-judge rubric scores the produced\ncode on eight Go dimensions (contract fidelity, simplicity, idiomaticity,\nreadability, robustness, security, concurrency safety, performance), **blind to\nthe spec's tests** and head-to-head against a real Sonnet-medium solution to the\nsame contract. Each\nrecord also carries a deterministic `modernize`-finding count as a noise-free\nidiomaticity signal. It appends to `logs/judgments.jsonl`, is never a gate, and\nis never seen by the agent — keeping it outside the loop is what keeps it honest.\n\n**First measured result — `pipeline`, three-way.** Judged blind under an Opus\nreferee against a Sonnet-medium bar, the concurrency example lands both local\nmodels at a near-identical weighted score — **27B-oQ8 0.856, 35B-A3B-oQ6 0.852**,\nabout **0.045 below Sonnet (0.897)**. They trail for the same reasons: a C-style\nworker loop where `for range workers` reads cleaner, two redundant index types,\nand `len(in)`-sized channel buffers rather than `workers`-sized. Contract fidelity\nand security barely separate the three; the gap is idiomaticity and allocation\ndiscipline. Notably the 35B one-shot the task ~17× faster in wall-clock (37.6 s vs\n648.7 s) and without the 27B's write-then-rewrite, yet its *final code* scores no\nhigher — speed and one-shot cleanliness are not code quality. Scores are ordinal\nand single-sample (trust the subject−bar gap, not the absolute number); rows in\n`logs/judgments.jsonl`.\n\n## License\n\n[MIT](LICENSE) © Andy Geiss\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandygeiss%2Fharness-benchmarking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandygeiss%2Fharness-benchmarking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandygeiss%2Fharness-benchmarking/lists"}