{"id":50484167,"url":"https://github.com/stone16/swe-bench-harness-eval","last_synced_at":"2026-06-01T20:05:34.401Z","repository":{"id":359690212,"uuid":"1246773544","full_name":"stone16/swe-bench-harness-eval","owner":"stone16","description":"Head-to-head: harness-engineering-skills (multi-agent orchestration) vs 5 public SWE-bench Verified baselines. 7/10 resolved, including 2 hard-tier instances no public agent solved.","archived":false,"fork":false,"pushed_at":"2026-05-23T00:56:09.000Z","size":93,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T01:25:12.092Z","etag":null,"topics":["agent-evaluation","benchmark","claude","claude-code","llm","multi-agent","swe-bench"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stone16.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-22T14:37:59.000Z","updated_at":"2026-05-23T00:56:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/stone16/swe-bench-harness-eval","commit_stats":null,"previous_names":["stone16/swe-bench-harness-eval"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/stone16/swe-bench-harness-eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stone16%2Fswe-bench-harness-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stone16%2Fswe-bench-harness-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stone16%2Fswe-bench-harness-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stone16%2Fswe-bench-harness-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stone16","download_url":"https://codeload.github.com/stone16/swe-bench-harness-eval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stone16%2Fswe-bench-harness-eval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33790982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","benchmark","claude","claude-code","llm","multi-agent","swe-bench"],"created_at":"2026-06-01T20:05:33.356Z","updated_at":"2026-06-01T20:05:34.395Z","avatar_url":"https://github.com/stone16.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SWE-bench Harness Evaluation\n\n\u003e Does multi-agent orchestration actually beat single-agent approaches **when\n\u003e controlled for model**? We took the [harness-engineering-skills][hes] plugin\n\u003e (running in a stripped-down \"harness-lite\" config — see caveats) and ran it\n\u003e on Claude Opus 4.7 against the **same model** in OpenHands' SWE-bench\n\u003e Verified evaluation. Then we widened the comparison to include 5 older\n\u003e public baselines.\n\n[hes]: https://github.com/stone16/harness-engineering-skills\n\n## Headline numbers — same model (Opus 4.7) comparison\n\n| Tier | Harness-lite + Opus 4.7 (this repo) | OpenHands + Opus 4.7 | Δ |\n|---|:-:|:-:|:-:|\n| **Overall (10 instances)** | **7/10 = 70%** | 6/10 = 60% | **+10pp** |\n| Easy (3 instances) | 1/3 = 33% | 2/3 = 67% | −34pp |\n| Medium (4 instances) | **4/4 = 100%** | 4/4 = 100% | tie |\n| **Hard (3 instances)** | **2/3 = 67%** | **0/3 = 0%** | **+67pp** 🚀 |\n\nThe lead comes **entirely from the hard tier**. On easy bugs, harness's\nmulti-agent overhead actually hurts. On medium bugs, modern OpenHands +\nOpus 4.7 already saturates. On **hard bugs, OpenHands + Opus 4.7 went\n0/3 — harness solved 2 of those 3**.\n\n📄 **[Full per-instance verdict matrix and failure analysis →\nRESULTS.md][RESULTS.md]**\n\n## The two instances that prove the architecture lift\n\nBoth are SWE-bench Verified hard-tier instances. Both were failed by\nOpenHands with Opus 4.7 AND Opus 4.6 AND every older-model baseline we\nchecked (Sonar, OpenHands/Sonnet 4, bash-only Claude, SWE-Agent). Both\nwere resolved by harness-lite with Opus 4.7:\n\n| Instance | OpenHands o4.7 | OpenHands o4.6 | Sonar o4.5 | SWE-Agent s4 | bash o4 | **Harness-lite o4.7** |\n|---|:-:|:-:|:-:|:-:|:-:|:-:|\n| `django__django-10554` (hard, ORDER BY in Union) | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** |\n| `pydata__xarray-6992` (hard, set_index/reset_index refactor) | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** |\n\nThis is the result we set out to find: instances where **no public\nagent — even using the same Opus 4.7 model — could resolve, but\nmulti-agent orchestration with a Generator/Evaluator/fault-path loop\ncould**.\n\n## What's \"harness-lite\"?\n\nThe harness skill ships with a default pipeline that's heavier than what\nwe ran. We stripped several features to make the per-instance evaluation\naffordable on our budget. **Be honest about this when reading the\nnumbers** — what we tested is one slice of the full design:\n\n| Harness feature | Default | What we ran | Why |\n|---|:-:|:-:|---|\n| `max_spec_rounds` | 3 | **1** | Spec was pre-generated offline from the GitHub issue; no Spec Evaluator loop needed |\n| Per-checkpoint Generator + Evaluator loop | ✓ | **✓** | This IS the load-bearing anti-drift mechanism — kept |\n| `cross_model_review` (Codex/Gemini peer) | ✓ | **✗** | Disabled to save cost (~3× per instance) |\n| `auto_retro` (post-PR retro) | ✓ | **✗** | Disabled — single-task eval, not multi-task learning |\n| `skip_full_verify` | false | **true** | No upstream PR to verify against |\n| `coverage_threshold` | 85 | 0 | Most repos don't expose conftest-compatible coverage |\n\n**Update — we ran the harness-full A/B experiment**: we re-ran all 3\nfailed instances (`astropy-14369`, `django-10097`, `matplotlib-20676`)\nwith `cross_model_review=true` (Codex peer review enabled). **Result:\n0/3 resolved.** Cross-model peer review changed the patch strategy on\n2/3 instances (one going from hacky string preprocessor to proper LALR\ngrammar rewrite, one moving from wrong layer to right layer) — but\n**did not flip any failure to a pass**. Both Claude and Codex share\nthe same bias toward \"make positive cases pass\" and miss negative-case\nregression risk. Full A/B writeup: **[EXPERIMENT_AB.md][AB]**.\n\n[AB]: ./EXPERIMENT_AB.md\n\n## Per-instance matrix (8-system comparison)\n\nLegend: ✓ resolved · ✗ failed\n\n| # | Instance | Tier | Harness-lite | OH o4.7 | OH o4.6 | OH o4.5 | Sonar o4.5 | OH s4 | bash o4 | SWE-A s4 |\n|---|---|---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|\n| 1 | astropy-14309 | easy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |\n| 2 | django-10097 | easy | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ |\n| 3 | matplotlib-20676 | easy | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ |\n| 4 | astropy-14539 | medium | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ |\n| 5 | django-11149 | medium | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |\n| 6 | matplotlib-20488 | medium | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |\n| 7 | xarray-6599 | medium | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |\n| 8 | astropy-14369 | hard | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |\n| 9 | **django-10554** | **hard** | **✓ 🚀** | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |\n| 10 | **xarray-6992** | **hard** | **✓ 🚀** | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |\n\n## Methodology\n\nStrict adherence to the [public SWE-bench Verified protocol][swebench]:\n\n| Constraint | Choice | Rationale |\n|---|---|---|\n| Dataset | SWE-bench Verified (500 instances) | OpenAI-annotated, human-verified |\n| Sample | 10 stratified | 3 easy + 4 medium + 3 hard |\n| Input | `problem_statement` only | Strict apples-to-apples |\n| Hidden tests | **Never** exposed to harness | Grader supplies its own |\n| Grader | Official `swebench.harness.run_evaluation` | Docker-isolated, deterministic |\n| Host model | **Claude Opus 4.7** | Same as OpenHands primary baseline |\n| Harness config | **harness-lite** (see above) | Cost-pruned, evaluator loop preserved |\n\n## Baselines\n\n| Alias | Agent | Model | Source | Per-instance data |\n|---|---|---|---|---|\n| **oh/o47** | OpenHands ACP | Claude Opus 4.7 | [OpenHands/benchmarks #576][oh] | ✓ tarball |\n| oh/o46 | OpenHands ACP | Claude Opus 4.6 | OpenHands/benchmarks #576 | ✓ tarball |\n| sonar/o45 | Sonar Foundation Agent | Claude Opus 4.5 | [swe-bench/experiments][exp] | ✓ JSON |\n| oh/o45 | OpenHands | Claude Opus 4.5 | swe-bench/experiments | ✓ JSON |\n| oh/s4 | OpenHands | Claude Sonnet 4 | swe-bench/experiments | ✓ JSON |\n| tools/o4 | bash-tools-only Claude | Claude Opus 4 | swe-bench/experiments | ✓ JSON |\n| swea/s4 | SWE-Agent | Claude Sonnet 4 | swe-bench/experiments | ✓ JSON |\n\nThe first two (OH o4.7 and OH o4.6) are the **load-bearing same-model\ncomparison points**. The other five give historical context but used\nolder model classes (Opus 4.5 era and below).\n\n## Reproduce\n\n```bash\ngit clone \u003cthis-repo\u003e \u0026\u0026 cd swe-bench-harness-eval\npython3 -m venv .venv \u0026\u0026 .venv/bin/pip install swebench\n\n.venv/bin/python scripts/fetch_baselines.py     # 5 swe-bench/experiments baselines\n# OpenHands Opus 4.7/4.6 tarballs: download from OpenHands/benchmarks #576\n\n.venv/bin/python scripts/select_instances.py\n.venv/bin/python scripts/build_spec.py\n\nbash scripts/run_harness.sh astropy__astropy-14309  # ~17 min, ~$5-15\nbash scripts/grade.sh harness_v1                    # Docker, ~5-30 min\n.venv/bin/python scripts/compare.py --run-id '*'\n```\n\n## Caveats\n\n- **Sample size is 10.** This is enough for an existence proof — showing\n  harness-lite CAN solve instances no agent (including OpenHands+Opus\n  4.7) could — but not for a statistical claim about overall\n  resolve-rate gap. A full 500-instance run is the natural next step.\n- **harness-lite ≠ full harness.** We disabled cross-model peer review,\n  full-verify, retro, and reduced spec rounds. The full harness might\n  resolve more instances at higher cost.\n- **Cost asymmetry.** Even harness-lite's two-agent loop is roughly 2×\n  the cost per instance of OpenHands' single-agent loop. Whether the\n  +1 instance (70% vs 60%) and +2 hard breakthroughs justify the spend\n  depends on use case.\n- **Easy-tier underperformance is real.** Harness lost on `django-10097`\n  (over-restrictive regex) where OpenHands won. Multi-agent overhead\n  can push fixes toward over-specification on simple bugs.\n- **Apple Silicon grader caveat.** matplotlib instances required\n  `--namespace swebench` (Docker Hub prebuilt images) due to conda\n  network failures under amd64 emulation. Infrastructure, not harness.\n\n## License\n\nApache 2.0 (same as upstream harness-engineering-skills).\n\n[RESULTS.md]: ./RESULTS.md\n[swebench]: https://github.com/swe-bench/SWE-bench\n[exp]: https://github.com/swe-bench/experiments\n[oh]: https://github.com/OpenHands/benchmarks/issues/576\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstone16%2Fswe-bench-harness-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstone16%2Fswe-bench-harness-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstone16%2Fswe-bench-harness-eval/lists"}