{"id":51017325,"url":"https://github.com/matte1782/sota-bench","last_synced_at":"2026-06-21T12:30:20.143Z","repository":{"id":363568088,"uuid":"1258817474","full_name":"matte1782/sota-bench","owner":"matte1782","description":"Open AI-for-security validation benchmark: non-LLM scorer + a SOTA-validation loop. Labeled positive corpus withheld pending coordinated disclosure.","archived":false,"fork":false,"pushed_at":"2026-06-18T18:35:28.000Z","size":323,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-18T20:24:32.554Z","etag":null,"topics":["agent-security","ai-security","cwe-862","evaluation","llm-security","ml-security","security-benchmark","vulnerability-detection"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/matte1782.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-04T00:25:00.000Z","updated_at":"2026-06-18T18:35:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/matte1782/sota-bench","commit_stats":null,"previous_names":["matte1782/sota-bench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/matte1782/sota-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matte1782%2Fsota-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matte1782%2Fsota-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matte1782%2Fsota-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matte1782%2Fsota-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/matte1782","download_url":"https://codeload.github.com/matte1782/sota-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matte1782%2Fsota-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34610826,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-security","ai-security","cwe-862","evaluation","llm-security","ml-security","security-benchmark","vulnerability-detection"],"created_at":"2026-06-21T12:30:18.389Z","updated_at":"2026-06-21T12:30:20.121Z","avatar_url":"https://github.com/matte1782.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sota_bench\n\n**An open, model-agnostic benchmark for agent tool-dispatch\nauthorization-confusion vulnerability detection and severity calibration, plus\na non-LLM scorer and a pre-registered SOTA-validation loop.**\n\n`sota_bench` is a standalone benchmark for evaluating automated\nvulnerability-discovery systems (LLM agents, static analyzers, hybrid pipelines)\non a security bug *class* that frontier coding agents now produce and miss in\nroughly equal measure: **agent tool-dispatch authorization confusion**, a\nprivileged operation dispatched through an agent / MCP / tool-calling surface\n*without re-checking the caller's authorization at the point of dispatch*, or\ngated on one path (e.g. REST) but not its equivalent tool/agent twin. Each item\nis a labeled finding pinned to a precise `(repo, commit_sha, file, line)`\nlocation with a ground-truth disposition, OWASP/CWE taxonomy, the decisive\nruntime-gating check that resolves it, an expected CVSS band/vector, and its\nrealized disclosure outcome. The core data model, dataset loader, and scorer have\n**zero third-party dependencies**; an optional adapter packages the dataset as a\n[UK-AISI Inspect](https://inspect.aisi.org.uk/) eval `Task`.\n\n## The measurement-first thesis\n\nRaw absolute scores on a vuln-detection benchmark are not decision-useful: they\nrise automatically every time a better base model ships, with or without any\nmethodological contribution. `sota_bench` is built around a different question:\n*does a method add value on top of a naive single frontier-model call, and how\ndoes that value move as the frontier improves?* The unit of measurement is the\n**signed delta** (`method_metrics − naive_metrics`), pre-registered in\n[`PROTOCOL.md`](PROTOCOL.md) so the headline number cannot be chosen after seeing\nresults. Three properties make the numbers trustworthy:\n\n- **No LLM-as-judge.** Every metric is a closed-form function of labels and\n  predictions. The loop only does the arithmetic of differencing metric maps.\n- **Exonerated negatives are first-class.** Each vulnerable finding is paired with\n  near-duplicate `secure` / patched twins at the same code location, so a system\n  cannot win by pattern-matching the surrounding code: it must reason about the\n  gating check. Correctly *clearing* a secure variant counts exactly as much as\n  flagging a real bug.\n- **Severity is calibrated, both ways.** CVSS error is reported as separate\n  non-negative inflation (over-rating) and deflation (under-rating) magnitudes, so\n  a system that systematically over- or under-states severity cannot hide behind a\n  symmetric mean that cancels to zero.\n\n## Install\n\n```bash\npip install -e .                 # core, stdlib-only, no runtime dependencies\npip install -e \".[inspect]\"      # + optional UK-AISI Inspect adapter\npip install -e \".[dev]\"          # + pytest, ruff (for development)\n```\n\nRequires Python \u003e= 3.11. The core (`sota_bench.schema`, `sota_bench.scorer`,\n`sota_bench.loop`, `sota_bench.cvss`, `sota_bench.triad`, and the stdlib\nadapters) imports nothing outside the standard library; `inspect_ai` is imported\nlazily and only when the optional adapter is actually called.\n\n## Usage\n\n### 1. Load a labeled dataset\n\n```python\nfrom sota_bench import load_dataset, BenchEntry\n\nentries: list[BenchEntry] = load_dataset(\"datasets/authz_v1.jsonl\")\n```\n\n`load_dataset` validates every row with a strict, fail-closed, line-aware\nvalidator (`validate_entry`); a bad field names the offending 1-based line.\n\n### 2. Score predictions (non-LLM)\n\n```python\nfrom sota_bench import Prediction\nfrom sota_bench.scorer import score\n\npredictions = [\n    Prediction(finding_id=e.finding_id, predicted_label=\"vuln\",\n               predicted_cvss_score=None, predicted_cvss_band=\"high\")\n    for e in entries\n]\n\nresult = score(entries, predictions)\nprint(result.recall, result.precision, result.youden_j)\nprint(result.inflation_mae, result.deflation_mae)   # severity error, both ways\nprint(result.to_metrics_dict())                      # flat dict[str, float]\n```\n\n`score` matches predictions to entries by `finding_id` and returns a frozen\n`ScoreResult` with the OWASP confusion matrix (TP/FP/TN/FN, recall, precision,\nspecificity, Youden's J), pairwise accuracy over every `(vuln, secure)` pair, the\nPrimeVul VD-S operating point, and both-ways CVSS calibration.\n\n### 3. Run the SOTA-validation loop\n\n```python\nfrom sota_bench.loop import run_delta, pin_baseline, load_baseline, delta_vs_baseline\nfrom sota_bench.scorer import scorer_fn\n\n# First release: measure method-minus-naive and pin it.\nresult = run_delta(\n    entries, naive_adapter, method_adapter, predict_fn, scorer_fn,\n    model_label=\"frontier-2026.06\", dataset_fingerprint=\"corpus-v1\",\n)\npin_baseline(result, \"baselines/frontier-2026.06.json\")\nprint(result.delta)                                  # signed method − naive\n\n# Next release: re-run on the same frozen corpus and report movement vs the pin.\nbaseline = load_baseline(\"baselines/frontier-2026.06.json\")\nnew_result = run_delta(\n    entries, naive_adapter, method_adapter, predict_fn, scorer_fn,\n    model_label=\"frontier-2026.07\", dataset_fingerprint=\"corpus-v1\",\n)\nprint(delta_vs_baseline(new_result, baseline))       # change in the gap\n```\n\n`naive_adapter` and `method_adapter` are both just `ModelAdapter` subclasses\n(the single seam is `run(prompt: str) -\u003e str`), so the identical scoring code\nmeasures the signed delta between them. A deterministic, offline `StubAdapter`\nships for reproducible fixtures and tests. The loop is model-agnostic and imports\nno vendor SDK.\n\n## The two pinned baselines (reported honestly)\n\nBoth baselines were produced by a frontier (Opus-class) agent and scored with\nsota_bench's **own non-LLM scorer**, no LLM-as-judge anywhere. The full\nmethodology, headline numbers, and caveats are in [`PROTOCOL.md`](PROTOCOL.md);\nthe public dataset slice is under `datasets/`; the labeled positive corpus and the\ndated baselines are withheld pending coordinated disclosure.\n\n### Static baseline\n\nA blind, code-only read over the full labeled `authz_v1` set (the positive items\nand two secure twins are withheld from the public slice pending coordinated\ndisclosure), no advisory lookup. **Result: the method did NOT beat naive on this\nslice: naive was strictly better on detection** (naive recall 0.83, precision 1.00,\nYouden J 0.83; the runtime-gating \"method\" proxy lower at recall 0.67, precision\n0.80, Youden J 0.57; signed method-minus-naive delta: recall −0.167, Youden J\n−0.267, over the pinned 16-item 2026-06-03 baseline). The method's extra skepticism\nflipped two calls the wrong way: it cleared\none real vuln to `secure` (a false negative, where an owner-scoped argument made\nthe dispatch *look* gated) and flagged a by-design shared-capability surface as\n`vuln` (a false positive). We publish this negative result as-is. The specific\nlabeled POSITIVE items behind these counts, and the naive-vs-method baselines that\nturn on them, are WITHHELD PENDING COORDINATED DISCLOSURE and will be published\nonce the underlying advisories are public.\n\n### Runtime baseline (a diagnostic, not a demonstrated edge over naive)\n\nA second, **dynamic** baseline stands the target up, drives the agent/tool-dispatch\npath with a fresh per-run CSPRNG sentinel, and lets the runtime gating check fire (or\nnot) against a live oracle. On this small multi-finding subset the runtime method\nmatches ground truth, and its one decisive correction flips a static *false negative*\nback to `vuln`: a low-privilege member denied on the REST entitlement sibling reached\nowner-private content through the agent dispatch sink.\n\nRead honestly, that correction is narrower than it first looks, and we label runtime a\n**diagnostic, not an edge over a naive call**. The false negative it repaired was\nintroduced by the static *method's* own extra skepticism; the **naive single call\nalready flagged that same row `vuln`**. So on the one row that was truly re-run live,\nruntime's delta over naive is **zero**: it undid a mistake the method made and naive\ndid not. Runtime has not been shown to beat a naive call, and it has not cleared the\nmethod's other v1 error (the AnythingLLM false *positive*), whose runtime exoneration\nremains future \"expand\" work. Two further caveats stated in the artifact and\n`PROTOCOL.md`: only part of the subset was truly live-reran (the rest rest on recorded\nevidence with no new sentinel minted), and this is not a full re-scoring of the v1\nslice. Runtime is a per-row validation technique under test here, not the benchmark's\nthesis. The per-target identities, the labeled POSITIVE corpus, and the naive-vs-method\nbaselines behind this subset are WITHHELD PENDING COORDINATED DISCLOSURE and will be\npublished once the underlying advisories are public.\n\n## Differentiation\n\n`sota_bench` is positioned against three reference points.\n\n### vs. ZeroPath\nZeroPath is a strong commercial agentic scanner. `sota_bench` adds two axes it\ndoes not score explicitly: **REST-vs-agent path-divergence** (the same operation\ngated on one entry point but not its tool/agent twin) and **severity calibration**\n(not just \"is it a bug?\" but \"how bad, signed both ways?\"). It is also open and\nreproducible.\n\n### vs. BACFuzz\nBACFuzz is a dynamic broken-access-control fuzzer. `sota_bench` is\n**model/language-agnostic and static-capable**, it scores systems that never run\nthe target, and it treats **exonerated negatives as first-class** labels rather\nthan as the absence of a crash, so a system is rewarded for correctly clearing\nsecure code, not only for triggering a failure.\n\n### vs. Anthropic Mythos\nMythos is a large internal evaluation effort. `sota_bench` is **open,\nmodel-agnostic, and honest-band-anchored**: severity is anchored to hand-assessed\nCVSS bands rather than self-reported confidence. Be honest about provenance: the\nwidely-cited Mythos throughput figures are Anthropic *estimates*. The defensible,\ncomparable signal is the hand-assessed slice: high/critical true-positive rate,\nexact-band severity agreement versus security firms, and the small fraction of\ndisclosed findings that reach a CVE/GHSA. `sota_bench` is built so the numbers it\nreports are of the *hand-assessed, reproducible* kind, not the estimated kind.\n\n## Durability: a dated corpus, a non-LLM oracle, and published losses\n\nThe durable asset here is not secrecy. It is three things a stronger model cannot\nquietly erase:\n\n- **A deterministic, non-LLM differential oracle over a DATED corpus.** Each scored\n  item turns on a closed-form check (`fp_killer`) and a paired `secure` twin at the\n  same code location, graded by a pure function of labels and predictions (no\n  LLM-as-judge). A frontier agent can read the source and even execute it; it cannot\n  fabricate a non-LLM oracle's verdict, and it cannot train on a finding whose public\n  `evidence_date` postdates its training cutoff.\n- **Temporal contamination control, not a secret split.** A benchmark whose labels are\n  merely hidden is brittle and unverifiable: a reviewer cannot reproduce it, and the\n  hidden labels still leak the day anyone indexes them. Instead, the public `authz_v1`\n  slice is the open, reproducible front door, and contamination is controlled by DATE:\n  a finding is scored against a model M only if its `evidence_date` is strictly after\n  M's training cutoff ([`PROTOCOL.md`](PROTOCOL.md) L3). The same row can therefore be\n  fully public AND a valid held-out test for every model whose cutoff predates it. Each\n  run is stamped with a content-addressed `dataset_hash` so a published delta is\n  attributable to an exact corpus version.\n- **Published losses.** The headline is reported as-is, including where the method\n  LOSES to a naive single call (the `authz_v1` static baseline below). A method whose\n  edge erodes is published as eroded; the pre-registered SOTA loop is the mechanism\n  that says so, on schedule.\n\nThere is a separate, disclosure-safety reason to hold a finding back: an UNFIXED\nadvisory. An embargoed vulnerability sits in a private vault, never scored and never\nsent to a hosted model, until its advisory publishes, at which point it becomes a\nPUBLIC DATED row (verifiable, and contamination-controlled for every prior model).\nThat vault is a coordinated-disclosure safeguard, not the durability mechanism, and\nnothing in it is ever the headline.\n\n## Selected public findings\n\nThe track record behind the method: coordinated security disclosures by the author\nthat are now public (severity as the official advisory rates it). Generated from a\nsingle source of truth; each row is verified against the live advisory state before\nit ships.\n\n\u003c!-- PUBLIC-FINDINGS:START (generated by the portfolio's tools/generate_disclosures.py from public-findings.json; do not edit by hand) --\u003e\n| finding | severity | identifier | advisory |\n|---|---|---|---|\n| Open WebUI | High 8.5 | CVE-2026-54008 | [CVE-2026-54008](https://github.com/open-webui/open-webui/security/advisories/GHSA-226f-f24g-524w) |\n| dex | High | GHSA-7qjx | [GHSA-7qjx](https://github.com/dexidp/dex/security/advisories/GHSA-7qjx-gp9h-65qj) |\n| Langroid | High | GHSA-2pq5 | [GHSA-2pq5](https://github.com/langroid/langroid/security/advisories/GHSA-2pq5-3q89-j7cc) |\n| ouroboros | High | GHSA-jv2h | [GHSA-jv2h](https://github.com/Q00/ouroboros/security/advisories/GHSA-jv2h-4p9v-wf5w) |\n| Open WebUI | High 7.3 | GHSA-3wgj | [GHSA-3wgj](https://github.com/advisories/GHSA-3wgj-c2hg-vm6q) |\n| MCP Registry | Mod 6.3 | CVE-2026-44430 | [CVE-2026-44430](https://github.com/advisories/GHSA-r48c-v28r-pf6v) |\n| GitHub MCP Server | Mod 6.0 | CVE-2026-48529 | [CVE-2026-48529](https://github.com/github/github-mcp-server/security/advisories/GHSA-pjp5-fpmr-3349) |\n| Kirby | Mod 5.3 | CVE-2026-45334 | [CVE-2026-45334](https://github.com/getkirby/kirby/security/advisories/GHSA-39vq-49qm-r2mc) |\n| Outline | Mod | CVE-2026-43890 | [CVE-2026-43890](https://github.com/outline/outline/security/advisories/GHSA-gf8h-cv9v-q4fw) |\n| Google MCP Toolbox | Mod | PR #3324 | [PR #3324](https://github.com/googleapis/mcp-toolbox/pull/3324) |\n\u003c!-- PUBLIC-FINDINGS:END --\u003e\n\nAdditional findings are in private coordination and are not listed until their\nadvisories publish.\n\n## Verticals\n\nThe flagship vertical is `authz` (agent tool-dispatch authorization confusion):\nbroken object-/function-level authorization on tool handlers, REST-vs-agent path\ndivergence, and confused-deputy delegation through a tool-calling layer. A second\n`decode` vertical is reserved for parsing/decoding-primitive bugs.\n\n## Metrics, in brief\n\n- **OWASP confusion matrix**: TP/FP/TN/FN with **Youden's J**\n  (`sensitivity + specificity − 1`), rewarding systems that both catch vulns *and*\n  exonerate secure variants.\n- **PrimeVul VD-S**: false-negative rate at a fixed false-positive operating\n  point, for comparability with the vulnerability-detection literature.\n- **Signed both-ways CVSS-v3.1 calibration MAE**: inflation and deflation kept\n  separate, plus exact-band agreement.\n- **SOTA-validation delta loop**: continuous, pre-registered re-evaluation of the\n  signed method-minus-naive gap on a frozen slice, so regression and saturation\n  surface over time.\n\n## References\n\n- OWASP API Security Top 10 (2023): `API1:2023` Broken Object Level\n  Authorization, `API5:2023` Broken Function Level Authorization.\n  \u003chttps://owasp.org/API-Security/editions/2023/en/0x11-t10/\u003e\n- CWE-862 Missing Authorization; CWE-863 Incorrect Authorization; CWE-285\n  Improper Authorization. \u003chttps://cwe.mitre.org/\u003e\n- CVSS v3.1 Specification (FIRST). \u003chttps://www.first.org/cvss/v3-1/specification-document\u003e\n- PrimeVul / VD-S: Ding et al., *Vulnerability Detection with Code Language\n  Models: How Far Are We?* \u003chttps://arxiv.org/abs/2403.18624\u003e\n- UK AI Safety Institute, *Inspect* evaluation framework.\n  \u003chttps://inspect.aisi.org.uk/\u003e\n- Model Context Protocol (MCP) specification. \u003chttps://modelcontextprotocol.io/\u003e\n\n## License\n\nApache-2.0. See [`LICENSE`](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatte1782%2Fsota-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmatte1782%2Fsota-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatte1782%2Fsota-bench/lists"}