https://github.com/matte1782/sota-bench

Open AI-for-security validation benchmark: non-LLM scorer + a SOTA-validation loop. Labeled positive corpus withheld pending coordinated disclosure.
https://github.com/matte1782/sota-bench
agent-security ai-security cwe-862 evaluation llm-security ml-security security-benchmark vulnerability-detection
Last synced: 5 days ago
JSON representation
Open AI-for-security validation benchmark: non-LLM scorer + a SOTA-validation loop. Labeled positive corpus withheld pending coordinated disclosure.
Host: GitHub
URL: https://github.com/matte1782/sota-bench
Owner: matte1782
License: apache-2.0
Created: 2026-06-04T00:25:00.000Z (22 days ago)
Default Branch: main
Last Pushed: 2026-06-18T18:35:28.000Z (8 days ago)
Last Synced: 2026-06-18T20:24:32.554Z (7 days ago)
Topics: agent-security, ai-security, cwe-862, evaluation, llm-security, ml-security, security-benchmark, vulnerability-detection
Language: Python
Size: 315 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # sota_bench

**An open, model-agnostic benchmark for agent tool-dispatch

authorization-confusion vulnerability detection and severity calibration, plus

a non-LLM scorer and a pre-registered SOTA-validation loop.**

`sota_bench` is a standalone benchmark for evaluating automated

vulnerability-discovery systems (LLM agents, static analyzers, hybrid pipelines)

on a security bug *class* that frontier coding agents now produce and miss in

roughly equal measure: **agent tool-dispatch authorization confusion**, a

privileged operation dispatched through an agent / MCP / tool-calling surface

*without re-checking the caller's authorization at the point of dispatch*, or

gated on one path (e.g. REST) but not its equivalent tool/agent twin. Each item

is a labeled finding pinned to a precise `(repo, commit_sha, file, line)`

location with a ground-truth disposition, OWASP/CWE taxonomy, the decisive

runtime-gating check that resolves it, an expected CVSS band/vector, and its

realized disclosure outcome. The core data model, dataset loader, and scorer have

**zero third-party dependencies**; an optional adapter packages the dataset as a

[UK-AISI Inspect](https://inspect.aisi.org.uk/) eval `Task`.

## The measurement-first thesis

Raw absolute scores on a vuln-detection benchmark are not decision-useful: they

rise automatically every time a better base model ships, with or without any

methodological contribution. `sota_bench` is built around a different question:

*does a method add value on top of a naive single frontier-model call, and how

does that value move as the frontier improves?* The unit of measurement is the

**signed delta** (`method_metrics − naive_metrics`), pre-registered in

[`PROTOCOL.md`](PROTOCOL.md) so the headline number cannot be chosen after seeing

results. Three properties make the numbers trustworthy:

- **No LLM-as-judge.** Every metric is a closed-form function of labels and

  predictions. The loop only does the arithmetic of differencing metric maps.

- **Exonerated negatives are first-class.** Each vulnerable finding is paired with

  near-duplicate `secure` / patched twins at the same code location, so a system

  cannot win by pattern-matching the surrounding code: it must reason about the

  gating check. Correctly *clearing* a secure variant counts exactly as much as

  flagging a real bug.

- **Severity is calibrated, both ways.** CVSS error is reported as separate

  non-negative inflation (over-rating) and deflation (under-rating) magnitudes, so

  a system that systematically over- or under-states severity cannot hide behind a

  symmetric mean that cancels to zero.

## Install

```bash

pip install -e .                 # core, stdlib-only, no runtime dependencies

pip install -e ".[inspect]"      # + optional UK-AISI Inspect adapter

pip install -e ".[dev]"          # + pytest, ruff (for development)

```

Requires Python >= 3.11. The core (`sota_bench.schema`, `sota_bench.scorer`,

`sota_bench.loop`, `sota_bench.cvss`, `sota_bench.triad`, and the stdlib

adapters) imports nothing outside the standard library; `inspect_ai` is imported

lazily and only when the optional adapter is actually called.

## Usage

### 1. Load a labeled dataset

```python

from sota_bench import load_dataset, BenchEntry

entries: list[BenchEntry] = load_dataset("datasets/authz_v1.jsonl")

```

`load_dataset` validates every row with a strict, fail-closed, line-aware

validator (`validate_entry`); a bad field names the offending 1-based line.

### 2. Score predictions (non-LLM)

```python

from sota_bench import Prediction

from sota_bench.scorer import score

predictions = [

    Prediction(finding_id=e.finding_id, predicted_label="vuln",

               predicted_cvss_score=None, predicted_cvss_band="high")

    for e in entries

]

result = score(entries, predictions)

print(result.recall, result.precision, result.youden_j)

print(result.inflation_mae, result.deflation_mae)   # severity error, both ways

print(result.to_metrics_dict())                      # flat dict[str, float]

```

`score` matches predictions to entries by `finding_id` and returns a frozen

`ScoreResult` with the OWASP confusion matrix (TP/FP/TN/FN, recall, precision,

specificity, Youden's J), pairwise accuracy over every `(vuln, secure)` pair, the

PrimeVul VD-S operating point, and both-ways CVSS calibration.

### 3. Run the SOTA-validation loop

```python

from sota_bench.loop import run_delta, pin_baseline, load_baseline, delta_vs_baseline

from sota_bench.scorer import scorer_fn

# First release: measure method-minus-naive and pin it.

result = run_delta(

    entries, naive_adapter, method_adapter, predict_fn, scorer_fn,

    model_label="frontier-2026.06", dataset_fingerprint="corpus-v1",

)

pin_baseline(result, "baselines/frontier-2026.06.json")

print(result.delta)                                  # signed method − naive

# Next release: re-run on the same frozen corpus and report movement vs the pin.

baseline = load_baseline("baselines/frontier-2026.06.json")

new_result = run_delta(

    entries, naive_adapter, method_adapter, predict_fn, scorer_fn,

    model_label="frontier-2026.07", dataset_fingerprint="corpus-v1",

)

print(delta_vs_baseline(new_result, baseline))       # change in the gap

```

`naive_adapter` and `method_adapter` are both just `ModelAdapter` subclasses

(the single seam is `run(prompt: str) -> str`), so the identical scoring code

measures the signed delta between them. A deterministic, offline `StubAdapter`

ships for reproducible fixtures and tests. The loop is model-agnostic and imports

no vendor SDK.

## The two pinned baselines (reported honestly)

Both baselines were produced by a frontier (Opus-class) agent and scored with

sota_bench's **own non-LLM scorer**, no LLM-as-judge anywhere. The full

methodology, headline numbers, and caveats are in [`PROTOCOL.md`](PROTOCOL.md);

the public dataset slice is under `datasets/`; the labeled positive corpus and the

dated baselines are withheld pending coordinated disclosure.

### Static baseline

A blind, code-only read over the full labeled `authz_v1` set (the positive items

and two secure twins are withheld from the public slice pending coordinated

disclosure), no advisory lookup. **Result: the method did NOT beat naive on this

slice: naive was strictly better on detection** (naive recall 0.83, precision 1.00,

Youden J 0.83; the runtime-gating "method" proxy lower at recall 0.67, precision

0.80, Youden J 0.57; signed method-minus-naive delta: recall −0.167, Youden J

−0.267, over the pinned 16-item 2026-06-03 baseline). The method's extra skepticism

flipped two calls the wrong way: it cleared

one real vuln to `secure` (a false negative, where an owner-scoped argument made

the dispatch *look* gated) and flagged a by-design shared-capability surface as

`vuln` (a false positive). We publish this negative result as-is. The specific

labeled POSITIVE items behind these counts, and the naive-vs-method baselines that

turn on them, are WITHHELD PENDING COORDINATED DISCLOSURE and will be published

once the underlying advisories are public.

### Runtime baseline (a diagnostic, not a demonstrated edge over naive)

A second, **dynamic** baseline stands the target up, drives the agent/tool-dispatch

path with a fresh per-run CSPRNG sentinel, and lets the runtime gating check fire (or

not) against a live oracle. On this small multi-finding subset the runtime method

matches ground truth, and its one decisive correction flips a static *false negative*

back to `vuln`: a low-privilege member denied on the REST entitlement sibling reached

owner-private content through the agent dispatch sink.

Read honestly, that correction is narrower than it first looks, and we label runtime a

**diagnostic, not an edge over a naive call**. The false negative it repaired was

introduced by the static *method's* own extra skepticism; the **naive single call

already flagged that same row `vuln`**. So on the one row that was truly re-run live,

runtime's delta over naive is **zero**: it undid a mistake the method made and naive

did not. Runtime has not been shown to beat a naive call, and it has not cleared the

method's other v1 error (the AnythingLLM false *positive*), whose runtime exoneration

remains future "expand" work. Two further caveats stated in the artifact and

`PROTOCOL.md`: only part of the subset was truly live-reran (the rest rest on recorded

evidence with no new sentinel minted), and this is not a full re-scoring of the v1

slice. Runtime is a per-row validation technique under test here, not the benchmark's

thesis. The per-target identities, the labeled POSITIVE corpus, and the naive-vs-method

baselines behind this subset are WITHHELD PENDING COORDINATED DISCLOSURE and will be

published once the underlying advisories are public.

## Differentiation

`sota_bench` is positioned against three reference points.

### vs. ZeroPath

ZeroPath is a strong commercial agentic scanner. `sota_bench` adds two axes it

does not score explicitly: **REST-vs-agent path-divergence** (the same operation

gated on one entry point but not its tool/agent twin) and **severity calibration**

(not just "is it a bug?" but "how bad, signed both ways?"). It is also open and

reproducible.

### vs. BACFuzz

BACFuzz is a dynamic broken-access-control fuzzer. `sota_bench` is

**model/language-agnostic and static-capable**, it scores systems that never run

the target, and it treats **exonerated negatives as first-class** labels rather

than as the absence of a crash, so a system is rewarded for correctly clearing

secure code, not only for triggering a failure.

### vs. Anthropic Mythos

Mythos is a large internal evaluation effort. `sota_bench` is **open,

model-agnostic, and honest-band-anchored**: severity is anchored to hand-assessed

CVSS bands rather than self-reported confidence. Be honest about provenance: the

widely-cited Mythos throughput figures are Anthropic *estimates*. The defensible,

comparable signal is the hand-assessed slice: high/critical true-positive rate,

exact-band severity agreement versus security firms, and the small fraction of

disclosed findings that reach a CVE/GHSA. `sota_bench` is built so the numbers it

reports are of the *hand-assessed, reproducible* kind, not the estimated kind.

## Durability: a dated corpus, a non-LLM oracle, and published losses

The durable asset here is not secrecy. It is three things a stronger model cannot

quietly erase:

- **A deterministic, non-LLM differential oracle over a DATED corpus.** Each scored

  item turns on a closed-form check (`fp_killer`) and a paired `secure` twin at the

  same code location, graded by a pure function of labels and predictions (no

  LLM-as-judge). A frontier agent can read the source and even execute it; it cannot

  fabricate a non-LLM oracle's verdict, and it cannot train on a finding whose public

  `evidence_date` postdates its training cutoff.

- **Temporal contamination control, not a secret split.** A benchmark whose labels are

  merely hidden is brittle and unverifiable: a reviewer cannot reproduce it, and the

  hidden labels still leak the day anyone indexes them. Instead, the public `authz_v1`

  slice is the open, reproducible front door, and contamination is controlled by DATE:

  a finding is scored against a model M only if its `evidence_date` is strictly after

  M's training cutoff ([`PROTOCOL.md`](PROTOCOL.md) L3). The same row can therefore be

  fully public AND a valid held-out test for every model whose cutoff predates it. Each

  run is stamped with a content-addressed `dataset_hash` so a published delta is

  attributable to an exact corpus version.

- **Published losses.** The headline is reported as-is, including where the method

  LOSES to a naive single call (the `authz_v1` static baseline below). A method whose

  edge erodes is published as eroded; the pre-registered SOTA loop is the mechanism

  that says so, on schedule.

There is a separate, disclosure-safety reason to hold a finding back: an UNFIXED

advisory. An embargoed vulnerability sits in a private vault, never scored and never

sent to a hosted model, until its advisory publishes, at which point it becomes a

PUBLIC DATED row (verifiable, and contamination-controlled for every prior model).

That vault is a coordinated-disclosure safeguard, not the durability mechanism, and

nothing in it is ever the headline.

## Selected public findings

The track record behind the method: coordinated security disclosures by the author

that are now public (severity as the official advisory rates it). Generated from a

single source of truth; each row is verified against the live advisory state before

it ships.

| finding | severity | identifier | advisory |

|---|---|---|---|

| Open WebUI | High 8.5 | CVE-2026-54008 | [CVE-2026-54008](https://github.com/open-webui/open-webui/security/advisories/GHSA-226f-f24g-524w) |

| dex | High | GHSA-7qjx | [GHSA-7qjx](https://github.com/dexidp/dex/security/advisories/GHSA-7qjx-gp9h-65qj) |

| Langroid | High | GHSA-2pq5 | [GHSA-2pq5](https://github.com/langroid/langroid/security/advisories/GHSA-2pq5-3q89-j7cc) |

| ouroboros | High | GHSA-jv2h | [GHSA-jv2h](https://github.com/Q00/ouroboros/security/advisories/GHSA-jv2h-4p9v-wf5w) |

| Open WebUI | High 7.3 | GHSA-3wgj | [GHSA-3wgj](https://github.com/advisories/GHSA-3wgj-c2hg-vm6q) |

| MCP Registry | Mod 6.3 | CVE-2026-44430 | [CVE-2026-44430](https://github.com/advisories/GHSA-r48c-v28r-pf6v) |

| GitHub MCP Server | Mod 6.0 | CVE-2026-48529 | [CVE-2026-48529](https://github.com/github/github-mcp-server/security/advisories/GHSA-pjp5-fpmr-3349) |

| Kirby | Mod 5.3 | CVE-2026-45334 | [CVE-2026-45334](https://github.com/getkirby/kirby/security/advisories/GHSA-39vq-49qm-r2mc) |

| Outline | Mod | CVE-2026-43890 | [CVE-2026-43890](https://github.com/outline/outline/security/advisories/GHSA-gf8h-cv9v-q4fw) |

| Google MCP Toolbox | Mod | PR #3324 | [PR #3324](https://github.com/googleapis/mcp-toolbox/pull/3324) |

Additional findings are in private coordination and are not listed until their

advisories publish.

## Verticals

The flagship vertical is `authz` (agent tool-dispatch authorization confusion):

broken object-/function-level authorization on tool handlers, REST-vs-agent path

divergence, and confused-deputy delegation through a tool-calling layer. A second

`decode` vertical is reserved for parsing/decoding-primitive bugs.

## Metrics, in brief

- **OWASP confusion matrix**: TP/FP/TN/FN with **Youden's J**

  (`sensitivity + specificity − 1`), rewarding systems that both catch vulns *and*

  exonerate secure variants.

- **PrimeVul VD-S**: false-negative rate at a fixed false-positive operating

  point, for comparability with the vulnerability-detection literature.

- **Signed both-ways CVSS-v3.1 calibration MAE**: inflation and deflation kept

  separate, plus exact-band agreement.

- **SOTA-validation delta loop**: continuous, pre-registered re-evaluation of the

  signed method-minus-naive gap on a frozen slice, so regression and saturation

  surface over time.

## References

- OWASP API Security Top 10 (2023): `API1:2023` Broken Object Level

  Authorization, `API5:2023` Broken Function Level Authorization.

  

- CWE-862 Missing Authorization; CWE-863 Incorrect Authorization; CWE-285

  Improper Authorization. 

- CVSS v3.1 Specification (FIRST). 

- PrimeVul / VD-S: Ding et al., *Vulnerability Detection with Code Language

  Models: How Far Are We?* 

- UK AI Safety Institute, *Inspect* evaluation framework.

  

- Model Context Protocol (MCP) specification. 

## License

Apache-2.0. See [`LICENSE`](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/matte1782/sota-bench

Awesome Lists containing this project

README