https://github.com/bes-dev/reharness-bench
Execution-based benchmark for the reharness reasoning compiler
https://github.com/bes-dev/reharness-bench
Last synced: 2 days ago
JSON representation
Execution-based benchmark for the reharness reasoning compiler
- Host: GitHub
- URL: https://github.com/bes-dev/reharness-bench
- Owner: bes-dev
- Created: 2026-06-05T20:00:50.000Z (21 days ago)
- Default Branch: master
- Last Pushed: 2026-06-12T22:21:56.000Z (14 days ago)
- Last Synced: 2026-06-13T00:16:29.280Z (14 days ago)
- Language: TypeScript
- Size: 313 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# reharness-bench
Execution-based benchmark for the [reharness](../pi-fsm) reasoning compiler. It compiles real and
authored agent sessions (and NL requests) into FSM pipelines, then **runs the compiled pipeline on a
fixture and checks the real outcome** with a deterministic verifier — "compiles green" is not the bar.
## Layers (gated, cheap → expensive)
- **L0 ingest** — the session/request was read + staged (format-agnostic)
- **L1 compile** — verify is green (tsc + structural + dataflow)
- **L2 non-hollow** — the lib isn't a verify-passing stub
- **L3 fidelity** — the PRD captures the task + the external target is parameterised (`` / manifest)
- **L4 execution** — the compiled command runs on a fixture; a deterministic verifier checks the outcome
## Run
```
npm install
npm link reharness # put the compiler under test on PATH (or: REHARNESS_CLI=/path/to/dist/cli.js)
npm run bench # all cases
npm run bench -- # one case, e.g. trace-fixbug
npm run bench -- corpus 3 # compile the 3 largest real synthtraces (L0–L2)
npm run mine # stratify the corpus by capability (heuristic, no LLM)
```
Set `REHARNESS_CLI` to a working-tree `dist/cli.js` to bench an unreleased build.
## Layout
- `cases//` — a case: `meta.json`, `session.jsonl|md` (or an NL `request` in meta), `fixture/`, optional `verify.mjs`
- `run.mts` — the harness (compile → L0–L4)
- `mine.mts` — corpus property-stratifier
- `CAPABILITIES.md` — the compiler capability matrix this bench targets
- `corpus/` — third-party trajectory data (gitignored): `synthtraces/` (julien-c/synthtraces), `nemotron/` (nvidia/Nemotron-Agentic-v1), `traces/` (captured subagent runs)
## Provenance of cases
- **mined real-trace** (`trace-*`): captured from real subagent runs on self-verifying tasks (unbiased trajectories; tasks authored)
- **authored sessions**: hand-written demonstrations
- **NL-request**: compiled from a natural-language request (no session)
- **corpus**: third-party HF datasets — L0–L2 only (no fixtures/gold)