https://github.com/hinanohart/rollproof
Contamination-aware sequential A/B for robot-policy evaluation: propagate success-detector label noise and initial-condition mismatch into an anytime-valid verdict, on CPU, from logs alone (synthetic-validated, pre-alpha).
https://github.com/hinanohart/rollproof
anytime-valid confidence-sequence covariate-shift evaluation label-noise robotics statistics
Last synced: 10 days ago
JSON representation
Contamination-aware sequential A/B for robot-policy evaluation: propagate success-detector label noise and initial-condition mismatch into an anytime-valid verdict, on CPU, from logs alone (synthetic-validated, pre-alpha).
- Host: GitHub
- URL: https://github.com/hinanohart/rollproof
- Owner: hinanohart
- License: mit
- Created: 2026-06-06T00:36:30.000Z (19 days ago)
- Default Branch: main
- Last Pushed: 2026-06-06T05:47:55.000Z (19 days ago)
- Last Synced: 2026-06-06T07:13:24.577Z (19 days ago)
- Topics: anytime-valid, confidence-sequence, covariate-shift, evaluation, label-noise, robotics, statistics
- Language: Python
- Homepage: https://github.com/hinanohart/rollproof
- Size: 51.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# rollproof
**Contamination-aware sequential A/B for robot-policy evaluation.**
`rollproof` is a CPU-only, hardware-free reference implementation. It takes the
*logs* of two robot policies being compared (A vs B) and asks a single,
honest question:
> *Given that the success detector is noisy and the two policies were not run
> under identical initial conditions, can we still say — at any stopping time —
> which policy is better, without inflating the false-positive rate?*
It does this by propagating two contamination sources into the verdict:
1. **Success-detector label noise** — estimated from a small audit set as a
confusion matrix `(TPR, FPR)` and Krippendorff alpha, then removed with the
Natarajan unbiased transform `y_tilde = (y_hat - FPR) / J`, `J = TPR - FPR`.
2. **Initial-condition mismatch** — fiducial-measured initial conditions are
reweighted to a common reference distribution via a covariate-shift
importance weight `w(x)`, so A and B are compared on the same footing.
The corrected, reweighted payoffs `z = w * y_tilde` are fed to a
**weighted two-sample anytime-valid confidence sequence** (self-implemented;
cites Howard et al. 2021 / Waudby-Smith & Ramdas 2020 rather than vendoring).
The result is a verdict that is valid at *every* peek, and that **fails closed**
(returns `HOLD`/`INVALID`, never `PASS`) when the labels or initial conditions
are too unreliable to support a decision.
## Architecture
## Why
Robot-policy leaderboards compare success rates measured by an imperfect
auto-detector, on episodes that did not start from the same physical state.
Both effects bias naive A/B comparisons. `rollproof` does not build a
leaderboard or a detector — it is a **measurement instrument** that sits on top
of existing eval logs (LeRobot / AutoEval / RoboArena style) and reports a
contamination-aware, anytime-valid verdict.
## Install
```bash
pip install rollproof # core (numpy, scipy, typer)
pip install "rollproof[mcap]" # + MCAP container
pip install "rollproof[adapters]" # + LeRobot/HF adapters
```
## Quickstart
```bash
roboeval doctor # report optional deps honestly
roboeval ab AB.jsonl --alpha 0.05 # anytime-valid A/B verdict
```
```python
from rollproof.seq import compare # generic spine — reusable beyond robotics
verdict = compare(z_a, z_b, alpha=0.05)
print(verdict.decision) # PASS_A | PASS_B | NO_DIFF_YET | HOLD | INVALID
```
## How it works
The pipeline has six layers, each in its own submodule:
| Layer | Module | Responsibility |
|---|---|---|
| Schema | `rollproof.schema` | Parse JSONL/MCAP rollout records into a typed `Ledger` |
| Reliability | `rollproof.seq.reliability` | Estimate detector confusion matrix (TPR/FPR) and Krippendorff alpha from an audit set |
| Decontamination | `rollproof.seq.unbias` | Apply the Natarajan transform to remove label noise from raw success flags |
| Initial conditions | `rollproof.ic` | Fingerprint episode start states; compute covariate-shift importance weights |
| Confidence sequence | `rollproof.seq.cs` | Run a bounded two-sample anytime-valid CS (Howard 2021 style) on corrected payoffs |
| Decision | `rollproof.seq.verdict` | Apply trust gates; emit `PASS_A`, `PASS_B`, `NO_DIFF_YET`, `HOLD`, or `INVALID` |
The generic entry point (`rollproof.seq.compare`) is robotics-agnostic and can
be used for any bounded payoff streams. The robot-eval entry point
(`rollproof.seq.compare_rollouts`) adds the trust gates and the fail-closed
contract.
## Measured results
All numbers below are machine-generated from `results/v0.1.0a2_metrics.json`
(`python scripts/run_metrics.py`), on synthetic ground truth.
| metric | value | target | pass |
|---|---|---|---|
| type-I false-stop rate (null) | 0.0000 | <= 0.05 | True |
| detection rate (true delta 0.30) | 1.0000 | > 0.8 | True |
| detector TPR recovery (MAE) | 0.0208 | < 0.05 | True |
| theta recovery (MAE) | 0.0144 | < 0.05 | True |
| IC spurious gap, naive | 0.2167 | (baseline) | - |
| IC spurious gap, reweighted | 0.0316 | < naive | True |
| IC-reweight false-pass (null+shift) | 0.0000 | <= 0.05 | True |
| negative-control false-pass rate | 0.0000 | <= 0.05 | True |
| delta estimate error, naive | 0.1057 | (baseline) | - |
| delta estimate error, rollproof | 0.0277 | < naive | True |
_Generated on 2026-06-06 (python 3.12.3, seed 20260606, mode synthetic). True delta in the estimate test = 0.30._
## Scope & honesty
rollproof is a CPU reference implementation that propagates label-noise (Krippendorff alpha) and initial-condition mismatch into anytime-valid A/B verdicts over a provenance record of robot-eval rollouts. It consumes MCAP or JSON-Lines records and cites confidence-sequence theory rather than reinventing it. Validated on algorithm-correctness metrics with synthetic ground-truth only — it does NOT measure or improve real-robot evaluation accuracy or reproducibility (that requires GPU policy rollouts and is out of scope).
- The verified core (`rollproof.seq`) is checked on synthetic ground-truth.
- The `schema` / `ic` / `cli` adapters ship with synthetic-log demos only.
- Real-hardware eval-log ingestion is deferred (v0.1.1+).
### Known limitations (a2)
- The detector audit uncertainty (Clopper-Pearson on TPR/FPR) enters the
**trust gate** (`j_lo > j_min`) but is **not yet propagated into the
confidence-sequence width**; the CS uses the point estimates of TPR/FPR. The
audit set is treated as a **fixed sample** (not sequentially peeked). Both are
planned for v0.2.
- The initial-condition reweighting path (`ic_reweight=True`) uses **estimated,
clipped importance weights** (no self-normalization yet) with a bounded
`clip * [lo, hi]` proxy. Because the plug-in weights are not a fixed
predictable sequence, its type-I control is **approximate** (empirically
checked on synthetic nulls, not proven) and it is treated as conservative /
experimental. The label-noise-corrected default path (`ic_reweight=False`) is
the rigorous, recommended one. Self-normalized, variance-adaptive weighting is
planned for v0.2.
- The CLI focuses on the anytime-valid path; a classical **fixed-sample exact**
test (Barnard) is available via the Python API
(`rollproof.seq.barnard_two_proportion`). Eval-log ingestion in a2 is via the
synthetic-log adapters / `rollproof.synth`; real-directory ingestion is
deferred (see above).
## License
MIT — see [LICENSE](LICENSE). Third-party notices in [NOTICE](NOTICE).