https://github.com/hinanohart/rollproof

Contamination-aware sequential A/B for robot-policy evaluation: propagate success-detector label noise and initial-condition mismatch into an anytime-valid verdict, on CPU, from logs alone (synthetic-validated, pre-alpha).
https://github.com/hinanohart/rollproof

anytime-valid confidence-sequence covariate-shift evaluation label-noise robotics statistics

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/hinanohart/rollproof
Owner: hinanohart
License: mit
Created: 2026-06-06T00:36:30.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-06-06T05:47:55.000Z (about 2 months ago)
Last Synced: 2026-06-06T07:13:24.577Z (about 2 months ago)
Topics: anytime-valid, confidence-sequence, covariate-shift, evaluation, label-noise, robotics, statistics
Language: Python
Homepage: https://github.com/hinanohart/rollproof
Size: 51.8 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE

Awesome Lists containing this project

README

          # rollproof

**Contamination-aware sequential A/B for robot-policy evaluation.**

`rollproof` is a CPU-only, hardware-free reference implementation. It takes the

*logs* of two robot policies being compared (A vs B) and asks a single,

honest question:

> *Given that the success detector is noisy and the two policies were not run

> under identical initial conditions, can we still say — at any stopping time —

> which policy is better, without inflating the false-positive rate?*

It does this by propagating two contamination sources into the verdict:

1. **Success-detector label noise** — estimated from a small audit set as a

   confusion matrix `(TPR, FPR)` and Krippendorff alpha, then removed with the

   Natarajan unbiased transform `y_tilde = (y_hat - FPR) / J`, `J = TPR - FPR`.

2. **Initial-condition mismatch** — fiducial-measured initial conditions are

   reweighted to a common reference distribution via a covariate-shift

   importance weight `w(x)`, so A and B are compared on the same footing.

The corrected, reweighted payoffs `z = w * y_tilde` are fed to a

**weighted two-sample anytime-valid confidence sequence** (self-implemented;

cites Howard et al. 2021 / Waudby-Smith & Ramdas 2020 rather than vendoring).

The result is a verdict that is valid at *every* peek, and that **fails closed**

(returns `HOLD`/`INVALID`, never `PASS`) when the labels or initial conditions

are too unreliable to support a decision.

## Architecture



  



## Why

Robot-policy leaderboards compare success rates measured by an imperfect

auto-detector, on episodes that did not start from the same physical state.

Both effects bias naive A/B comparisons. `rollproof` does not build a

leaderboard or a detector — it is a **measurement instrument** that sits on top

of existing eval logs (LeRobot / AutoEval / RoboArena style) and reports a

contamination-aware, anytime-valid verdict.

## Install

```bash

pip install rollproof              # core (numpy, scipy, typer)

pip install "rollproof[mcap]"      # + MCAP container

pip install "rollproof[adapters]"  # + LeRobot/HF adapters

```

## Quickstart

```bash

roboeval doctor                    # report optional deps honestly

roboeval ab AB.jsonl --alpha 0.05  # anytime-valid A/B verdict

```

```python

from rollproof.seq import compare  # generic spine — reusable beyond robotics

verdict = compare(z_a, z_b, alpha=0.05)

print(verdict.decision)  # PASS_A | PASS_B | NO_DIFF_YET | HOLD | INVALID

```

## How it works

The pipeline has six layers, each in its own submodule:

| Layer | Module | Responsibility |

|---|---|---|

| Schema | `rollproof.schema` | Parse JSONL/MCAP rollout records into a typed `Ledger` |

| Reliability | `rollproof.seq.reliability` | Estimate detector confusion matrix (TPR/FPR) and Krippendorff alpha from an audit set |

| Decontamination | `rollproof.seq.unbias` | Apply the Natarajan transform to remove label noise from raw success flags |

| Initial conditions | `rollproof.ic` | Fingerprint episode start states; compute covariate-shift importance weights |

| Confidence sequence | `rollproof.seq.cs` | Run a bounded two-sample anytime-valid CS (Howard 2021 style) on corrected payoffs |

| Decision | `rollproof.seq.verdict` | Apply trust gates; emit `PASS_A`, `PASS_B`, `NO_DIFF_YET`, `HOLD`, or `INVALID` |

The generic entry point (`rollproof.seq.compare`) is robotics-agnostic and can

be used for any bounded payoff streams. The robot-eval entry point

(`rollproof.seq.compare_rollouts`) adds the trust gates and the fail-closed

contract.

## Measured results

All numbers below are machine-generated from `results/v0.1.0a2_metrics.json`

(`python scripts/run_metrics.py`), on synthetic ground truth.

| metric | value | target | pass |

|---|---|---|---|

| type-I false-stop rate (null) | 0.0000 | <= 0.05 | True |

| detection rate (true delta 0.30) | 1.0000 | > 0.8 | True |

| detector TPR recovery (MAE) | 0.0208 | < 0.05 | True |

| theta recovery (MAE) | 0.0144 | < 0.05 | True |

| IC spurious gap, naive | 0.2167 | (baseline) | - |

| IC spurious gap, reweighted | 0.0316 | < naive | True |

| IC-reweight false-pass (null+shift) | 0.0000 | <= 0.05 | True |

| negative-control false-pass rate | 0.0000 | <= 0.05 | True |

| delta estimate error, naive | 0.1057 | (baseline) | - |

| delta estimate error, rollproof | 0.0277 | < naive | True |

_Generated on 2026-06-06 (python 3.12.3, seed 20260606, mode synthetic). True delta in the estimate test = 0.30._

## Scope & honesty

rollproof is a CPU reference implementation that propagates label-noise (Krippendorff alpha) and initial-condition mismatch into anytime-valid A/B verdicts over a provenance record of robot-eval rollouts. It consumes MCAP or JSON-Lines records and cites confidence-sequence theory rather than reinventing it. Validated on algorithm-correctness metrics with synthetic ground-truth only — it does NOT measure or improve real-robot evaluation accuracy or reproducibility (that requires GPU policy rollouts and is out of scope).

- The verified core (`rollproof.seq`) is checked on synthetic ground-truth.

- The `schema` / `ic` / `cli` adapters ship with synthetic-log demos only.

- Real-hardware eval-log ingestion is deferred (v0.1.1+).

### Known limitations (a2)

- The detector audit uncertainty (Clopper-Pearson on TPR/FPR) enters the

  **trust gate** (`j_lo > j_min`) but is **not yet propagated into the

  confidence-sequence width**; the CS uses the point estimates of TPR/FPR. The

  audit set is treated as a **fixed sample** (not sequentially peeked). Both are

  planned for v0.2.

- The initial-condition reweighting path (`ic_reweight=True`) uses **estimated,

  clipped importance weights** (no self-normalization yet) with a bounded

  `clip * [lo, hi]` proxy. Because the plug-in weights are not a fixed

  predictable sequence, its type-I control is **approximate** (empirically

  checked on synthetic nulls, not proven) and it is treated as conservative /

  experimental. The label-noise-corrected default path (`ic_reweight=False`) is

  the rigorous, recommended one. Self-normalized, variance-adaptive weighting is

  planned for v0.2.

- The CLI focuses on the anytime-valid path; a classical **fixed-sample exact**

  test (Barnard) is available via the Python API

  (`rollproof.seq.barnard_two_proportion`). Eval-log ingestion in a2 is via the

  synthetic-log adapters / `rollproof.synth`; real-directory ingestion is

  deferred (see above).

## License

MIT — see [LICENSE](LICENSE). Third-party notices in [NOTICE](NOTICE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hinanohart/rollproof

Awesome Lists containing this project

README