{"id":50873673,"url":"https://github.com/hinanohart/rollproof","last_synced_at":"2026-06-15T07:31:17.493Z","repository":{"id":362837526,"uuid":"1260859151","full_name":"hinanohart/rollproof","owner":"hinanohart","description":"Contamination-aware sequential A/B for robot-policy evaluation: propagate success-detector label noise and initial-condition mismatch into an anytime-valid verdict, on CPU, from logs alone (synthetic-validated, pre-alpha).","archived":false,"fork":false,"pushed_at":"2026-06-06T05:47:55.000Z","size":53,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-06T07:13:24.577Z","etag":null,"topics":["anytime-valid","confidence-sequence","covariate-shift","evaluation","label-noise","robotics","statistics"],"latest_commit_sha":null,"homepage":"https://github.com/hinanohart/rollproof","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hinanohart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-06T00:36:30.000Z","updated_at":"2026-06-06T05:45:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hinanohart/rollproof","commit_stats":null,"previous_names":["hinanohart/rollproof"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/hinanohart/rollproof","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Frollproof","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Frollproof/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Frollproof/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Frollproof/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hinanohart","download_url":"https://codeload.github.com/hinanohart/rollproof/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Frollproof/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34353189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anytime-valid","confidence-sequence","covariate-shift","evaluation","label-noise","robotics","statistics"],"created_at":"2026-06-15T07:31:16.510Z","updated_at":"2026-06-15T07:31:17.488Z","avatar_url":"https://github.com/hinanohart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# rollproof\n\n**Contamination-aware sequential A/B for robot-policy evaluation.**\n\n`rollproof` is a CPU-only, hardware-free reference implementation. It takes the\n*logs* of two robot policies being compared (A vs B) and asks a single,\nhonest question:\n\n\u003e *Given that the success detector is noisy and the two policies were not run\n\u003e under identical initial conditions, can we still say — at any stopping time —\n\u003e which policy is better, without inflating the false-positive rate?*\n\nIt does this by propagating two contamination sources into the verdict:\n\n1. **Success-detector label noise** — estimated from a small audit set as a\n   confusion matrix `(TPR, FPR)` and Krippendorff alpha, then removed with the\n   Natarajan unbiased transform `y_tilde = (y_hat - FPR) / J`, `J = TPR - FPR`.\n2. **Initial-condition mismatch** — fiducial-measured initial conditions are\n   reweighted to a common reference distribution via a covariate-shift\n   importance weight `w(x)`, so A and B are compared on the same footing.\n\nThe corrected, reweighted payoffs `z = w * y_tilde` are fed to a\n**weighted two-sample anytime-valid confidence sequence** (self-implemented;\ncites Howard et al. 2021 / Waudby-Smith \u0026 Ramdas 2020 rather than vendoring).\nThe result is a verdict that is valid at *every* peek, and that **fails closed**\n(returns `HOLD`/`INVALID`, never `PASS`) when the labels or initial conditions\nare too unreliable to support a decision.\n\n## Architecture\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/architecture.png\" alt=\"rollproof architecture\" width=\"840\"\u003e\n\u003c/div\u003e\n\n## Why\n\nRobot-policy leaderboards compare success rates measured by an imperfect\nauto-detector, on episodes that did not start from the same physical state.\nBoth effects bias naive A/B comparisons. `rollproof` does not build a\nleaderboard or a detector — it is a **measurement instrument** that sits on top\nof existing eval logs (LeRobot / AutoEval / RoboArena style) and reports a\ncontamination-aware, anytime-valid verdict.\n\n## Install\n\n```bash\npip install rollproof              # core (numpy, scipy, typer)\npip install \"rollproof[mcap]\"      # + MCAP container\npip install \"rollproof[adapters]\"  # + LeRobot/HF adapters\n```\n\n## Quickstart\n\n```bash\nroboeval doctor                    # report optional deps honestly\nroboeval ab AB.jsonl --alpha 0.05  # anytime-valid A/B verdict\n```\n\n```python\nfrom rollproof.seq import compare  # generic spine — reusable beyond robotics\nverdict = compare(z_a, z_b, alpha=0.05)\nprint(verdict.decision)  # PASS_A | PASS_B | NO_DIFF_YET | HOLD | INVALID\n```\n\n## How it works\n\nThe pipeline has six layers, each in its own submodule:\n\n| Layer | Module | Responsibility |\n|---|---|---|\n| Schema | `rollproof.schema` | Parse JSONL/MCAP rollout records into a typed `Ledger` |\n| Reliability | `rollproof.seq.reliability` | Estimate detector confusion matrix (TPR/FPR) and Krippendorff alpha from an audit set |\n| Decontamination | `rollproof.seq.unbias` | Apply the Natarajan transform to remove label noise from raw success flags |\n| Initial conditions | `rollproof.ic` | Fingerprint episode start states; compute covariate-shift importance weights |\n| Confidence sequence | `rollproof.seq.cs` | Run a bounded two-sample anytime-valid CS (Howard 2021 style) on corrected payoffs |\n| Decision | `rollproof.seq.verdict` | Apply trust gates; emit `PASS_A`, `PASS_B`, `NO_DIFF_YET`, `HOLD`, or `INVALID` |\n\nThe generic entry point (`rollproof.seq.compare`) is robotics-agnostic and can\nbe used for any bounded payoff streams. The robot-eval entry point\n(`rollproof.seq.compare_rollouts`) adds the trust gates and the fail-closed\ncontract.\n\n## Measured results\n\nAll numbers below are machine-generated from `results/v0.1.0a2_metrics.json`\n(`python scripts/run_metrics.py`), on synthetic ground truth.\n\n\u003c!--METRICS:START--\u003e\n| metric | value | target | pass |\n|---|---|---|---|\n| type-I false-stop rate (null) | 0.0000 | \u003c= 0.05 | True |\n| detection rate (true delta 0.30) | 1.0000 | \u003e 0.8 | True |\n| detector TPR recovery (MAE) | 0.0208 | \u003c 0.05 | True |\n| theta recovery (MAE) | 0.0144 | \u003c 0.05 | True |\n| IC spurious gap, naive | 0.2167 | (baseline) | - |\n| IC spurious gap, reweighted | 0.0316 | \u003c naive | True |\n| IC-reweight false-pass (null+shift) | 0.0000 | \u003c= 0.05 | True |\n| negative-control false-pass rate | 0.0000 | \u003c= 0.05 | True |\n| delta estimate error, naive | 0.1057 | (baseline) | - |\n| delta estimate error, rollproof | 0.0277 | \u003c naive | True |\n\n_Generated on 2026-06-06 (python 3.12.3, seed 20260606, mode synthetic). True delta in the estimate test = 0.30._\n\u003c!--METRICS:END--\u003e\n\n## Scope \u0026 honesty\n\n\u003c!--DISCLAIMER--\u003e\nrollproof is a CPU reference implementation that propagates label-noise (Krippendorff alpha) and initial-condition mismatch into anytime-valid A/B verdicts over a provenance record of robot-eval rollouts. It consumes MCAP or JSON-Lines records and cites confidence-sequence theory rather than reinventing it. Validated on algorithm-correctness metrics with synthetic ground-truth only — it does NOT measure or improve real-robot evaluation accuracy or reproducibility (that requires GPU policy rollouts and is out of scope).\n\u003c!--/DISCLAIMER--\u003e\n\n- The verified core (`rollproof.seq`) is checked on synthetic ground-truth.\n- The `schema` / `ic` / `cli` adapters ship with synthetic-log demos only.\n- Real-hardware eval-log ingestion is deferred (v0.1.1+).\n\n### Known limitations (a2)\n\n- The detector audit uncertainty (Clopper-Pearson on TPR/FPR) enters the\n  **trust gate** (`j_lo \u003e j_min`) but is **not yet propagated into the\n  confidence-sequence width**; the CS uses the point estimates of TPR/FPR. The\n  audit set is treated as a **fixed sample** (not sequentially peeked). Both are\n  planned for v0.2.\n- The initial-condition reweighting path (`ic_reweight=True`) uses **estimated,\n  clipped importance weights** (no self-normalization yet) with a bounded\n  `clip * [lo, hi]` proxy. Because the plug-in weights are not a fixed\n  predictable sequence, its type-I control is **approximate** (empirically\n  checked on synthetic nulls, not proven) and it is treated as conservative /\n  experimental. The label-noise-corrected default path (`ic_reweight=False`) is\n  the rigorous, recommended one. Self-normalized, variance-adaptive weighting is\n  planned for v0.2.\n- The CLI focuses on the anytime-valid path; a classical **fixed-sample exact**\n  test (Barnard) is available via the Python API\n  (`rollproof.seq.barnard_two_proportion`). Eval-log ingestion in a2 is via the\n  synthetic-log adapters / `rollproof.synth`; real-directory ingestion is\n  deferred (see above).\n\n## License\n\nMIT — see [LICENSE](LICENSE). Third-party notices in [NOTICE](NOTICE).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Frollproof","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhinanohart%2Frollproof","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Frollproof/lists"}