{"id":50873656,"url":"https://github.com/hinanohart/vlatrust","last_synced_at":"2026-06-15T07:31:14.144Z","repository":{"id":360778959,"uuid":"1251670375","full_name":"hinanohart/vlatrust","owner":"hinanohart","description":"Calibration-under-shift trust harness for Vision-Language-Action (VLA) policies: does a policy's confidence degrade in step with its competence as input shift rises?","archived":false,"fork":false,"pushed_at":"2026-06-10T09:29:17.000Z","size":111,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T10:11:54.620Z","etag":null,"topics":["calibration","conformal-prediction","distribution-shift","robotics","trustworthy-ai","uncertainty-quantification","vla"],"latest_commit_sha":null,"homepage":"https://github.com/hinanohart/vlatrust","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hinanohart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-27T20:03:58.000Z","updated_at":"2026-06-10T09:29:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hinanohart/vlatrust","commit_stats":null,"previous_names":["hinanohart/vlatrust"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/hinanohart/vlatrust","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fvlatrust","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fvlatrust/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fvlatrust/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fvlatrust/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hinanohart","download_url":"https://codeload.github.com/hinanohart/vlatrust/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hinanohart%2Fvlatrust/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34353189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["calibration","conformal-prediction","distribution-shift","robotics","trustworthy-ai","uncertainty-quantification","vla"],"created_at":"2026-06-15T07:31:13.100Z","updated_at":"2026-06-15T07:31:14.130Z","avatar_url":"https://github.com/hinanohart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vlatrust\n\n**Calibration-under-shift trust harness for Vision-Language-Action (VLA) policies.**\n\nA VLA policy can score ~97% on its in-distribution benchmark and then drop to\nnear-0% the moment the scene, lighting, instruction phrasing, or initial state\nshifts — *while remaining just as confident*. `vlatrust` measures exactly that\nfailure mode: **does a policy's confidence degrade in step with its competence\nas input distribution shift rises?**\n\nIt works over **recorded action traces** (no GPU, no simulator required for the\ncore), emitting:\n\n- a **conformal abstention gate** with distribution-free finite-sample coverage,\n- a **Reliability-Shift / collapse curve** (success rate vs. perturbation intensity, per modality),\n- a single **Trust-Shift score** that is high only when the policy *flags its own* degradation.\n\n## Architecture overview\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/architecture.png\" alt=\"vlatrust architecture\" width=\"840\"\u003e\n\u003c/div\u003e\n\n## Status \u0026 honesty (read this first)\n\n**v0.1.0a1 is a pre-alpha _framework_.** What is and isn't backed by data:\n\n- ✅ **The metric's behaviour is validated** — a 163-test CPU suite, plus an\n  env-stamped measurement (`bench_results/v0.1.0a1_falsification.json`).\n- ⚠️ **The falsification numbers below are from deterministic _synthetic_ fixtures**\n  (`MockPolicy`), **not** recorded real-robot traces. They show the *metric* is\n  correct, not that any specific real policy behaves a certain way.\n- ⏳ **Empirical real-trace validation is deferred to v0.1.1.** Reproducing\n  high-success-then-collapse on a *recorded real trace* needs either live OpenVLA\n  inference (GPU) or a permissively-licensed graded-perturbation benchmark\n  (LIBERO-plus ships no license, so it is not a dependency — see NOTICE).\n\nNo real-policy claim is made or implied by the a1 numbers.\n\n## The one claim (falsifiable)\n\n\u003e For VLA policies that expose token-level confidence (Tier-A, e.g. OpenVLA),\n\u003e a well-calibrated policy earns a **high** Trust-Shift score and a\n\u003e confidently-wrong policy earns a **low** one.\n\nFalsification test (run in CI, on synthetic fixtures): a confidently-collapsing\npolicy **must** score below a gracefully-degrading one, and an abstention-enabled\nvariant **must** out-score its abstention-disabled twin. Measured\n(`bench_results/`, synthetic, seed 0):\n\n| quantity (synthetic fixtures) | α=0.1 | α=0.25 |\n|---|---|---|\n| gracefully-degrading Trust-Shift | **0.749** | **0.795** |\n| confidently-collapsing Trust-Shift | **0.496** | **0.503** |\n| abstention on vs. off | 0.749 / 0.749 (no-op¹) | 0.795 / 0.749 |\n| every-trajectory-physically-invalid Trust-Shift | **0.000** (hard gate) | |\n\n¹ At α=0.1 the gate correctly abstains on nothing here: the worst stratum's\nmiscoverage exceeds the 10% budget, so the conformal threshold admits all — the\nhonest, designed behaviour, not a bug. The gate engages at α=0.25.\n\nReal-math anchors that need no policy (same record): the Tier-A token-entropy\nextractor gives confidence ≈1.0 on a peaked action-token distribution and\n≈1/256 on a uniform one; split-conformal marginal coverage is **0.9004** over 300\nsplits against a 0.90 nominal target.\n\n## Why calibration, not just perturbation\n\nPerturbation-robustness benchmarking for VLAs already exists (e.g. RobustVLA /\nLIBERO-plus), and dataset-quality heuristics (e.g. LeRobot episode scorers) score\n*data*, not *trust*. `vlatrust`'s layer is the **cross-model\ncalibration-under-shift** measurement on top: conformal coverage, a reliability\ngap (reference-vs-target, calibrated-vs-actual), and a fail-closed\nout-of-distribution action gate — as one harness over recorded traces.\n\n## Tiers (what confidence means per policy family)\n\n| Tier | Policy family | Confidence source | Trust-Shift claim |\n|------|---------------|-------------------|-------------------|\n| A | token/autoregressive (OpenVLA) | token entropy (native) | full claim |\n| B | flow-matching (π0, SmolVLA, GR00T) | sampling variance (opt-in, GPU) | NON-claim (v0.1.1) |\n| — | no exposable confidence | `ConfidenceSource.NONE` | abstention axis returns `N/A` (fail-closed) |\n\n`vlatrust doctor` reports which backends are live vs. mock vs. unavailable on\nyour machine, so a mock run is never mistaken for a live one.\n\n## Install\n\n```bash\npip install vlatrust                 # core: recorded-trace path, numpy only\npip install \"vlatrust[openvla]\"      # Tier-A token-confidence backend (torch)\npip install \"vlatrust[lerobot]\"      # ingest LeRobot datasets (pyarrow)\n```\n\n(PyPI publication is part of v0.1.1; install from the GitHub release/source for a1.)\n\n## Quickstart\n\n```bash\nvlatrust doctor                      # which backends are live vs. mock\nvlatrust score   --mock --html report.html   # synthetic demo -\u003e scorecard + HTML\nvlatrust score   \u003ctrace.json\u003e        # full scorecard (JSON + self-contained HTML)\nvlatrust calibrate \u003ctrace.json\u003e      # calibration report only\n```\n\n```text\n$ vlatrust score --mock              # illustrative (deterministic synthetic input)\n** Input is a deterministic MOCK TraceSet — illustrative, not an empirical measurement.\nTrust-Shift: 0.749  (source=token_entropy, physically_valid=True)\n  tracking=0.970\n  inverse_brier=0.867\n  retained_reliability=0.410\n  hard_valid_factor=1.000\n  abstention_gate=on\n```\n\n## How it works\n\n### Trust-Shift score composition\n\nThe headline score is composed multiplicatively through four axes:\n\n```\nTrust-Shift = hard_valid_factor × blend(tracking, calibration, retained_reliability)\n```\n\n- **hard_valid_factor** (`h`): fraction of trajectories with physically valid\n  actions (joint limits, velocity caps). A policy commanding invalid actions\n  is pulled toward 0 regardless of confidence.\n- **tracking** (`T`): `1 - mean_τ |confidence(τ) - success_rate(τ)|` across\n  perturbation intensities. This is the core claim — confidence must fall in\n  step with success as shift rises.\n- **calibration** (`C`): inverse-Brier score of confidence vs. success.\n- **retained_reliability** (`R`): success rate among trajectories the conformal\n  abstention gate accepts. A useful gate raises `R` above the accept-all baseline.\n\nIf `ConfidenceSource.NONE` the score is `None` — never fabricated (fail-closed).\n\n### Perturbation injector\n\n14 post-hoc perturbation dimensions span 6 modalities:\n\n| Modality | Example dims |\n|----------|-------------|\n| `language` | instruction rephrasing, negation |\n| `init_state` | object pose jitter |\n| `sensor_noise` | brightness shift, gaussian noise, salt-pepper |\n| `dynamics` | latency shift, step dropout |\n| `camera` | viewpoint shift |\n| `actuation` | action delay, scale |\n\nPerturbations are applied post-hoc to recorded traces; no simulator is needed\nfor the CPU-only core path.\n\n### Conformal abstention gate\n\nSplit conformal prediction (+ Mondrian stratification by modality + weighted\nvariant) computes a finite-sample-valid nonconformity threshold at coverage level\n`1-α`. Steps whose nonconformity exceeds the threshold are abstained; the OOD\ngate is fail-closed (unknown → ABSTAIN).\n\n## Scope\n\n- **In (v0.1.0a1):** TraceSet core; 14 post-hoc perturbation dims (own injector);\n  `ConfidenceSource` enum + OpenVLA Tier-A confidence extractor; spike-preserving\n  sequence-level conformal (+ Mondrian, + weighted); reliability gap (Δ_succ,\n  Δ_cov); fail-closed OOD action gate; collapse curve + fragility; PAVA /\n  inverse-Brier / multiplicative gate / bootstrap CI; self-contained HTML\n  scorecard; CPU-only tests.\n- **Deferred (v0.1.1):** real-trace empirical validation; renderer-heavy 3\n  perturbation dims (GPU); flow-matching sampling-variance adapter (Tier-B); live\n  OpenVLA / π0 / GR00T inference; sim integration; PyPI.\n- **Deferred (v0.2):** sim→real gap on real-robot traces; GR00T backend;\n  evolutionary score-hardening.\n\n## License\n\nMIT. See [LICENSE](LICENSE). vlatrust bundles no\nthird-party model code or weights, and does **not** depend on LIBERO-plus\n(which ships no license).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fvlatrust","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhinanohart%2Fvlatrust","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhinanohart%2Fvlatrust/lists"}