https://github.com/hinanohart/tracecal
Conformal-calibrated, URDF physics-gated validity & abstention auditing for LeRobot robot-learning datasets (GPU-free, Apache-2.0)
https://github.com/hinanohart/tracecal
abstention calibration conformal-prediction dataset-quality lerobot physics-validation robot-learning urdf
Last synced: 3 days ago
JSON representation
Conformal-calibrated, URDF physics-gated validity & abstention auditing for LeRobot robot-learning datasets (GPU-free, Apache-2.0)
- Host: GitHub
- URL: https://github.com/hinanohart/tracecal
- Owner: hinanohart
- License: mit
- Created: 2026-05-27T17:40:37.000Z (22 days ago)
- Default Branch: main
- Last Pushed: 2026-06-10T13:37:35.000Z (8 days ago)
- Last Synced: 2026-06-11T06:31:32.434Z (7 days ago)
- Topics: abstention, calibration, conformal-prediction, dataset-quality, lerobot, physics-validation, robot-learning, urdf
- Language: Python
- Homepage: https://github.com/hinanohart/tracecal
- Size: 522 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# tracecal
**Conformal-calibrated, URDF physics-gated validity & abstention auditing for LeRobot datasets.**
`tracecal` audits a [LeRobot](https://github.com/huggingface/lerobot) robot-learning dataset and
returns, per episode, a calibrated verdict — **accept / hold / reject** — built on three things no
existing LeRobot data-quality tool combines:
- **Multiplicative URDF joint-limit physics gate.** For industrial arms (Franka Panda, KUKA iiwa,
and any arm you point at a plain URDF) an episode that drives a joint past its hard
position/velocity limit is *kinematically impossible* — its score is forced to `Q = 0`
irrespective of how clean the video or smooth the motion looks.
- **Distribution-free conformal abstention.** When validity cannot be called confidently the
episode is held back (`abstain`) rather than guessed.
- **Degrade-first-class honesty.** Embodiments with no resolvable URDF (SO-101, Koch, LeKiwi —
the majority of Hub datasets by count) are reported in a physics-skipped `hold` state with
`coverage = None`; they are never silently treated as validated.
This is deliberately **not** a CV/heuristic quality scorer like
[`score_lerobot_episodes`](https://github.com/RoboticsData/score_lerobot_episodes) (blur, motion
smoothness, optional VLM checks) or a motion-consistency filter — those are complementary.
tracecal's value is a *physically grounded, distribution-free* validity gate with explicit
abstention. GPU-free, torch-free at runtime, MIT.
> **Scope (honest).** The physics-gate claim targets **industrial-arm** datasets — fewer in number
> on the Hub but the largest by episode volume (professional labs). Hobbyist arms run in degrade
> mode. See *Measured results* for exactly what is and isn't demonstrated in v0.1.0a1.
**v0.1.0a1 — pre-alpha.** The validated claim is the physics gate; conformal coverage is a
diagnostic that becomes a guarantee only under the label precondition below.
---
## Install
```bash
pip install tracecal # core (numpy / scipy / pydantic)
pip install "tracecal[physics]" # + URDF joint-limit gate (yourdfpy, robot_descriptions)
pip install "tracecal[hub]" # + load real LeRobotDataset v3 from the Hub (pyarrow)
```
## Quickstart
```python
from tracecal import evaluate_dataset
report = evaluate_dataset("lerobot/pusht", confidence=0.9) # local dir or HF repo id
print(report.n_accept, report.n_hold, report.n_reject, report.n_degraded)
```
```bash
tracecal run path/to/dataset --confidence 0.9 --format html -o report.html
tracecal selftest # self-contained physics-gate check (no network/GPU)
tracecal list-embodiments # which robot_types resolve to a URDF vs degrade
```
Use it as a CI gate via the pytest plugin:
```python
def test_dataset_is_clean(tracecal_audit):
report = tracecal_audit("path/or/repo_id", confidence=0.9)
tracecal_audit.assert_coverage_holds(report) # no-op in reference-mode; fails on a breach
```
## How it works
Each episode passes through three sequential stages:
```
load v3 episodes ─▶ resolve embodiment URDF ─▶ kinematic gate ─▶ conformal / reference-mode ─▶ verdict
(pyarrow) (robot_descriptions) (Q=hard·quality) (split/Mondrian, abstain) accept/hold/reject
```
**Verdicts:**
- `reject` — a hard kinematic violation forced `Q = 0`. The gate is multiplicative:
`Q = hard_valid(episode) * quality(episode)` where `hard_valid ∈ {0, 1}`.
- `hold` — degrade-mode (no URDF resolved) or non-confident (non-singleton conformal prediction
set / reference-mode outlier).
- `accept` — hard-valid and confidently called valid.
**Key modules:**
| Module | Role |
|-------------------------------|----------------------------------------------------------------------|
| `tracecal.physics.gate` | Multiplicative hard gate; forces `Q = 0` on joint-limit violations |
| `tracecal.conformal.calibrate`| Split-conformal and Mondrian conformal calibration |
| `tracecal.report.card` | Translates calibrated scores + physics results into `accept/hold/reject` |
## Architecture
## Conformal coverage: the label precondition
Conformal coverage is a **validated finite-sample guarantee only when real binary validity labels
are supplied** for calibration. With no such labels tracecal runs in **reference-mode**
(`coverage = None`) and reports the physics gate plus a self-supervised abstention heuristic
without claiming a coverage number. Calibration figures in this repo are synthetic and for
**algorithm validation only**.
## Measured results (v0.1.0a1)
All numbers below are generated from `results/*.json` (`python scripts/measure.py`); they are not
hand-written.
| Tier | What | Result |
|------------------------------------------------|---------------------------------------------------------------------------------------------------------------|---------------------------|
| Physics gate (CLAIM) | Adversarial synthetic trajectories on a **real Franka Panda URDF**: fraction of kinematically-invalid episodes forced to `Q = 0` | **1.00** |
| Calibration (synthetic, `algorithm validation only`) | Split-conformal holdout coverage vs target `0.90` over 80 trials | **0.90** empirical |
| Calibration (synthetic) | Isotonic reliability error (ECE) of the calibrated validity score | **0.045** |
| Real data (live) | `lerobot/pusht`, 50 episodes — unmapped robot_type → degrade-first-class (`hold`, `coverage = None`) | degrade demonstrated |
What is **not** claimed in v0.1.0a1: a physics-gate firing on a *real* industrial-arm episode (real
robots stay within their own limits, so the gate is demonstrated with adversarial synthetic
trajectories on a real URDF); cross-embodiment normalization; and a success-probability coverage
guarantee. Live industrial-arm dataset runs are **deferred to v0.1.1**.
## License
MIT. See `LICENSE` and `NOTICE` (third-party / URDF source license matrix). tracecal bundles
no manufacturer URDF, dataset, or model weights.