{"id":51263961,"url":"https://github.com/kaushikb11/robocurate","last_synced_at":"2026-06-29T14:02:21.144Z","repository":{"id":367299896,"uuid":"1280034161","full_name":"kaushikb11/robocurate","owner":"kaushikb11","description":"LeRobotDataset-native data curation for robot learning — find the trajectories hurting your policy and get a cleaner subset back. Deterministic, multi-source, honest.","archived":false,"fork":false,"pushed_at":"2026-06-25T11:43:51.000Z","size":894,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-25T12:21:11.411Z","etag":null,"topics":["data-curation","embodied-ai","imitation-learning","lerobot","machine-learning","robot-learning","robotics"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kaushikb11.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-25T07:58:59.000Z","updated_at":"2026-06-25T11:43:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kaushikb11/robocurate","commit_stats":null,"previous_names":["kaushikb11/robocurate"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/kaushikb11/robocurate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaushikb11%2Frobocurate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaushikb11%2Frobocurate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaushikb11%2Frobocurate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaushikb11%2Frobocurate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kaushikb11","download_url":"https://codeload.github.com/kaushikb11/robocurate/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaushikb11%2Frobocurate/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34929703,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-curation","embodied-ai","imitation-learning","lerobot","machine-learning","robot-learning","robotics"],"created_at":"2026-06-29T14:02:17.794Z","updated_at":"2026-06-29T14:02:21.135Z","avatar_url":"https://github.com/kaushikb11.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RoboCurate\n\n![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)\n![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)\n![Status: pre-alpha](https://img.shields.io/badge/status-pre--alpha-orange)\n[![CI](https://github.com/kaushikb11/robocurate/actions/workflows/ci.yml/badge.svg)](https://github.com/kaushikb11/robocurate/actions/workflows/ci.yml)\n\n\u003e Point it at any robot dataset and it tells you which trajectories are hurting your\n\u003e policy, and hands you back the clean subset that trains a better one.\n\nRoboCurate is a data-curation framework for robot-learning / embodied-AI datasets. It is\n[LeRobotDataset](https://github.com/huggingface/lerobot)-native — it reads and writes the\nLeRobotDataset format so you incur near-zero switching cost — and it curates **both**\nreal/teleop data (Open X, DROID, LeRobot Hub) **and** simulation-generated data\n(ManiSkill3, RoboCasa, RoboTwin).\n\n\u003e **Status: pre-alpha.** The framework is built and validated the way data-centric tools earn trust\n\u003e — faithful multi-source I/O (incl. real LeRobot v3), deterministic + reproducible curation,\n\u003e known-answer corruption recovery, a trivial equal-N fair comparison, and honest reporting — *not*\n\u003e by claiming our own signals are state-of-the-art. A trustworthy downstream rollout gate (and a\n\u003e real influence signal) is the next milestone; see [`docs/ROADMAP.md`](docs/ROADMAP.md). We're just\n\u003e getting started — here is an honest map of what's real today versus what's still ahead.\n\n### Where this is going\n\nRoboCurate is built as a 4-rung ladder, climbed in order — each rung earned only after the one\nbelow it is real:\n\n1. **Curation core** *(now, built)* — point at any robot dataset, get the clean subset + a manifest.\n2. **Influence flagship + an open \"DataComp-for-robotics\" benchmark** *(next)* — a real\n   policy-impact signal, and the open leaderboard the field is asking for.\n3. **Verify the generated** *(later)* — a calibrated, physics-aware checker for the\n   simulation-generated data that every generator ships without quality control.\n4. **An open data-engine harness** *(horizon)* — reproducible generate → verify → curate → retrain.\n\nThe full strategy, the honest competitive picture, and where we're weak today are in\n[`docs/ROADMAP.md`](docs/ROADMAP.md).\n\n### What you can run today on a laptop (no GPU)\n\n- **Frozen core abstractions** — canonical trajectory, `Signal` protocol, adapters,\n  curator, scorecard. See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).\n- **Twelve quality signals** (Tier 0→2): jerk, action-noise, path-efficiency (directness),\n  spectral-smoothness (SPARC — spectral arc length), redundancy, structural-validity\n  (truncation / stall / non-finite — the *structural* defects the geometric signals miss),\n  sim physics-validity, a Demo-SCORE-inspired classifier, CUPID-inspired proxy-influence, and\n  three CPU image-quality signals — image-blur (variance-of-Laplacian sharpness), visual-stall\n  (a frozen camera), and visual-diversity (image-space near-duplicate detection). The cheap\n  heuristic signals need only NumPy + PyArrow; the learned two live behind extras, and the\n  image trio behind the `video` extra (PyAV, CPU-only decode).\n- **Honest self-checks you can run** — a known-answer corruption study (we inject defects we\n  control and report each signal's blind spots — e.g. directness/smoothness *invert* on a\n  truncated demo, which `structural-validity` then catches) and a sim-free held-out\n  behavior-cloning-loss evaluator (a CPU-only downstream comparison of curated vs equal-N and\n  length-matched random subsets — an independent cross-check of the GPU rollout gate).\n- **Dataset adapters**: LeRobotDataset — **v3.0 read+write** (the current Hub default; low-dim\n  features, version auto-detected, validated on a real Hub dataset) **and v2.1 read+write** — so\n  curating a v3 dataset emits a v3 dataset. Plus RLDS / Open X-Embodiment, ManiSkill demonstrations,\n  robomimic, and configurable **generic HDF5 and Zarr** readers (`GenericHDF5Reader` / `ZarrReader`\n  + a shared schema) that curate any one-group-per-episode HDF5/Zarr dataset. (v3 video-frame decode\n  is a follow-up; low-dim curation needs only pyarrow.)\n- **Curator + CLI**: target-budget selection (three modes, see below), equal-N random baseline,\n  hard validity-gate. CLI `curate` / `report` / `diff`, plus `list-signals` (every loadable signal\n  and its install extra), `validate` (alias `doctor` — a read-only dataset health check),\n  `profile` (a dataset EDA report: length/feature distributions, task balance, a diversity\n  estimate), `inspect` (one episode's per-signal values + per-transition trace), `explain` (why an\n  episode was kept/removed from a saved manifest), `compare` (diff two curation runs — kept-set\n  overlap + flips), `verify` (re-run a manifest and prove byte-identical decisions), and the\n  `benchmark` group (the open \"DataComp-for-robotics\" v0).\n- **Shareable, reproducible curation runs.** Every `curate` write emits a provenance manifest\n  (what was removed and why, the equal-N baseline, the config + seed + code version) and, by\n  default, a Hugging Face `README.md` dataset card summarizing the run (`--no-card` to skip).\n  `--save-recipe`/`--recipe` round-trip the full config as a JSON *recipe* so anyone can\n  reproduce byte-identical decisions; `--report-html` writes a self-contained HTML scorecard;\n  and `--push-to-hub \u003crepo_id\u003e` optionally publishes the curated **output** to the HF Hub after\n  the local write is validated (reads only the output, never the source; needs the `lerobot`\n  extra). v3 image/video frame data is preserved through curation (Stage-1 pass-through).\n- **We test our own signals and honestly report their blind spots.** Two scripts you can run\n  on real and synthetic data, framed as methodology rather than as headline numbers:\n  - `experiments/robomimic_scorecard.py` — a **ground-truth diagnostic** on robomimic\n    Multi-Human teleop: how well does each cheap signal track the dataset's operator-skill\n    labels, against an equal-size random baseline? It is *orientation-aware* (respects each\n    signal's `higher_is_better`) and reports every signal warts-and-all — including where one\n    is flat or where its keep-direction is *backwards* on this data. It also runs a confound\n    probe (\"are we just keeping short episodes?\").\n  - `experiments/corruption_recovery.py` — a **known-answer test**: inject known-bad\n    trajectories into a synthetic dataset and check the signals recover them. This is where we\n    surface honest blind spots — e.g. directness and smoothness can invert on *truncated*\n    demos and risk discarding rare recovery/corrective trajectories.\n- **Honest caveat (a strength, not a footnote):** label-AUC is a *diagnostic*, not\n  validation. Recovering operator labels need not mean better curation — CUPID found, on\n  robomimic, that perceived quality can diverge from what maximizes policy success. The only\n  real proof is the downstream gate, and we have **not** passed it yet (see below).\n\n### Validated as machinery (synthetic data)\n\n- **GPU pipeline on Modal** — curate → train → eval runs end-to-end. This is a *harness\n  sanity check on a synthetic 16-demo `identity_synthetic` dataset*, confirming the plumbing\n  works; it is **not** a real-data curation result and we report no metric from it. See\n  `experiments/modal_app.py`.\n- **Open-benchmark v0 scaffolding** (`robocurate benchmark`, \"DataComp-for-robotics\") — the data\n  is the submission: a frozen pool + fixed held-out eval split + fixed BC config; a submission is\n  a selection (recipe or index-set), scored by held-out BC loss vs an equal-N random control. This\n  is *scaffolding + a runnable synthetic proof* on a **proxy** metric (a documented coverage bias\n  toward the random control), **not** the field's adopted benchmark; the real pool + an unbiased\n  rollout-success metric + a public leaderboard are the funded next step. See\n  [`docs/BENCHMARK.md`](docs/BENCHMARK.md) and `examples/benchmark_identity.py`.\n\n### Pending — the real downstream gate (honestly not passed yet)\n\n- **Downstream rollout validation is a Rung-2 capability, not a v1 claim.**\n  `experiments/robomimic_bc_validation_modal.py` curates robomimic MH by a signal, trains a BC\n  policy on Modal, and compares rollout success against **two** random controls — equal-N and\n  length-matched — with paired CIs. The *pipeline* runs end-to-end, but the harness does **not yet\n  reproduce robomimic's published BC numbers** (Can 0.36 vs 0.86, gap grows with task difficulty —\n  a robosuite-1.5 / v1.5-dataset issue, not curation), so a curated-vs-random rollout delta would be\n  measured on an untrustworthy instrument. We therefore do **not** report one. Making the rollout\n  harness trustworthy (reproduce a published baseline, then a published *method* like CUPID/DataMIL\n  inside it) is Rung-2 work — see [`docs/ROADMAP.md`](docs/ROADMAP.md). Our cheap signals' downstream\n  efficacy is an honest open question; the CPU held-out-loss proxy already suggests they don't beat\n  random — reported as a finding, not hidden. Platform note: robosuite/MuJoCo state-eval runs fine on\n  Modal's gVisor (no Vulkan needed), unlike ManiSkill below.\n\n### Blocked\n\n- **ManiSkill3 sim-environment rollouts.** The integration (env, demo reader, image recipe) is\n  code-complete but **never executed**: it is blocked by Modal's gVisor/Vulkan sandbox\n  (diagnosis in `experiments/maniskill_modal.py`) and needs a non-gVisor GPU host (RunPod /\n  Lambda / bare metal).\n\n## Install\n\n```bash\nuv sync\n```\n\nThe core installs clean on a laptop with no GPU — the cheap Tier-0 signals (jerk, action-noise,\npath-efficiency, spectral-smoothness, redundancy, sim physics-validity) need only NumPy +\nPyArrow. Learned signals and optional tooling live behind extras:\n\n```bash\nuv sync --extra demo-score   # Demo-SCORE-inspired classifier (torch, CPU-ok)\nuv sync --extra influence    # CUPID-inspired proxy-influence signal (torch)\nuv sync --extra policy       # the behavior-cloning policy for the experiment harness (torch)\nuv sync --extra rlds         # read Open X-Embodiment / DROID RLDS datasets (tensorflow-datasets)\nuv sync --extra viz          # scorecard plots (matplotlib): per-signal distributions,\n                             # kept-vs-removed, per-signal values by operator tier\nuv sync --all-extras         # everything\n```\n\nA learned signal is always discoverable by name; if its extra isn't installed, requesting it\nreturns a clear message telling you which extra to install — it never breaks the cheap\nsignals. The RLDS reader itself is TF-free (the `rlds` extra is only needed to *load* real\ndatasets via `RLDSReader.from_tfds`).\n\n## Quickstart (target shape)\n\n```python\nfrom robocurate import Dataset, Curator, Budget, signals\n\nds = Dataset.from_lerobot(\"./aloha_sim_insertion\")            # local LeRobotDataset dir\nresult = Curator([signals.Jerk()], budget=Budget.fraction(0.8)).run(ds)\nresult.save(\"./aloha_curated\")            # new dataset + manifest; source untouched\nprint(result.scorecard().to_markdown())   # what was removed and why, + equal-N baseline\n```\n\nOr from the CLI:\n\n```bash\nrobocurate curate ./aloha_sim_insertion --out ./aloha_curated --signals jerk --budget 0.8\n```\n\n### Selection modes\n\nThe curator turns per-trajectory keep-scores into a kept set under a budget via one of three\nmodes (`--selection`, or `selection=SelectionMode.…` in the API). All three keep exactly the\nbudgeted `k`, and the equal-N random baseline is always drawn from the same valid pool with the\nsame `k`, so the curated-vs-random comparison is fair regardless of mode (Invariant 5).\n\n- **`top_k`** (default) — keep the highest keep-scoring trajectories. Simple and fastest;\n  ignores diversity, so a high-scoring majority cluster can crowd out everything else.\n- **`greedy_dedup`** — keep one representative per near-duplicate cluster (the highest-scoring\n  member), collapsing redundant bloat that top-K cannot. Tuned by `dedup_epsilon`.\n- **`coverage`** — greedy submodular **facility-location** over the embedding distribution: keep\n  a representative, *diverse* subset that best covers the whole distribution. This preserves\n  rare-but-valid modes (recovery/corrective demos, uncommon object poses) that top-K would\n  discard in favour of the dense majority. CPU-only, reuses the same statistical embedding as\n  dedup; `--coverage-quality-weight` tilts the objective from pure diversity toward keep-score.\n\n## Guarantees\n\n- **Source data is read-only.** Curation emits a *new* dataset plus a manifest describing\n  what was removed and why. There is no code path that writes back to the source.\n- **No silent data corruption.** Every write is validated against the LeRobotDataset\n  schema and checksummed; a curated dataset that fails round-trip reload is a hard\n  failure.\n- **Deterministic outputs.** Same input + config + seed produces byte-identical selection\n  decisions.\n- **Honest reporting.** Scorecards report effect sizes and uncertainty, never a single\n  cherry-picked number, and always explain *why* a trajectory was removed.\n\nSee [`CONTRIBUTING.md`](CONTRIBUTING.md) for the full project invariants.\n\n## Get involved\n\nRoboCurate is open and early. We're looking for **compute / GPU sponsorship** (to close the Rung-2\ndownstream gate), **real robot datasets** to validate curation on, **research collaboration** on the\ninfluence flagship + the open benchmark, and **adoption + feedback**. If any of these fit your lab or\nteam, open a [GitHub issue or discussion](https://github.com/kaushikb11/robocurate/issues) — the\ndetail is in [`docs/ROADMAP.md`](docs/ROADMAP.md#8-get-involved).\n\n## License\n\nApache-2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaushikb11%2Frobocurate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkaushikb11%2Frobocurate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaushikb11%2Frobocurate/lists"}