{"id":51076945,"url":"https://github.com/zozo123/genomics-sandboxes","last_synced_at":"2026-06-23T15:01:58.038Z","repository":{"id":364787080,"uuid":"1269189030","full_name":"zozo123/genomics-sandboxes","owner":"zozo123","description":"Bring Your Own Genome — map-reduce the human genome on disposable islo.dev sandboxes. Real demo: snapshot a warm reference, fork per chromosome, reduce. Live: https://zozo123.github.io/genomics-sandboxes/","archived":false,"fork":false,"pushed_at":"2026-06-14T13:58:09.000Z","size":1094,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-14T15:17:48.554Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zozo123.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-14T12:10:41.000Z","updated_at":"2026-06-14T13:58:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zozo123/genomics-sandboxes","commit_stats":null,"previous_names":["zozo123/genomics-sandboxes"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/zozo123/genomics-sandboxes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zozo123%2Fgenomics-sandboxes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zozo123%2Fgenomics-sandboxes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zozo123%2Fgenomics-sandboxes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zozo123%2Fgenomics-sandboxes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zozo123","download_url":"https://codeload.github.com/zozo123/genomics-sandboxes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zozo123%2Fgenomics-sandboxes/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34694786,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-23T15:01:57.357Z","updated_at":"2026-06-23T15:01:58.030Z","avatar_url":"https://github.com/zozo123.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Reference broadcast by VM snapshot\n\n**Copy-on-write fan-out for genomic scatter-gather — a real, genome-wide, reproducible demo on [islo.dev](https://islo.dev), orchestrated end-to-end by a Claude Code agent.**\n\n🔗 **Live:** https://zozo123.github.io/genomics-sandboxes/\n\n![Warm one box. Fork the genome.](./og.png)\n\n---\n\nSequencing collapsed to ~$100 a genome ([Ultima UG 100](https://www.statnews.com/2024/01/30/ultima-genomics-dna-sequencing-100-dollars/), 2024; ~$80 in 2025).\nReading DNA isn't the cost anymore — computing on it is, and the expensive part is the\n*read-only* state every pipeline shares: the reference FASTA (~3 GB) and its indices (BWA ~5 GB,\nSTAR 27–30 GB, plus GATK bundles) — 8–40 GB of immutable bytes. Best-practice scatter-gather\nworkflows ([nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/), GATK Best Practices) fan a run\nacross 50–1000 shards and re-localize that identical reference to every one of them.\n\n**The one idea: a VM snapshot is the reference broadcast.** Warm one box (open the reference,\nload the indices, page them resident), snapshot the initialized address space, and fork it\ncopy-on-write per shard — the same mechanism Firecracker and Lambda SnapStart use. The (N+1)th\nshard costs a page-table setup, not another multi-GB read, and every fork is byte-identical\nbecause they map the same physical pages.\n\nThe fan-out is validated against a **positive control**, not a discovery: per-chromosome\nCpG-island density is known to track gene density, so a correct map-reduce *must* return the\ngene-dense chromosomes on top (chr19, chr17, chr22) and the gene-poor ones at the bottom (chr4,\nchr3, chr13). Genome-wide, it does. Recovering a known gradient from an independently written\nkernel is evidence the sharding, compute, and reduce are all wired correctly — exactly the check\nyou want when the pipeline is model-written.\n\n**The harness is a Claude Code agent.** The whole pipeline — warm → snapshot → fork(24) → reduce\n→ teardown — is driven end-to-end by an agent running [`crabbox.sh`](./crabbox.sh) / the islo CLI;\nin the run below a Claude Code sub-agent orchestrated the entire 24-way fan-out. No scheduler, no\ncluster account. Genome-wide receipts (all 24 GRCh38 chromosomes, ~3.09 Gb): warm base built once\nin **88.6 s**, snapshot **962 MB** in **10.3 s**, 24-way fan-out in **~60 s** wall-clock; staging\nthe reference per worker instead would be ≈13 min serially (**~13× on snapshot reuse**, **862 MB**\nof redundant reference transfer avoided). Every number on the site is fetched live from\n[`data/receipts.json`](./data/receipts.json).\n\n## What it computes\n\nPer human chromosome (GRCh38), in 1 Mb bins:\n\n- **GC %** and the GC landscape (isochores)\n- **CpG observed/expected** ratio\n- **CpG-island candidates** — Gardiner-Garden \u0026 Frommer (1987): 200 bp window, GC \u003e 50 %, obs/exp \u003e 0.6\n- assembly-gap (N) fraction\n\nCpG islands aren't trivia: methylation at CpG sites is the switch behind epigenetic age clocks,\ncancer screens (promoter hypermethylation), and cell identity — the same signal consumer\nepigenetic tests are built on.\n\n## The pattern\n\n```\nwarm one box ──▶ snapshot it ──▶ fork per chromosome ──▶ reduce\n (toolchain +     (the read-only   (MAP: each shard       (merge per-shard\n  reference +      base, broadcast   restores warm,         JSON → genome-wide\n  index, once)     to every worker)  just computes)         landscape, delete boxes)\n```\n\nFour verbs of the islo CLI, genome-wide:\n\n```bash\n# 1 · warm base: toolchain + all 24 GRCh38 chromosomes + index (paid once)\nislo use gx-warm -- bash -lc './warmup.sh chr1 chr2 ... chr22 chrX chrY'\n\n# 2 · broadcast: freeze the warm box to a snapshot\nislo snapshot save gx-warm --name genomics-wg\n\n# 3 · MAP: fork one warm box per chromosome (waves of 8)\nfor chr in chr1 ... chrY; do\n  islo use gx-$chr --snapshot genomics-wg -- python3 compute.py $chr \u0026\ndone; wait\n\n# 4 · REDUCE: merge per-shard JSON, then delete the boxes\n```\n\nThe whole thing is `./crabbox.sh run`, driven end-to-end by a **Claude Code agent** as the harness.\n\n## Real receipts — genome-wide (this run)\n\n| | |\n|---|---|\n| Chromosomes (shards) | 24 — chr1–chr22, chrX, chrY |\n| Bases scanned | 3,088,269,832 (~3.09 Gb) |\n| CpG sites | 29,401,360 |\n| CpG-island candidates | 264,816 |\n| Warm base built (once) | 88.6 s |\n| Snapshot | 962 MB, saved in 10.3 s |\n| Warm 24-way fan-out | ~60 s wall-clock (3 waves of 8) |\n| Cold-equivalent serial staging | ~13 min |\n| Re-run speedup (snapshot reuse) | ~13× |\n| Redundant reference downloads avoided | ~862 MB |\n| Orchestrator | a Claude Code sub-agent |\n\nRaw numbers behind every figure: [`data/receipts.json`](./data/receipts.json) (the site fetches\nit live — nothing is hardcoded). Per-shard outputs are in `data/wg_warm_*.json`.\n\n### The free correctness check\n\nThe fan-out recovers a known biological fact, genome-wide: **CpG-island density tracks gene density.**\n\n| chromosome | islands / sequenced Mb | |\n|---|---|---|\n| chr19 | **287.6** | densest in the genome |\n| chr17 | 178.8 | |\n| chr22 | 174.0 | gene-rich |\n| chr16 | 147.8 | |\n| … | … | (24 chromosomes, sorted) |\n| chr13 | 68.5 | |\n| chr3  | 63.2 | gene-poor |\n| chr4  | **62.4** | sparsest |\n\nThe gene-dense chromosomes (chr19/17/22) sort to the top and the gene-poor ones (chr4/3/13) to\nthe bottom — across all 24, from an independently written kernel. If the map-reduce were wrong,\nthe gradient would be wrong. It isn't.\n\n## Reproduce\n\n```bash\n# islo CLI + login required (https://islo.dev)\n./crabbox.sh run                   # genome-wide: warm → snapshot → fork(24) → reduce → data/\n#   (or the 5-chromosome quick version: bash scripts/run_demo.sh)\npython3 -m http.server 8799        # then open http://localhost:8799\n```\n\n| File | Purpose |\n|------|---------|\n| `crabbox.sh` | genome-wide harness (warm → snapshot → 24-way fan-out → reduce); run by a Claude Code agent |\n| `index.html` / `styles.css` / `script.js` | the interactive explainer (vanilla, no build, fetches `data/*.json`) |\n| `scripts/compute.py` | the MAP kernel — one chromosome → JSON (numpy, memory-frugal) |\n| `scripts/reduce_wg.py` | genome-wide reduce → `data/receipts.json` + `data/landscape.json` |\n| `scripts/warmup.sh` | the warm-up that gets snapshotted (toolchain + reference + index) |\n| `scripts/run_demo.sh` | host orchestrator: warm → snapshot → cold/warm fan-out → reduce |\n| `data/` | measured receipts + reduced landscape + raw per-shard outputs |\n| `og-card.html` | self-contained 1200×630 social card (rendered to `og.png`) |\n\n## Caveats (read these)\n\n**Not medical advice.** This computes sequence statistics on the *public* human reference. It is\nnot a clinical test, not a diagnosis, and says nothing about any individual. CpG-island counts\nare candidate calls (Gardiner-Garden \u0026 Frommer 1987; Takai \u0026 Jones 2002 tightened the rule), not\ncurated annotations like ENCODE's cCRE Registry (Moore et al., *Nature* 2020).\n\nThe snapshot/fork-for-startup mechanism is **standard systems infrastructure** (CRIU, Firecracker\nCOW restore, AWS Lambda SnapStart) — pointed here at a heavy, read-only genomics reference\nfan-out. Nothing about the mechanism is claimed as new. At this toy scale the first-run speedup\nis modest; the real payoff is **amortization** (re-runs pay only the map wall-clock) and\n**byte-identical reproducibility** across shards. Scale to a 3 GB reference + BWA indices across a\ncohort and the snapshot becomes the only sane way to do it.\n\n## Related\n\n- [The Sandbox Shift](https://zozo123.github.io/sandboxes-why-how-when/)\n- [The Living Layer](https://zozo123.github.io/the-living-layer/)\n- [Databases in the AI Era](https://zozo123.github.io/databases-in-the-ai-era/)\n\nIdeas owed to Dean \u0026 Ghemawat (MapReduce, 2004) and the ENCODE Consortium. Reference: GRCh38\n(UCSC goldenPath). By [Yossi Eliaz](https://www.linkedin.com/in/yossi-eliaz), 2026.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzozo123%2Fgenomics-sandboxes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzozo123%2Fgenomics-sandboxes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzozo123%2Fgenomics-sandboxes/lists"}