{"id":50939942,"url":"https://github.com/mandarwagh9/dvd-jepa","last_synced_at":"2026-06-17T13:02:38.974Z","repository":{"id":364554359,"uuid":"1268289071","full_name":"mandarwagh9/dvd-jepa","owner":"mandarwagh9","description":"A tiny, fully-reproducible JEPA world model that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10s. Interactive browser demo.","archived":false,"fork":false,"pushed_at":"2026-06-13T12:42:21.000Z","size":3093,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-13T14:24:42.795Z","etag":null,"topics":["anomaly-detection","deep-learning","i-jepa","interactive-demo","jepa","machine-learning","pytorch","representation-learning","reproducible-research","self-supervised-learning","v-jepa","video-prediction","world-models"],"latest_commit_sha":null,"homepage":"https://dvd-jepa.vercel.app","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mandarwagh9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-13T11:02:30.000Z","updated_at":"2026-06-13T12:42:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mandarwagh9/dvd-jepa","commit_stats":null,"previous_names":["mandarwagh9/dvd-jepa"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/mandarwagh9/dvd-jepa","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fdvd-jepa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fdvd-jepa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fdvd-jepa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fdvd-jepa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mandarwagh9","download_url":"https://codeload.github.com/mandarwagh9/dvd-jepa/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mandarwagh9%2Fdvd-jepa/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34449283,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-17T02:00:05.408Z","response_time":127,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anomaly-detection","deep-learning","i-jepa","interactive-demo","jepa","machine-learning","pytorch","representation-learning","reproducible-research","self-supervised-learning","v-jepa","video-prediction","world-models"],"created_at":"2026-06-17T13:02:38.334Z","updated_at":"2026-06-17T13:02:38.968Z","avatar_url":"https://github.com/mandarwagh9.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# DVD-JEPA\n\n### A tiny, fully-reproducible **Joint-Embedding Predictive Architecture** world model — that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a **CPU in ~10 seconds**.\n\n[![Paper (PDF)](https://img.shields.io/badge/📄_paper-PDF-b31b1b)](paper/main.pdf)\n[![Live demo](https://img.shields.io/badge/▶_live_demo-run_in_browser-2bd4ff)](https://dvd-jepa.vercel.app)\n[![HF Space](https://img.shields.io/badge/🤗_Spaces-demo-yellow)](https://huggingface.co/spaces/mandarwagh/dvd-jepa)\n[![Open in Colab](https://img.shields.io/badge/Colab-train_it_yourself-F9AB00?logo=googlecolab\u0026logoColor=white)](https://colab.research.google.com/github/mandarwagh9/dvd-jepa/blob/main/notebooks/dvd_jepa.ipynb)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n![CPU only](https://img.shields.io/badge/hardware-CPU_only-lightgrey)\n\n\u003cimg src=\"assets/dvd_jepa_dream.gif\" width=\"560\" alt=\"Reality vs. the JEPA's rendered latent dream\"/\u003e\n\n*Left: reality. Right: the model's dream — rolled forward purely in latent space and decoded to pixels.*\n\n\u003c/div\u003e\n\n---\n\n## Abstract\n\nMost attempts to learn a **world model** from video try to predict the next frame pixel-by-pixel, and drown in detail that is fundamentally unpredictable. **JEPA** (Joint-Embedding Predictive Architecture, [LeCun 2022](#references)) makes a different bet: predict the *representation* of the future, not the pixels, and let the encoder discard whatever it cannot predict.\n\n**DVD-JEPA** is the smallest honest demonstration of that idea we could build. The \"world\" is a DVD logo bouncing in a 16×16 box. A context encoder, an EMA target encoder, and a latent predictor are trained — with no labels and no decoder — to predict the next observation **in a 32-dimensional representation space**. We then show three things:\n\n1. **It learned the world.** A linear probe recovers the logo's exact (y, x) position from the frozen 32-d latent to within **0.73 px** — though it was never given a coordinate.\n2. **It can dream (once you add a decoder).** Bolt an optional decoder onto the frozen latents and roll the predictor forward: it renders a correct **future-frame video** of the bounce, including wall reflections, for ~20 steps before latent drift sets in.\n3. **It is useful.** Run it as a 1-step predictive monitor and the prediction error becomes an **anomaly signal**: inject a teleport and surprise spikes **88×** over baseline, on the right frame.\n\nThe whole thing runs **client-side in your browser** at [dvd-jepa.vercel.app](https://dvd-jepa.vercel.app) — the trained MLPs are re-implemented in ~40 lines of JavaScript. It is a joke and it is also a correct, working instance of the architecture behind I-JEPA, V-JEPA, and V-JEPA 2.\n\n## 📄 Paper\n\nThere's a full arXiv-style write-up (method, anti-collapse ablation, forecast-horizon curve, anomaly detection, references): **[`paper/main.pdf`](paper/main.pdf)** — also attached to the [latest release](https://github.com/mandarwagh9/dvd-jepa/releases/latest).\n\nThe paper is fully reproducible: [`paper/main.tex`](paper/main.tex) is the LaTeX source and [`paper/figures.py`](paper/figures.py) regenerates every figure and number in it.\n\n```bash\npython paper/figures.py     # regenerate figures + metrics.tex\ntectonic paper/main.tex     # compile the PDF (any LaTeX engine works)\n```\n\n## The idea in one picture\n\n```\n            ┌──────────────────────── trained without labels, without a decoder ───────────────────────┐\n            │                                                                                            │\n obs_t  ──▶ │ Encoder Eθ ─▶ z_t ──▶ Predictor P ─▶ ẑ_{t+1} ─────────────▶  ‖ ẑ_{t+1} − sg(z̄_{t+1}) ‖²  │ ◀── loss is in\n (2 frames) │                                                              ▲   (prediction in latent      │     LATENT space,\n            │ obs_{t+1} ─▶ Encoder E_ema (EMA, stop-grad) ─▶ z̄_{t+1} ──────┘    space, never pixels)       │     never pixels\n            └────────────────────────────────────────────────────────────────────────────────────────────┘\n                                              │  + VICReg variance term  →  no representation collapse\n                                              ▼\n        (optional, separate) Decoder D : z → 16×16 frame      ←  the \"sellout\" that makes the dream visible \u0026 useful\n```\n\n## Why a bouncing logo?\n\nIt is the simplest system that still has the property that matters: **the future is unreadable from a single frame** (you can't tell which way a static dot is going), but **perfectly predictable from two** (position + velocity → the entire deterministic future, bounces included). So a context of two stacked frames is necessary and sufficient — exactly the spatio-temporal setup real video JEPAs use, minus a million hours of internet video.\n\n## Method\n\n| Component | Shape | Role |\n|---|---|---|\n| **Context encoder** `Eθ` | `2·16·16 → 256 → 128 → 32` | encodes an observation (2 stacked frames) to a latent |\n| **Target encoder** `E_ema` | same, EMA of `Eθ`, stop-grad | produces the prediction target — the anti-collapse asymmetry |\n| **Predictor** `P` | `32 → 64 → 32` | **the world model**: one step forward in latent space |\n| **Decoder** `D` *(optional)* | `32 → 64 → 256 → 256` | readout to pixels; a *pure* JEPA omits this |\n\n**Training objective.** Minimise the latent prediction error plus a variance term:\n\n```\nL = ‖ P(Eθ(obs_t)) − sg(E_ema(obs_{t+1})) ‖²   +   Σ_d relu(1 − std(z_d))\n       └──────── predict the future in representation space ────────┘     └─ VICReg anti-collapse ─┘\n```\n\nThe target encoder is an exponential moving average (`τ = 0.99`) of the online encoder with a stop-gradient — the [BYOL](#references) trick. Without the variance term the embedding std starts at **0.007** (collapsing to a constant); with it, std holds at **~2.4–3.0** throughout. The decoder is trained *separately* on the frozen encoder, so the JEPA does all the understanding and the decoder is only a readout.\n\n## Results\n\nAll numbers are produced by `python -m dvd_jepa.train` (seed 0, CPU, ~10 s) and saved to [`assets/metrics.json`](assets/metrics.json).\n\n| Result | Value | What it shows |\n|---|---:|---|\n| Linear-probe position RMSE | **0.73 px** (box is 16 px) | the 32-d latent secretly encodes exact world state |\n| Forecast MSE, 1 step ahead | **0.0005** | near-perfect short-horizon prediction |\n| Forecast MSE, 30 steps ahead | **0.028** | graceful latent-rollout drift, not collapse |\n| Anomaly peak / baseline | **88×** | a teleport is detected via prediction error… |\n| Anomaly detected at frame | **22** (injected at 24) | …on the correct frame (2 early: the monitor looks 2 ahead) |\n| Embedding std (collapse check) | **~3.0** (not 0) | the representation never collapsed |\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"assets/dvd_jepa_anomaly.png\" width=\"620\" alt=\"Predictive surprise spikes exactly on the injected anomaly\"/\u003e\n\u003c/div\u003e\n\n## Try it — interactive demo\n\n**▶ [dvd-jepa.vercel.app](https://dvd-jepa.vercel.app)** — the trained model running entirely in your browser (no server, no GPU). Also mirrored on [🤗 Hugging Face Spaces](https://huggingface.co/spaces/mandarwagh/dvd-jepa). Things to do:\n\n- **Toggle the decoder off.** This is the *pure JEPA*. It understands the bounce perfectly and gives you nothing but 32 latent bars — it literally cannot draw. Toggle it back on and the dream renders. This is the whole joke, made interactive.\n- **Inject an anomaly.** Teleport the logo and watch the surprise meter spike past the threshold.\n- **Dream 30 steps ahead.** Freeze reality and let the predictor roll forward on its own — watch it imagine the future, then slowly drift.\n\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"assets/demo.png\" width=\"640\" alt=\"The interactive browser demo\"/\u003e\u003c/div\u003e\n\n## Reproduce\n\n```bash\ngit clone https://github.com/mandarwagh9/dvd-jepa\ncd dvd-jepa\npip install -r requirements.txt\n\npython -m dvd_jepa.train      # trains everything, writes checkpoints/, web/weights.json, assets/\npython scripts/pure_jepa.py   # the original no-decoder version: prints the ASCII latent dream\n```\n\nTo run the browser demo locally (ES modules need a server, not `file://`):\n\n```bash\ncd web \u0026\u0026 python -m http.server 8000   # then open http://localhost:8000\n```\n\nOr **[open the Colab notebook](https://colab.research.google.com/github/mandarwagh9/dvd-jepa/blob/main/notebooks/dvd_jepa.ipynb)** and run it cell by cell.\n\n## Repository layout\n\n```\ndvd_jepa/            the package\n  world.py           the bouncing-logo environment + observation pairs\n  models.py          Encoder, Predictor, Decoder, variance term\n  train.py           train, evaluate, export browser weights, render assets\nweb/                 the client-side interactive demo (index.html + jepa.js + weights.json)\nscripts/pure_jepa.py the original decoder-free \"it only does vectors\" version\nnotebooks/           Colab notebook\nassets/              rendered gif/png + metrics.json\ncheckpoints/         trained PyTorch weights\n```\n\n## How this relates to real systems\n\nDVD-JEPA is a toy, but every moving part has a full-scale counterpart:\n\n- **I-JEPA** (images) and **V-JEPA / V-JEPA 2** (video) use exactly this predict-in-representation-space objective with an EMA target encoder, at ViT scale on real data.\n- **V-JEPA 2-AC** makes the predictor *action-conditioned* and plans a real robot in latent space — the same \"imagine the future, pick the best\" loop, with actions added.\n- The two capabilities shown here — **forecast the next frames** and **flag when reality diverges from the forecast** — are exactly what a world model contributes to an egocentric-video data pipeline: predict what the person does next, and auto-surface the unexpected moment.\n\n## Limitations (honest)\n\n- **Latent rollout drifts** after ~20 steps: the predictor is trained for a single step, so errors compound. Multi-step rollout training or a recurrent predictor would extend the horizon.\n- **It's 16×16 and deterministic.** There is no stochastic latent `z` for multi-modal futures (real JEPAs add one) because the bouncing logo has exactly one future.\n- **The decoder is a crutch.** A pure JEPA has none; we add it only to *visualise* and to compute a pixel-space surprise score.\n\n## References\n\n1. Y. LeCun. *A Path Towards Autonomous Machine Intelligence.* 2022.\n2. M. Assran et al. *Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA).* CVPR 2023.\n3. A. Bardes et al. *Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA).* 2024.\n4. Meta AI. *V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.* 2025.\n5. A. Bardes, J. Ponce, Y. LeCun. *VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.* ICLR 2022.\n6. J.-B. Grill et al. *Bootstrap Your Own Latent (BYOL).* NeurIPS 2020.\n\n## Citation\n\n```bibtex\n@software{dvdjepa2026,\n  title  = {DVD-JEPA: a tiny reproducible JEPA world model of a bouncing logo},\n  author = {Wagh, Mandar},\n  year   = {2026},\n  url    = {https://github.com/mandarwagh9/dvd-jepa}\n}\n```\n\n## License\n\nMIT — see [LICENSE](LICENSE). Built as the rigorous sequel to *DVD Dreamer*.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmandarwagh9%2Fdvd-jepa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmandarwagh9%2Fdvd-jepa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmandarwagh9%2Fdvd-jepa/lists"}