https://github.com/mandarwagh9/dvd-jepa
A tiny, fully-reproducible JEPA world model that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10s. Interactive browser demo.
https://github.com/mandarwagh9/dvd-jepa
anomaly-detection deep-learning i-jepa interactive-demo jepa machine-learning pytorch representation-learning reproducible-research self-supervised-learning v-jepa video-prediction world-models
Last synced: 5 days ago
JSON representation
A tiny, fully-reproducible JEPA world model that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10s. Interactive browser demo.
- Host: GitHub
- URL: https://github.com/mandarwagh9/dvd-jepa
- Owner: mandarwagh9
- License: mit
- Created: 2026-06-13T11:02:30.000Z (9 days ago)
- Default Branch: main
- Last Pushed: 2026-06-13T12:42:21.000Z (9 days ago)
- Last Synced: 2026-06-13T14:24:42.795Z (9 days ago)
- Topics: anomaly-detection, deep-learning, i-jepa, interactive-demo, jepa, machine-learning, pytorch, representation-learning, reproducible-research, self-supervised-learning, v-jepa, video-prediction, world-models
- Language: Python
- Homepage: https://dvd-jepa.vercel.app
- Size: 2.95 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DVD-JEPA
### A tiny, fully-reproducible **Joint-Embedding Predictive Architecture** world model β that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a **CPU in ~10 seconds**.
[](paper/main.pdf)
[](https://dvd-jepa.vercel.app)
[](https://huggingface.co/spaces/mandarwagh/dvd-jepa)
[](https://colab.research.google.com/github/mandarwagh9/dvd-jepa/blob/main/notebooks/dvd_jepa.ipynb)
[](LICENSE)


*Left: reality. Right: the model's dream β rolled forward purely in latent space and decoded to pixels.*
---
## Abstract
Most attempts to learn a **world model** from video try to predict the next frame pixel-by-pixel, and drown in detail that is fundamentally unpredictable. **JEPA** (Joint-Embedding Predictive Architecture, [LeCun 2022](#references)) makes a different bet: predict the *representation* of the future, not the pixels, and let the encoder discard whatever it cannot predict.
**DVD-JEPA** is the smallest honest demonstration of that idea we could build. The "world" is a DVD logo bouncing in a 16Γ16 box. A context encoder, an EMA target encoder, and a latent predictor are trained β with no labels and no decoder β to predict the next observation **in a 32-dimensional representation space**. We then show three things:
1. **It learned the world.** A linear probe recovers the logo's exact (y, x) position from the frozen 32-d latent to within **0.73 px** β though it was never given a coordinate.
2. **It can dream (once you add a decoder).** Bolt an optional decoder onto the frozen latents and roll the predictor forward: it renders a correct **future-frame video** of the bounce, including wall reflections, for ~20 steps before latent drift sets in.
3. **It is useful.** Run it as a 1-step predictive monitor and the prediction error becomes an **anomaly signal**: inject a teleport and surprise spikes **88Γ** over baseline, on the right frame.
The whole thing runs **client-side in your browser** at [dvd-jepa.vercel.app](https://dvd-jepa.vercel.app) β the trained MLPs are re-implemented in ~40 lines of JavaScript. It is a joke and it is also a correct, working instance of the architecture behind I-JEPA, V-JEPA, and V-JEPA 2.
## π Paper
There's a full arXiv-style write-up (method, anti-collapse ablation, forecast-horizon curve, anomaly detection, references): **[`paper/main.pdf`](paper/main.pdf)** β also attached to the [latest release](https://github.com/mandarwagh9/dvd-jepa/releases/latest).
The paper is fully reproducible: [`paper/main.tex`](paper/main.tex) is the LaTeX source and [`paper/figures.py`](paper/figures.py) regenerates every figure and number in it.
```bash
python paper/figures.py # regenerate figures + metrics.tex
tectonic paper/main.tex # compile the PDF (any LaTeX engine works)
```
## The idea in one picture
```
βββββββββββββββββββββββββ trained without labels, without a decoder ββββββββββββββββββββββββ
β β
obs_t βββΆ β Encoder EΞΈ ββΆ z_t βββΆ Predictor P ββΆ αΊ_{t+1} ββββββββββββββΆ β αΊ_{t+1} β sg(zΜ_{t+1}) βΒ² β βββ loss is in
(2 frames) β β² (prediction in latent β LATENT space,
β obs_{t+1} ββΆ Encoder E_ema (EMA, stop-grad) ββΆ zΜ_{t+1} βββββββ space, never pixels) β never pixels
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β + VICReg variance term β no representation collapse
βΌ
(optional, separate) Decoder D : z β 16Γ16 frame β the "sellout" that makes the dream visible & useful
```
## Why a bouncing logo?
It is the simplest system that still has the property that matters: **the future is unreadable from a single frame** (you can't tell which way a static dot is going), but **perfectly predictable from two** (position + velocity β the entire deterministic future, bounces included). So a context of two stacked frames is necessary and sufficient β exactly the spatio-temporal setup real video JEPAs use, minus a million hours of internet video.
## Method
| Component | Shape | Role |
|---|---|---|
| **Context encoder** `EΞΈ` | `2Β·16Β·16 β 256 β 128 β 32` | encodes an observation (2 stacked frames) to a latent |
| **Target encoder** `E_ema` | same, EMA of `EΞΈ`, stop-grad | produces the prediction target β the anti-collapse asymmetry |
| **Predictor** `P` | `32 β 64 β 32` | **the world model**: one step forward in latent space |
| **Decoder** `D` *(optional)* | `32 β 64 β 256 β 256` | readout to pixels; a *pure* JEPA omits this |
**Training objective.** Minimise the latent prediction error plus a variance term:
```
L = β P(EΞΈ(obs_t)) β sg(E_ema(obs_{t+1})) βΒ² + Ξ£_d relu(1 β std(z_d))
βββββββββ predict the future in representation space βββββββββ ββ VICReg anti-collapse ββ
```
The target encoder is an exponential moving average (`Ο = 0.99`) of the online encoder with a stop-gradient β the [BYOL](#references) trick. Without the variance term the embedding std starts at **0.007** (collapsing to a constant); with it, std holds at **~2.4β3.0** throughout. The decoder is trained *separately* on the frozen encoder, so the JEPA does all the understanding and the decoder is only a readout.
## Results
All numbers are produced by `python -m dvd_jepa.train` (seed 0, CPU, ~10 s) and saved to [`assets/metrics.json`](assets/metrics.json).
| Result | Value | What it shows |
|---|---:|---|
| Linear-probe position RMSE | **0.73 px** (box is 16 px) | the 32-d latent secretly encodes exact world state |
| Forecast MSE, 1 step ahead | **0.0005** | near-perfect short-horizon prediction |
| Forecast MSE, 30 steps ahead | **0.028** | graceful latent-rollout drift, not collapse |
| Anomaly peak / baseline | **88Γ** | a teleport is detected via prediction errorβ¦ |
| Anomaly detected at frame | **22** (injected at 24) | β¦on the correct frame (2 early: the monitor looks 2 ahead) |
| Embedding std (collapse check) | **~3.0** (not 0) | the representation never collapsed |
## Try it β interactive demo
**βΆ [dvd-jepa.vercel.app](https://dvd-jepa.vercel.app)** β the trained model running entirely in your browser (no server, no GPU). Also mirrored on [π€ Hugging Face Spaces](https://huggingface.co/spaces/mandarwagh/dvd-jepa). Things to do:
- **Toggle the decoder off.** This is the *pure JEPA*. It understands the bounce perfectly and gives you nothing but 32 latent bars β it literally cannot draw. Toggle it back on and the dream renders. This is the whole joke, made interactive.
- **Inject an anomaly.** Teleport the logo and watch the surprise meter spike past the threshold.
- **Dream 30 steps ahead.** Freeze reality and let the predictor roll forward on its own β watch it imagine the future, then slowly drift.
## Reproduce
```bash
git clone https://github.com/mandarwagh9/dvd-jepa
cd dvd-jepa
pip install -r requirements.txt
python -m dvd_jepa.train # trains everything, writes checkpoints/, web/weights.json, assets/
python scripts/pure_jepa.py # the original no-decoder version: prints the ASCII latent dream
```
To run the browser demo locally (ES modules need a server, not `file://`):
```bash
cd web && python -m http.server 8000 # then open http://localhost:8000
```
Or **[open the Colab notebook](https://colab.research.google.com/github/mandarwagh9/dvd-jepa/blob/main/notebooks/dvd_jepa.ipynb)** and run it cell by cell.
## Repository layout
```
dvd_jepa/ the package
world.py the bouncing-logo environment + observation pairs
models.py Encoder, Predictor, Decoder, variance term
train.py train, evaluate, export browser weights, render assets
web/ the client-side interactive demo (index.html + jepa.js + weights.json)
scripts/pure_jepa.py the original decoder-free "it only does vectors" version
notebooks/ Colab notebook
assets/ rendered gif/png + metrics.json
checkpoints/ trained PyTorch weights
```
## How this relates to real systems
DVD-JEPA is a toy, but every moving part has a full-scale counterpart:
- **I-JEPA** (images) and **V-JEPA / V-JEPA 2** (video) use exactly this predict-in-representation-space objective with an EMA target encoder, at ViT scale on real data.
- **V-JEPA 2-AC** makes the predictor *action-conditioned* and plans a real robot in latent space β the same "imagine the future, pick the best" loop, with actions added.
- The two capabilities shown here β **forecast the next frames** and **flag when reality diverges from the forecast** β are exactly what a world model contributes to an egocentric-video data pipeline: predict what the person does next, and auto-surface the unexpected moment.
## Limitations (honest)
- **Latent rollout drifts** after ~20 steps: the predictor is trained for a single step, so errors compound. Multi-step rollout training or a recurrent predictor would extend the horizon.
- **It's 16Γ16 and deterministic.** There is no stochastic latent `z` for multi-modal futures (real JEPAs add one) because the bouncing logo has exactly one future.
- **The decoder is a crutch.** A pure JEPA has none; we add it only to *visualise* and to compute a pixel-space surprise score.
## References
1. Y. LeCun. *A Path Towards Autonomous Machine Intelligence.* 2022.
2. M. Assran et al. *Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA).* CVPR 2023.
3. A. Bardes et al. *Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA).* 2024.
4. Meta AI. *V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.* 2025.
5. A. Bardes, J. Ponce, Y. LeCun. *VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.* ICLR 2022.
6. J.-B. Grill et al. *Bootstrap Your Own Latent (BYOL).* NeurIPS 2020.
## Citation
```bibtex
@software{dvdjepa2026,
title = {DVD-JEPA: a tiny reproducible JEPA world model of a bouncing logo},
author = {Wagh, Mandar},
year = {2026},
url = {https://github.com/mandarwagh9/dvd-jepa}
}
```
## License
MIT β see [LICENSE](LICENSE). Built as the rigorous sequel to *DVD Dreamer*.