An open API service indexing awesome lists of open source software.

https://github.com/mandarwagh9/dvd-jepa

A tiny, fully-reproducible JEPA world model that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10s. Interactive browser demo.
https://github.com/mandarwagh9/dvd-jepa

anomaly-detection deep-learning i-jepa interactive-demo jepa machine-learning pytorch representation-learning reproducible-research self-supervised-learning v-jepa video-prediction world-models

Last synced: 5 days ago
JSON representation

A tiny, fully-reproducible JEPA world model that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a CPU in ~10s. Interactive browser demo.

Awesome Lists containing this project

README

          

# DVD-JEPA

### A tiny, fully-reproducible **Joint-Embedding Predictive Architecture** world model β€” that learns the physics of a bouncing DVD logo in representation space, dreams its future, and detects anomalies. Trains on a **CPU in ~10 seconds**.

[![Paper (PDF)](https://img.shields.io/badge/πŸ“„_paper-PDF-b31b1b)](paper/main.pdf)
[![Live demo](https://img.shields.io/badge/β–Ά_live_demo-run_in_browser-2bd4ff)](https://dvd-jepa.vercel.app)
[![HF Space](https://img.shields.io/badge/πŸ€—_Spaces-demo-yellow)](https://huggingface.co/spaces/mandarwagh/dvd-jepa)
[![Open in Colab](https://img.shields.io/badge/Colab-train_it_yourself-F9AB00?logo=googlecolab&logoColor=white)](https://colab.research.google.com/github/mandarwagh9/dvd-jepa/blob/main/notebooks/dvd_jepa.ipynb)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
![CPU only](https://img.shields.io/badge/hardware-CPU_only-lightgrey)

Reality vs. the JEPA's rendered latent dream

*Left: reality. Right: the model's dream β€” rolled forward purely in latent space and decoded to pixels.*

---

## Abstract

Most attempts to learn a **world model** from video try to predict the next frame pixel-by-pixel, and drown in detail that is fundamentally unpredictable. **JEPA** (Joint-Embedding Predictive Architecture, [LeCun 2022](#references)) makes a different bet: predict the *representation* of the future, not the pixels, and let the encoder discard whatever it cannot predict.

**DVD-JEPA** is the smallest honest demonstration of that idea we could build. The "world" is a DVD logo bouncing in a 16Γ—16 box. A context encoder, an EMA target encoder, and a latent predictor are trained β€” with no labels and no decoder β€” to predict the next observation **in a 32-dimensional representation space**. We then show three things:

1. **It learned the world.** A linear probe recovers the logo's exact (y, x) position from the frozen 32-d latent to within **0.73 px** β€” though it was never given a coordinate.
2. **It can dream (once you add a decoder).** Bolt an optional decoder onto the frozen latents and roll the predictor forward: it renders a correct **future-frame video** of the bounce, including wall reflections, for ~20 steps before latent drift sets in.
3. **It is useful.** Run it as a 1-step predictive monitor and the prediction error becomes an **anomaly signal**: inject a teleport and surprise spikes **88Γ—** over baseline, on the right frame.

The whole thing runs **client-side in your browser** at [dvd-jepa.vercel.app](https://dvd-jepa.vercel.app) β€” the trained MLPs are re-implemented in ~40 lines of JavaScript. It is a joke and it is also a correct, working instance of the architecture behind I-JEPA, V-JEPA, and V-JEPA 2.

## πŸ“„ Paper

There's a full arXiv-style write-up (method, anti-collapse ablation, forecast-horizon curve, anomaly detection, references): **[`paper/main.pdf`](paper/main.pdf)** β€” also attached to the [latest release](https://github.com/mandarwagh9/dvd-jepa/releases/latest).

The paper is fully reproducible: [`paper/main.tex`](paper/main.tex) is the LaTeX source and [`paper/figures.py`](paper/figures.py) regenerates every figure and number in it.

```bash
python paper/figures.py # regenerate figures + metrics.tex
tectonic paper/main.tex # compile the PDF (any LaTeX engine works)
```

## The idea in one picture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ trained without labels, without a decoder ───────────────────────┐
β”‚ β”‚
obs_t ──▢ β”‚ Encoder EΞΈ ─▢ z_t ──▢ Predictor P ─▢ αΊ‘_{t+1} ─────────────▢ β€– αΊ‘_{t+1} βˆ’ sg(zΜ„_{t+1}) β€–Β² β”‚ ◀── loss is in
(2 frames) β”‚ β–² (prediction in latent β”‚ LATENT space,
β”‚ obs_{t+1} ─▢ Encoder E_ema (EMA, stop-grad) ─▢ zΜ„_{t+1} β”€β”€β”€β”€β”€β”€β”˜ space, never pixels) β”‚ never pixels
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ + VICReg variance term β†’ no representation collapse
β–Ό
(optional, separate) Decoder D : z β†’ 16Γ—16 frame ← the "sellout" that makes the dream visible & useful
```

## Why a bouncing logo?

It is the simplest system that still has the property that matters: **the future is unreadable from a single frame** (you can't tell which way a static dot is going), but **perfectly predictable from two** (position + velocity β†’ the entire deterministic future, bounces included). So a context of two stacked frames is necessary and sufficient β€” exactly the spatio-temporal setup real video JEPAs use, minus a million hours of internet video.

## Method

| Component | Shape | Role |
|---|---|---|
| **Context encoder** `EΞΈ` | `2Β·16Β·16 β†’ 256 β†’ 128 β†’ 32` | encodes an observation (2 stacked frames) to a latent |
| **Target encoder** `E_ema` | same, EMA of `EΞΈ`, stop-grad | produces the prediction target β€” the anti-collapse asymmetry |
| **Predictor** `P` | `32 β†’ 64 β†’ 32` | **the world model**: one step forward in latent space |
| **Decoder** `D` *(optional)* | `32 β†’ 64 β†’ 256 β†’ 256` | readout to pixels; a *pure* JEPA omits this |

**Training objective.** Minimise the latent prediction error plus a variance term:

```
L = β€– P(EΞΈ(obs_t)) βˆ’ sg(E_ema(obs_{t+1})) β€–Β² + Ξ£_d relu(1 βˆ’ std(z_d))
└──────── predict the future in representation space β”€β”€β”€β”€β”€β”€β”€β”€β”˜ └─ VICReg anti-collapse β”€β”˜
```

The target encoder is an exponential moving average (`Ο„ = 0.99`) of the online encoder with a stop-gradient β€” the [BYOL](#references) trick. Without the variance term the embedding std starts at **0.007** (collapsing to a constant); with it, std holds at **~2.4–3.0** throughout. The decoder is trained *separately* on the frozen encoder, so the JEPA does all the understanding and the decoder is only a readout.

## Results

All numbers are produced by `python -m dvd_jepa.train` (seed 0, CPU, ~10 s) and saved to [`assets/metrics.json`](assets/metrics.json).

| Result | Value | What it shows |
|---|---:|---|
| Linear-probe position RMSE | **0.73 px** (box is 16 px) | the 32-d latent secretly encodes exact world state |
| Forecast MSE, 1 step ahead | **0.0005** | near-perfect short-horizon prediction |
| Forecast MSE, 30 steps ahead | **0.028** | graceful latent-rollout drift, not collapse |
| Anomaly peak / baseline | **88Γ—** | a teleport is detected via prediction error… |
| Anomaly detected at frame | **22** (injected at 24) | …on the correct frame (2 early: the monitor looks 2 ahead) |
| Embedding std (collapse check) | **~3.0** (not 0) | the representation never collapsed |


Predictive surprise spikes exactly on the injected anomaly

## Try it β€” interactive demo

**β–Ά [dvd-jepa.vercel.app](https://dvd-jepa.vercel.app)** β€” the trained model running entirely in your browser (no server, no GPU). Also mirrored on [πŸ€— Hugging Face Spaces](https://huggingface.co/spaces/mandarwagh/dvd-jepa). Things to do:

- **Toggle the decoder off.** This is the *pure JEPA*. It understands the bounce perfectly and gives you nothing but 32 latent bars β€” it literally cannot draw. Toggle it back on and the dream renders. This is the whole joke, made interactive.
- **Inject an anomaly.** Teleport the logo and watch the surprise meter spike past the threshold.
- **Dream 30 steps ahead.** Freeze reality and let the predictor roll forward on its own β€” watch it imagine the future, then slowly drift.

The interactive browser demo

## Reproduce

```bash
git clone https://github.com/mandarwagh9/dvd-jepa
cd dvd-jepa
pip install -r requirements.txt

python -m dvd_jepa.train # trains everything, writes checkpoints/, web/weights.json, assets/
python scripts/pure_jepa.py # the original no-decoder version: prints the ASCII latent dream
```

To run the browser demo locally (ES modules need a server, not `file://`):

```bash
cd web && python -m http.server 8000 # then open http://localhost:8000
```

Or **[open the Colab notebook](https://colab.research.google.com/github/mandarwagh9/dvd-jepa/blob/main/notebooks/dvd_jepa.ipynb)** and run it cell by cell.

## Repository layout

```
dvd_jepa/ the package
world.py the bouncing-logo environment + observation pairs
models.py Encoder, Predictor, Decoder, variance term
train.py train, evaluate, export browser weights, render assets
web/ the client-side interactive demo (index.html + jepa.js + weights.json)
scripts/pure_jepa.py the original decoder-free "it only does vectors" version
notebooks/ Colab notebook
assets/ rendered gif/png + metrics.json
checkpoints/ trained PyTorch weights
```

## How this relates to real systems

DVD-JEPA is a toy, but every moving part has a full-scale counterpart:

- **I-JEPA** (images) and **V-JEPA / V-JEPA 2** (video) use exactly this predict-in-representation-space objective with an EMA target encoder, at ViT scale on real data.
- **V-JEPA 2-AC** makes the predictor *action-conditioned* and plans a real robot in latent space β€” the same "imagine the future, pick the best" loop, with actions added.
- The two capabilities shown here β€” **forecast the next frames** and **flag when reality diverges from the forecast** β€” are exactly what a world model contributes to an egocentric-video data pipeline: predict what the person does next, and auto-surface the unexpected moment.

## Limitations (honest)

- **Latent rollout drifts** after ~20 steps: the predictor is trained for a single step, so errors compound. Multi-step rollout training or a recurrent predictor would extend the horizon.
- **It's 16Γ—16 and deterministic.** There is no stochastic latent `z` for multi-modal futures (real JEPAs add one) because the bouncing logo has exactly one future.
- **The decoder is a crutch.** A pure JEPA has none; we add it only to *visualise* and to compute a pixel-space surprise score.

## References

1. Y. LeCun. *A Path Towards Autonomous Machine Intelligence.* 2022.
2. M. Assran et al. *Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA).* CVPR 2023.
3. A. Bardes et al. *Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA).* 2024.
4. Meta AI. *V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.* 2025.
5. A. Bardes, J. Ponce, Y. LeCun. *VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.* ICLR 2022.
6. J.-B. Grill et al. *Bootstrap Your Own Latent (BYOL).* NeurIPS 2020.

## Citation

```bibtex
@software{dvdjepa2026,
title = {DVD-JEPA: a tiny reproducible JEPA world model of a bouncing logo},
author = {Wagh, Mandar},
year = {2026},
url = {https://github.com/mandarwagh9/dvd-jepa}
}
```

## License

MIT β€” see [LICENSE](LICENSE). Built as the rigorous sequel to *DVD Dreamer*.