https://github.com/OpenDriveLab/SparseVideoNav
Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification
https://github.com/OpenDriveLab/SparseVideoNav
embodied-navigation video-generation-model vln
Last synced: 2 months ago
JSON representation
Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification
- Host: GitHub
- URL: https://github.com/OpenDriveLab/SparseVideoNav
- Owner: OpenDriveLab
- License: other
- Created: 2026-02-04T04:27:22.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-31T01:54:41.000Z (2 months ago)
- Last Synced: 2026-04-04T22:57:04.886Z (2 months ago)
- Topics: embodied-navigation, video-generation-model, vln
- Language: Python
- Homepage: https://opendrivelab.com/SparseVideoNav
- Size: 8.17 MB
- Stars: 68
- Watchers: 3
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-and-novel-works-in-slam - [Code
README
SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Hai Zhang*
Siqi Liang*
Li Chen
Yuxian Li
Yukuan Xu
Yichao Zhong
Fu Zhang†
Hongyang Li†
The University of Hong Kong
## 📖 Introduction
SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It achieves sub-second trajectory inference with a sparse future spanning a 20-second horizon, yielding a remarkable 27× speed-up. Real-world zero-shot experiments show 2.5× higher success rate than state-of-the-art LLM baselines and mark the first realization in challenging night scenes.
**Developers**: [Hai Zhang](https://zhanghenryhai12138.github.io/) and [Siqi Liang](https://github.com/stdcat)
## 📢 News
> [!IMPORTANT]
> 🌟 Stay up to date at [opendrivelab.com](https://opendrivelab.com/#news)!
- 🎉 **2026-02-05**: [Project Page](https://opendrivelab.com/SparseVideoNav) is now available!
- 🎉 **2026-02-06**: [arXiv preprint](https://arxiv.org/abs/2602.05827) is now available!
- 🎉 **2026-03-31**: Inference code and model checkpoint released!
## 📌 Table of Contents
- 📖 [Introduction](#-introduction)
- 📢 [News](#-news)
- 🔥 [Highlights](#highlights)
- 🔧 [Installation](#-installation)
- 📥 [Checkpoint](#-checkpoint)
- 🚀 [Usage](#-usage)
- 📝 [TODO List](#-todo-list)
- 📬 [Contact](#-contact)
- 📄 [License and Citation](#-license-and-citation)
## 🔥 Highlights
- We investigate beyond-the-view navigation tasks in the real world by introducing video generation model to this field for the first time.
- We pioneer a paradigm shift from continuous to sparse video generation for longer prediction horizon.
- We achieve sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart.
- We achieve the first realization of beyond-the-view navigation in challenging night scenes with a 17.5% success rate.
## 🔧 Installation
**Requirements**
- Linux (tested on Ubuntu)
- Python 3.10
- NVIDIA GPU with ≥ 16 GB VRAM
- [uv](https://docs.astral.sh/uv/) ≥ 0.7
```bash
# 1. Clone the repository
git clone https://github.com/OpenDriveLab/SparseVideoNav.git
cd SparseVideoNav
# 2. Create virtual environment and install dependencies
uv sync --all-groups
# 3. Activate the environment
source .venv/bin/activate
```
> `--all-groups` also installs `flash-attn`. Building it from source takes a few minutes on first install.
## 📥 Checkpoint
Download the SparseVideoNav pipeline checkpoint and place it under `models/SparseVideoNav-Models/`:
| Component | Download |
|-----------|----------|
| **SparseVideoNav pipeline checkpoint** | 🤗 [HuggingFace](https://huggingface.co/OpenDriveLab/SparseVideoNav_VGM) |
Expected directory layout after download:
```
models/SparseVideoNav-Models/
├── google/
│ └── umt5-xxl/
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
├── models_t5_umt5-xxl-enc-bf16.pth
├── Wan2.1_VAE.pth
└── svn_ckpt/
├── config.json
└── diffusion_pytorch_model.safetensors
```
If you place the checkpoint elsewhere, update `ckpt_path` in `config/inference.yaml` or override it on the command line.
## 🚀 Usage
### 1. Command-line inference
```bash
python inference.py video_path=/path/to/input.mp4 'prompt=turn right'
```
Results are written to `outputs/_/`:
- `predicted_video.mp4` — generated future video
Key overrides:
| Parameter | Default | Description |
|-----------|---------|-------------|
| `video_path` | — | Input video path (required) |
| `prompt` | — | Language instruction (required) |
| `output_path` | `outputs` | Root output directory |
| `ckpt_path` | `models/SparseVideoNav-Models` | Pipeline checkpoint directory |
| `inference.device` | `cuda:0` | Target device |
| `inference.denoise_steps` | `4` | Denoising steps (higher → better quality) |
Example with overrides:
```bash
python inference.py \
video_path=/path/to/input.mp4 \
'prompt=walk forward and turn left' \
ckpt_path=/path/to/checkpoint \
inference.device=cuda:0 \
inference.denoise_steps=8
```
### 2. Gradio web demo
```bash
python gradio_interface.py
```
Opens a local demo at `http://0.0.0.0:7860`. Upload a video, enter a navigation instruction, and click **Run Prediction**.
Common options:
| Flag | Default | Description |
|------|---------|-------------|
| `--ckpt_path` | from config | Override checkpoint directory |
| `--device` | `cuda:0` | Target device |
| `--port` | `7860` | Server port |
| `--share` | `False` | Create a public Gradio share link |
### 3. Python API
```python
from omegaconf import OmegaConf
from inference import SVNPipeline
cfg = OmegaConf.load("config/inference.yaml")
cfg.ckpt_path = "/path/to/checkpoint"
cfg.inference.device = "cuda:0"
pipeline = SVNPipeline.from_pretrained(cfg)
# Returns np.ndarray (T, H, W, C) uint8
video = pipeline(video="/path/to/input.mp4", text="turn right")
```
For direct access to the latent-space model:
```python
from sparseVideoNav.svn_model import SVNModel
model = SVNModel.from_pretrained("/path/to/checkpoint/svn_ckpt")
```
## 📝 TODO List
- [x] SparseVideoNav Paper Release.
- [x] arXiv preprint is now available!
- [x] SparseVideoNav Code Release.
- [x] Inference code of distilled video generation model and model checkpoint.
- [ ] Inference code of continuous action head and model checkpoint (Estimate 2026 Q3).
- [ ] SparseVideoNav Dataset Release
- [ ] ~140h real-world VLN data (Estimate 2026 Q3).
## 📬 Contact
For further inquiries or assistance, please contact [zhanghenryhai12138@gmail.com](mailto:zhanghenryhai12138@gmail.com) or [liangsiqi@connect.hku.hk](mailto:liangsiqi@connect.hku.hk)
## 📄 License and Citation
All the data and code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
- Please consider citing our work if it helps your research.
```BibTeX
@article{zhang2026sparse,
title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
journal={arXiv preprint arXiv:2602.05827},
year={2026}
}
```