https://github.com/OpenDriveLab/SparseVideoNav

Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification
https://github.com/OpenDriveLab/SparseVideoNav

embodied-navigation video-generation-model vln

Last synced: 3 months ago
JSON representation

Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification

Host: GitHub
URL: https://github.com/OpenDriveLab/SparseVideoNav
Owner: OpenDriveLab
License: other
Created: 2026-02-04T04:27:22.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-03-31T01:54:41.000Z (3 months ago)
Last Synced: 2026-04-04T22:57:04.886Z (3 months ago)
Topics: embodied-navigation, video-generation-model, vln
Language: Python
Homepage: https://opendrivelab.com/SparseVideoNav
Size: 8.17 MB
Stars: 68
Watchers: 3
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-and-novel-works-in-slam - [Code

README

          


  





SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation



    Hai Zhang^* 

    Siqi Liang^* 

    Li Chen 

    Yuxian Li 

    Yukuan Xu 

    Yichao Zhong 

    Fu Zhang^† 

    Hongyang Li^† 

    



    

    


    The University of Hong Kong 

    


  


















## 📖 Introduction

SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It achieves sub-second trajectory inference with a sparse future spanning a 20-second horizon, yielding a remarkable 27× speed-up. Real-world zero-shot experiments show 2.5× higher success rate than state-of-the-art LLM baselines and mark the first realization in challenging night scenes.

**Developers**: [Hai Zhang](https://zhanghenryhai12138.github.io/) and [Siqi Liang](https://github.com/stdcat)

## 📢 News

> [!IMPORTANT]

> 🌟 Stay up to date at [opendrivelab.com](https://opendrivelab.com/#news)!

- 🎉 **2026-02-05**: [Project Page](https://opendrivelab.com/SparseVideoNav) is now available!

- 🎉 **2026-02-06**: [arXiv preprint](https://arxiv.org/abs/2602.05827) is now available!

- 🎉 **2026-03-31**: Inference code and model checkpoint released!

## 📌 Table of Contents

- 📖 [Introduction](#-introduction)

- 📢 [News](#-news)

- 🔥 [Highlights](#highlights)

- 🔧 [Installation](#-installation)

- 📥 [Checkpoint](#-checkpoint)

- 🚀 [Usage](#-usage)

- 📝 [TODO List](#-todo-list)

- 📬 [Contact](#-contact)

- 📄 [License and Citation](#-license-and-citation)

## 🔥 Highlights

- We investigate beyond-the-view navigation tasks in the real world  by introducing video generation model to this field for the first time.

- We pioneer a paradigm shift from continuous to sparse video generation for longer prediction horizon.

- We achieve sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart.

- We achieve the first realization of beyond-the-view navigation in challenging night scenes with a 17.5% success rate.

## 🔧 Installation

**Requirements**

- Linux (tested on Ubuntu)

- Python 3.10

- NVIDIA GPU with ≥ 16 GB VRAM

- [uv](https://docs.astral.sh/uv/) ≥ 0.7

```bash

# 1. Clone the repository

git clone https://github.com/OpenDriveLab/SparseVideoNav.git

cd SparseVideoNav

# 2. Create virtual environment and install dependencies

uv sync --all-groups

# 3. Activate the environment

source .venv/bin/activate

```

> `--all-groups` also installs `flash-attn`. Building it from source takes a few minutes on first install.

## 📥 Checkpoint

Download the SparseVideoNav pipeline checkpoint and place it under `models/SparseVideoNav-Models/`:

| Component | Download |

|-----------|----------|

| **SparseVideoNav pipeline checkpoint** | 🤗 [HuggingFace](https://huggingface.co/OpenDriveLab/SparseVideoNav_VGM) |

Expected directory layout after download:

```

models/SparseVideoNav-Models/

├── google/

│   └── umt5-xxl/

│       ├── special_tokens_map.json

│       ├── spiece.model

│       ├── tokenizer.json

│       └── tokenizer_config.json

├── models_t5_umt5-xxl-enc-bf16.pth

├── Wan2.1_VAE.pth

└── svn_ckpt/

    ├── config.json

    └── diffusion_pytorch_model.safetensors

```

If you place the checkpoint elsewhere, update `ckpt_path` in `config/inference.yaml` or override it on the command line.

## 🚀 Usage

### 1. Command-line inference

```bash

python inference.py video_path=/path/to/input.mp4 'prompt=turn right'

```

Results are written to `outputs/_/`:

- `predicted_video.mp4` — generated future video

Key overrides:

| Parameter | Default | Description |

|-----------|---------|-------------|

| `video_path` | — | Input video path (required) |

| `prompt` | — | Language instruction (required) |

| `output_path` | `outputs` | Root output directory |

| `ckpt_path` | `models/SparseVideoNav-Models` | Pipeline checkpoint directory |

| `inference.device` | `cuda:0` | Target device |

| `inference.denoise_steps` | `4` | Denoising steps (higher → better quality) |

Example with overrides:

```bash

python inference.py \

    video_path=/path/to/input.mp4 \

    'prompt=walk forward and turn left' \

    ckpt_path=/path/to/checkpoint \

    inference.device=cuda:0 \

    inference.denoise_steps=8

```

### 2. Gradio web demo

```bash

python gradio_interface.py

```

Opens a local demo at `http://0.0.0.0:7860`. Upload a video, enter a navigation instruction, and click **Run Prediction**.

Common options:

| Flag | Default | Description |

|------|---------|-------------|

| `--ckpt_path` | from config | Override checkpoint directory |

| `--device` | `cuda:0` | Target device |

| `--port` | `7860` | Server port |

| `--share` | `False` | Create a public Gradio share link |

### 3. Python API

```python

from omegaconf import OmegaConf

from inference import SVNPipeline

cfg = OmegaConf.load("config/inference.yaml")

cfg.ckpt_path = "/path/to/checkpoint"

cfg.inference.device = "cuda:0"

pipeline = SVNPipeline.from_pretrained(cfg)

# Returns np.ndarray (T, H, W, C) uint8

video = pipeline(video="/path/to/input.mp4", text="turn right")

```

For direct access to the latent-space model:

```python

from sparseVideoNav.svn_model import SVNModel

model = SVNModel.from_pretrained("/path/to/checkpoint/svn_ckpt")

```

## 📝 TODO List

- [x] SparseVideoNav Paper Release.

  - [x] arXiv preprint is now available!

- [x] SparseVideoNav Code Release.

  - [x] Inference code of distilled video generation model and model checkpoint.

  - [ ] Inference code of continuous action head and model checkpoint (Estimate 2026 Q3).

- [ ] SparseVideoNav Dataset Release

  - [ ] ~140h real-world VLN data (Estimate 2026 Q3).

## 📬 Contact

For further inquiries or assistance, please contact [zhanghenryhai12138@gmail.com](mailto:zhanghenryhai12138@gmail.com) or [liangsiqi@connect.hku.hk](mailto:liangsiqi@connect.hku.hk)

## 📄 License and Citation

All the data and code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).

- Please consider citing our work if it helps your research.

```BibTeX

@article{zhang2026sparse,

  title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},

  author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},

  journal={arXiv preprint arXiv:2602.05827},

  year={2026}

}

```