https://github.com/InternRobotics/StreamVLN

Official implementation of the paper: "StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling"
https://github.com/InternRobotics/StreamVLN

Last synced: about 2 months ago
JSON representation

Official implementation of the paper: "StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling"

Host: GitHub
URL: https://github.com/InternRobotics/StreamVLN
Owner: InternRobotics
Created: 2025-07-07T09:06:25.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-08-09T10:12:42.000Z (2 months ago)
Last Synced: 2025-08-09T12:14:13.673Z (2 months ago)
Language: Python
Homepage: https://streamvln.github.io/
Size: 12.9 MB
Stars: 176
Watchers: 6
Forks: 7
Open Issues: 10
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-and-novel-works-in-slam - [Code

README

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei*
Chenyang Wan*
Xiqian Yu*
Tai Wang*‡
Yuqiang Yang
Xiaohan Mao
Chenming Zhu
Wenzhe Cai
Hanqing Wang
Yilun Chen
Xihui Liu†
Jiangmiao Pang†

Shanghai AI Laboratory The University of Hong Kong Zhejiang University Shanghai Jiao Tong University

[![arxiv](https://img.shields.io/badge/arXiv_2507.05240-red?logo=arxiv)](http://arxiv.org/abs/2507.05240)
[![project](https://img.shields.io/badge/Project_Page-0065D3?logo=rocket&logoColor=white)](https://streamvln.github.io/)
[![hf](https://img.shields.io/badge/Hugging_Face-FF9D00?logo=huggingface&logoColor=white)](https://huggingface.co/papers/2507.05240/)
[![video-en](https://img.shields.io/badge/YouTube-D33846?logo=youtube)](https://www.youtube.com/watch?v=gG3mpefOBjc)

## 🏠 About
StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on **LLaVA-Video** as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a **fast-streaming** dialogue context with a sliding-window KV cache; and (2) a **slow-updating** memory via token pruning.

## 📢 News
[2025-07-30] We have released the ScaleVLN training data, including a subset of ~150k episodes converted from the discrete environment setting to the VLN-CE format. For usage details, see [here](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/README.md#envdrop--scalevln-dataset-note).

[2025-07-18] We’ve fixed a bug where num_history was not correctly passed to the model during evaluation, causing it to default to None. This had a significant impact on performance. Please make sure to pull the latest code for correct evaluation.

## 🛠 Getting Started
We test under the following environment:
* Python 3.9
* Pytorch 2.1.2
* CUDA Version 12.4

1. **Preparing a conda env with `Python3.9` & Install habitat-sim and habitat-lab**
```bash
conda create -n streamvln python=3.9
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab # install habitat_lab
pip install -e habitat-baselines # install habitat_baselines
```

2. **Clone this repository**
```bash
git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN
```

## 📁 Data Preparation

To get started, you need to prepare three types of data:

1. **Scene Datasets**
- For **R2R**, **RxR** and **EnvDrop**: Download the MP3D scenes from the [official project page](https://niessner.github.io/Matterport/), and place them under `data/scene_datasets/mp3d/`.
- For **ScaleVLN**: Download the HM3D scenes from the [official github page](https://github.com/matterport/habitat-matterport-3dresearch), and place the `train` split under `data/scene_datasets/hm3d/`

2. **VLN-CE Episodes**
Download the VLN-CE episodes:
- [r2r](https://drive.google.com/file/d/18DCrNcpxESnps1IbXVjXSbGLDzcSOqzD/view) (Rename `R2R_VLNCE_v1/` -> `r2r/`)
- [rxr](https://drive.google.com/file/d/145xzLjxBaNTbVgBfQ8e9EsBAV8W-SM0t/view) (Rename `RxR_VLNCE_v0/` -> `rxr/`)
- [envdrop](https://drive.google.com/file/d/1fo8F4NKgZDH-bPSdVU3cONAkt5EW-tyr/view) (Rename `R2R_VLNCE_v1-3_preprocessed/envdrop/` -> `envdrop/`)
- [scalevln](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/ScaleVLN/scalevln_subset_150k.json.gz) (This is a subset of the ScaleVLN dataset, converted to the VLN-CE format. For the original dataset, please refer to the [official repository](https://github.com/wz0919/ScaleVLN).)

Extract them into the `data/datasets/` directory.

3. **Collected Trajectory Data**
We provide pre-collected observation-action trajectory data for training. These trajectories were collected using the **training episodes** from **R2R** and **RxR** under the Matterport3D environment. For the **EnvDrop** and **ScaleVLN** subset, please refer to [here](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/README.md) for instructions on how to collect it yourself.
Download the observation-action trajectory data from [Hugging Face](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data), and extract it to `data/trajectory_data/`.

Your final folder structure should look like this:

```bash
data/
├── datasets/
│ ├── r2r/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ └── val_seen.json.gz
│ │ └── val_unseen/
│ │ └── val_unseen.json.gz
│ ├── rxr/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ ├── val_seen_guide.json.gz
│ │ │ └── ...
│ │ └── val_unseen/
│ │ ├── val_unseen_guide.json.gz
│ │ └── ...
│ ├── envdrop/
│ │ ├── envdrop.json.gz
│ │ └── ...
│ └── scalevln/
│ └── scalevln_subset_150k.json.gz
├── scene_datasets/
│ └── hm3d/
│ ├── 00000-kfPV7w3FaU5/
│ ├── 00001-UVdNNRcVyV1/
│ └── ...
│ └── mp3d/
│ ├── 17DRP5sb8fy/
│ ├── 1LXtFkjw3qL/
│ └── ...
└── trajectory_data/
├── R2R/
│ ├── images/
│ └── annotations.json
├── RxR/
│ ├── images/
│ └── annotations.json
├── EnvDrop/
│ ├── images/
│ └── annotations.json
└── ScaleVLN/
├── images/
└── annotations.json

```

## 🏆 Model Zoo

We provide two model checkpoints for different use cases:

- **Benchmark Reproduction**
Use this [checkpoint](https://huggingface.co/mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln) to reproduce results on the VLN-CE benchmark.

- **Real-World Deployment**
This [checkpoint](https://huggingface.co/mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln_real_world) is recommended for deployment on physical robots.

We made two modifications:
1. **Remove redundant initial spinning actions**: The initial left/right turns not mentioned in the instructions are removed for better instruction alignment.
2. **Trajectory safety**: Enhanced obstacle avoidance ensures more reliable navigation in real-world environments.

## 🚀 Training

To perform **multi-node multi-GPU training** with distributed setup, run:

```bash
sbatch scripts/streamvln_train_slurm.sh
```

## 🤖 Evaluation

To perform multi-GPU evaluation with key-value cache support, simply run:

```bash
sh scripts/streamvln_eval_multi_gpu.sh
```

## 📝 TODO List

- ✅ Release the arXiv paper (Jul. 8, 2025)
- ✅ Provide inference scripts and model checkpoints
- ✅ Release training code and configurations
- ✅ Release training data
- ⏳ Support co-training with LLaVA-Video-178K, ScanQA, MMC4
- ⏳ Dagger data collection

## 🙋‍♂️ Questions or Issues

If you encounter any problems or have questions about StreamVLN, please feel free to [open an issue](https://github.com/OpenRobotLab/StreamVLN/issues).

## 🔗 Citation

If you find our work helpful, please consider starring this repo 🌟 and cite:

```bibtex
@article{wei2025streamvln,
title={StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling},
author={Wei, Meng and Wan, Chenyang and Yu, Xiqian and Wang, Tai and Yang, Yuqiang and Mao, Xiaohan and Zhu, Chenming and Cai, Wenzhe and Wang, Hanqing and Chen, Yilun and others},
journal={arXiv preprint arXiv:2507.05240},
year={2025}
}
```

## 📄 License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

## 👏 Acknowledgements

This repo is based on [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/InternRobotics/StreamVLN

Awesome Lists containing this project

README

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling