https://github.com/InternRobotics/StreamVLN
Official implementation of the paper: "StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling"
https://github.com/InternRobotics/StreamVLN
Last synced: about 2 months ago
JSON representation
Official implementation of the paper: "StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling"
- Host: GitHub
- URL: https://github.com/InternRobotics/StreamVLN
- Owner: InternRobotics
- Created: 2025-07-07T09:06:25.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-08-09T10:12:42.000Z (2 months ago)
- Last Synced: 2025-08-09T12:14:13.673Z (2 months ago)
- Language: Python
- Homepage: https://streamvln.github.io/
- Size: 12.9 MB
- Stars: 176
- Watchers: 6
- Forks: 7
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-and-novel-works-in-slam - [Code
README
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Meng Wei*
Chenyang Wan*
Xiqian Yu*
Tai Wang*‡
Yuqiang Yang
Xiaohan Mao
Chenming Zhu
Wenzhe Cai
Hanqing Wang
Yilun Chen
Xihui Liu†
Jiangmiao Pang†
Shanghai AI Laboratory The University of Hong Kong Zhejiang University Shanghai Jiao Tong University
[](http://arxiv.org/abs/2507.05240)
[](https://streamvln.github.io/)
[](https://huggingface.co/papers/2507.05240/)
[](https://www.youtube.com/watch?v=gG3mpefOBjc)## 🏠 About
StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on **LLaVA-Video** as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a **fast-streaming** dialogue context with a sliding-window KV cache; and (2) a **slow-updating** memory via token pruning.
![]()
## 📢 News
[2025-07-30] We have released the ScaleVLN training data, including a subset of ~150k episodes converted from the discrete environment setting to the VLN-CE format. For usage details, see [here](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/README.md#envdrop--scalevln-dataset-note).[2025-07-18] We’ve fixed a bug where num_history was not correctly passed to the model during evaluation, causing it to default to None. This had a significant impact on performance. Please make sure to pull the latest code for correct evaluation.
## 🛠 Getting Started
We test under the following environment:
* Python 3.9
* Pytorch 2.1.2
* CUDA Version 12.41. **Preparing a conda env with `Python3.9` & Install habitat-sim and habitat-lab**
```bash
conda create -n streamvln python=3.9
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab # install habitat_lab
pip install -e habitat-baselines # install habitat_baselines
```2. **Clone this repository**
```bash
git clone https://github.com/OpenRobotLab/StreamVLN.git
cd StreamVLN
```## 📁 Data Preparation
To get started, you need to prepare three types of data:
1. **Scene Datasets**
- For **R2R**, **RxR** and **EnvDrop**: Download the MP3D scenes from the [official project page](https://niessner.github.io/Matterport/), and place them under `data/scene_datasets/mp3d/`.
- For **ScaleVLN**: Download the HM3D scenes from the [official github page](https://github.com/matterport/habitat-matterport-3dresearch), and place the `train` split under `data/scene_datasets/hm3d/`2. **VLN-CE Episodes**
Download the VLN-CE episodes:
- [r2r](https://drive.google.com/file/d/18DCrNcpxESnps1IbXVjXSbGLDzcSOqzD/view) (Rename `R2R_VLNCE_v1/` -> `r2r/`)
- [rxr](https://drive.google.com/file/d/145xzLjxBaNTbVgBfQ8e9EsBAV8W-SM0t/view) (Rename `RxR_VLNCE_v0/` -> `rxr/`)
- [envdrop](https://drive.google.com/file/d/1fo8F4NKgZDH-bPSdVU3cONAkt5EW-tyr/view) (Rename `R2R_VLNCE_v1-3_preprocessed/envdrop/` -> `envdrop/`)
- [scalevln](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/ScaleVLN/scalevln_subset_150k.json.gz) (This is a subset of the ScaleVLN dataset, converted to the VLN-CE format. For the original dataset, please refer to the [official repository](https://github.com/wz0919/ScaleVLN).)
Extract them into the `data/datasets/` directory.3. **Collected Trajectory Data**
We provide pre-collected observation-action trajectory data for training. These trajectories were collected using the **training episodes** from **R2R** and **RxR** under the Matterport3D environment. For the **EnvDrop** and **ScaleVLN** subset, please refer to [here](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data/blob/main/README.md) for instructions on how to collect it yourself.
Download the observation-action trajectory data from [Hugging Face](https://huggingface.co/datasets/cywan/StreamVLN-Trajectory-Data), and extract it to `data/trajectory_data/`.Your final folder structure should look like this:
```bash
data/
├── datasets/
│ ├── r2r/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ └── val_seen.json.gz
│ │ └── val_unseen/
│ │ └── val_unseen.json.gz
│ ├── rxr/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ ├── val_seen_guide.json.gz
│ │ │ └── ...
│ │ └── val_unseen/
│ │ ├── val_unseen_guide.json.gz
│ │ └── ...
│ ├── envdrop/
│ │ ├── envdrop.json.gz
│ │ └── ...
│ └── scalevln/
│ └── scalevln_subset_150k.json.gz
├── scene_datasets/
│ └── hm3d/
│ ├── 00000-kfPV7w3FaU5/
│ ├── 00001-UVdNNRcVyV1/
│ └── ...
│ └── mp3d/
│ ├── 17DRP5sb8fy/
│ ├── 1LXtFkjw3qL/
│ └── ...
└── trajectory_data/
├── R2R/
│ ├── images/
│ └── annotations.json
├── RxR/
│ ├── images/
│ └── annotations.json
├── EnvDrop/
│ ├── images/
│ └── annotations.json
└── ScaleVLN/
├── images/
└── annotations.json```
## 🏆 Model Zoo
We provide two model checkpoints for different use cases:
- **Benchmark Reproduction**
Use this [checkpoint](https://huggingface.co/mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln) to reproduce results on the VLN-CE benchmark.- **Real-World Deployment**
This [checkpoint](https://huggingface.co/mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln_real_world) is recommended for deployment on physical robots.We made two modifications:
1. **Remove redundant initial spinning actions**: The initial left/right turns not mentioned in the instructions are removed for better instruction alignment.
2. **Trajectory safety**: Enhanced obstacle avoidance ensures more reliable navigation in real-world environments.## 🚀 Training
To perform **multi-node multi-GPU training** with distributed setup, run:
```bash
sbatch scripts/streamvln_train_slurm.sh
```## 🤖 Evaluation
To perform multi-GPU evaluation with key-value cache support, simply run:
```bash
sh scripts/streamvln_eval_multi_gpu.sh
```## 📝 TODO List
- ✅ Release the arXiv paper (Jul. 8, 2025)
- ✅ Provide inference scripts and model checkpoints
- ✅ Release training code and configurations
- ✅ Release training data
- ⏳ Support co-training with LLaVA-Video-178K, ScanQA, MMC4
- ⏳ Dagger data collection## 🙋♂️ Questions or Issues
If you encounter any problems or have questions about StreamVLN, please feel free to [open an issue](https://github.com/OpenRobotLab/StreamVLN/issues).
## 🔗 Citation
If you find our work helpful, please consider starring this repo 🌟 and cite:
```bibtex
@article{wei2025streamvln,
title={StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling},
author={Wei, Meng and Wan, Chenyang and Yu, Xiqian and Wang, Tai and Yang, Yuqiang and Mao, Xiaohan and Zhu, Chenming and Cai, Wenzhe and Wang, Hanqing and Chen, Yilun and others},
journal={arXiv preprint arXiv:2507.05240},
year={2025}
}
```## 📄 License
![]()
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.## 👏 Acknowledgements
This repo is based on [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).