https://github.com/ZJU-REAL/ViewSpatial-Bench
ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models
https://github.com/ZJU-REAL/ViewSpatial-Bench
Last synced: about 2 months ago
JSON representation
ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models
- Host: GitHub
- URL: https://github.com/ZJU-REAL/ViewSpatial-Bench
- Owner: ZJU-REAL
- License: apache-2.0
- Created: 2025-05-23T15:05:02.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2026-03-09T09:17:26.000Z (about 2 months ago)
- Last Synced: 2026-03-09T13:57:58.574Z (about 2 months ago)
- Language: Python
- Homepage:
- Size: 3.11 MB
- Stars: 71
- Watchers: 0
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - ZJU-REAL/ViewSpatial-Bench - Bench是一个用于评估视觉-语言模型多视角空间定位能力的基准项目,旨在解决现有模型在不同视角下空间关系理解不足的问题。该项目通过构建包含多视角场景的数据集,设计了多维度评估指标,涵盖相对位置预测、视角一致性、多模态对齐等任务,能够全面衡量模型在复杂空间场景中的定位精度和鲁棒性。其核心工作原理基于视图变换技术,通过生成同一场景的不同视角图像与对应文本描述,要求模型在跨视角间保持空间关系的逻辑一致性,例如判断"相机在桌子左侧"与"相机在桌子右侧"的视角差异是否合理。项目创新性地引入了视角敏感度评估模块,通过计算模型对视角变化的鲁棒性,量化其空间感知能力的稳定性。此外,ViewSpatial-Bench还提供了可视化分析工具,支持对定位误差的细粒度分析,包括空间关系混淆矩阵、视角偏差分布等。该基准已整合了多种主流视觉-语言模型的基线结果,可作为研究者验证模型空间理解能力的标准测试平台,特别适用于需要跨视角推理的场景如室内导航、机器人视觉等应用领域。 (对象检测_分割 / 资源传输下载)
README
ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Our work presents a range of spatial localization tasks requiring reasoning from both camera-centric and human-centric perspectives, revealing the challenges visual-language models (VLMs) face in multi-viewpoint spatial understanding. Current VLMs are predominantly trained on image-text pairs from the web that lack explicit 3D spatial annotations, limiting their cross-perspective spatial reasoning capabilities.
## 📖ViewSpatial-Bench
To address this gap, we introduce **ViewSpatial-Bench**, a comprehensive benchmark with over 5,700 question-answer pairs across 1,000+ 3D scenes from ScanNet and MS-COCO validation sets. This benchmark evaluates VLMs' spatial localization capabilities from multiple perspectives, specifically testing both egocentric (camera) and allocentric (human subject) viewpoints across five distinct task types.The figure below shows the construction pipeline and example demonstrations of our benchmark.

## 🤖Multi-View Spatial Model
We present Multi-View Spatial Model (MVSM), developed to address limitations in perspective-dependent spatial reasoning in vision-language models. Following the ViewSpatial-Bench pipeline, we constructed a training dataset of ~43K diverse spatial relationship samples across five task categories, utilizing automated spatial annotations from ScanNet and MS-COCO data, supplemented with Spatial-MM for person-perspective tasks. Using consistent language templates and standardized directional classifications, we implemented a Multi-Perspective Fine-Tuning strategy on Qwen2.5-VL (3B) to enhance reasoning across different observational viewpoints. This approach enables MVSM to develop unified 3D spatial relationship representations that robustly support both camera and human perspective reasoning.
## 👁️🗨️Results

Accuracy comparison across multiple VLMs on camera and human perspective spatial tasks. Our Multi-View Spatial Model (MVSM) significantly outperforms all baseline models across all task categories, demonstrating the effectiveness of our multi-perspective spatial fine-tuning approach. These results reveal fundamental limitations in perspective-based spatial reasoning capabilities among current VLMs. Even powerful proprietary models like GPT-4o (34.98%) and Gemini-2.0-Flash (32.56%) perform only marginally above random chance (26.33%), confirming our hypothesis that standard VLMs struggle with perspective-dependent spatial reasoning despite their strong performance on other vision-language tasks.
## ⚒️QuickStart
```plaintext
ViewSpatial-Bench
├── data_process # Script code for processing raw datasets to obtain metadata
├── eval # Used to store the raw dataset of ViewSpatial-Bench
├── ViewSpatial-Bench # Used to store the source images in ViewSpatial-Bench (can be downloaded from Huggingface)
├── README.md
├── evaluate.py # Script code for evaluating multiple VLMs on ViewSpatial-Bench
└── requirements.txt # Dependencies for evaluation
```
**Note**: [CoCo dataset](https://cocodataset.org/) processing in `data_process` uses the original dataset's annotation files (download from official source). Head orientation calculations use [Orient Anything](https://github.com/SpatialVision/Orient-Anything)'s open-source code and model - place `head2body_orientation_data.py` in its root directory to run.
## 👀Evaluation on Your Own Model
**I. Using EASI (Third-Party Evaluation)**
ViewSpatial-Bench is officially supported by **EASI (Holistic Evaluation of Spatial Intelligence)**. This allows you to compare your model's performance on a broader leaderboard.🎉🎉🎉
- **GitHub**: [EvolvingLMMs-Lab/EASI](https://github.com/EvolvingLMMs-Lab/EASI)
- **Leaderboard**: [EASI Hugging Face Space](https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard)
- **Paper**: [Holistic Evaluation of Multimodal LLMs on Spatial Intelligence](https://arxiv.org/abs/2508.13142)
> **A Note of Appreciation:** > We would like to express our sincere gratitude to the **EASI team** for including ViewSpatial-Bench as a supported benchmark. We share a common vision that Spatial Intelligence is a pivotal frontier for multimodal foundation models, and we are honored to collaborate in advancing research in this field.
**II. With HuggingFace datasets library.**
```py
# NOTE: pip install datasets
from datasets import load_dataset
ds = load_dataset("lidingm/ViewSpatial-Bench")
```
**III. Evaluation using Open-Source Code.**
Evaluate using our open-source evaluation code available on Github.(Coming Soon)
```py
# Clone the repository
git clone https://github.com/ZJU-REAL/ViewSpatial-Bench.git
cd ViewSpatial-Bench
# Install dependencies
pip install -r requirements.txt
# Run evaluation
python evaluate.py --model_path your_model_path
```
You can configure the appropriate model parameters and evaluation settings according to the framework's requirements to obtain performance evaluation results on the ViewSpatial-Bench dataset.
## Acknowledgement
We thank the creators of the [ScanNet](https://github.com/ScanNet/ScanNet) and [MS-COCO](https://cocodataset.org/) datasets for their open-source contributions, which provided the foundational 3D scene data and visual content for our spatial annotation pipeline. We also acknowledge the developers of the [Orient Anything](https://github.com/SpatialVision/Orient-Anything) model for their valuable open-source work that supported our annotation framework development. Special thanks to the [EASI](https://github.com/EvolvingLMMs-Lab/EASI) team for their support in integrating ViewSpatial-Bench and for our shared commitment to advancing spatial intelligence research.
## Citation
```
@misc{li2025viewspatialbenchevaluatingmultiperspectivespatial,
title={ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models},
author={Dingming Li and Hongxing Li and Zixuan Wang and Yuchen Yan and Hang Zhang and Siqi Chen and Guiyang Hou and Shengpei Jiang and Wenqi Zhang and Yongliang Shen and Weiming Lu and Yueting Zhuang},
year={2025},
eprint={2505.21500},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.21500},
}
```