{"id":46189555,"url":"https://github.com/ZJU-REAL/ViewSpatial-Bench","last_synced_at":"2026-03-16T17:00:55.310Z","repository":{"id":295098529,"uuid":"989105890","full_name":"ZJU-REAL/ViewSpatial-Bench","owner":"ZJU-REAL","description":"ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models","archived":false,"fork":false,"pushed_at":"2026-03-09T09:17:26.000Z","size":3266,"stargazers_count":71,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-09T13:57:58.574Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZJU-REAL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-23T15:05:02.000Z","updated_at":"2026-03-09T09:17:31.000Z","dependencies_parsed_at":"2025-05-23T16:55:04.670Z","dependency_job_id":"a7b34c4f-e698-4cc2-a9ff-ba11644cf72d","html_url":"https://github.com/ZJU-REAL/ViewSpatial-Bench","commit_stats":null,"previous_names":["zju-real/viewspatial-bench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ZJU-REAL/ViewSpatial-Bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJU-REAL%2FViewSpatial-Bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJU-REAL%2FViewSpatial-Bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJU-REAL%2FViewSpatial-Bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJU-REAL%2FViewSpatial-Bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZJU-REAL","download_url":"https://codeload.github.com/ZJU-REAL/ViewSpatial-Bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJU-REAL%2FViewSpatial-Bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30417673,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T06:40:58.731Z","status":"ssl_error","status_checked_at":"2026-03-12T06:40:40.296Z","response_time":114,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-03-03T00:00:52.741Z","updated_at":"2026-03-16T17:00:55.304Z","avatar_url":"https://github.com/ZJU-REAL.png","language":"Python","readme":"\u003ch1\u003e\u003cimg src=\"docs/icon/avatar.png\" width=\"6%\"/\u003e\u003ci\u003eViewSpatial-Bench\u003c/i\u003e:Evaluating Multi-perspective Spatial Localization in Vision-Language Models\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://arxiv.org/abs/2505.21500\" target=\"_blank\"\u003e\n        \u003cimg alt=\"arXiv\" src=\"https://img.shields.io/badge/arXiv-ViewSpatial_Bench-red?logo=arxiv\" height=\"20\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/datasets/lidingm/ViewSpatial-Bench\" target=\"_blank\"\u003e\n        \u003cimg alt=\"ViewSpatial_Bench\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20_Benchmark-ViewSpatial_Bench-ffc107?color=ffc107\u0026logoColor=white\" height=\"20\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://zju-real.github.io/ViewSpatial-Page/\" target=\"_blank\"\u003e\n        \u003cimg alt=\"Webpage\" src=\"https://img.shields.io/badge/%F0%9F%8C%8E_Website-ViewSpatial_Bench-green.svg\" height=\"20\" /\u003e\n    \u003c/a\u003e\n\u003c/div\u003e\n\n\u003cimg src=\"docs/flat_patternmaking.png\" width=\"100%\"/\u003e\nOur work presents a range of spatial localization tasks requiring reasoning from both camera-centric and human-centric perspectives, revealing the challenges visual-language models (VLMs) face in multi-viewpoint spatial understanding. Current VLMs are predominantly trained on image-text pairs from the web that lack explicit 3D spatial annotations, limiting their cross-perspective spatial reasoning capabilities. \n\n## 📖ViewSpatial-Bench\n\nTo address this gap, we introduce **ViewSpatial-Bench**, a comprehensive benchmark with over 5,700 question-answer pairs across 1,000+ 3D scenes from ScanNet and MS-COCO validation sets. This benchmark evaluates VLMs' spatial localization capabilities from multiple perspectives, specifically testing both egocentric (camera) and allocentric (human subject) viewpoints across five distinct task types.The figure below shows the construction pipeline and example demonstrations of our benchmark.\n\n\u003cimg src=\"docs/pipeline_and_case.png\" width=\"100%\"/\u003e\n\n## 🤖Multi-View Spatial Model\n\nWe present Multi-View Spatial Model (MVSM), developed to address limitations in perspective-dependent spatial reasoning in vision-language models. Following the ViewSpatial-Bench pipeline, we constructed a training dataset of ~43K diverse spatial relationship samples across five task categories, utilizing automated spatial annotations from ScanNet and MS-COCO data, supplemented with Spatial-MM for person-perspective tasks. Using consistent language templates and standardized directional classifications, we implemented a Multi-Perspective Fine-Tuning strategy on Qwen2.5-VL (3B) to enhance reasoning across different observational viewpoints. This approach enables MVSM to develop unified 3D spatial relationship representations that robustly support both camera and human perspective reasoning.\n\n## 👁️‍🗨️Results\n\n\u003cimg src=\"docs/main_result.png\" width=\"100%\"/\u003e\n\nAccuracy comparison across multiple VLMs on camera and human perspective spatial tasks. Our Multi-View Spatial Model (MVSM) significantly outperforms all baseline models across all task categories, demonstrating the effectiveness of our multi-perspective spatial fine-tuning approach. These results reveal fundamental limitations in perspective-based spatial reasoning capabilities among current VLMs. Even powerful proprietary models like GPT-4o (34.98%) and Gemini-2.0-Flash (32.56%) perform only marginally above random chance (26.33%), confirming our hypothesis that standard VLMs struggle with perspective-dependent spatial reasoning despite their strong performance on other vision-language tasks.\n\n\n## ⚒️QuickStart \n\n```plaintext\nViewSpatial-Bench\n├── data_process        # Script code for processing raw datasets to obtain metadata\n├── eval                # Used to store the raw dataset of ViewSpatial-Bench\n├── ViewSpatial-Bench\t# Used to store the source images in ViewSpatial-Bench (can be downloaded from Huggingface)\n├── README.md\n├── evaluate.py         # Script code for evaluating multiple VLMs on ViewSpatial-Bench\n└── requirements.txt    # Dependencies for evaluation\n```\n\n**Note**: [CoCo dataset](https://cocodataset.org/) processing in `data_process` uses the original dataset's annotation files (download from official source). Head orientation calculations use [Orient Anything](https://github.com/SpatialVision/Orient-Anything)'s open-source code and model - place `head2body_orientation_data.py` in its root directory to run.\n\n## 👀Evaluation on Your Own Model\n**I. Using EASI (Third-Party Evaluation)**\n\nViewSpatial-Bench is officially supported by **EASI (Holistic Evaluation of Spatial Intelligence)**. This allows you to compare your model's performance on a broader leaderboard.🎉🎉🎉\n\n- **GitHub**: [EvolvingLMMs-Lab/EASI](https://github.com/EvolvingLMMs-Lab/EASI)\n- **Leaderboard**: [EASI Hugging Face Space](https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard)\n- **Paper**: [Holistic Evaluation of Multimodal LLMs on Spatial Intelligence](https://arxiv.org/abs/2508.13142)\n\u003e **A Note of Appreciation:** \u003e We would like to express our sincere gratitude to the **EASI team** for including ViewSpatial-Bench as a supported benchmark. We share a common vision that Spatial Intelligence is a pivotal frontier for multimodal foundation models, and we are honored to collaborate in advancing research in this field.\n\n**II. With HuggingFace datasets library.**\n\n```py\n# NOTE: pip install datasets\n\nfrom datasets import load_dataset\nds = load_dataset(\"lidingm/ViewSpatial-Bench\")\n```\n\n**III. Evaluation using Open-Source Code.**\n\nEvaluate using our open-source evaluation code available on Github.(Coming Soon)\n\n```py\n# Clone the repository\ngit clone https://github.com/ZJU-REAL/ViewSpatial-Bench.git\ncd ViewSpatial-Bench\n\n# Install dependencies\npip install -r requirements.txt\n\n# Run evaluation\npython evaluate.py --model_path your_model_path\n```\n\nYou can configure the appropriate model parameters and evaluation settings according to the framework's requirements to obtain performance evaluation results on the ViewSpatial-Bench dataset.\n\n\n## Acknowledgement\n\nWe thank the creators of the [ScanNet](https://github.com/ScanNet/ScanNet) and [MS-COCO](https://cocodataset.org/) datasets for their open-source contributions, which provided the foundational 3D scene data and visual content for our spatial annotation pipeline. We also acknowledge the developers of the [Orient Anything](https://github.com/SpatialVision/Orient-Anything) model for their valuable open-source work that supported our annotation framework development. Special thanks to the [EASI](https://github.com/EvolvingLMMs-Lab/EASI) team for their support in integrating ViewSpatial-Bench and for our shared commitment to advancing spatial intelligence research.\n\n## Citation\n\n```\n@misc{li2025viewspatialbenchevaluatingmultiperspectivespatial,\n      title={ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models}, \n      author={Dingming Li and Hongxing Li and Zixuan Wang and Yuchen Yan and Hang Zhang and Siqi Chen and Guiyang Hou and Shengpei Jiang and Wenqi Zhang and Yongliang Shen and Weiming Lu and Yueting Zhuang},\n      year={2025},\n      eprint={2505.21500},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https://arxiv.org/abs/2505.21500}, \n}\n```\n\n","funding_links":[],"categories":["对象检测_分割"],"sub_categories":["资源传输下载"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZJU-REAL%2FViewSpatial-Bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FZJU-REAL%2FViewSpatial-Bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZJU-REAL%2FViewSpatial-Bench/lists"}