https://github.com/jiayuww/spatialeval
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
https://github.com/jiayuww/spatialeval
claude foundation-models gemini gpt-4o gpt-4v large-language-models llama3 machine-learning multimodal-deep-learning reasoning spatial-reasoning vision-language-models
Last synced: about 1 month ago
JSON representation
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
- Host: GitHub
- URL: https://github.com/jiayuww/spatialeval
- Owner: jiayuww
- Created: 2024-10-23T03:19:56.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-01-23T02:51:19.000Z (9 months ago)
- Last Synced: 2025-04-05T05:03:56.802Z (6 months ago)
- Topics: claude, foundation-models, gemini, gpt-4o, gpt-4v, large-language-models, llama3, machine-learning, multimodal-deep-learning, reasoning, spatial-reasoning, vision-language-models
- Language: Python
- Homepage:
- Size: 3.95 MB
- Stars: 23
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SpatialEval
Welcome to the official codebase for [Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models](https://arxiv.org/abs/2406.14852).
## π Quick Links
[](https://spatialeval.github.io/)
[](https://arxiv.org/pdf/2406.14852)
[](https://huggingface.co/datasets/MilaWang/SpatialEval)
[](https://neurips.cc/virtual/2024/poster/94371)## π₯ News π₯
* **[2024.09.25]** π SpatialEval has been accepted to **NeurIPS 2024**!
* **[2024.09.16]** π SpatialEval has been included in [Eureka](https://www.microsoft.com/en-us/research/publication/eureka-evaluating-and-understanding-large-foundation-models/) from **Microsoft Research**!* **[2024.06.21]** π’ SpatialEval is now publicly available on [arXiv](https://arxiv.org/abs/2406.14852)
## π€ About SpatialEval
SpatialEval is a comprehensive benchmark for evaluating spatial intelligence in LLMs and VLMs across four key dimensions:
- Spatial relationships
- Positional understanding
- Object counting
- Navigation### Benchmark Tasks
1. **Spatial-Map**: Understanding spatial relationships between objects in map-based scenarios
2. **Maze-Nav**: Testing navigation through complex environments
3. **Spatial-Grid**: Evaluating spatial reasoning within structured environments
4. **Spatial-Real**: Assessing real-world spatial understandingEach task supports three input modalities:
- Text-only (TQA)
- Vision-only (VQA)
- Vision-Text (VTQA)
## π Quick Start
### π Load Dataset
SpatialEval provides three input modalitiesβTQA (Text-only), VQA (Vision-only), and VTQA (Vision-text)βacross four tasks: Spatial-Map, Maze-Nav, Spatial-Grid, and Spatial-Real. Each modality and task is easily accessible via Hugging Face. Ensure you have installed the [packages](https://huggingface.co/docs/datasets/en/quickstart):
```python
from datasets import load_datasettqa = load_dataset("MilaWang/SpatialEval", "tqa", split="test")
vqa = load_dataset("MilaWang/SpatialEval", "vqa", split="test")
vtqa = load_dataset("MilaWang/SpatialEval", "vtqa", split="test")
```### π Evaluate SpatialEval
SpatialEval supports any evaluation pipelines compatible with language models and vision-language models. For text-based prompts, use the `text` column with this structure:
`{text} First, provide a concise answer in one sentence. Then, elaborate on the reasoning behind your answer in a detailed, step-by-step explanation.` The image input is in the `image` column, and the correct answers are available in the `oracle_answer`, `oracle_option`, and `oracle_full_answer` columns.Next, we provide full scripts for inference and evaluation.
#### Install
1. Clone this repository
```python
git clone git@github.com:jiayuww/SpatialEval.git
```2. Install dependencies
To run models like LLaVA and Bunny, install [LLaVA](https://github.com/haotian-liu/LLaVA) and [Bunny](https://github.com/BAAI-DCAI/Bunny). Install [fastchat](https://github.com/lm-sys/FastChat) for language model inference.
For Bunny variants, ensure you merge LoRA weights into the base LLMs before initiation.#### π¬ Running Inference
For language models, for example, to run on Llama-3-8B for all four tasks:
```bash
# Run on all tasks
python inference_lm.py \
--task "all" \
--mode "tqa" \
--w_reason \
--model-path "meta-llama/Meta-Llama-3-8B-Instruct" \
--output_folder outputs \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.0 \
--max_new_tokens 512 \
--device "cuda"# For specific tasks, replace "all" with:
# - "spatialmap"
# - "mazenav"
# - "spatialgrid"
# - "spatialreal"
```For vision-language models, for example, to run LLaVA-1.6-Mistral-7B across all tasks:
```python
# VQA mode
python inference_vlm.py \
--mode "vqa" \
--task "all" \
--model_path "liuhaotian/llava-v1.6-mistral-7b" \
--w_reason \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.0 \
--max_new_tokens 512 \
--device "cuda"# For VTQA mode, use --mode "vtqa"
```Example bash scripts are available in the `scripts/` folder. For more configurations, see `configs/inference_configs.py`. VLMs support `tqa`, `vqa`, and `vtqa` modes, while LMs support `tqa` only. Tasks include all four tasks or individual tasks like `spatialmap`, `mazenav`, `spatialgrid`, and `spatialreal`.
We can also test the first `k` examples, for exmaple, first 100 samples for each question type in each task by specifying `--first_k 100`.#### π Evaluation
We use exact match for evaluation. For example, to evaluate Spatial-Map task on all three input modalities TQA, VQA and VTQA:
```bash
# For TQA on Spatial-Map
python evals/evaluation.py --mode 'tqa' --task 'spatialmap' --output_folder 'outputs/' --dataset_id 'MilaWang/SpatialEval' --eval_summary_dir 'eval_summary'
# For VQA on Spatial-Map
python evals/evaluation.py --mode 'vqa' --task 'spatialmap' --output_folder 'outputs/' --dataset_id 'MilaWang/SpatialEval' --eval_summary_dir 'eval_summary'
# For VTQA on Spatial-Map
python evals/evaluation.py --mode 'vtqa' --task 'spatialmap' --output_folder 'outputs/' --dataset_id 'MilaWang/SpatialEval' --eval_summary_dir 'eval_summary'
```Evaluation can also be configured for other tasks `mazenav`, `spatialgrid`, and `spatialreal`. Further details are in `evals/evaluation.py`.
### π‘ Dataset Generation Script
Stay tuned! The dataset generation script will be released in Feburary π
## β Citation
If you find our work helpful, please consider citing our paper π
```
@inproceedings{wang2024spatial,
title={Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models},
author={Wang, Jiayu and Ming, Yifei and Shi, Zhenmei and Vineet, Vibhav and Wang, Xin and Li, Yixuan and Joshi, Neel},
booktitle={The Thirty-Eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
```## π¬ Questions
Have questions? We're here to help!
- Open an issue in this repository
- Contact us through the channels listed on our project page