https://github.com/vchitect/shotbench
https://github.com/vchitect/shotbench
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/vchitect/shotbench
- Owner: Vchitect
- Created: 2025-06-25T09:23:40.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-06-30T11:19:26.000Z (12 months ago)
- Last Synced: 2025-06-30T11:21:24.264Z (12 months ago)
- Size: 106 MB
- Stars: 6
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Hongbo Liu1, 3*,
Jingwen He2, 3*,
Yi Jin1,
Dian Zheng3,
Yuhao Dong4,
Fan Zhang3,
Ziqi Huang4,
Yinan He3,
Yangguang Li3,
Weichao Chen1,
Yu Qiao3,
Wanli Ouyang2,
Shengjie Zhao1†,
Ziwei Liu4†
(* equal contributions) († corresponding authors)
1 Tongji University
2 The Chinese University of Hong Kong
3 Shanghai Artificial Intelligence Laboratory
4 S-Lab, Nanyang Technological University
## 🎬 Overview
- We introduce **ShotBench**, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5 k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film.
- We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60 % average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension.
- To address the identified limitations and facilitate future research, we constructed **ShotQA**, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70 k high-quality QA pairs. Leveraging ShotQA, we developed **ShotVL**, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new **state-of-the-art** on ShotBench.
## 🔥 News
- **2025-07-7** Release **Evaluation** code.
- **2025-07-2** Release [**ShotQA-70k**](https://huggingface.co/datasets/Vchitect/ShotQA) dataset.
- **2025-06-27** Release [**ShotBench**](https://huggingface.co/datasets/Vchitect/ShotBench) **test** split.
- **2025-06-27** Release our paper: [**ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models**](https://arxiv.org/abs/2506.21356).
- **2025-06-27** Release[ **ShotVL-7B**](https://huggingface.co/Vchitect/ShotVL-7B) and [**ShotVL-3B**](https://huggingface.co/Vchitect/ShotVL-3B), these models are currently SOTA VLMs on cinematography understanding.
## Installation
```shell
conda create -n shotbench python=3.10
conda activate shotbench
pip install -r requirements.txt
```
## Evaluation
### 1.Preparing ShotBench Test Data
```shell
mkdir -p evaluation/data && cd evaluation/data
huggingface-cli download --repo-type dataset Vchitect/ShotBench --local-dir ShotBench
cd ShotBench
tar -xvf images.tar
tar -xvf videos.tar
cd ../../../
```
### 2.Run Evaluation Code
Evaluate ShotVL-3B with 4 GPUs:
```shell
accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-3B --reasoning --output-dir eval_results
```
Evaluate ShotVL-7B with 4 GPUs:
```shell
accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-7B --output-dir eval_results
```
### 3.Calculate Metrics
```shell
OPENAI_API_KEY=YOUR_OPENAI_APIKEY python evaluation/calculate_scores.py --prediction_path OUTPUT_FILE_PATH
```
## Evaluation Results
Abbreviations:
SS = Shot Size,
SF = Shot Framing,
CA = Camera Angle,
LS = Lens Size,
LT = Lighting Type,
LC = Lighting Conditions,
SC = Shot Composition,
CM = Camera Movement.
Underline marks previous best in each group.
Our ShotVL models establish new SOTA.
ModelsSSSFCALSLT
LCSCCMAvg
Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct54.656.643.136.659.345.141.531.946.1
Qwen2.5-VL-7B-Instruct69.173.553.247.060.547.449.930.253.8
LLaVA-NeXT-Video-7B35.937.132.527.850.931.728.031.334.4
LLaVA-Video-7B-Qwen256.965.445.136.063.545.437.435.348.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat58.471.052.338.759.544.950.939.751.9
InternVL2.5-8B56.370.350.841.160.245.150.133.650.9
InternVL3-2B56.356.044.434.656.844.643.038.146.7
InternVL3-8B62.165.846.842.958.044.346.844.251.4
InternVL3-14B59.682.255.440.761.744.651.138.254.2
Internlm-xcomposer2d5-7B51.171.039.832.759.335.735.738.845.5
Ovis2-8B35.937.132.527.850.931.728.035.334.9
VILA1.5-3B33.444.932.128.650.635.728.421.534.4
VILA1.5-8B40.644.539.129.748.932.934.436.938.4
VILA1.5-13B36.754.640.734.852.835.434.231.340.1
Instructblip-vicuna-7B27.027.934.529.444.429.727.125.030.6
Instructblip-vicuna-13B26.829.227.928.039.024.027.122.028.0
InternVL2.5-38B67.885.455.441.761.748.952.444.057.2
InternVL3-38B68.084.051.943.664.446.954.744.657.3
Qwen2.5-VL-32B-Instruct62.376.651.048.361.744.052.243.855.0
Qwen2.5-VL-72B-Instruct75.182.956.746.859.049.454.148.959.1
InternVL3-78B69.780.054.544.065.547.451.844.457.2
Proprietary VLMs
Gemini-2.0-flash48.975.544.631.962.248.952.447.451.5
Gemini-2.5-flash-preview-04-1757.782.951.443.865.245.745.943.554.5
GPT-4o69.383.158.248.963.248.055.248.359.3
Ours
## Open-Sourcing Plan
- [ ] Release Training code.
- [x] Release Evaluation code.
- [x] Release **ShotQA-70k** dataset.
- [x] Release **ShotBench** test set.
- [x] Release **ShotVL** models.
## BibTeX
```
@misc{
liu2025shotbench,
title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models},
author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
year={2025},
eprint={2506.21356},
achivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.21356},
}
```