https://github.com/vchitect/shotbench

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/vchitect/shotbench
Owner: Vchitect
Created: 2025-06-25T09:23:40.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-06-30T11:19:26.000Z (12 months ago)
Last Synced: 2025-06-30T11:21:24.264Z (12 months ago)
Size: 106 MB
Stars: 6
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Hongbo Liu^{1, 3*},
Jingwen He^{2, 3*},
Yi Jin¹,
Dian Zheng³,
Yuhao Dong⁴,
Fan Zhang³,
Ziqi Huang⁴,
Yinan He³,
Yangguang Li³,
Weichao Chen¹,
Yu Qiao³,
Wanli Ouyang²,
Shengjie Zhao^1†,
Ziwei Liu^4†

(* equal contributions) († corresponding authors)

¹ Tongji University
² The Chinese University of Hong Kong

³ Shanghai Artificial Intelligence Laboratory
⁴ S-Lab, Nanyang Technological University

## 🎬 Overview
- We introduce **ShotBench**, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5 k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film.
- We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60 % average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension.
- To address the identified limitations and facilitate future research, we constructed **ShotQA**, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70 k high-quality QA pairs. Leveraging ShotQA, we developed **ShotVL**, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new **state-of-the-art** on ShotBench.

## 🔥 News
- **2025-07-7** Release **Evaluation** code.
- **2025-07-2** Release [**ShotQA-70k**](https://huggingface.co/datasets/Vchitect/ShotQA) dataset.
- **2025-06-27** Release [**ShotBench**](https://huggingface.co/datasets/Vchitect/ShotBench) **test** split.
- **2025-06-27** Release our paper: [**ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models**](https://arxiv.org/abs/2506.21356).
- **2025-06-27** Release[ **ShotVL-7B**](https://huggingface.co/Vchitect/ShotVL-7B) and [**ShotVL-3B**](https://huggingface.co/Vchitect/ShotVL-3B), these models are currently SOTA VLMs on cinematography understanding.

## Installation

```shell
conda create -n shotbench python=3.10
conda activate shotbench
pip install -r requirements.txt
```

## Evaluation

### 1.Preparing ShotBench Test Data

```shell
mkdir -p evaluation/data && cd evaluation/data
huggingface-cli download --repo-type dataset Vchitect/ShotBench --local-dir ShotBench
cd ShotBench
tar -xvf images.tar
tar -xvf videos.tar
cd ../../../
```

### 2.Run Evaluation Code

Evaluate ShotVL-3B with 4 GPUs:

```shell
accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-3B --reasoning --output-dir eval_results
```

Evaluate ShotVL-7B with 4 GPUs:

```shell
accelerate launch --num_processes 4 evaluation/shotvl/evaluate.py --model ShotVL-7B --output-dir eval_results
```

### 3.Calculate Metrics

```shell
OPENAI_API_KEY=YOUR_OPENAI_APIKEY python evaluation/calculate_scores.py --prediction_path OUTPUT_FILE_PATH
```

## Evaluation Results

Abbreviations:
SS = Shot Size,
SF = Shot Framing,
CA = Camera Angle,
LS = Lens Size,
LT = Lighting Type,
LC = Lighting Conditions,
SC = Shot Composition,
CM = Camera Movement.
Underline marks previous best in each group.

Our ShotVL models establish new SOTA.

ModelsSSSFCALSLT
LCSCCMAvg

Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct54.656.643.136.659.345.141.531.946.1
Qwen2.5-VL-7B-Instruct69.173.553.247.060.547.449.930.253.8
LLaVA-NeXT-Video-7B35.937.132.527.850.931.728.031.334.4
LLaVA-Video-7B-Qwen256.965.445.136.063.545.437.435.348.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat58.471.052.338.759.544.950.939.751.9
InternVL2.5-8B56.370.350.841.160.245.150.133.650.9
InternVL3-2B56.356.044.434.656.844.643.038.146.7
InternVL3-8B62.165.846.842.958.044.346.844.251.4
InternVL3-14B59.682.255.440.761.744.651.138.254.2
Internlm-xcomposer2d5-7B51.171.039.832.759.335.735.738.845.5
Ovis2-8B35.937.132.527.850.931.728.035.334.9
VILA1.5-3B33.444.932.128.650.635.728.421.534.4
VILA1.5-8B40.644.539.129.748.932.934.436.938.4
VILA1.5-13B36.754.640.734.852.835.434.231.340.1
Instructblip-vicuna-7B27.027.934.529.444.429.727.125.030.6
Instructblip-vicuna-13B26.829.227.928.039.024.027.122.028.0
InternVL2.5-38B67.885.455.441.761.748.952.444.057.2
InternVL3-38B68.084.051.943.664.446.954.744.657.3
Qwen2.5-VL-32B-Instruct62.376.651.048.361.744.052.243.855.0
Qwen2.5-VL-72B-Instruct75.182.956.746.859.049.454.148.959.1
InternVL3-78B69.780.054.544.065.547.451.844.457.2
Proprietary VLMs
Gemini-2.0-flash48.975.544.631.962.248.952.447.451.5
Gemini-2.5-flash-preview-04-1757.782.951.443.865.245.745.943.554.5
GPT-4o69.383.158.248.963.248.055.248.359.3
Ours

ShotVL-3B

77.985.668.859.365.7
53.157.451.765.1

ShotVL-7B

81.290.178.068.570.1
64.345.762.970.1

## Open-Sourcing Plan

- [ ] Release Training code.
- [x] Release Evaluation code.
- [x] Release **ShotQA-70k** dataset.
- [x] Release **ShotBench** test set.
- [x] Release **ShotVL** models.

## BibTeX

```
@misc{
liu2025shotbench,
title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models},
author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
year={2025},
eprint={2506.21356},
achivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.21356},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vchitect/shotbench

Awesome Lists containing this project

README