https://github.com/drive-bench/toolkit
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
https://github.com/drive-bench/toolkit
autonomous-driving chatgpt driving-with-language internvl phi-3 qwen2-vl vision-language-models
Last synced: about 1 year ago
JSON representation
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
- Host: GitHub
- URL: https://github.com/drive-bench/toolkit
- Owner: drive-bench
- License: apache-2.0
- Created: 2025-01-01T13:26:09.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-22T00:47:25.000Z (over 1 year ago)
- Last Synced: 2025-03-28T21:51:08.147Z (about 1 year ago)
- Topics: autonomous-driving, chatgpt, driving-with-language, internvl, phi-3, qwen2-vl, vision-language-models
- Language: Python
- Homepage: https://drive-bench.github.io
- Size: 14.4 MB
- Stars: 59
- Watchers: 7
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-LLM4AD - DriveBench
- Awesome-LVLM-Attack - Github
README
English | 简体中文
Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives
Shaoyuan Xie1
Lingdong Kong2,3
Yuhao Dong2,4
Chonghao Sima2,6
Wenwei Zhang2
Qi Alfred Chen1
Ziwei Liu4
Liang Pan2
1University of California, Irvine
2Shanghai AI Laboratory
3National University of Singapore
4S-Lab, Nanyang Technological University
5The University of Hong Kong
## About
|  |
|:-:|
| We introduce :blue_car: **DriveBench**, a benchmark dataset designed to evaluate VLM reliability across **17 settings** (clean, corrupted, and text-only inputs), encompassing **19,200 frames**, **20,498 question-answer pairs**, **three question types**, **four mainstream driving tasks**, and **a total of 12 popular VLMs**.
| Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving.
## :memo: Updates
- \[2025.01\] - The evaluation data can be accessible at our [HuggingFace Dataset Card](https://huggingface.co/datasets/drive-bench/arena). :hugs:
- \[2025.01\] - Introducing the :blue_car: **DriveBench** project! For more details, kindly refer to our [Project Page](https://drive-bench.github.io/) and [Preprint](https://arxiv.org/abs/2501.04003). :rocket:
# Table of Content
- [Benchmark Comparison](#bar_chart-benchmark-comparison)
- [Installation](#gear-installation)
- [Data Preparation](#hotsprings-data-preparation)
- [Getting Started](#rocket-getting-started)
- [Benchmark Results](#aerial_tramway-benchmark-results)
- [Benchmark Configuration](#benchmark-configuration)
- [Benchmark Study](#benchmark-study)
- [Robustness Analysis](#robustness-analysis)
- [Citation](#citation)
- [License](#license)
- [Acknowledgments](#acknowledgments)
# :bar_chart: Benchmark Comparison
Benchmark
Perception
Prediction
Behavior
Planning
Robustness
Frames
QA
Logic
Evaluation Metrics
(Test)
(Test)
BDD-X
✔
✘
✘
✘
✘
-
-
None
Language
BDD-OIA
✔
✘
✔
✘
✘
-
-
None
F1 Score
nuScenes-QA
✔
✘
✘
✘
✘
36,114
83,337
None
Acc
Talk2Car
✔
✘
✘
✔
✘
~1.8k
2,447
None
-
nuPrompt
✔
✘
✘
✘
✘
~36k
~6k
None
AMOTA
DRAMA
✔
✘
✘
✔
✘
-
~14k
Chain
Language
Rank2Tel
✔
✘
✘
✔
✘
-
-
Chain
Accuracy, Language
DirveMLLM
✔
✘
✘
✘
✘
880
-
None
Acc
DriveVLM
✔
✘
✔
✔
✘
-
-
None
GPTctx
DriveLM
✔
✔
✔
✔
✘
4,794
15,480
Graph
Language, GPT
DriveBench (Ours)
✔
✔
✔
✔
✔
19,200
20,498
Graph
Acc, Language, GPT, GPTctx
# :gear: Installation
For details related to installation and environment setups, kindly refer to [INSTALL.md](./docs/INSTALL.md).
# :hotsprings: Data Preparation
Kindly refer to [DATA_PREPAER.md](./docs/DATA_PREPAER.md) for the details to prepare the datasets.
# :rocket: Getting Started
To learn more usage about this codebase, kindly refer to [GET_STARTED.md](./docs/GET_STARTED.md).
# :aerial_tramway: Benchmark Results
## Benchmark Configuration
 Commercial VLMs
> - [x] **[GPT4-o]()**
 Open-Source VLMs
> - [x] **[LLaVA-1.5]()** [**`[Code]`**]()
> - [x] **[LLaVA-NeXT]()** [**`[Code]`**]()
> - [x] **[InternVL2]()** [**`[Code]`**]()
> - [x] **[Phi-3]()** [**`[Code]`**]()
> - [x] **[Phi-3.5]()** [**`[Code]`**]()
> - [x] **[Oryx]()** [**`[Code]`**]()
> - [x] **[Qwen2-VL]()** [**`[Code]`**]()
 Specialist VLMs
> - [x] **[DriveLM-Agent]()** [**`[Code]`**]()
> - [x] **[Dolphins]()** [**`[Code]`**]()
## Benchmark Study
Model
Size
Type
Perception (Clean)
Perception (Corr.)
Perception (T.O.)
Prediction (Clean)
Prediction (Corr.)
Prediction (T.O.)
Planning (Clean)
Planning (Corr.)
Planning (T.O.)
Behavior (Clean)
Behavior (Corr.)
Behavior (T.O.)
Human
-
-
47.67
38.32
-
-
-
-
-
-
-
69.51
54.09
-
GPT-4o
-
Commercial
35.37
35.25
36.48
51.30
49.94
49.05
75.75
75.36
73.21
45.40
44.33
50.03
LLaVA-1.5
7B
Open
23.22
22.95
22.31
22.02
17.54
14.64
29.15
31.51
32.45
13.60
13.62
14.91
LLaVA-1.5
13B
Open
23.35
23.37
22.37
36.98
37.78
23.98
34.26
34.99
38.85
32.99
32.43
32.79
LLaVA-NeXT
7B
Open
24.15
19.62
13.86
35.07
35.89
28.36
45.27
44.36
27.58
48.16
39.44
11.92
InternVL2
8B
Open
32.36
32.68
33.60
45.52
37.93
48.89
53.27
55.25
34.56
54.58
40.78
20.14
Phi-3
4.2B
Open
22.88
23.93
28.26
40.11
37.27
22.61
60.03
61.31
46.88
45.20
44.57
28.22
Phi-3.5
4.2B
Open
27.52
27.51
28.26
45.13
38.21
4.92
31.91
28.36
46.30
37.89
49.13
39.16
Oryx
7B
Open
17.02
15.97
18.47
48.13
46.63
12.77
53.57
55.76
48.26
33.92
33.81
23.94
Qwen2-VL
7B
Open
28.99
27.85
35.16
37.89
39.55
37.77
57.04
54.78
41.66
49.07
47.68
54.48
Qwen2-VL
72B
Open
30.13
26.92
17.70
49.35
43.49
5.57
61.30
63.07
53.35
51.26
49.78
39.46
DriveLM
7B
Specialist
16.85
16.00
8.75
44.33
39.71
4.70
68.71
67.60
65.24
42.78
40.37
27.83
Dolphins
7B
Specialist
9.59
10.84
11.01
32.66
29.88
39.98
52.91
53.77
60.98
8.81
8.25
11.92
## Robustness Analysis
Model
Size
Type

Weather

External

Sensor

Motion

Transmission
MCQ
VQA
CAP
MCQ
VQA
CAP
MCQ
VQA
CAP
MCQ
VQA
CAP
MCQ
VQA
CAP
GPT-4o
-
Commercial
57.20
57.28
54.90
29.25
56.60
61.98
44.25
54.95
56.53
34.25
59.20
56.25
36.83
53.95
57.57
LLaVA-1.5
7B
Open
69.70
35.49
35.91
26.50
29.17
34.95
18.83
30.64
33.15
71.25
33.43
35.18
10.17
27.28
34.38
LLaVA-1.5
13B
Open
61.60
39.76
37.76
15.50
34.55
37.83
24.08
35.48
36.08
79.75
36.46
36.42
15.50
32.53
34.33
LLaVA-NeXT
7B
Open
69.70
36.96
48.52
48.50
30.32
57.18
21.83
30.40
44.37
66.00
34.20
50.44
11.83
29.43
53.50
InternVL2
8B
Open
59.90
48.72
48.60
50.75
47.74
57.82
29.92
45.06
51.14
68.25
49.51
49.67
30.00
43.42
54.24
Phi-3
4.2B
Open
40.00
40.59
45.61
25.00
31.44
45.99
16.83
35.58
43.71
31.25
42.92
48.43
27.67
33.04
41.35
Phi-3.5
4.2B
Open
60.60
41.82
45.97
21.25
36.89
30.95
25.58
34.66
39.30
33.00
46.03
49.33
39.67
33.47
39.67
Oryx
7B
Open
53.20
40.43
48.95
45.00
40.68
56.06
50.50
36.71
48.55
72.50
40.01
48.33
39.67
36.98
49.87
Qwen2-VL
7B
Open
76.70
49.33
45.12
37.50
47.62
51.24
22.83
39.45
47.23
57.00
47.40
47.74
35.83
42.31
48.60
Qwen2-VL
72B
Open
59.80
51.05
48.55
45.50
50.57
57.25
52.25
45.89
48.59
58.25
50.85
47.88
44.83
46.23
50.50
DriveLM
7B
Specialist
21.20
42.86
20.04
21.25
37.49
21.92
9.00
36.68
15.56
22.25
42.05
17.07
17.50
39.56
10.37
Dolphins
7B
Specialist
54.30
30.21
31.08
3.00
30.42
29.38
9.42
26.83
26.30
9.25
29.82
28.05
21.50
28.86
27.65
## Qualitative Comparisons
|  |
|:-:|
| Examples of different VLM responses under the Frame Lost condition. We observe that GPT-4o responses with visible objects while LLaVA-NeXT and DriveLM tend to hallucinate objects that cannot be seen from the provided images.
|  |
|:-:|
| Examples of different VLM responses under the Water Splash condition. We observe that, under severe visual corruptions, VLMs respond with ambiguous and general answers based on their learned knowledge, without referring to the visual information. Most responses include traffic signals and pedestrians, even though they are not visible in the provided images.
# Citation
If you find this work helpful, please kindly consider citing our paper:
```bibtex
@article{xie2025drivebench,
author = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},
title = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},
journal = {arXiv preprint arXiv:2501.04003},
year = {2025},
}
```
# License
This work is under the [Apache License Version 2.0](https://www.apache.org/licenses/LICENSE-2.0), while some specific implementations in this codebase might be with other licenses. Kindly refer to [LICENSE.md]() for a more careful check, if you are using our code for commercial matters.
# Acknowledgments
To be updated.