https://github.com/drive-bench/toolkit

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
https://github.com/drive-bench/toolkit

autonomous-driving chatgpt driving-with-language internvl phi-3 qwen2-vl vision-language-models

Last synced: over 1 year ago
JSON representation

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Host: GitHub
URL: https://github.com/drive-bench/toolkit
Owner: drive-bench
License: apache-2.0
Created: 2025-01-01T13:26:09.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-22T00:47:25.000Z (over 1 year ago)
Last Synced: 2025-03-28T21:51:08.147Z (over 1 year ago)
Topics: autonomous-driving, chatgpt, driving-with-language, internvl, phi-3, qwen2-vl, vision-language-models
Language: Python
Homepage: https://drive-bench.github.io
Size: 14.4 MB
Stars: 59
Watchers: 7
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-LLM4AD - DriveBench
Awesome-LVLM-Attack - Github

README

          
English | 简体中文 



  
  

    

    Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

  


  


      Shaoyuan Xie¹    

      Lingdong Kong^2,3    

      Yuhao Dong^2,4    

      Chonghao Sima^2,6


      Wenwei Zhang²    

      Qi Alfred Chen¹    

      Ziwei Liu⁴    

      Liang Pan²

    

  ¹University of California, Irvine    

  ²Shanghai AI Laboratory    

  ³National University of Singapore    

  ⁴S-Lab, Nanyang Technological University    

  ⁵The University of Hong Kong

  




  

    

   

  

    

   

  

    

   

  

    

   

  

    

  



## About

| ![drivebench](./docs/figs/bench.png) |

|:-:|

| We introduce :blue_car: **DriveBench**, a benchmark dataset designed to evaluate VLM reliability across **17 settings** (clean, corrupted, and text-only inputs), encompassing **19,200 frames**, **20,498 question-answer pairs**, **three question types**, **four mainstream driving tasks**, and **a total of 12 popular VLMs**. 

| Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. 

## :memo: Updates

- \[2025.01\] - The evaluation data can be accessible at our [HuggingFace Dataset Card](https://huggingface.co/datasets/drive-bench/arena). :hugs:

- \[2025.01\] - Introducing the :blue_car: **DriveBench** project! For more details, kindly refer to our [Project Page](https://drive-bench.github.io/) and [Preprint](https://arxiv.org/abs/2501.04003). :rocket:

# Table of Content

- [Benchmark Comparison](#bar_chart-benchmark-comparison)

- [Installation](#gear-installation)

- [Data Preparation](#hotsprings-data-preparation)

- [Getting Started](#rocket-getting-started)

- [Benchmark Results](#aerial_tramway-benchmark-results)

  - [Benchmark Configuration](#benchmark-configuration)

  - [Benchmark Study](#benchmark-study)

  - [Robustness Analysis](#robustness-analysis)

- [Citation](#citation)

- [License](#license)

- [Acknowledgments](#acknowledgments)

# :bar_chart: Benchmark Comparison

Benchmark

Perception

Prediction

Behavior

Planning

Robustness

Frames

QA

Logic

Evaluation Metrics

(Test)

(Test)

BDD-X

✔

✘

✘

✘

✘

-

-

None

Language

BDD-OIA

✔

✘

✔

✘

✘

-

-

None

F1 Score

nuScenes-QA

✔

✘

✘

✘

✘

36,114

83,337

None

Acc

Talk2Car

✔

✘

✘

✔

✘

~1.8k

2,447

None

-

nuPrompt

✔

✘

✘

✘

✘

~36k

~6k

None

AMOTA

DRAMA

✔

✘

✘

✔

✘

-

~14k

Chain

Language

Rank2Tel

✔

✘

✘

✔

✘

-

-

Chain

Accuracy, Language

DirveMLLM

✔

✘

✘

✘

✘

880

-

None

Acc

DriveVLM

✔

✘

✔

✔

✘

-

-

None

GPT_ctx

DriveLM

✔

✔

✔

✔

✘

4,794

15,480

Graph

Language, GPT

DriveBench (Ours)

✔

✔

✔

✔

✔

19,200

20,498

Graph

Acc, Language, GPT, GPT_ctx

     

# :gear: Installation

For details related to installation and environment setups, kindly refer to [INSTALL.md](./docs/INSTALL.md).

# :hotsprings: Data Preparation

Kindly refer to [DATA_PREPAER.md](./docs/DATA_PREPAER.md) for the details to prepare the datasets.

# :rocket: Getting Started

To learn more usage about this codebase, kindly refer to [GET_STARTED.md](./docs/GET_STARTED.md).

# :aerial_tramway: Benchmark Results

## Benchmark Configuration

&nbspCommercial VLMs

  

> - [x] **[GPT4-o]()**

&nbspOpen-Source VLMs

  

> - [x] **[LLaVA-1.5]()** ^{[**`[Code]`**]()}

> - [x] **[LLaVA-NeXT]()** ^{[**`[Code]`**]()}

> - [x] **[InternVL2]()** ^{[**`[Code]`**]()}

> - [x] **[Phi-3]()** ^{[**`[Code]`**]()}

> - [x] **[Phi-3.5]()** ^{[**`[Code]`**]()}

> - [x] **[Oryx]()** ^{[**`[Code]`**]()}

> - [x] **[Qwen2-VL]()** ^{[**`[Code]`**]()}

&nbspSpecialist VLMs

  

> - [x] **[DriveLM-Agent]()** ^{[**`[Code]`**]()}

> - [x] **[Dolphins]()** ^{[**`[Code]`**]()}

## Benchmark Study

Model

Size

Type

Perception (Clean)

Perception (Corr.)

Perception (T.O.)

Prediction (Clean)

Prediction (Corr.)

Prediction (T.O.)

Planning (Clean)

Planning (Corr.)

Planning (T.O.)

Behavior (Clean)

Behavior (Corr.)

Behavior (T.O.)

Human

-

-

47.67

38.32

-

-

-

-

-

-

-

69.51

54.09

-

GPT-4o

-

Commercial

35.37

35.25

36.48

51.30

49.94

49.05

75.75

75.36

73.21

45.40

44.33

50.03

LLaVA-1.5

7B

Open

23.22

22.95

22.31

22.02

17.54

14.64

29.15

31.51

32.45

13.60

13.62

14.91

LLaVA-1.5

13B

Open

23.35

23.37

22.37

36.98

37.78

23.98

34.26

34.99

38.85

32.99

32.43

32.79

LLaVA-NeXT

7B

Open

24.15

19.62

13.86

35.07

35.89

28.36

45.27

44.36

27.58

48.16

39.44

11.92

InternVL2

8B

Open

32.36

32.68

33.60

45.52

37.93

48.89

53.27

55.25

34.56

54.58

40.78

20.14

Phi-3

4.2B

Open

22.88

23.93

28.26

40.11

37.27

22.61

60.03

61.31

46.88

45.20

44.57

28.22

Phi-3.5

4.2B

Open

27.52

27.51

28.26

45.13

38.21

4.92

31.91

28.36

46.30

37.89

49.13

39.16

Oryx

7B

Open

17.02

15.97

18.47

48.13

46.63

12.77

53.57

55.76

48.26

33.92

33.81

23.94

Qwen2-VL

7B

Open

28.99

27.85

35.16

37.89

39.55

37.77

57.04

54.78

41.66

49.07

47.68

54.48

Qwen2-VL

72B

Open

30.13

26.92

17.70

49.35

43.49

5.57

61.30

63.07

53.35

51.26

49.78

39.46

DriveLM

7B

Specialist

16.85

16.00

8.75

44.33

39.71

4.70

68.71

67.60

65.24

42.78

40.37

27.83

Dolphins

7B

Specialist

9.59

10.84

11.01

32.66

29.88

39.98

52.91

53.77

60.98

8.81

8.25

11.92

          

## Robustness Analysis

Model

Size

Type


Weather


External


Sensor


Motion


Transmission

MCQ

VQA

CAP

MCQ

VQA

CAP

MCQ

VQA

CAP

MCQ

VQA

CAP

MCQ

VQA

CAP

GPT-4o

-

Commercial

57.20

57.28

54.90

29.25

56.60

61.98

44.25

54.95

56.53

34.25

59.20

56.25

36.83

53.95

57.57

LLaVA-1.5

7B

Open

69.70

35.49

35.91

26.50

29.17

34.95

18.83

30.64

33.15

71.25

33.43

35.18

10.17

27.28

34.38

LLaVA-1.5

13B

Open

61.60

39.76

37.76

15.50

34.55

37.83

24.08

35.48

36.08

79.75

36.46

36.42

15.50

32.53

34.33

LLaVA-NeXT

7B

Open

69.70

36.96

48.52

48.50

30.32

57.18

21.83

30.40

44.37

66.00

34.20

50.44

11.83

29.43

53.50

InternVL2

8B

Open

59.90

48.72

48.60

50.75

47.74

57.82

29.92

45.06

51.14

68.25

49.51

49.67

30.00

43.42

54.24

Phi-3

4.2B

Open

40.00

40.59

45.61

25.00

31.44

45.99

16.83

35.58

43.71

31.25

42.92

48.43

27.67

33.04

41.35

Phi-3.5

4.2B

Open

60.60

41.82

45.97

21.25

36.89

30.95

25.58

34.66

39.30

33.00

46.03

49.33

39.67

33.47

39.67

Oryx

7B

Open

53.20

40.43

48.95

45.00

40.68

56.06

50.50

36.71

48.55

72.50

40.01

48.33

39.67

36.98

49.87

Qwen2-VL

7B

Open

76.70

49.33

45.12

37.50

47.62

51.24

22.83

39.45

47.23

57.00

47.40

47.74

35.83

42.31

48.60

Qwen2-VL

72B

Open

59.80

51.05

48.55

45.50

50.57

57.25

52.25

45.89

48.59

58.25

50.85

47.88

44.83

46.23

50.50

DriveLM

7B

Specialist

21.20

42.86

20.04

21.25

37.49

21.92

9.00

36.68

15.56

22.25

42.05

17.07

17.50

39.56

10.37

Dolphins

7B

Specialist

54.30

30.21

31.08

3.00

30.42

29.38

9.42

26.83

26.30

9.25

29.82

28.05

21.50

28.86

27.65

## Qualitative Comparisons

| ![example](./docs/figs/examples_benchmark_3.png) |

|:-:|

| Examples of different VLM responses under the Frame Lost condition. We observe that GPT-4o responses with visible objects while LLaVA-NeXT and DriveLM tend to hallucinate objects that cannot be seen from the provided images.

| ![example](./docs/figs/examples_benchmark_4.png) |

|:-:|

| Examples of different VLM responses under the Water Splash condition. We observe that, under severe visual corruptions, VLMs respond with ambiguous and general answers based on their learned knowledge, without referring to the visual information. Most responses include traffic signals and pedestrians, even though they are not visible in the provided images.

# Citation

If you find this work helpful, please kindly consider citing our paper:

```bibtex

@article{xie2025drivebench,

  author  = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},

  title   = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},

  journal = {arXiv preprint arXiv:2501.04003},

  year    = {2025},

}

```

# License

This work is under the [Apache License Version 2.0](https://www.apache.org/licenses/LICENSE-2.0), while some specific implementations in this codebase might be with other licenses. Kindly refer to [LICENSE.md]() for a more careful check, if you are using our code for commercial matters.

# Acknowledgments

To be updated.