https://github.com/eric-ai-lab/ProbMed

"Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"
https://github.com/eric-ai-lab/ProbMed

evaluation large-multimodal-models llms medical-diagnosis medical-vqa vision-and-language

Last synced: 6 months ago
JSON representation

"Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"

Host: GitHub
URL: https://github.com/eric-ai-lab/ProbMed
Owner: eric-ai-lab
Created: 2024-05-30T17:25:12.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-21T23:33:07.000Z (8 months ago)
Last Synced: 2025-02-22T00:24:52.177Z (8 months ago)
Topics: evaluation, large-multimodal-models, llms, medical-diagnosis, medical-vqa, vision-and-language
Language: Python
Homepage: https://jackie-2000.github.io/probmed.github.io/
Size: 263 KB
Stars: 15
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-multimodal-in-medical-imaging - ProbMed

README

          # ProbMed

[**🌐 Homepage**](https://jackie-2000.github.io/probmed.github.io/) | [**🤗 Dataset**](https://huggingface.co/datasets/rippleripple/ProbMed) | [**🤗 Paper**](https://arxiv.org/pdf/2405.20421) | [**📖 arXiv**](https://arxiv.org/abs/2405.20421) | [**GitHub**](https://github.com/eric-ai-lab/ProbMed/)

This repo contains the evaluation code for the paper "[Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA]([https://arxiv.org/pdf/2311.16502.pdf](https://github.com/eric-ai-lab/ProbMed/))"

## Introduction

We introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. ProbMed draws from two comprehensive biomedical datasets MedICaT and ChestX-ray14 to compile a diverse set of 6,303 images. These images span three modalities (X-ray, MRI, and CT scan) and four organs (abdomen, brain, chest, and spine). After preprocessing, we generated a diverse set of high-quality questions for each image, covering various diagnostic dimensions. This process resulted in a total of 57,132 question-answer pairs, averaging 9 pairs per image.

![Alt text](image.png)

## Dataset Creation

ProbMed was created to rigorously evaluate LMMs’ readiness for real-life diagnostic tasks, particularly under adversarial conditions. Please refer to our huggingface [**🤗 Dataset**](https://huggingface.co/datasets/rippleripple/ProbMed) for more details.

## Evaluation

Please refer to our [eval](eval)

 folder for more details.

## 🏆 Leaderboard

| Model           | Modality  | Organ     | Abnormality | Condition/Finding | Position | Overall |

|-----------------|:---------:|:---------:|:-----------:|:-----------------:|:--------:|:-------:|

| Random Choice   | 25.00	    | 25.00	    | 50.00	      | **35.67**	        | **36.48**| 32.13   |

| GPT-4o          | **97.42**	| 69.46     | 61.79	      | 29.30	            | 24.06    | **55.60**   |

| GPT-4V          | 92.51	    | 71.73	    | 53.30	      | 35.19	            | 22.40    | 55.28   |

| Gemini 1.5 Pro  | 96.47     | 75.69	    | 62.59	      | 27.93	            | 17.54    | 55.08   |

| Med-Flamingo    | 44.15     | 61.39	    | 50.00	      | 26.33	            | 5.65     | 35.66   |

| CheXagent       | 37.25	    | 33.95	    | **73.31**	  | 28.52	            | 7.48     | 30.61   |

| BiomedGPT       | 60.25	    | 46.81	    | 50.31	      | 14.13	            | 6.11     | 33.34   |

| LLaVA-Med       | 5.48	     | 32.96	    | 38.76	      | 20.38	            | 5.33     | 17.90   |

| MiniGPT-v2      | 3.25	     | 76.26	    | 50.08	      | 15.23	            | 7.96     | 27.67   |

| LLaVA-v1.6 (7B) | 6.77	     | **80.70**	| 46.18	      | 3.56	             | 1.21     | 24.96   |

| LLaVA-v1 (7B)   | 25.27	    | 40.53	    | 50.00	      | 0.34		            | 0.11     | 19.30   |

## Contact

- Qianqi Yan: qyan79@ucsc.edu

- Xin Eric Wang: xwang366@ucsc.edu

## Citation

**BibTeX:**

```bibtex

@misc{yan2024worse,

      title={Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA}, 

      author={Qianqi Yan and Xuehai He and Xiang Yue and Xin Eric Wang},

      year={2024},

      eprint={2405.20421},

      archivePrefix={arXiv},

      primaryClass={cs.AI}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eric-ai-lab/ProbMed

Awesome Lists containing this project

README