An open API service indexing awesome lists of open source software.

https://github.com/whwu95/GPT4Vis

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
https://github.com/whwu95/GPT4Vis

gpt-4-vision-preview point-cloud-classification prompt-engineering video-recognition visual-recognition

Last synced: 7 months ago
JSON representation

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

Awesome Lists containing this project

README

          

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?


If you like our project, please give us a star ⭐ on GitHub for latest update.

[![arXiv](https://img.shields.io/badge/Arxiv-2311.15732-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.15732)
[![zhihu](https://img.shields.io/badge/-知乎-000000?logo=zhihu&logoColor=0084FF)](https://zhuanlan.zhihu.com/p/669758735)

[Wenhao Wu](https://whwu95.github.io/)1,2, [Huanjin Yao](https://openreview.net/profile?id=~Huanjin_Yao1)2,3, [Mengxi Zhang](https://scholar.google.com/citations?user=73tAoEAAAAAJ&hl=en)2,4, [Yuxin Song](https://openreview.net/profile?id=~YuXin_Song1)2, [Wanli Ouyang](https://wlouyang.github.io/)5, [Jingdong Wang](https://jingdongwang2017.github.io/)2


1[The University of Sydney](https://www.sydney.edu.au/), 2[Baidu](https://vis.baidu.com/#/), 3[Tsinghua University](https://www.tsinghua.edu.cn/en/), 4[Tianjin University](https://www.tju.edu.cn/english/index.htm/), 5[The Chinese University of Hong Kong](https://www.cuhk.edu.hk/english/#)


***
This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark.



📣 I also have other cross-modal projects that may interest you ✨.

> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)

> Wenhao Wu, Zhun Sun, Wanli Ouyang

> [![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Text4Vis)

> [**Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models**](https://arxiv.org/abs/2301.00182)

> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

> [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/BIKE)

> [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)

> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

> Accepted by CVPR 2023 as 🌟Highlight🌟 | [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Cap4Video)

## News

- [x] **[Mar 7, 2024]**
Due to the recent removal of RPD (request per day) limits on the GPT-4V API, we've updated our predictions for all datasets using standard single testing (one sample per request). Check out the [**GPT4V Results**](./GPT4V_ZS_Results), [**Ground Truth**](./annotations) and [**Datasets**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) we've shared for you! **As a heads-up, 😭running all tests once costs around 💰$4000+💰.**
- [x] **[Nov 28, 2023]** We release our [report](https://arxiv.org/abs/2311.15732) in Arxiv.
- [x] **[Nov 27, 2023]** Our prompts have been released. Thanks for your star 😝.

## Overview


An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.

Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.

## Generated Descriptions from GPT-4



- We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the [**GPT_generated_prompts**](./GPT4_generated_prompts) folder. Enjoy exploring!

- We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the [generate_prompt.py](./generate_prompt.py) file. Happy coding! Please refer to the [**config**](./config) folder for detailed information on all datasets used in our project.
- Execute the following command to generate descriptions with GPT-4.
```sh
# To run the script for specific dataset, simply update the following line with the name of the dataset you're working with:
# dataset_name = ["Dataset Name Here"] # e.g., dtd
python generate_prompt.py
```

## GPT-4V(ision) for Visual Recognition



- We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the [GPT4V_ZS.py](./GPT4V_ZS.py) file for a step-by-step guide on implementing this. We hope it helps you get started with ease!

```sh
# GPT4V zero-shot recognition script.
# dataset_name = ["Dataset Name Here"] # e.g., dtd
python GPT4V_ZS.py
```

- All results are available in the [**GPT4V_ZS_Results**](./GPT4V_ZS_Results) folder! In addition, we've provided the [**Datasets link**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) along with their corresponding ground truths ([**annotations**](./annotations) folder) to help readers in replicating the results. *Note: For certain datasets, we may have removed prefixes from the sample IDs. For instance, in the case of ImageNet, "ILSVRC2012_val_00031094.JPEG" was modified to "00031094.JPEG".*

| DTD | EuroSAT | SUN397 | RAF-DB | Caltech101 | ImageNet-1K | FGVC-Aircraft | Flower102 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [57.7](./GPT4V_ZS_Results/dtd.json) | [46.8](./GPT4V_ZS_Results/eurosat.json) | [59.2](./GPT4V_ZS_Results/sun397.json) | [68.7](./GPT4V_ZS_Results/rafdb.json) | [93.7](./GPT4V_ZS_Results/caltech101.json) | [63.1](./GPT4V_ZS_Results/imagenet.json) | [56.6](./GPT4V_ZS_Results/aircraft.json) | [69.1](./GPT4V_ZS_Results/flower102.json) |
| [Label](./annotations/dtd_gt.json) | [Label](./annotations/eurosat_gt.json) | [Label](./annotations/sun397_gt.json) | [Label](./annotations/rafdb_gt.json) | [Label](./annotations/caltech101_gt.json) | [Label](./annotations/imagenet_gt.json) | [Label](./annotations/aircraft_gt.json) | [Label](./annotations/flower102_gt.json) |

| Stanford Cars | Food101| Oxford Pets | UCF-101 | HMDB-51 | Kinetics-400 | ModelNet-10 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
[62.7](./GPT4V_ZS_Results/car.json) | [86.2](./GPT4V_ZS_Results/food101.json) | [90.8](./GPT4V_ZS_Results/pets.json) | [83.7](./GPT4V_ZS_Results/ucf_8frame.json) | [58.8](./GPT4V_ZS_Results/hmdb_8frame.json) | [58.8](./GPT4V_ZS_Results/k400.json) | [66.9](./GPT4V_ZS_Results/modelnet10_front.json) |
| [Label](./annotations/stanford_cars_gt.json) | [Label](./annotations/food_gt.json) | [Label](./annotations/pets_gt.json) | [Label](./annotations/ucf_gt.json) | [Label](./annotations/hmdb_gt.json) | [Label](./annotations/k400_gt.json) | [Label](./annotations/modelnet10_gt.json) |

- With the provided prediction and annotation files, you can reproduce our top-1/top-5 accuracy results with the [calculate_acc.py](./calculate_acc.py) script.

```sh
# pred_json_path = 'GPT4V_ZS_Results/imagenet.json'
# gt_json_path = 'annotations/imagenet_gt.json'
python calculate_acc.py
```

## Requirement
For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: [OpenAI Quickstart Guide](https://platform.openai.com/docs/quickstart).


## 📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

```bibtex
@article{GPT4Vis,
title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},
author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},
booktitle={arXiv preprint arXiv:2311.15732},
year={2023}
}
```


## 🎗️ Acknowledgement
This evaluation is built on the excellent works:
- [CLIP](https://github.com/openai/CLIP): Learning Transferable Visual Models From Natural Language Supervision
- [GPT-4](https://platform.openai.com/docs/guides/vision)
- [Text4Vis](https://github.com/whwu95/Text4Vis): Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

We extend our sincere gratitude to these contributors.

## 👫 Contact
For any questions, please feel free to file an issue.