https://github.com/whwu95/GPT4Vis
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
https://github.com/whwu95/GPT4Vis
gpt-4-vision-preview point-cloud-classification prompt-engineering video-recognition visual-recognition
Last synced: 7 months ago
JSON representation
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
- Host: GitHub
- URL: https://github.com/whwu95/GPT4Vis
- Owner: whwu95
- License: mit
- Created: 2023-11-17T10:21:31.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-22T04:58:13.000Z (over 1 year ago)
- Last Synced: 2024-11-06T02:38:07.671Z (12 months ago)
- Topics: gpt-4-vision-preview, point-cloud-classification, prompt-engineering, video-recognition, visual-recognition
- Language: Python
- Homepage: https://arxiv.org/abs/2311.15732
- Size: 16 MB
- Stars: 207
- Watchers: 11
- Forks: 26
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
If you like our project, please give us a star ⭐ on GitHub for latest update.
[](https://arxiv.org/abs/2311.15732)
[](https://zhuanlan.zhihu.com/p/669758735)
[Wenhao Wu](https://whwu95.github.io/)1,2, [Huanjin Yao](https://openreview.net/profile?id=~Huanjin_Yao1)2,3, [Mengxi Zhang](https://scholar.google.com/citations?user=73tAoEAAAAAJ&hl=en)2,4, [Yuxin Song](https://openreview.net/profile?id=~YuXin_Song1)2, [Wanli Ouyang](https://wlouyang.github.io/)5, [Jingdong Wang](https://jingdongwang2017.github.io/)2
1[The University of Sydney](https://www.sydney.edu.au/), 2[Baidu](https://vis.baidu.com/#/), 3[Tsinghua University](https://www.tsinghua.edu.cn/en/), 4[Tianjin University](https://www.tju.edu.cn/english/index.htm/), 5[The Chinese University of Hong Kong](https://www.cuhk.edu.hk/english/#)
***
This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark.
📣 I also have other cross-modal projects that may interest you ✨.
> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)
> Wenhao Wu, Zhun Sun, Wanli Ouyang
> [](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [](https://link.springer.com/article/10.1007/s11263-023-01876-w) [](https://github.com/whwu95/Text4Vis)
> [**Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models**](https://arxiv.org/abs/2301.00182)
> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
> [](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [](https://github.com/whwu95/BIKE)
> [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)
> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
> Accepted by CVPR 2023 as 🌟Highlight🌟 | [](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [](https://github.com/whwu95/Cap4Video)
## News
- [x] **[Mar 7, 2024]**
Due to the recent removal of RPD (request per day) limits on the GPT-4V API, we've updated our predictions for all datasets using standard single testing (one sample per request). Check out the [**GPT4V Results**](./GPT4V_ZS_Results), [**Ground Truth**](./annotations) and [**Datasets**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) we've shared for you! **As a heads-up, 😭running all tests once costs around 💰$4000+💰.**
- [x] **[Nov 28, 2023]** We release our [report](https://arxiv.org/abs/2311.15732) in Arxiv.
- [x] **[Nov 27, 2023]** Our prompts have been released. Thanks for your star 😝.
## Overview
An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.

Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.
## Generated Descriptions from GPT-4
- We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the [**GPT_generated_prompts**](./GPT4_generated_prompts) folder. Enjoy exploring!
- We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the [generate_prompt.py](./generate_prompt.py) file. Happy coding! Please refer to the [**config**](./config) folder for detailed information on all datasets used in our project.
- Execute the following command to generate descriptions with GPT-4.
```sh
# To run the script for specific dataset, simply update the following line with the name of the dataset you're working with:
# dataset_name = ["Dataset Name Here"] # e.g., dtd
python generate_prompt.py
```
## GPT-4V(ision) for Visual Recognition
- We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the [GPT4V_ZS.py](./GPT4V_ZS.py) file for a step-by-step guide on implementing this. We hope it helps you get started with ease!
```sh
# GPT4V zero-shot recognition script.
# dataset_name = ["Dataset Name Here"] # e.g., dtd
python GPT4V_ZS.py
```
- All results are available in the [**GPT4V_ZS_Results**](./GPT4V_ZS_Results) folder! In addition, we've provided the [**Datasets link**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) along with their corresponding ground truths ([**annotations**](./annotations) folder) to help readers in replicating the results. *Note: For certain datasets, we may have removed prefixes from the sample IDs. For instance, in the case of ImageNet, "ILSVRC2012_val_00031094.JPEG" was modified to "00031094.JPEG".*
| DTD | EuroSAT | SUN397 | RAF-DB | Caltech101 | ImageNet-1K | FGVC-Aircraft | Flower102 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| [57.7](./GPT4V_ZS_Results/dtd.json) | [46.8](./GPT4V_ZS_Results/eurosat.json) | [59.2](./GPT4V_ZS_Results/sun397.json) | [68.7](./GPT4V_ZS_Results/rafdb.json) | [93.7](./GPT4V_ZS_Results/caltech101.json) | [63.1](./GPT4V_ZS_Results/imagenet.json) | [56.6](./GPT4V_ZS_Results/aircraft.json) | [69.1](./GPT4V_ZS_Results/flower102.json) |
| [Label](./annotations/dtd_gt.json) | [Label](./annotations/eurosat_gt.json) | [Label](./annotations/sun397_gt.json) | [Label](./annotations/rafdb_gt.json) | [Label](./annotations/caltech101_gt.json) | [Label](./annotations/imagenet_gt.json) | [Label](./annotations/aircraft_gt.json) | [Label](./annotations/flower102_gt.json) |
| Stanford Cars | Food101| Oxford Pets | UCF-101 | HMDB-51 | Kinetics-400 | ModelNet-10 |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
[62.7](./GPT4V_ZS_Results/car.json) | [86.2](./GPT4V_ZS_Results/food101.json) | [90.8](./GPT4V_ZS_Results/pets.json) | [83.7](./GPT4V_ZS_Results/ucf_8frame.json) | [58.8](./GPT4V_ZS_Results/hmdb_8frame.json) | [58.8](./GPT4V_ZS_Results/k400.json) | [66.9](./GPT4V_ZS_Results/modelnet10_front.json) |
| [Label](./annotations/stanford_cars_gt.json) | [Label](./annotations/food_gt.json) | [Label](./annotations/pets_gt.json) | [Label](./annotations/ucf_gt.json) | [Label](./annotations/hmdb_gt.json) | [Label](./annotations/k400_gt.json) | [Label](./annotations/modelnet10_gt.json) |
- With the provided prediction and annotation files, you can reproduce our top-1/top-5 accuracy results with the [calculate_acc.py](./calculate_acc.py) script.
```sh
# pred_json_path = 'GPT4V_ZS_Results/imagenet.json'
# gt_json_path = 'annotations/imagenet_gt.json'
python calculate_acc.py
```
## Requirement
For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: [OpenAI Quickstart Guide](https://platform.openai.com/docs/quickstart).
If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.
```bibtex
@article{GPT4Vis,
title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},
author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},
booktitle={arXiv preprint arXiv:2311.15732},
year={2023}
}
```
## 🎗️ Acknowledgement
This evaluation is built on the excellent works:
- [CLIP](https://github.com/openai/CLIP): Learning Transferable Visual Models From Natural Language Supervision
- [GPT-4](https://platform.openai.com/docs/guides/vision)
- [Text4Vis](https://github.com/whwu95/Text4Vis): Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
We extend our sincere gratitude to these contributors.
## 👫 Contact
For any questions, please feel free to file an issue.