https://github.com/whwu95/GPT4Vis

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
https://github.com/whwu95/GPT4Vis

gpt-4-vision-preview point-cloud-classification prompt-engineering video-recognition visual-recognition

Last synced: 7 months ago
JSON representation

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

Host: GitHub
URL: https://github.com/whwu95/GPT4Vis
Owner: whwu95
License: mit
Created: 2023-11-17T10:21:31.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-05-22T04:58:13.000Z (over 1 year ago)
Last Synced: 2024-11-06T02:38:07.671Z (12 months ago)
Topics: gpt-4-vision-preview, point-cloud-classification, prompt-engineering, video-recognition, visual-recognition
Language: Python
Homepage: https://arxiv.org/abs/2311.15732
Size: 16 MB
Stars: 207
Watchers: 11
Forks: 26
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


 GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?



 If you like our project, please give us a star ⭐ on GitHub for latest update.  

[![arXiv](https://img.shields.io/badge/Arxiv-2311.15732-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.15732) 

[![zhihu](https://img.shields.io/badge/-知乎-000000?logo=zhihu&logoColor=0084FF)](https://zhuanlan.zhihu.com/p/669758735)

[Wenhao Wu](https://whwu95.github.io/)^1,2, [Huanjin Yao](https://openreview.net/profile?id=~Huanjin_Yao1)^2,3, [Mengxi Zhang](https://scholar.google.com/citations?user=73tAoEAAAAAJ&hl=en)^2,4, [Yuxin Song](https://openreview.net/profile?id=~YuXin_Song1)², [Wanli Ouyang](https://wlouyang.github.io/)⁵, [Jingdong Wang](https://jingdongwang2017.github.io/)²

 

¹[The University of Sydney](https://www.sydney.edu.au/), ²[Baidu](https://vis.baidu.com/#/), ³[Tsinghua University](https://www.tsinghua.edu.cn/en/), ⁴[Tianjin University](https://www.tju.edu.cn/english/index.htm/), ⁵[The Chinese University of Hong Kong](https://www.cuhk.edu.hk/english/#)





***

This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark. 







📣 I also have other cross-modal projects that may interest you ✨. 


> [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)


> Wenhao Wu, Zhun Sun, Wanli Ouyang 


> [![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Text4Vis) 

> [**Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models**](https://arxiv.org/abs/2301.00182)


> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang 


> [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/BIKE) 

> [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)


> Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang 


> Accepted by CVPR 2023 as 🌟Highlight🌟 | [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Cap4Video)




## News

- [x] **[Mar 7, 2024]** 

Due to the recent removal of RPD (request per day) limits on the GPT-4V API, we've updated our predictions for all datasets using standard single testing (one sample per request). Check out the [**GPT4V Results**](./GPT4V_ZS_Results), [**Ground Truth**](./annotations) and [**Datasets**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) we've shared for you! **As a heads-up, 😭running all tests once costs around 💰$4000+💰.**

- [x] **[Nov 28, 2023]** We release our [report](https://arxiv.org/abs/2311.15732) in Arxiv.

- [x] **[Nov 27, 2023]** Our prompts have been released. Thanks for your star 😝.

## Overview



An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.



Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.





## Generated Descriptions from GPT-4







- We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the [**GPT_generated_prompts**](./GPT4_generated_prompts) folder. Enjoy exploring!

- We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the [generate_prompt.py](./generate_prompt.py) file. Happy coding! Please refer to the [**config**](./config) folder for detailed information on all datasets used in our project. 

- Execute the following command to generate descriptions with GPT-4.

  ```sh

  # To run the script for specific dataset, simply update the following line with the name of the dataset you're working with: 

  # dataset_name = ["Dataset Name Here"]   # e.g., dtd

  python generate_prompt.py

  ```

## GPT-4V(ision) for Visual Recognition







- We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the [GPT4V_ZS.py](./GPT4V_ZS.py) file for a step-by-step guide on implementing this. We hope it helps you get started with ease!

  ```sh

  # GPT4V zero-shot recognition script. 

  # dataset_name = ["Dataset Name Here"]   # e.g., dtd

  python GPT4V_ZS.py

  ```

- All results are available in the [**GPT4V_ZS_Results**](./GPT4V_ZS_Results) folder! In addition, we've provided the [**Datasets link**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) along with their corresponding ground truths ([**annotations**](./annotations) folder) to help readers in replicating the results. *Note: For certain datasets, we may have removed prefixes from the sample IDs. For instance, in the case of ImageNet, "ILSVRC2012_val_00031094.JPEG" was modified to "00031094.JPEG".*



| DTD |  EuroSAT |  SUN397 |  RAF-DB |  Caltech101  | ImageNet-1K | FGVC-Aircraft | Flower102 |

|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|

| [57.7](./GPT4V_ZS_Results/dtd.json)  | [46.8](./GPT4V_ZS_Results/eurosat.json) |  [59.2](./GPT4V_ZS_Results/sun397.json) |  [68.7](./GPT4V_ZS_Results/rafdb.json) | [93.7](./GPT4V_ZS_Results/caltech101.json)  |  [63.1](./GPT4V_ZS_Results/imagenet.json) | [56.6](./GPT4V_ZS_Results/aircraft.json) |  [69.1](./GPT4V_ZS_Results/flower102.json) | 

|  [Label](./annotations/dtd_gt.json)  |  [Label](./annotations/eurosat_gt.json)   | [Label](./annotations/sun397_gt.json)    | [Label](./annotations/rafdb_gt.json)    |  [Label](./annotations/caltech101_gt.json)  | [Label](./annotations/imagenet_gt.json)  |  [Label](./annotations/aircraft_gt.json)   |  [Label](./annotations/flower102_gt.json)    |

   

| Stanford Cars | Food101| Oxford Pets | UCF-101 | HMDB-51 | Kinetics-400 | ModelNet-10 |

|:---:|:---:|:---:|:---:|:---:|:---:|:---:|

[62.7](./GPT4V_ZS_Results/car.json)  |  [86.2](./GPT4V_ZS_Results/food101.json) | [90.8](./GPT4V_ZS_Results/pets.json) | [83.7](./GPT4V_ZS_Results/ucf_8frame.json) | [58.8](./GPT4V_ZS_Results/hmdb_8frame.json) | [58.8](./GPT4V_ZS_Results/k400.json) | [66.9](./GPT4V_ZS_Results/modelnet10_front.json) |

|  [Label](./annotations/stanford_cars_gt.json)   |  [Label](./annotations/food_gt.json)   | [Label](./annotations/pets_gt.json)    | [Label](./annotations/ucf_gt.json)   |  [Label](./annotations/hmdb_gt.json)  |  [Label](./annotations/k400_gt.json)  |  [Label](./annotations/modelnet10_gt.json)   |



- With the provided prediction and annotation files, you can reproduce our top-1/top-5 accuracy results with the [calculate_acc.py](./calculate_acc.py) script.

  ```sh

  # pred_json_path = 'GPT4V_ZS_Results/imagenet.json'

  # gt_json_path = 'annotations/imagenet_gt.json'

  python calculate_acc.py

  ```

## Requirement

For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: [OpenAI Quickstart Guide](https://platform.openai.com/docs/quickstart).



## 📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

```bibtex

@article{GPT4Vis,

  title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},

  author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},

  booktitle={arXiv preprint arXiv:2311.15732},

  year={2023}

}

```



## 🎗️ Acknowledgement

This evaluation is built on the excellent works:

- [CLIP](https://github.com/openai/CLIP): Learning Transferable Visual Models From Natural Language Supervision

- [GPT-4](https://platform.openai.com/docs/guides/vision)

- [Text4Vis](https://github.com/whwu95/Text4Vis): Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

  

We extend our sincere gratitude to these contributors.

## 👫 Contact

For any questions, please feel free to file an issue.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/whwu95/GPT4Vis

Awesome Lists containing this project

README

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?