{"id":13584959,"url":"https://github.com/whwu95/GPT4Vis","last_synced_at":"2025-04-07T06:32:08.051Z","repository":{"id":209487603,"uuid":"719983011","full_name":"whwu95/GPT4Vis","owner":"whwu95","description":"GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?","archived":false,"fork":false,"pushed_at":"2024-05-22T04:58:13.000Z","size":16805,"stargazers_count":207,"open_issues_count":3,"forks_count":26,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-11-06T02:38:07.671Z","etag":null,"topics":["gpt-4-vision-preview","point-cloud-classification","prompt-engineering","video-recognition","visual-recognition"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2311.15732","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/whwu95.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-17T10:21:31.000Z","updated_at":"2024-10-28T02:14:31.000Z","dependencies_parsed_at":"2024-11-06T03:03:04.856Z","dependency_job_id":"95b8f271-fb5e-4076-9056-8c00984f8919","html_url":"https://github.com/whwu95/GPT4Vis","commit_stats":null,"previous_names":["whwu95/gpt4vis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whwu95%2FGPT4Vis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whwu95%2FGPT4Vis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whwu95%2FGPT4Vis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/whwu95%2FGPT4Vis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/whwu95","download_url":"https://codeload.github.com/whwu95/GPT4Vis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607203,"owners_count":20965937,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt-4-vision-preview","point-cloud-classification","prompt-engineering","video-recognition","visual-recognition"],"created_at":"2024-08-01T15:04:37.667Z","updated_at":"2025-04-07T06:32:08.044Z","avatar_url":"https://github.com/whwu95.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\n\n\u003ch2 align=\"center\"\u003e \u003ca href=\"https://arxiv.org/abs/2311.15732\"\u003eGPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?\u003c/a\u003e\u003c/h2\u003e\n\u003ch5 align=\"center\"\u003e If you like our project, please give us a star ⭐ on GitHub for latest update.  \u003c/h2\u003e\n\n\n[![arXiv](https://img.shields.io/badge/Arxiv-2311.15732-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.15732) \n[![zhihu](https://img.shields.io/badge/-知乎-000000?logo=zhihu\u0026logoColor=0084FF)](https://zhuanlan.zhihu.com/p/669758735)\n\n\n[Wenhao Wu](https://whwu95.github.io/)\u003csup\u003e1,2\u003c/sup\u003e, [Huanjin Yao](https://openreview.net/profile?id=~Huanjin_Yao1)\u003csup\u003e2,3\u003c/sup\u003e, [Mengxi Zhang](https://scholar.google.com/citations?user=73tAoEAAAAAJ\u0026hl=en)\u003csup\u003e2,4\u003c/sup\u003e, [Yuxin Song](https://openreview.net/profile?id=~YuXin_Song1)\u003csup\u003e2\u003c/sup\u003e, [Wanli Ouyang](https://wlouyang.github.io/)\u003csup\u003e5\u003c/sup\u003e, [Jingdong Wang](https://jingdongwang2017.github.io/)\u003csup\u003e2\u003c/sup\u003e\n\n \n\u003csup\u003e1\u003c/sup\u003e[The University of Sydney](https://www.sydney.edu.au/), \u003csup\u003e2\u003c/sup\u003e[Baidu](https://vis.baidu.com/#/), \u003csup\u003e3\u003c/sup\u003e[Tsinghua University](https://www.tsinghua.edu.cn/en/), \u003csup\u003e4\u003c/sup\u003e[Tianjin University](https://www.tju.edu.cn/english/index.htm/), \u003csup\u003e5\u003c/sup\u003e[The Chinese University of Hong Kong](https://www.cuhk.edu.hk/english/#)\n\n\n\u003c/div\u003e\n\n***\nThis work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark. \n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/method.png\" width=\"800\" /\u003e\n\u003c/div\u003e\n\n\u003cdetails open\u003e\u003csummary\u003e📣 I also have other cross-modal projects that may interest you ✨. \u003c/summary\u003e\u003cp\u003e\n\n\n\u003e [**Revisiting Classifier: Transferring Vision-Language Models for Video Recognition**](https://arxiv.org/abs/2207.01297)\u003cbr\u003e\n\u003e Wenhao Wu, Zhun Sun, Wanli Ouyang \u003cbr\u003e\n\u003e [![Conference](http://img.shields.io/badge/AAAI-2023-f9f107.svg)](https://ojs.aaai.org/index.php/AAAI/article/view/25386/25158) [![Journal](http://img.shields.io/badge/IJCV-2023-Bf107.svg)](https://link.springer.com/article/10.1007/s11263-023-01876-w) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Text4Vis) \n\n\n\u003e [**Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models**](https://arxiv.org/abs/2301.00182)\u003cbr\u003e\n\u003e Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang \u003cbr\u003e\n\u003e [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/BIKE) \n\n\n\u003e [**Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?**](https://arxiv.org/abs/2301.00184)\u003cbr\u003e\n\u003e Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang \u003cbr\u003e\n\u003e Accepted by CVPR 2023 as 🌟Highlight🌟 | [![Conference](http://img.shields.io/badge/CVPR-2023-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.html) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/whwu95/Cap4Video)\u003cbr\u003e\n\n\n\u003c/p\u003e\u003c/details\u003e\n\n\n\n\n## News\n\u003c!-- - [x] **[Mar 10, 2024]** We have updated all results in our [report](https://arxiv.org/abs/2311.15732). For accurate predictions, we strongly recommend using single testing with GPT-4V and have accordingly eliminated scripts related to batch testing. --\u003e\n- [x] **[Mar 7, 2024]** \nDue to the recent removal of RPD (request per day) limits on the GPT-4V API, we've updated our predictions for all datasets using standard single testing (one sample per request). Check out the [**GPT4V Results**](./GPT4V_ZS_Results), [**Ground Truth**](./annotations) and [**Datasets**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) we've shared for you! **As a heads-up, 😭running all tests once costs around 💰$4000+💰.**\n- [x] **[Nov 28, 2023]** We release our [report](https://arxiv.org/abs/2311.15732) in Arxiv.\n- [x] **[Nov 27, 2023]** Our prompts have been released. Thanks for your star 😝.\n\n\n## Overview\n\n\u003c!-- \u003ch3 style=\"text-align: center;\"\u003eZero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.\u003c/h3\u003e --\u003e\n\n\n\n\n\u003cdiv align=\"center\"\u003e\nAn overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.\n\n\u003cimg src=\"docs/datasets.png\" width=\"800\"\u003e\n\nZero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.\n\u003cimg src=\"docs/results.jpg\" width=\"800\"\u003e\n\u003c/div\u003e\n\n\n## Generated Descriptions from GPT-4\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/generated_sentences.png\" width=\"800\" /\u003e\n\u003c/div\u003e\n\n- We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the [**GPT_generated_prompts**](./GPT4_generated_prompts) folder. Enjoy exploring!\n\n- We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the [generate_prompt.py](./generate_prompt.py) file. Happy coding! Please refer to the [**config**](./config) folder for detailed information on all datasets used in our project. \n- Execute the following command to generate descriptions with GPT-4.\n  ```sh\n  # To run the script for specific dataset, simply update the following line with the name of the dataset you're working with: \n  # dataset_name = [\"Dataset Name Here\"]   # e.g., dtd\n  python generate_prompt.py\n  ```\n\n## GPT-4V(ision) for Visual Recognition\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/gpt4v_prompt.png\" width=\"800\" /\u003e\n\u003c/div\u003e\n\n- We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the [GPT4V_ZS.py](./GPT4V_ZS.py) file for a step-by-step guide on implementing this. We hope it helps you get started with ease!\n\n  ```sh\n  # GPT4V zero-shot recognition script. \n  # dataset_name = [\"Dataset Name Here\"]   # e.g., dtd\n  python GPT4V_ZS.py\n  ```\n\n- All results are available in the [**GPT4V_ZS_Results**](./GPT4V_ZS_Results) folder! In addition, we've provided the [**Datasets link**](https://unisyd-my.sharepoint.com/:f:/g/personal/wenhao_wu_sydney_edu_au/EmoNoASH2b1JqQXb14fx0tMBkj4VU3nOUrKyt9ZT1aIw2Q?e=jNL0CL) along with their corresponding ground truths ([**annotations**](./annotations) folder) to help readers in replicating the results. *Note: For certain datasets, we may have removed prefixes from the sample IDs. For instance, in the case of ImageNet, \"ILSVRC2012_val_00031094.JPEG\" was modified to \"00031094.JPEG\".*\n\n\u003cdiv align=\"center\"\u003e\n\n| DTD |  EuroSAT |  SUN397 |  RAF-DB |  Caltech101  | ImageNet-1K | FGVC-Aircraft | Flower102 |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| [57.7](./GPT4V_ZS_Results/dtd.json)  | [46.8](./GPT4V_ZS_Results/eurosat.json) |  [59.2](./GPT4V_ZS_Results/sun397.json) |  [68.7](./GPT4V_ZS_Results/rafdb.json) | [93.7](./GPT4V_ZS_Results/caltech101.json)  |  [63.1](./GPT4V_ZS_Results/imagenet.json) | [56.6](./GPT4V_ZS_Results/aircraft.json) |  [69.1](./GPT4V_ZS_Results/flower102.json) | \n|  [Label](./annotations/dtd_gt.json)  |  [Label](./annotations/eurosat_gt.json)   | [Label](./annotations/sun397_gt.json)    | [Label](./annotations/rafdb_gt.json)    |  [Label](./annotations/caltech101_gt.json)  | [Label](./annotations/imagenet_gt.json)  |  [Label](./annotations/aircraft_gt.json)   |  [Label](./annotations/flower102_gt.json)    |\n   \n\n\n| Stanford Cars | Food101| Oxford Pets | UCF-101 | HMDB-51 | Kinetics-400 | ModelNet-10 |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n[62.7](./GPT4V_ZS_Results/car.json)  |  [86.2](./GPT4V_ZS_Results/food101.json) | [90.8](./GPT4V_ZS_Results/pets.json) | [83.7](./GPT4V_ZS_Results/ucf_8frame.json) | [58.8](./GPT4V_ZS_Results/hmdb_8frame.json) | [58.8](./GPT4V_ZS_Results/k400.json) | [66.9](./GPT4V_ZS_Results/modelnet10_front.json) |\n|  [Label](./annotations/stanford_cars_gt.json)   |  [Label](./annotations/food_gt.json)   | [Label](./annotations/pets_gt.json)    | [Label](./annotations/ucf_gt.json)   |  [Label](./annotations/hmdb_gt.json)  |  [Label](./annotations/k400_gt.json)  |  [Label](./annotations/modelnet10_gt.json)   |\n\n\u003c/div\u003e\n\n\n- With the provided prediction and annotation files, you can reproduce our top-1/top-5 accuracy results with the [calculate_acc.py](./calculate_acc.py) script.\n\n  ```sh\n  # pred_json_path = 'GPT4V_ZS_Results/imagenet.json'\n  # gt_json_path = 'annotations/imagenet_gt.json'\n  python calculate_acc.py\n  ```\n\n## Requirement\nFor guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: [OpenAI Quickstart Guide](https://platform.openai.com/docs/quickstart).\n\n\n\n\n\n\u003ca name=\"bibtex\"\u003e\u003c/a\u003e\n## 📌 BibTeX \u0026 Citation\n\nIf you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.\n\n```bibtex\n@article{GPT4Vis,\n  title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},\n  author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},\n  booktitle={arXiv preprint arXiv:2311.15732},\n  year={2023}\n}\n```\n\n\u003ca name=\"acknowledgment\"\u003e\u003c/a\u003e\n## 🎗️ Acknowledgement\nThis evaluation is built on the excellent works:\n- [CLIP](https://github.com/openai/CLIP): Learning Transferable Visual Models From Natural Language Supervision\n- [GPT-4](https://platform.openai.com/docs/guides/vision)\n- [Text4Vis](https://github.com/whwu95/Text4Vis): Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective\n  \nWe extend our sincere gratitude to these contributors.\n\n\n\n## 👫 Contact\nFor any questions, please feel free to file an issue.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhwu95%2FGPT4Vis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwhwu95%2FGPT4Vis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhwu95%2FGPT4Vis/lists"}