{"id":13517998,"url":"https://github.com/ttengwang/Awesome_Prompting_Papers_in_Computer_Vision","last_synced_at":"2025-03-31T09:31:02.037Z","repository":{"id":37459049,"uuid":"448496515","full_name":"ttengwang/Awesome_Prompting_Papers_in_Computer_Vision","owner":"ttengwang","description":"A curated list of prompt-based paper in computer vision and vision-language learning.","archived":false,"fork":false,"pushed_at":"2023-12-18T05:42:47.000Z","size":74,"stargazers_count":862,"open_issues_count":4,"forks_count":68,"subscribers_count":37,"default_branch":"main","last_synced_at":"2024-05-19T21:38:13.289Z","etag":null,"topics":["adapter","few-shot-learning","parameter-efficient-tuning","prompt-learning","prompt-tuning","visual-prompt","zero-shot-learning"],"latest_commit_sha":null,"homepage":"https://visualprompting.github.io/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ttengwang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-01-16T08:22:26.000Z","updated_at":"2024-05-17T09:29:56.000Z","dependencies_parsed_at":"2023-02-09T10:16:10.582Z","dependency_job_id":"76677d4f-1de4-4236-8951-d5e08e30d159","html_url":"https://github.com/ttengwang/Awesome_Prompting_Papers_in_Computer_Vision","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ttengwang","download_url":"https://codeload.github.com/ttengwang/Awesome_Prompting_Papers_in_Computer_Vision/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246365641,"owners_count":20765546,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adapter","few-shot-learning","parameter-efficient-tuning","prompt-learning","prompt-tuning","visual-prompt","zero-shot-learning"],"created_at":"2024-08-01T05:01:39.567Z","updated_at":"2025-03-31T09:31:02.015Z","avatar_url":"https://github.com/ttengwang.png","language":null,"funding_links":[],"categories":["Technical","Computer Vision","Other Lists"],"sub_categories":["awesome-*","TeX Lists"],"readme":"# Awesome Prompting Papers in Computer Vision\nA curated list of prompt-based papers in computer vision and vision-language learning. \n\n\n- [Awesome Prompting Papers in Computer Vision](#awesome-prompting-papers-in-computer-vision)\n    - [Keywords](#keywords)\n  - [Vision Prompt](#vision-prompt)\n  - [Vision-Language Prompt](#vision-language-prompt)\n    - [Language-Interactable Prompt](#language-interactable-prompt)\n    - [Vision-Language Instruction Tuning](#vision-language-instruction-tuning)\n  - [More Resources](#more-resources)\n\n### Keywords\n* Task tag, e.g., ![](https://img.shields.io/badge/object--detection-759CBC?style=flat-square) ![](https://img.shields.io/badge/VQA-759CBC?style=flat-square)\n* Abbreviation tag, e.g., ![](https://img.shields.io/badge/CLIP-CD6155?style=flat-square)\n* Characteristic tag: Some characteristic makes this paper unique, e.g., ![](https://img.shields.io/badge/NAS-BC9575?style=flat-square) ![](https://img.shields.io/badge/unsupervised-BC9575?style=flat-square)\n* **Bold font**: We highlight some pilot work that may contribute to the prevalence of visual prompting.\n\n\n## Vision Prompt\nThis section collects papers prompting pretrained vision foundation models (e.g., ViT) for parameter-efficient adaptation.\n\n- **Learning to Prompt for Continual Learning** [[paper]](https://arxiv.org/abs/2112.08654) [[code]](https://github.com/google-research/l2p)\n\n  `CVPR 2022`  ![](https://img.shields.io/badge/continual--learning-759CBC?style=flat-square)\n\n- **Visual Prompt Tuning** [[paper]](https://arxiv.org/pdf/2203.12119.pdf) [[code]](https://github.com/KMnP/vpt)\n   \n  `ECCV 2022` ![](https://img.shields.io/badge/VPT-CD6155?style=flat-square)  \n \n- DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning [[paper]](https://arxiv.org/pdf/2204.04799.pdf) [[code]](https://github.com/google-research/l2p)\n\n  `ECCV 2022` ![](https://img.shields.io/badge/continual--learning-759CBC?style=flat-square)\n  \n- AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition [[paper]](https://arxiv.org/abs/2205.13535) [[code]](https://github.com/ShoufaChen/AdaptFormer)\n  \n  `NeurIPS 2022`  ![](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square)\n\n- Scaling \u0026 Shifting Your Features: A New Baseline for Efficient Model Tuning [[paper]](https://arxiv.org/abs/2210.08823) [[code]](https://github.com/dongzelian/SSF)\n\n  `NeurIPS 2022` \n\n- P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting [[paper]](https://arxiv.org/abs/2208.02812) [[code]](https://github.com/wangzy22/P2P)\n\n  `NeurIPS 2022` ![](https://img.shields.io/badge/3D--point--cloud--tasks-759CBC?) \n\n- Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models [[paper]](https://arxiv.org/abs/2209.06970) [[code]](https://github.com/ChenWu98/Generative-Visual-Prompt)\n\n  `NeurIPS 2022` ![](https://img.shields.io/badge/image--generation-759CBC?) \n\n- Visual Prompting via Image Inpainting [[paper]]() [[code]](https://github.com/amirbar/visual_prompting)\n\n  `NeurIPS 2022` ![](https://img.shields.io/badge/visual--in--context--learning-759CBC?) ![](https://img.shields.io/badge/image--generation-759CBC?) \n\n- Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation [[paper]](https://arxiv.org/abs/2212.04145)\n\n  `AAAI 2023` \n\n- LPT: Long-tailed Prompt Tuning for Image Classification [[paper]](https://openreview.net/forum?id=8pOVAeo8ie) \n\n  `ICLR 2023` \n\n- Diversity-Aware Meta Visual Prompting [[paper]](https://arxiv.org/abs/2303.08138) [[code]](https://github.com/shikiw/DAM-VP)\n\n  `CVPR 2023` \n\n- Semantic Prompt for Few-Shot Image Recognition [[paper]](https://arxiv.org/abs/2303.14123) \n\n  `CVPR 2023` ![](https://img.shields.io/badge/few--shot--learning-759CBC?) \n\n\n- Visual Prompt Tuning for Generative Transfer Learning [[paper]](https://arxiv.org/abs/2210.00990) [[code]](https://github.com/google-research/generative_transfer)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/image--generative--tasks-759CBC?) \n\n- CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [[paper]](https://arxiv.org/abs/2303.13076) [[code]](https://github.com/tgxs002/CORA)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/open--vocabulary--detection-759CBC?) \n\n- Images Speak in Images: A Generalist Painter for In-Context Visual Learning [[paper]](https://arxiv.org/abs/2212.02499) [[code]](https://arxiv.org/abs/2212.02499)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/image--generation-759CBC?)  ![](https://img.shields.io/badge/in--context--learning-759CBC?)\n\n- PIVOT: Prompting for Video Continual Learning [[paper]](https://arxiv.org/abs/2212.04842)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/continual--learning-759CBC?) \n\n- Learning Expressive Prompting With Residuals for Vision Transformers [[paper]](https://arxiv.org/abs/2303.15591)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/semantic--segmentation-759CBC?) \n\n- BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning [[paper]](https://arxiv.org/abs/2303.14773) [[code]](https://github.com/changdaeoh/BlackVIP)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/black--box--optimization-759CBC?) \n\n- Visual Prompt Multi-Modal Tracking [[paper]](https://arxiv.org/abs/2303.10826) [[code]](https://github.com/jiawen-zhu/ViPT)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/object--training-759CBC?) \n\n- A-La-Carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting [[paper]](https://arxiv.org/abs/2302.07994) \n\n  `CVPR 2023` ![](https://img.shields.io/badge/continual--learning-759CBC?) \n\n- Understanding and Improving Visual Prompting: A Label-Mapping Perspective [[paper]](https://arxiv.org/abs/2211.11635) [[code]](https://github.com/OPTML-Group/ILM-VP)\n\n  `CVPR 2023` \n\n- Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning [[paper]](https://arxiv.org/abs/2212.03220) [[code]](https://github.com/andytu28/VQT)\n\n  `CVPR 2023` \n\n- Explicit Visual Prompting for Low-Level Structure Segmentations low-level segmentation [[paper]](https://arxiv.org/abs/2303.10883) [[code]](https://github.com/NiFangBaAGe/Explicit-Visual-Prompt)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/low--level--segmentation-759CBC?) \n\n- Understanding and Improving Visual Prompting: A Label-Mapping Perspective [[paper]](https://arxiv.org/abs/2211.11635) [[code]](https://github.com/OPTML-Group/ILM-VP)\n\n  `CVPR 2023` \n\n**ArXiv Papers**\n\n- **Exploring Visual Prompts for Adapting Large-Scale Models** [[paper]](https://arxiv.org/pdf/2203.17274.pdf) [[code]](https://github.com/hjbahng/visual_prompting)\n\n  `arXiv 2022/03` \n\n- Vision Transformer Adapter for Dense Predictions [[paper]](https://arxiv.org/pdf/2205.08534.pdf) [[code]](https://github.com/czczup/ViT-Adapter)\n  \n  `arXiv 2022/05` ![](https://img.shields.io/badge/object--detection-759CBC?style=flat-square) ![](https://img.shields.io/badge/instance--segmentaion-759CBC?style=flat-square)\n\n- Neural Prompt Search [[paper]](https://arxiv.org/abs/2206.04673) [[code]](https://github.com/Davidzhangyuanhan/NOAH)\n\n  `arXiv 2022/06` ![](https://img.shields.io/badge/NOAH-CD6155?style=flat-square)  ![](https://img.shields.io/badge/NAS-BC9575?style=flat-square)\n\n- Convolutional Bypasses Are Better Vision Transformer Adapters [[paper]](https://arxiv.org/abs/2207.07039) [[code]](https://github.com/JieShibo/PETL-ViT)\n\n  `arXiv 2022/07`   \n\n- Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets [[paper]](https://arxiv.org/pdf/2208.07463.pdf) \n\n  `arXiv 2022/08`   \n\n- Prompt Vision Transformer for Domain Generalization [[paper]](https://arxiv.org/pdf/2208.08914.pdf) \n\n  `arXiv 2022/08`  ![](https://img.shields.io/badge/domain--generalization-759CBC?style=flat-square) \n\n- Prompt-Matched Semantic Segmentation [[paper]](https://arxiv.org/abs/2208.10159) \n\n  `arXiv 2022/08`  ![](https://img.shields.io/badge/segmentation-759CBC?style=flat-square) \n  \n- Visual Prompt Tuning for Test-time Domain Adaptation [[paper]](https://arxiv.org/abs/2210.04831)\n\n  `arXiv 2022/10`   \n  \n- Visual Prompting for Adversarial Robustness [[paper]](https://arxiv.org/abs/2210.06284)\n\n  `arXiv 2022/10`  ![](https://img.shields.io/badge/adversarial--robustness-759CBC?style=flat-square) \n\n- Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers [[paper]](https://arxiv.org/abs/2210.06466) [[code]](https://github.com/jochemloedeman/PGN)\n\n  `arXiv 2022/10`   \n\n- Towards a Unified View on Visual Parameter-Efficient Transfer Learning [[paper]](https://arxiv.org/abs/2210.00788) [[code]](https://github.com/bruceyo/V-PETL)\n  \n  `arXiv 2022/10` ![](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square)\n  \n- Multitask Vision-Language Prompt Tuning [[paper]](https://arxiv.org/abs/2211.11720) [[code]](https://github.com/sIncerass/MVLPT)\n\n  `arXiv 2022/11` ![](https://img.shields.io/badge/MVLPT-CD6155?style=flat-square)   ![](https://img.shields.io/badge/multitask--learning-BC9575?style=flat-square)\n\n\n## Vision-Language Prompt\nThis section collects papers prompting pretrained vision-language foundation models (e.g., CLIP) for parameter-efficient adaptation.\n\n- **Learning Transferable Visual Models From Natural Language Supervision** [[paper]](https://arxiv.org/abs/2103.00020) [[code]](https://github.com/OpenAI/CLIP) \n\n  `ICML 2021` ![](https://img.shields.io/badge/CLIP-CD6155?style=flat-square) \n\n- **Learning to Prompt for Vision-Language Models** [[paper]](https://arxiv.org/abs/2109.01134) [[code]](https://github.com/KaiyangZhou/CoOp)\n\n  `IJCV 2022`  ![](https://img.shields.io/badge/CoOP-CD6155?style=flat-square) \n\n- Prompt Distribution Learning [[paper]](https://arxiv.org/pdf/2205.03340.pdf)\n\n  `CVPR 2022` \n\n- Conditional Prompt Learning for Vision-Language Models [[paper]](https://arxiv.org/pdf/2203.05557.pdf) [[code]](https://github.com/KaiyangZhou/CoOp)\n\n  `CVPR 2022`  ![](https://img.shields.io/badge/CoCoOP-CD6155?style=flat-square) \n\n- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting\t[[paper]](https://arxiv.org/pdf/2112.01518.pdf) [[code]](https://github.com/raoyongming/denseclip) \n\n  `CVPR 2022` ![](https://img.shields.io/badge/detection-759CBC?style=flat-square) ![](https://img.shields.io/badge/segmentation-759CBC?style=flat-square) \n\n- Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [[paper]](https://arxiv.org/pdf/2203.14104.pdf) [[code]](https://github.com/ttlmh/Bridge-Prompt) \n\n  `CVPR 2022` ![](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square) ![](https://img.shields.io/badge/action--segmentation-759CBC?style=flat-square)\n\n- PointCLIP: Point Cloud Understanding by CLIP [[paper]](https://arxiv.org/pdf/2112.02413.pdf) [[code]](https://github.com/ZrrSkywalker/PointCLIP)\n\n  `CVPR 2022`  ![](https://img.shields.io/badge/point--cloud-759CBC?style=flat-square)\n\n- VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [[paper]](https://arxiv.org/pdf/2112.06825.pdf) [[code]](https://github.com/ylsung/VL_adapter)\n  \n    `CVPR 2022`  ![](https://img.shields.io/badge/VQA-759CBC?style=flat-square) ![](https://img.shields.io/badge/VideoQA-759CBC?style=flat-square) ![](https://img.shields.io/badge/captioning-759CBC?style=flat-square)\n\n- A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models\t[[paper]](https://arxiv.org/abs/2110.08484)\n\n  `ACL 2022`  ![](https://img.shields.io/badge/VQA-759CBC?)  ![](https://img.shields.io/badge/captioning-759CBC?)\n\n- Can Language Understand Depth? [[paper]](https://arxiv.org/pdf/2207.01077.pdf) [[code]](https://github.com/Adonis-galaxy/DepthCLIP)\n\n  `ACM MM 2022` ![](https://img.shields.io/badge/depthclip-CD6155?style=flat-square)  ![](https://img.shields.io/badge/depth--estimation-759CBC?)\n\n- Expanding Language-Image Pretrained Models for General Video Recognition [[paper]](https://arxiv.org/abs/2208.02816) [[code]](https://aka.ms/X-CLIP)\n\n  `ECCV 2022` ![](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square)\n  \n- Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [[paper]](https://arxiv.org/pdf/2207.09519.pdf) [[code]](https://github.com/gaopengcuhk/tip-adapter)\n\n  `ECCV 2022` \n  \n- OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression [[paper]](https://arxiv.org/abs/2206.02338)\n\n  `NeurIPS 2022` ![](https://img.shields.io/badge/ordinal--regression-759CBC?style=flat-square)\n\n- Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [[paper]](https://arxiv.org/pdf/2209.07511.pdf) [[code]](https://azshue.github.io/TPT/)\n\n  `NeurIPS 2022` \n \n\n\n- Learning to Decompose Visual Features with Latent Textual Prompts [[paper]](https://openreview.net/forum?id=wtcud6HroZr)\n\n  `ICLR 2023` \n\n- PLOT: Prompt Learning with Optimal Transport for Vision-Language Models [[paper]](https://openreview.net/forum?id=zqwryBoXYnh) [[code]](https://github.com/CHENGY12/PLOT)\n\n  `ICLR 2023` \n\n- Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [[paper]](https://arxiv.org/abs/2303.13283) [[code]](https://github.com/htyao89/KgCoOp)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/image--classification-759CBC?)\n\n- Open-Set Fine-Grained Retrieval Via Prompting Vision-Language Evaluator [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Wang_Open-Set_Fine-Grained_Retrieval_via_Prompting_Vision-Language_Evaluator_CVPR_2023_paper.pdf) \n\n  `CVPR 2023` ![](https://img.shields.io/badge/open--set--retrieval-759CBC?) \n\n- Multimodal Prompting With Missing Modalities for Visual Recognition [[paper]](https://arxiv.org/abs/2303.03369) [[code]](https://github.com/YiLunLee/Missing_aware_prompts)\n\n  `CVPR 2023` \n\n- Efficient Multimodal Fusion Via Interactive Prompting [[paper]](https://arxiv.org/abs/2304.06306)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/multimodal--classification-759CBC?) \n\n- Hierarchical Prompt Learning for Multi-Task Learning [[paper]](https://openaccess.thecvf.com/content/CVPR2023/papers/Liu_Hierarchical_Prompt_Learning_for_Multi-Task_Learning_CVPR_2023_paper.pdf) [[code]](https://github.com/lynlynlyn/hipro)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/multitask--learning-759CBC?) \n\n- Text-Visual Prompting for Efficient 2D Temporal Video Grounding [[paper]](https://arxiv.org/abs/2303.04995)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/video--grounding-759CBC?) \n\n- VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval [[paper]](https://arxiv.org/abs/2211.12764) [[code]](https://github.com/bighuang624/VoP)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/text--video--retrieval-759CBC?) \n\n- MaPLe: Multi-modal Prompt Learning [[paper]](https://arxiv.org/abs/2210.03117) [[code]](https://github.com/muzairkhattak/multimodal-prompt-learning)\n\n\n  `CVPR 2023` \n\n- Texts as Images in Prompt Tuning for Multi-Label Image Recognition [[paper]](https://arxiv.org/abs/2211.12739) [[code]](https://github.com/guozix/TaI-DPT)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/multi--label--recognition-759CBC?) \n\n\n- Vita-CLIP: Video and Text Adaptive CLIP Via Multimodal Prompting [[paper]](https://arxiv.org/abs/2304.03307) [[code]](https://github.com/TalalWasim/Vita-CLIP)\n\n  `CVPR 2023` ![](https://img.shields.io/badge/action--recognition-759CBC?) \n\n\n- LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision \u0026 Language Models [[paper]](https://arxiv.org/abs/2210.01115) [[code]](https://www.adrianbulat.com/lasp)\n\n  `CVPR 2023` \n\n- $\\pi$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation [[paper]](https://arxiv.org/abs/2304.14381) [[code]](https://github.com/TencentARC/pi-Tuning)\n\n  `ICML 2023` \n\n- POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models [[paper]](https://arxiv.org/abs/2305.00350) [[code]](https://github.com/korawat-tanwisuth/POUF)\n\n  `ICML 2023` \n  \n- Rethinking the Openness of CLIP [[paper]](https://arxiv.org/abs/2206.01986) [[code]](https://github.com/lancopku/clip-openness)\n\n  `ACL 2023` ![](https://img.shields.io/badge/REPE-CD6155?style=flat-square)\n\n- PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization [[paper]](https://arxiv.org/abs/2307.15199) [[code]](https://promptstyler.github.io/)\n\n  `ICCV 2023` ![](https://img.shields.io/badge/PromptStyler-CD6155?style=flat-square)  \n\n\u003cbar\u003e\n\n**ArXiv Papers**\n\n- **Colorful Prompt Tuning for Pre-trained Vision-Language Models** [[paper]](https://arxiv.org/abs/2109.11797) \n\n  `arXiv 2021/08` ![](https://img.shields.io/badge/CPT-CD6155?style=flat-square) ![](https://img.shields.io/badge/grounding-759CBC?style=flat-square) \n\n\n- ActionCLIP: A New Paradigm for Video Action Recognition [[paper]](https://arxiv.org/abs/2109.08472) [[code]](https://github.com/sallymmx/ActionCLIP)\n\n  `arXiv 2021/09` ![](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square)\n\n- CLIP-Adapter: Better Vision-Language Models with Feature Adapters [[paper]](https://arxiv.org/abs/2110.04544) [[code]](https://github.com/gaopengcuhk/clip-adapter)\n\n  `arXiv 2021/10` \n\n- Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization [[paper]](https://arxiv.org/abs/2111.12853)\n\n  `arXiv 2021/11` ![](https://img.shields.io/badge/domain--generalization-BC9575?style=flat-square)\n\n- Prompting Visual-Language Models for Efficient Video Understanding [[paper]](https://arxiv.org/abs/2112.04478) [[code]](https://github.com/ju-chen/Efficient-Prompt)\n\n  `arXiv 2021/12` ![task](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square) ![task](https://img.shields.io/badge/action--localization-759CBC?style=flat-square) ![task](https://img.shields.io/badge/retrieval-759CBC?style=flat-square)\n\n- Unsupervised Prompt Learning for Vision-Language Models [[paper]](https://arxiv.org/pdf/2204.03649.pdf) [[code]](https://github.com/tonyhuang2022/UPL)\n\n  `arXiv 2022/04` ![](https://img.shields.io/badge/UPL-CD6155?style=flat-square) ![](https://img.shields.io/badge/unsupervised-BC9575?style=flat-square)\n\n- Prompt-aligned Gradient for Prompt Tuning [[paper]](https://arxiv.org/abs/2205.14865) [[code]](https://github.com/BeierZhu/Prompt-align)\n\n  `arXiv 2022/05` \n\n\n- Parameter-Efficient Image-to-Video Transfer Learning [[paper]](https://arxiv.org/pdf/2206.13559.pdf)\n\n  `arXiv 2022/06`  ![](https://img.shields.io/badge/ST--adapter-CD6155?style=flat-square) ![task](https://img.shields.io/badge/action--recognition-759CBC?style=flat-square)\n\n- DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations [[paper]](https://arxiv.org/abs/2206.09541)\n\n  `arXiv 2022/06` ![task](https://img.shields.io/badge/multilabel--recognition-759CBC?style=flat-square)\n\n- Prompt Tuning for Generative Multimodal Pretrained Models [[paper]](https://arxiv.org/abs/2208.02532) [[code]](https://github.com/OFA-Sys/OFA)\n\n  `arXiv 2022/06` ![](https://img.shields.io/badge/VQA-759CBC?style=flat-square) ![](https://img.shields.io/badge/captioning-759CBC?style=flat-square)  ![](https://img.shields.io/badge/referring--expression-759CBC?style=flat-square)  ![](https://img.shields.io/badge/visual--entailment-759CBC?style=flat-square) \n  \n- Prompt Tuning with Soft Context Sharing for Vision-Language Models [[paper]](https://arxiv.org/pdf/2208.13474.pdf)\n\n  `arXiv 2022/08`   ![](https://img.shields.io/badge/multi--task--learning-759CBC?style=flat-square) \n \n\n  \n- CPL: Counterfactual Prompt Learning for Vision and Language Models [[paper]](https://arxiv.org/abs/2210.10362) [[code]](https://github.com/eric-ai-lab/CPL)\n\n  `arXiv 2022/10`   ![](https://img.shields.io/badge/retrieval-759CBC?style=flat-square) \n![](https://img.shields.io/badge/VQA-759CBC?style=flat-square) \n\n\n- Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [[paper]](https://arxiv.org/abs/2211.02219) [[code]](https://github.com/machengcheng2016/Subspace-Prompt-Learning)\n\n  `arXiv 2022/10`  ![](https://img.shields.io/badge/object--detection-759CBC?style=flat-square) \n![](https://img.shields.io/badge/semantic--segmentation-759CBC?style=flat-square) \n\n\n- Unified Vision and Language Prompt Learning [[paper]](https://arxiv.org/abs/2210.07225)\n\n  `arXiv 2022/10` \n    \n\n- Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation [[paper]](https://arxiv.org/abs/2209.15210)\n\n  `arXiv 2022/10` ![](https://img.shields.io/badge/domain--adaptation-759CBC?style=flat-square)\n\n- Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition [[paper]](https://arxiv.org/abs/2304.04704) [[code]](https://github.com/amazon-science/prompt-pretraining)\n    \n  `arXiv 2023/04`\n    \n### Language-Interactable Prompt\nLanguage-interactable prompter develops zero/few-shot capabilities by prompting **several independent foundational models** (VLMs, LLMs, VMs, etc.) with the language interface. One of the most attractive applications is [multimodal chatbot](https://github.com/zjr2000/Awesome-Multimodal-Assistant).\n\n\n- **Multimodal Few-Shot Learning with Frozen Language Models** [[paper]](https://arxiv.org/abs/2106.13884)\n\n  `NeurIPS 2021` ![](https://img.shields.io/badge/VQA-759CBC?)\n\n- An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [[paper]](https://arxiv.org/pdf/2109.05014.pdf) [[code]](https://github.com/microsoft/PICa) \n\n  `AAAI 2022` ![](https://img.shields.io/badge/VQA-759CBC?)\n\n- VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning [[paper]](https://arxiv.org/pdf/2102.10407.pdf) [[code]](https://github.com/Vision-CAIR/VisualGPT)\n\n  `CVPR 2022` ![](https://img.shields.io/badge/captioning-759CBC?)\n\n- **Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language** [[paper]](https://arxiv.org/pdf/2204.00598.pdf) [[code]](https://socraticmodels.github.io/#code)\n\n  `ICLR 2023` ![](https://img.shields.io/badge/captioning-759CBC?style=flat-square) ![](https://img.shields.io/badge/retrieval-759CBC?style=flat-square) ![](https://img.shields.io/badge/visual--dialog-759CBC?style=flat-square) \n\n\u003cbar\u003e\n\n**Arxiv Papers**\n- **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models** [[paper]](https://arxiv.org/abs/2303.04671) [[code]](https://github.com/microsoft/TaskMatrix) [[demo]](https://huggingface.co/spaces/microsoft/visual_chatgpt) \n`arXiv 2023/03`  ![](https://img.shields.io/badge/Visual--ChatGPT-CD6155?style=flat-square) ![](https://img.shields.io/badge/multimodal--chatbot-759CBC?) ![](https://img.shields.io/badge/LLMs-(chatGPT)-759CBC?) \n\n- **Chameleon: Plug-and-play compositional reasoning with large language models** [[paper]](https://arxiv.org/abs/2304.09842) [[code]](https://github.com/lupantech/chameleon-llm)\n `arXiv 2023/04` ![](https://img.shields.io/badge/Chameleon-CD6155?style=flat-square) ![](https://img.shields.io/badge/multimodal--chatbot-759CBC?) ![](https://img.shields.io/badge/LLMs-(GPT4)-759CBC?) \n\n- ClipCap: CLIP Prefix for Image Captioning\t[[paper]](https://arxiv.org/abs/2111.09734) [[code]](https://github.com/rmokady/CLIP_prefix_caption)\n\n  `arXiv 2021/11` ![](https://img.shields.io/badge/captioning-759CBC?)\n\n- Flamingo: a Visual Language Model for Few-Shot Learning [[paper]](https://arxiv.org/abs/2204.14198) \n\n  `arXiv 2022/04` ![](https://img.shields.io/badge/VQA-759CBC?) ![](https://img.shields.io/badge/captioning-759CBC?)\n\n- Language Models Can See: Plugging Visual Controls in Text Generation [[paper]](https://arxiv.org/pdf/2205.02655.pdf) [[code]](https://github.com/yxuansu/MAGIC)\n\n  `arXiv 2022/05` ![](https://img.shields.io/badge/MAGIC-CD6155?style=flat-square) ![](https://img.shields.io/badge/captioning-759CBC?)\n\n- Zero-Shot Video Question Answering via Frozen Bidirectional Language Models [[paper]](https://arxiv.org/pdf/2206.08155.pdf) \n\n  `arXiv 2022/06` ![](https://img.shields.io/badge/VideoQA-759CBC?)\n\n- Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning [[paper]](https://arxiv.org/pdf/2206.01843.pdf) \n  \n  `arXiv 2022/06` ![](https://img.shields.io/badge/captioning-759CBC?)\n\n\n### Vision-Language Instruction Tuning\n\nThe goal of vision-language instruction tuning is to train a model that can effectively understand  instructions for general-purpose multimodal tasks. \n\n- Visual Instruction Tuning [[paper]](https://arxiv.org/abs/2304.08485) [[code]](https://github.com/haotian-liu/LLaVA) [[demo]](https://llava.hliu.cc/)\n\n  `arXiv 2023/04` \n\n- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [[paper]](https://arxiv.org/abs/2304.10592) [[code]](https://github.com/Vision-CAIR/MiniGPT-4) [[demo]](minigpt-4.github.io)\n\n  `arXiv 2023/04` \n\n- Otter: A Multi-Modal Model with In-Context Instruction Tuning [[paper]](https://arxiv.org/abs/2305.03726) [[code]](https://github.com/Luodian/Otter) [[demo]](otter.cliangyu.com/)\n\n  `arXiv 2023/05` \n\n- MultiModal-GPT: A Vision and Language Model for Dialogue with Humans [[paper]](https://arxiv.org/abs/2305.04790) [[code]](https://github.com/open-mmlab/Multimodal-GPT)\n\n  `arXiv 2023/05` \n\n- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [[paper]](https://arxiv.org/abs/2305.06500) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) \n\n  `arXiv 2023/05` \n\n\n- InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [[paper]](https://arxiv.org/abs/2310.00390) [[code]](https://github.com/AlaaLab/InstructCV) [[demo]](https://huggingface.co/spaces/alaa-lab/InstructCV)\n\n  `arXiv 2023/09`\n\n\n\n\n## More Resources \n* [PromptPapers](https://github.com/thunlp/PromptPapers): A comprehensive curated list for prompting papers (mainly in natural language processing)\n* [Awesome Multimodal Assistant](https://github.com/zjr2000/Awesome-Multimodal-Assistant): a curated list for vision-language instruction tuning and LLM-based chatbot.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fttengwang%2FAwesome_Prompting_Papers_in_Computer_Vision/lists"}