Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Prompting-on-Vision-Language-Model
This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
https://github.com/JindongGu/Awesome-Prompting-on-Vision-Language-Model
Last synced: 4 days ago
JSON representation
-
# :nerd_face: What is Prompting on Vision-Language Models?
-
Reference
-
-
# :paperclips: Awesome Papers
-
Prompting Models in Multimodal-to-Text Generation (*e.g.* on Flamingo)
- Unifying Vision-and-Language Tasks via Text Generation - min/VL-T5) | Encoder-decoder fusion; Text prefixes as prompt |
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision - decoder fusion; Text prefixes as prompt |
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework - Sys/OFA) | Encoder-decoder fusion; Text prefixes as prompt |
- Learning How to Ask: Querying LMs with Mixtures of Soft Prompts - HLT | 2021 | [Github](https://github.com/hiaoxui/soft-prompts) | Prompt tuning |
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Prompt Tuning for Generative Multimodal Pretrained Models - Sys/OFA) | Prompt tuning on OFA |
- Language Is Not All You Need: Aligning Perception with Language Models
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
- Towards Robust Prompts on Vision-Language Models - -- | Robustness of prompt tuning on VLMs |
- Visual Instruction Tuning - liu/LLaVA) | |
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond - VL) | Prompt tuning |
- Unifying Vision-and-Language Tasks via Text Generation - min/VL-T5) | Encoder-decoder fusion; Text prefixes as prompt |
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision - decoder fusion; Text prefixes as prompt |
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework - Sys/OFA) | Encoder-decoder fusion; Text prefixes as prompt |
- MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning - Alpha/magma) | Decoder-only fusion; Image conditional prefix tuning |
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - only fusion; Image conditional prefix tuning |
- Language Models are Unsupervised Multitask Learners - 2) | Task instruction prompt |
- Flamingo: a Visual Language Model for Few-Shot Learning - only fusion; Text prompts; |
- MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning - Alpha/magma) | Decoder-only fusion; Image conditional prefix tuning |
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models - only fusion; Image conditional prefix tuning |
- Language Models are Unsupervised Multitask Learners - 2) | Task instruction prompt |
- The Turking Test: Can Language Models Understand Instructions? - -- | Task instruction prompt |
- PaLI: A Jointly-Scaled Multilingual Language-Image Model - -- | Encoder-decoder fusion; Instruction prompt |
- Language Models are Few-Shot Learners - -- | In-context learning |
- Learning To Retrieve Prompts for In-Context Learning - HLT | 2022 | [Github](https://github.com/OhadRubin/EPR) | Retrieval-based prompting |
- Multimodal Few-Shot Learning with Frozen Language Models - only fusion; Image conditional prefix tuning |
- Prompt Tuning for Generative Multimodal Pretrained Models - Sys/OFA) | Prompt tuning on OFA |
- The Turking Test: Can Language Models Understand Instructions? - -- | Task instruction prompt |
- Language Models are Few-Shot Learners - -- | In-context learning |
- Learning To Retrieve Prompts for In-Context Learning - HLT | 2022 | [Github](https://github.com/OhadRubin/EPR) | Retrieval-based prompting |
- Unified Demonstration Retriever for In-Context Learning - based prompting |
- Compositional Exemplars for In-context Learning - ceil) | Retrieval-based prompting |
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - -- | Chain-of-thought prompting |
- The Power of Scale for Parameter-Efficient Prompt Tuning - -- | Prompt tuning |
- Learning How to Ask: Querying LMs with Mixtures of Soft Prompts - HLT | 2021 | [Github](https://github.com/hiaoxui/soft-prompts) | Prompt tuning |
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Language Is Not All You Need: Aligning Perception with Language Models
- Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
- Compositional Exemplars for In-context Learning - ceil) | Retrieval-based prompting |
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - -- | Chain-of-thought prompting |
- The Power of Scale for Parameter-Efficient Prompt Tuning - -- | Prompt tuning |
- Towards Robust Prompts on Vision-Language Models - -- | Robustness of prompt tuning on VLMs |
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- Visual Instruction Tuning - liu/LLaVA) | |
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond - VL) | Prompt tuning |
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
- MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS - 4.github.io/) | Prompt tuning |
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
-
-
Prompting Model in Image-Text Matching (*e.g.* on CLIP)
-
Applications & Responsible AI
- Visual Prompt Tuning for Few-Shot Text Classification - -- | Visual prompts for text classification |
- Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
- Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
- LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition - peng-xia/LMPT) | Prompts for long-tailed multi-label image classification |
- LPT: Long-tailed Prompt Tuning for Image Classification - tailed image classification |
- Texts as Images in Prompt Tuning for Multi-Label Image Recognition - DPT) | Prompts for multi-label image classification and detection |
- Debiasing Vision-Language Models via Biased Prompts
- LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition - peng-xia/LMPT) | Prompts for long-tailed multi-label image classification |
- LPT: Long-tailed Prompt Tuning for Image Classification - tailed image classification |
- Texts as Images in Prompt Tuning for Multi-Label Image Recognition - DPT) | Prompts for multi-label image classification and detection |
- DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations - label image classification and recognition |
- Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning - -- | Soft prompts for visual relation detection |
- Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection - LX/OpenVoc-VidVRD) | Relation Prompts for video open-vocabulary relation detection |
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting - conditioned text prompts for semantic segmentation |
- Segment Anything - anything.com/) | Promptable queries for semantic segmentation |
- Domain Adaptation via Prompt Learning - specific textual prompts for domain adaptation |
- Learning to Prompt for Continual Learning - research/l2p) | Prompts for continual learning |
- DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning - research/l2p) | Prompts for continual learning |
- DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations - label image classification and recognition |
- Visual Prompt Tuning for Few-Shot Text Classification - -- | Visual prompts for text classification |
- Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
- Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
- PromptDet: Towards Open-vocabulary Detection using Uncurated Images
- Optimizing Continuous Prompts for Visual Relationship Detection by Affix-Tuning - -- | Soft prompts for visual relation detection |
- Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning - -- | Soft prompts for visual relation detection |
- Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection - LX/OpenVoc-VidVRD) | Relation Prompts for video open-vocabulary relation detection |
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting - conditioned text prompts for semantic segmentation |
- Segment Anything - anything.com/) | Promptable queries for semantic segmentation |
- Domain Adaptation via Prompt Learning - specific textual prompts for domain adaptation |
- Learning to Prompt for Continual Learning - research/l2p) | Prompts for continual learning |
- Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
- PromptDet: Towards Open-vocabulary Detection using Uncurated Images
- Optimizing Continuous Prompts for Visual Relationship Detection by Affix-Tuning - -- | Soft prompts for visual relation detection |
- Debiasing Vision-Language Models via Biased Prompts
- DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning - research/l2p) | Prompts for continual learning |
- Understanding Zero-Shot Adversarial Robustness for Large-Scale Models - columbia/ZSRobust4FoundationModel) | Visual prompt tuning under adversarial attack |
- Visual Prompting for Adversarial Robustness - for-adversarial-robustness) | Visual prompting to improve the adversarial robustness |
- Exploring the Universal Vulnerability of Prompt-based Learning Paradigm - universal-vulnerability) | Visual prompting vulnerability |
- Poisoning and Backdooring Contrastive Learning - -- | Backdoor and poisoning attacks on CLIP |
- BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning
- Understanding Zero-Shot Adversarial Robustness for Large-Scale Models - columbia/ZSRobust4FoundationModel) | Visual prompt tuning under adversarial attack |
- Exploring the Universal Vulnerability of Prompt-based Learning Paradigm - universal-vulnerability) | Visual prompting vulnerability |
- Poisoning and Backdooring Contrastive Learning - -- | Backdoor and poisoning attacks on CLIP |
- CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning - -- | Defense backdoor attacks on CLIP |
-
Prompting Models in Multimodal-to-Text Generation (*e.g.* on Flamingo)
- Unsupervised Prompt Learning for Vision-Language Models
- Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition - science/prompt-pretraining) | Prompt Pre-Training |
- Consistency-guided Prompt Learning for Vision-Language Models - -- | Decoupled unified prompting |
- Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
- Learning to Prompt for Vision-Language Models
- Prompting Visual-Language Models for Efficient Video Understanding - chen/Efficient-Prompt) | Soft text prompts |
- Learning Transferable Visual Models From Natural Language Supervision
- Delving into the Openness of CLIP - openness) | Hard text prompts for understanding |
- Learning to Prompt for Vision-Language Models
- Prompting Visual-Language Models for Efficient Video Understanding - chen/Efficient-Prompt) | Soft text prompts |
- Delving into the Openness of CLIP - openness) | Hard text prompts for understanding |
- Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
- Multitask Vision-Language Prompt Tuning
- Unleashing the Power of Visual Prompting At the Pixel Level - VLAA/EVP) | Visual patch-wise prompts |
- Diversity-Aware Meta Visual Prompting - VP) | Visual patch-wise prompts |
- CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
- What does CLIP know about a red circle? Visual prompt engineering for VLMs - -- | Visual annotation prompts |
- Visual Prompting via Image Inpainting
- Conditional Prompt Learning for Vision-Language Models
- Visual Prompt Tuning - wise prompts |
- Exploring Visual Prompts for Adapting Large-Scale Models - wise prompts |
- Multitask Vision-Language Prompt Tuning - wise prompts |
- Unleashing the Power of Visual Prompting At the Pixel Level - VLAA/EVP) | Visual patch-wise prompts |
- Diversity-Aware Meta Visual Prompting - VP) | Visual patch-wise prompts |
- CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
- Conditional Prompt Learning for Vision-Language Models
- Visual Prompt Tuning - wise prompts |
- Exploring Visual Prompts for Adapting Large-Scale Models - wise prompts |
- What does CLIP know about a red circle? Visual prompt engineering for VLMs - -- | Visual annotation prompts |
- Visual Prompting via Image Inpainting
- MaPLe: Multi-modal Prompt Learning - prompt-learning) | Decoupled unified prompting |
- Unified Vision and Language Prompt Learning
- MaPLe: Multi-modal Prompt Learning - prompt-learning) | Decoupled unified prompting |
- Understanding Zero-shot Adversarial Robustness for Large-Scale Models
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation - Text Matching Model |
- Unsupervised Prompt Learning for Vision-Language Models
- Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
- Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition - science/prompt-pretraining) | Prompt Pre-Training |
- Consistency-guided Prompt Learning for Vision-Language Models - -- | Decoupled unified prompting |
- Understanding Zero-shot Adversarial Robustness for Large-Scale Models
- Visual Prompting for Adversarial Robustness - for-adversarial-robustness) | Adversarial robustness of prompt |
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation - Text Matching Model |
- Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models - -- | Learnable prompt |
-
-
Prompting Model in Text-to-Image Generation (*e.g.* on Stable Diffusion)
-
Applications & Responsible AI
- Diffusion Models Beat GANs on Image Synthesis - diffusion) | Diffusion models on image generation |
- Denoising Diffusion Probabilistic Models
- ImaginaryNet: Learning Object Detectors without Real Images and Annotations
- Diffusion Models Beat GANs on Image Synthesis - diffusion) | Diffusion models on image generation |
- SuS-X: Training-Free Name-Only Transfer of Vision-Language Models - X/) | Diffusion models on image generation |
- Investigating Prompt Engineering in Diffusion Models - -- | Semantic prompt design |
- DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
- Is synthetic data from generative models ready for image recognition? - Lab/SyntheticData) | Prompts for synthetic data generation |
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion - inversion.github.io/) | Complex control of synthesis results via prompts |
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
- Multi-Concept Customization of Text-to-Image Diffusion - research/custom-diffusion) | Complex control of synthesis results via prompts |
- Prompt-to-Prompt Image Editing with Cross Attention Control - to-prompt) | Complex control of synthesis results via prompts |
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis - free-structured-diffusion-guidance) | Controllable text-to-image generation |
- Diffusion Self-Guidance for Controllable Image Generation - to-image generation |
- Imagic: Text-Based Real Image Editing with Diffusion Models - editing.github.io/) | Controllable text-to-image generation |
- Adding Conditional Control to Text-to-Image Diffusion Models - to-image generation |
- ImaginaryNet: Learning Object Detectors without Real Images and Annotations
- Is synthetic data from generative models ready for image recognition? - Lab/SyntheticData) | Diversify generation with prompt |
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion - inversion.github.io/) | Complex control of synthesis results via prompts |
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
- Multi-Concept Customization of Text-to-Image Diffusion - research/custom-diffusion) | Complex control of synthesis results via prompts |
- Prompt-to-Prompt Image Editing with Cross Attention Control - -- | Complex control of synthesis results via prompts |
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis - free-structured-diffusion-guidance) | Controllable text-to-image generation |
- Diffusion Self-Guidance for Controllable Image Generation - to-image generation |
- Imagic: Text-Based Real Image Editing with Diffusion Models - editing.github.io/) | Controllable text-to-image generation |
- Adding Conditional Control to Text-to-Image Diffusion Models - to-image generation |
- SuS-X: Training-Free Name-Only Transfer of Vision-Language Models - X/) | Diffusion models on image generation |
- Investigating Prompt Engineering in Diffusion Models - -- | Semantic prompt design |
- DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
- T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation - -- | Prompts on text-to-image models considering biases |
- Stable Bias: Analyzing Societal Representations in Diffusion Models - -- | Prompts on text-to-image models considering biases |
- A Pilot Study of Query-Free Adversarial Attack Against Stable Diffusion - -- | Adversarial robustness of text-to-image models |
- Diffusion Models for Adversarial Purification - to-image models |
- Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis - -- | Backdoor attack on text-to-image models |
- Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning - -- | Backdoor attack on text-to-image models |
- Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models - -- | Backdoor attack on text-to-image models |
- Make-A-Video: Text-to-Video Generation without Text-Video Data - to-video generation |
- Imagen Video: High Definition Video Generation with Diffusion Models - to-video generation |
- FateZero: Fusing Attentions for Zero-shot Text-based Video Editing - to-video generation |
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation - A-Video) | Prompts for text-to-video generation |
- DiffRF: Rendering-Guided 3D Radiance Field Diffusion - to-3D generation |
- DreamFusion: Text-to-3D using 2D Diffusion - to-3D generation |
- Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models - to-3D generation |
- Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models - -- | Prompts for complex tasks |
- MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model - zhang.github.io/projects/MotionDiffuse.html) | Prompts for text-to-motion generation |
- FLAME: Free-form Language-based Motion Synthesis & Editing - to-motion generation |
- Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models - to-3D generation |
- MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model - zhang.github.io/projects/MotionDiffuse.html) | Prompts for text-to-motion generation |
- Make-A-Video: Text-to-Video Generation without Text-Video Data - to-video generation |
- Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models - -- | Prompts for complex tasks |
- FateZero: Fusing Attentions for Zero-shot Text-based Video Editing - to-video generation |
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation - A-Video) | Prompts for text-to-video generation |
- DiffRF: Rendering-Guided 3D Radiance Field Diffusion - to-3D generation |
- DreamFusion: Text-to-3D using 2D Diffusion - to-3D generation |
- Prompt Stealing Attacks Against Text-to-Image Generation Models - -- | Prompts for responsible AI |
- Membership Inference Attacks Against Text-to-image Generation Models - -- | Membership attacks against text-to-image models |
- Are Diffusion Models Vulnerable to Membership Inference Attacks? - to-image models |
- A Reproducible Extraction of Training Images from Diffusion Models - extraction) | Membership attacks against text-to-image models |
- Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness - research/Fair-Diffusion) | Prompts on text-to-image models considering fairness |
- Social Biases through the Text-to-Image Generation Lens - -- | Prompts on text-to-image models considering biases |
- T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation - -- | Prompts on text-to-image models considering biases |
- Stable Bias: Analyzing Societal Representations in Diffusion Models - -- | Prompts on text-to-image models considering biases |
- A Pilot Study of Query-Free Adversarial Attack Against Stable Diffusion - -- | Adversarial robustness of text-to-image models |
- Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis - -- | Backdoor attack on text-to-image models |
- Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning - -- | Backdoor attack on text-to-image models |
- Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models - -- | Backdoor attack on text-to-image models |
- Multimodal Procedural Planning via Dual Text-Image Prompting
- Prompt Stealing Attacks Against Text-to-Image Generation Models - -- | Prompts for responsible AI |
- Membership Inference Attacks Against Text-to-image Generation Models - -- | Membership attacks against text-to-image models |
- Are Diffusion Models Vulnerable to Membership Inference Attacks? - to-image models |
- A Reproducible Extraction of Training Images from Diffusion Models - extraction) | Membership attacks against text-to-image models |
- Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness - research/Fair-Diffusion) | Prompts on text-to-image models considering fairness |
- Social Biases through the Text-to-Image Generation Lens - -- | Prompts on text-to-image models considering biases |
- Multimodal Procedural Planning via Dual Text-Image Prompting
-
Categories