
An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Collection of AWESOME vision-language models for vision tasks

clip computer-vision deep-learning knowledge-distillation multi-modal-model survey transfer-learning vision-language-model

Last synced: 3 days ago
JSON representation

Collection of AWESOME vision-language models for vision tasks

Awesome Lists containing this project



## Awesome Vision-Language Models [![Awesome](](

This is the repository of **Vision Language Models for Vision Tasks: a Survey**, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

**Vision-Language Models for Vision Tasks: A Survey** [[Paper](]

*IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024*

[![PR's Welcome](](

*Feel free to pull requests or contact us if you find any related papers that are not included here.*

The process to submit a pull request is as follows:
- a. Fork the project into your own repository.
- b. Add the Title, Paper link, Conference, Project/Code link in `` using the following format:
|[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|
- c. Submit the pull request to this branch.

## 🔥 News
Last update on 2024/07/24

#### VLM Pre-training Methods
* [CVPR 2024] Do Vision and Language Encoders Represent the World Similarly? [[Paper](][[Code](]
* [CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking [[Paper](][[Code](]
* [CVPR 2024] Towards Better Vision-Inspired Vision-Language Models [[Paper](]
* [CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models [[Paper](]
* [CVPR 2024] ViTamin: Designing Scalable Vision Models in the Vision-Language Era [[Paper](][[Code](]
* [CVPR 2024] Iterated Learning Improves Compositionality in Large Vision-Language Models [[Paper](]
* [CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning [[Paper](][[Code](]
* [CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [[Paper](][[Code](]
* [CVPR 2024] VILA: On Pre-training for Visual Language Models [[Paper](]
* [CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection [[Paper](][[Code](]
* [CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions [[Paper](]
* [ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [[Paper](][[Code](]
* [ICLR 2024] MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [[Paper](][[Code](]
* [ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models [[Paper](]

#### VLM Transfer Learning Methods
* [ECCV 2024] CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [[Paper](][[Code](]
* [ECCV 2024] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [[Paper](][[Code](]
* [ECCV 2024] GalLoP: Learning Global and Local Prompts for Vision-Language Models [[Paper](]
* [ECCV 2024] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [[Paper](][[Code](]
* [CVPR 2024] One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models [[Paper](][[Code](]
* [CVPR 2024] Any-Shift Prompting for Generalization over Distributions [[Paper](]
* [CVPR 2024] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models [[Paper](][[Code](]
* [CVPR 2024] Anchor-based Robust Finetuning of Vision-Language Models [[Paper](]
* [CVPR 2024] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners [[Paper](][[Code](]
* [CVPR 2024] Visual In-Context Prompting [[Paper](][[Code](]
* [CVPR 2024] TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model [[Paper](][[Code](]
* [CVPR 2024] Efficient Test-Time Adaptation of Vision-Language Models [[Paper](][[Code](]
* [CVPR 2024] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models [[Paper](][[Code](]
* [ICLR 2024] DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning [[Paper](][[Code](]
* [ICLR 2024] Nemesis: Normalizing the soft-prompt vectors of vision-language models [[Paper](]
* [ICLR 2024] Prompt Gradient Projection for Continual Learning [[Paper](]
* [ICLR 2024] An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [[Paper](]
* [ICLR 2024] Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [[Paper](][[Code](]
* [ICLR 2024] Text-driven Prompt Generation for Vision-Language Models in Federated Learning [[Paper](]
* [ICLR 2024] Consistency-guided Prompt Learning for Vision-Language Models [[Paper](]
* [ICLR 2024] C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion [[Paper](]
* [arXiv 2024] Learning to Prompt Segment Anything Models [[Paper](]

#### VLM Knowledge Distillation for Detection
* [CVPR 2024] RegionGPT: Towards Region Understanding Vision Language Model [[Paper](][[Code](]
* [ICLR 2024] LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [[Paper](]
* [ICLR 2024] Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction [[Paper](]

#### VLM Knowledge Distillation for Segmentation
* [ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [[Paper](]

#### VLM Knowledge Distillation for Other Vision Tasks
* [ICLR 2024] FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition [[Paper](][[Project](]
* [ICLR 2024] AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection [[Paper](][[Code](]

## Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

## Citation
If you find our work useful in your research, please consider citing:
title={Vision-language models for vision tasks: A survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},

## Menu
- [Datasets](#datasets)
- [Datasets for VLM Pre-training](#datasets-for-vlm-pre-training)
- [Datasets for VLM Evaluation](#datasets-for-vlm-evaluation)
- [Vision-Language Pre-training Methods](#vision-language-pre-training-methods)
- [Pre-training with Contrastive Objective](#pre-training-with-contrastive-objective)
- [Pre-training with Generative Objective](#pre-training-with-generative-objective)
- [Pre-training with Alignment Objective](#pre-training-with-alignment-objective)
- [Vision-Language Model Transfer Learning Methods](#vision-language-model-transfer-learning-methods)
- [Transfer with Prompt Tuning](#transfer-with-prompt-tuning)
- [Transfer with Text Prompt Tuning](#transfer-with-text-prompt-tuning)
- [Transfer with Visual Prompt Tuning](#transfer-with-visual-prompt-tuning)
- [Transfer with Text and Visual Prompt Tuning](#transfer-with-text-and-visual-prompt-tuning)
- [Transfer with Feature Adapter](#transfer-with-feature-adapter)
- [Transfer with Other Methods](#transfer-with-other-methods)
- [Vision-Language Model Knowledge Distillation Methods](#vision-language-model-knowledge-distillation-methods)
- [Knowledge Distillation for Object Detection](#knowledge-distillation-for-object-detection)
- [Knowledge Distillation for Semantic Segmentation](#knowledge-distillation-for-semantic-segmentation)

## Datasets

### Datasets for VLM Pre-training

| Dataset | Year | Num of Image-Text Paris | Language | Project |
|[SBU Caption](|2011|1M|English|[Project](|
|[COCO Caption](|2016|1.5M|English|[Project](|
|[Yahoo Flickr Creative Commons 100 Million](|2016|100M|English|[Project](|
|[Visual Genome](|2017|5.4M|English|[Project](|
|[Conceptual Captions 3M](|2018|3.3M|English|[Project](|
|[Localized Narratives](|2020|0.87M|English|[Project](|
|[Conceptual 12M](|2021|12M|English|[Project](|
|[Wikipedia-based Image Text](|2021|37.6M|108 Languages|[Project](|
|[Red Caps](|2021|12M|English|[Project](|
|[LAION5B](|2022|5B|Over 100 Languages|[Project](|

### Datasets for VLM Evaluation

#### Image Classification

| Dataset | Year | Classes | Training | Testing |Evaluation Metric| Project|
|Caltech-101|2004|102|3,060|6,085|Mean Per Class|[Project](|
|PASCAL VOC 2007|2007|20|5,011|4,952|11-point mAP|[Project](|
|Oxford 102 Flowers|2008|102|2,040|6,149|Mean Per Class|[Project](|
|KITTI Distance|2012|4|6,770|711|Accuracy|[Project](|
|Oxford-IIIT PETS|2012|37|3,680|3,669|Mean Per Class|[Project](|
|Stanford Cars|2013|196|8,144|8,041|Accuracy|[Project](|
|FGVC Aircraft|2013|100|6,667|3,333|Mean Per Class|[Project](|
|Facial Emotion|2013|8|32,140|3,574|Accuracy|[Project](|
|Rendered SST2|2013|2|7,792|1,821|Accuracy|[Project](|
|Describable Textures|2014|47|3,760|1,880|Accuracy|[Project](|
|CLEVR Counts|2017|8|2,000|500|Accuracy|[Project](|
|Hateful Memes|2020|2|8,500|500|ROC AUC|[Project](|

#### Image-Text Retrieval

| Dataset | Year | Classes | Training | Testing |Evaluation Metric| Project|
|COCO Caption|2015|-|82,783|5,000|Recall|[Project](

#### Action Recognition

| Dataset | Year | Classes | Training | Testing |Evaluation Metric| Project|
|Kinetics700|2019|700|494,801|31,669|Mean (top1, top5)|[Project](|
|RareAct|2020|122|7,607|-|mWAP, mSAP|[Project](|

#### Object Detection

| Dataset | Year | Classes | Training | Testing |Evaluation Metric| Project|
|COCO 2014 Detection|2014|80|83,000|41,000|Box mAP|[Project](|
|COCO 2017 Detection|2017|80|118,000|5,000|Box mAP|[Project](|
|LVIS|2019|1203|118,000|5,000|Box mAP|[Project](|
|ODinW|2022|314|132,413|20,070|Box mAP|[Project](|

#### Semantic Segmentation

| Dataset | Year | Classes | Training | Testing |Evaluation Metric| Project|
|PASCAL VOC 2012|2012|20|1,464|1,449|mIoU|[Project](|
|PASCAL Content|2014|459|4,998|5,105|mIoU|[Project](|

## Vision-Language Pre-training Methods

### Pre-training with Contrastive Objective

| Paper | Published in | Code/Project |
|[CLIP: Learning Transferable Visual Models From Natural Language Supervision](|ICML 2021|[Code](|
|[ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](|ICML 2021|-|
|[OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation](|arXiv 2021|[Code](|
|[Florence: A New Foundation Model for Computer Vision](|arXiv 2021|-|
|[RegionClip: Region-based Language-Image Pretraining](|arXiv 2021|[Code](|
|[DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](|ICLR 2022|[Code](|
|[FILIP: Fine-grained Interactive Language-Image Pre-Training](|ICLR 2022|-|
|[KELIP: Large-scale Bilingual Language-Image Contrastive Learning](|ICLRW 2022|[Code](|
|[ZeroVL: Contrastive Vision-Language Pre-training with Limited Resources](|ECCV 2022|[Code](|
|[SLIP: Self-supervision meets Language-Image Pre-training](|ECCV 2022|[Code](|
|[UniCL: Unified Contrastive Learning in Image-Text-Label Space](|CVPR 2022|[Code](|
|[LiT: Zero-Shot Transfer with Locked-image text Tuning](|CVPR 2022|[Code](|
|[GroupViT: Semantic Segmentation Emerges from Text Supervision](|CVPR 2022|[Code](|
|[PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining](|NeurIPS 2022|-|
|[UniCLIP: Unified Framework for Contrastive Language-Image Pre-training](|NeurIPS 2022|-|
|[K-LITE: Learning Transferable Visual Models with External Knowledge](|NeurIPS 2022|[Code](|
|[FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone](|NeurIPS 2022|[Code](|
|[Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](|arXiv 2022|[Code](|
|[AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](|arXiv 2022|[Code](|
|[SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation](|arXiv 2022|[Code](|
|[NLIP: Noise-robust Language-Image Pre-training](|AAAI 2023|-|
|[PaLI: A Jointly-Scaled Multilingual Language-Image Model](|ICLR 2023|[Project](|
|[HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention](|ICLR 2023|[Code](|
|[CLIPPO: Image-and-Language Understanding from Pixels Only](|CVPR 2023|[Code](|
|[RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training](|CVPR 2023|-|
|[DeAR: Debiasing Vision-Language Models with Additive Residuals](|CVPR 2023|-|
|[Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training](|CVPR 2023|[Code](|
|[LaCLIP: Improving CLIP Training with Language Rewrites](|NeurIPS 2023|[Code](|
|[ALIP: Adaptive Language-Image Pre-training with Synthetic Caption](|ICCV 2023|[Code](|
|[GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training](|ICCV 2023|-|
|[CLIPpy: Perceptual Grouping in Contrastive Vision-Language Models](|ICCV 2023|-|

### Pre-training with Generative Objective

| Paper | Published in | Code/Project |
|[FLAVA: A Foundational Language And Vision Alignment Model](|CVPR 2022|[Code](|
|[CoCa: Contrastive Captioners are Image-Text Foundation Models](|arXiv 2022|[Code](|
|[Too Large; Data Reduction for Vision-Language Pre-Training](|arXiv 2023|[Code](|
|[SAM: Segment Anything](|arXiv 2023|[Code](|
|[SEEM: Segment Everything Everywhere All at Once](|arXiv 2023|[Code](|
|[Semantic-SAM: Segment and Recognize Anything at Any Granularity](|arXiv 2023|[Code](|

### Pre-training with Alignment Objective

| Paper | Published in | Code/Project |
|[GLIP: Grounded Language-Image Pre-training](|CVPR 2022|[Code](|
|[DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection](|NeurIPS 2022|-|
|[nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training](|CVPR 2023|[Code](|

## Vision-Language Model Transfer Learning Methods

### Transfer with Prompt Tuning

#### Transfer with Text Prompt Tuning

| Paper | Published in | Code/Project |
|[CoOp: Learning to Prompt for Vision-Language Models](|IJCV 2022|[Code](|
|[CoCoOp: Conditional Prompt Learning for Vision-Language Models](|CVPR 2022|[Code](|
|[ProDA: Prompt Distribution Learning](|CVPR 2022|-|
|[DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting](|CVPR 2022|[Code](|
|[TPT: Test-time prompt tuning for zero-shot generalization in vision-language models](|NeurIPS 2022|[Code](|
|[DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations](|NeurIPS 2022|[Code](|
|[CPL: Counterfactual Prompt Learning for Vision and Language Models](|EMNLP 2022|[Code](|
|[Bayesian Prompt Learning for Image-Language Model Generalization](|arXiv 2022|-|
|[UPL: Unsupervised Prompt Learning for Vision-Language Models](|arXiv 2022|[Code](|
|[ProGrad: Prompt-aligned Gradient for Prompt Tuning](|arXiv 2022|[Code](|
|[SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language Models](|arXiv 2022|[Code](|
|[SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models](|TCSVT 2023|[Code](|
|[LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models](|CVPR 2023|[Code](|
|[LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition](|arXiv 2023|[Code](|
|[Texts as Images in Prompt Tuning for Multi-Label Image Recognition](|CVPR 2023|[code](
|[Visual-Language Prompt Tuning with Knowledge-guided Context Optimization](|CVPR 2023|[Code](|
|[Learning to Name Classes for Vision and Language Models](|CVPR 2023|-|
|[PLOT: Prompt Learning with Optimal Transport for Vision-Language Models](|ICLR 2023|[Code](|
|[CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification](|ICCV 2023|[Code](|
|[ProTeCt: Prompt Tuning for Hierarchical Consistency](|arXiv 2023|-|
|[Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning](|arXiv 2023|[Code](|
|[Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?](|ICCV 2023|[Code](|
|[Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models](|ICCV 2023|-|
|[Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models](|ICCV 2023|-|
|[Read-only Prompt Optimization for Vision-Language Few-shot Learning](|ICCV 2023|[Code](|
|[Bayesian Prompt Learning for Image-Language Model Generalization](|ICCV 2023|[Code](|
|[Distribution-Aware Prompt Tuning for Vision-Language Models](|ICCV 2023|[Code](|
|[LPT: Long-Tailed Prompt Tuning For Image Classification](|ICCV 2023|[Code](|
|[Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning](|ICCV 2023|[Code](|
|[CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts](|ECCV 2024|[Code](|

#### Transfer with Visual Prompt Tuning

| Paper | Published in | Code/Project |
|[Exploring Visual Prompts for Adapting Large-Scale Models](|arXiv 2022|[Code](|
|[Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification](|arXiv 2023|-|
|[Fine-Grained Visual Prompting](|arXiv 2023|-|
|[LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models](|ICCV 2023|[Code](|

#### Transfer with Text and Visual Prompt Tuning

| Paper | Published in | Code/Project |
|[UPT: Unified Vision and Language Prompt Learning](|arXiv 2022|[Code](|
|[MVLPT: Multitask Vision-Language Prompt Tuning](|arXiv 2022|[Code](|
|[CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model](|arXiv 2022|[Code](|
|[MaPLe: Multi-modal Prompt Learning](|CVPR 2023|[Code](|

### Transfer with Feature Adapter

| Paper | Published in | Code/Project |
|[Clip-Adapter: Better Vision-Language Models with Feature Adapters](|arXiv 2021|[Code](|
|[Tip-Adapte: Training-free Adaption of CLIP for Few-shot Classification](|ECCV 2022|[Code](|
|[SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models](|BMVC 2022|[Code](|
|[CLIPPR: Improving Zero-Shot Models with Label Distribution Priors](|arXiv 2022|[Code](|
|[SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification](|arXiv 2022|-|
|[SuS-X: Training-Free Name-Only Transfer of Vision-Language Models](|ICCV 2023|[Code](|
|[VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control](|ICCV 2023|[Code](|
|[SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More](|arXiv 2023|[Code](|
|[Segment Anything in High Quality](|arXiv 2023|[Code](|
|[HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding](|arXiv 2023|[Code](|
|[CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models](|arXiv 2023|-|

### Transfer with Other Methods

| Paper | Published in | Code/Project |
|[VT-Clip: Enhancing Vision-Language Models with Visual-guided Texts](|arXiv 2021|-|
|[Wise-FT: Robust fine-tuning of zero-shot models](|CVPR 2022|[Code](|
|[MaskCLIP: Extract Free Dense Labels from CLIP](|ECCV 2022|[Code](|
|[MUST: Masked Unsupervised Self-training for Label-free Image Classification](|ICLR 2023| [Code](|
|[CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention](|AAAI 2023|[Code](|
|[Semantic Prompt for Few-Shot Image Recognition](|CVPR 2023|-|
|[Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners](|CVPR 2023|[Code](|
|[Task Residual for Tuning Vision-Language Models](|CVPR 2023|[Code](|
|[Deeply Coupled Cross-Modal Prompt Learning](|ACL 2023|[Code](|
|[Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation](|arXiv 2023|-|
|[Personalize Segment Anything Model with One Shot](|arXiv 2023|[Code](|
|[Chils: Zero-shot image classification with hierarchical label sets](|ICML 2023|[Code](|
|[Improving Zero-shot Generalization and Robustness of Multi-modal Models](|CVPR 2023|[Code](|
|[Exploiting Category Names for Few-Shot Classification with Vision-Language Models](|ICLR W 2023|-|
|[Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models](|arXiv 2023|[Code](|
|[Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models](|ICCV 2023|[Code](|
|[PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization](|ICCV 2023|[Code](|
|[PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation](|ICCV 2023|-|
|[Black Box Few-Shot Adaptation for Vision-Language models](|ICCV 2023|[Code](|
|[AD-CLIP: Adapting Domains in Prompt Space Using CLIP](|ICCVW 2023|-|
|[Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](|arXiv 2023|[Code](|
|[Language Models as Black-Box Optimizers for Vision-Language Models](|arXiv 2023|-|

## Vision-Language Model Knowledge Distillation Methods

### Knowledge Distillation for Object Detection
| Paper | Published in | Code/Project |
|[ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation](|ICLR 2022|[Code](|
|[DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model](|CVPR 2022|[Code](|
|[XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling](|CVPR 2022|[Code](|
|[Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection](|NeurIPS 2022|[Code](|
|[PromptDet: Towards Open-vocabulary Detection using Uncurated Images](|ECCV 2022|[Code](|
|[PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box Labels](|ECCV 2022|[Code](|
|[OV-DETR: Open-Vocabulary DETR with Conditional Matching](|ECCV 2022|[Code](|
|[Detic: Detecting Twenty-thousand Classes using Image-level Supervision](|ECCV 2022|[Code](|
|[OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers](|ECCV 2022|[Code](|
|[VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object Detection](|ECCV 2022|[Code](|
|[ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding Alignment](|arXiv 2022|[Code](|
|[HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation](|arXiv 2022|[Code](|
|[VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection](|ICLR 2023|[Code](|
|[F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models](|ICLR 2023|[Code](|
|[CondHead: Learning to Detect and Segment for Open Vocabulary Object Detection](|CVPR 2023|-|
|[Aligning Bag of Regions for Open-Vocabulary Object Detection](|CVPR 2023|[Code](|
|[Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers](|CVPR 2023|[Code](|
|[Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection](|CVPR 2023|[Code](|
|[CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching](|CVPR 2023|[Code](|
|[DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment](|CVPR 2023|-|
|[Detecting Everything in the Open World: Towards Universal Object Detection](|CVPR 2023|[Code](|
|[CapDet: Unifying Dense Captioning and Open-World Detection Pretraining](|CVPR 2023|-|
|[Contextual Object Detection with Multimodal Large Language Models](|arXiv 2023|[Code](|
|[Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models](|arXiv 2023|[Code](|
|[EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment](|ICCV 2023|[Code](|
|[Improving Pseudo Labels for Open-Vocabulary Object Detection](|arXiv 2023|-|

### Knowledge Distillation for Semantic Segmentation

| Paper | Published in | Code/Project |
|[SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples](|arXiv 2021|-|
|[ReCo: Retrieve and Co-segment for Zero-shot Transfer](|NeurIPS 2022|[Code](|
|[CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation](|CVPR 2022|[Code](|
|[CLIPSeg: Image Segmentation Using Text and Image Prompts](|CVPR 2022|[Code](|
|[ZegFormer: Decoupling Zero-Shot Semantic Segmentation](|CVPR 2022|[Code](|
|[LSeg: Language-driven Semantic Segmentation](|ICLR 2022|[Code](|
|[ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model](|ECCV 2022|[Code](|
|[OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels](|ECCV 2022|[Code](|
|[Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models](|BMVC 2022|[Code](|
|[OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP](|CVPR 2023|[Code](|
|[ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation](|CVPR 2023|[Code](|
|[CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation](|CVPR 2023|[Code](|
|[FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation](|CVPR 2023|[Code](|
|[Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations](|CVPR 2023|[Code](|
|[Exploring Open-Vocabulary Semantic Segmentation without Human Labels](|arXiv 2023|-|
|[OpenVIS: Open-vocabulary Video Instance Segmentation](|arXiv 2023|-|
|[Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation](|arXiv 2023|-|
|[Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation](|arXiv 2023|[Code](|
|[Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models](|arXiv 2023|-|
|[SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning](|ICCV 2023|[Code](|
|[ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation](|arXiv 2023|-|
|[Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP](|arXiv 2023|[Code](|
|[Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models](|arXiv 2023|-|

### Knowledge Distillation for Other Tasks

| Paper | Published in | Code/Project |
|[Controlling Vision-Language Models for Universal Image Restoration](|arXiv 2023|[Code](|