{"id":15025271,"url":"https://github.com/jingyi0000/vlm_survey","last_synced_at":"2026-02-19T10:04:09.995Z","repository":{"id":172636411,"uuid":"621170910","full_name":"jingyi0000/VLM_survey","owner":"jingyi0000","description":"Collection of AWESOME vision-language models for vision tasks","archived":false,"fork":false,"pushed_at":"2025-05-19T01:06:38.000Z","size":401,"stargazers_count":2725,"open_issues_count":1,"forks_count":209,"subscribers_count":97,"default_branch":"main","last_synced_at":"2025-05-19T02:24:11.456Z","etag":null,"topics":["clip","computer-vision","deep-learning","knowledge-distillation","multi-modal-model","survey","transfer-learning","vision-language-model"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jingyi0000.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-03-30T06:06:59.000Z","updated_at":"2025-05-19T01:06:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"1a655449-6fe9-40d5-b4b1-3fded7410353","html_url":"https://github.com/jingyi0000/VLM_survey","commit_stats":{"total_commits":80,"total_committers":5,"mean_commits":16.0,"dds":0.08750000000000002,"last_synced_commit":"b9cb6659abb54a2f13acc48d7b8c59a308ef51c0"},"previous_names":["jingyi0000/vlm_survey"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jingyi0000/VLM_survey","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jingyi0000%2FVLM_survey","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jingyi0000%2FVLM_survey/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jingyi0000%2FVLM_survey/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jingyi0000%2FVLM_survey/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jingyi0000","download_url":"https://codeload.github.com/jingyi0000/VLM_survey/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jingyi0000%2FVLM_survey/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279017959,"owners_count":26086237,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clip","computer-vision","deep-learning","knowledge-distillation","multi-modal-model","survey","transfer-learning","vision-language-model"],"created_at":"2024-09-24T20:01:56.346Z","updated_at":"2026-02-19T10:04:09.985Z","avatar_url":"https://github.com/jingyi0000.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"## Awesome Vision-Language Models [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)\n\n\u003cimg src=\"./images/overview.png\" width=\"96%\" height=\"96%\"\u003e\n\nThis is the repository of **Vision Language Models for Vision Tasks: a Survey**, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:\n\n**Vision-Language Models for Vision Tasks: A Survey**  [[Paper](https://arxiv.org/abs/2304.00685)]\n\n*IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024*\n\n🤩 Our paper is selected into **TPAMI Top 50 Popular Paper List** !!\n \n[![arXiv](https://img.shields.io/badge/arXiv-2304.00685-b31b1b.svg)](https://arxiv.org/abs/2304.00685) \n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity) \n[![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](http://makeapullrequest.com)\n\u003c!-- [![made-with-Markdown](https://img.shields.io/badge/Made%20with-Markdown-1f425f.svg)](http://commonmark.org) --\u003e\n\u003c!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) --\u003e\n\n*Feel free to pull requests or contact us if you find any related papers that are not included here.*\n\nThe process to submit a pull request is as follows:\n- a. Fork the project into your own repository.\n- b. Add the Title, Paper link, Conference, Project/Code link in `README.md` using the following format:\n```\n  |[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|\n```\n- c. Submit the pull request to this branch.\n\n**We plan to update the arXiv version of our survey paper soon. If your paper is missing from this repository, feel free to contact us or open an issue!**\n\n## 🔥 News\n\n\u003cdetails open\u003e\u003csummary\u003e📣 We also have a collection on agentic MLLMs that may interest you ✨. \u003c/summary\u003e\u003cp\u003e\n  \n\u003e [**Awesome-Agentic-MLLMs**](https://github.com/HJYao00/Awesome-Agentic-MLLMs) \u003cbr\u003e \n\u003e  [![arXiv](https://img.shields.io/badge/arXiv-2510.10991-b31b1b.svg)](https://arxiv.org/abs/2510.10991) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/HJYao00/Awesome-Agentic-MLLMs)\n\n\u003c/p \u003e\u003c/details\u003e\n\n📅 Last update on 2025/10/14\n\n#### VLMs and Synthetic Data\n\n* [CVPR 2025] Synthetic Data is an Elegant GIFT for Continual Vision-Language Models [[Paper](https://arxiv.org/pdf/2503.04229v1)][[Code](https://github.com/Luo-Jiaming/GIFT_CL)]\n* [CVPR 2025] Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Li_Enhancing_Vision-Language_Compositional_Understanding_with_Multimodal_Synthetic_Data_CVPR_2025_paper.pdf)]\n* [OpenReview 2025] A Survey on Bridging VLMs and Synthetic Data [[Paper](https://openreview.net/pdf?id=ThjDCZOljE)][[Code](https://github.com/mghiasvand1/Awesome-VLM-Synthetic-Data)]\n\n#### VLM Pre-training Methods\n\n* [NeurIPS 2024] PLIP: Language-Image Pre-training for Person Representation Learning [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/510ad3018bbdc5b6e3b10646e2e35771-Paper-Conference.pdf)][[Code](https://github.com/Zplusdragon/PLIP)]\n* [NeurIPS 2024] LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/77828623211df05497ce3658300dafd9-Paper-Conference.pdf)][[Code](https://wuw2019.github.io/lot-lip)]\n* [NeurIPS 2024] Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/8a54a80ffc2834689ffdd0920202018e-Paper-Conference.pdf)][[Code](https://chain-of-sight.github.io/)]\n\n#### VLM Transfer Learning Methods\n* [ICCV 2025] One Last Attention for Your Vision-Language Model [[Paper](https://arxiv.org/pdf/2507.15480v1)][[Code](https://github.com/khufia/RAda/tree/main)]\n* [ICCV 2025] Hierarchical Cross-modal Prompt Learning for Vision-Language Models [[Paper](https://arxiv.org/pdf/2507.14976v1)][[Code](https://github.com/zzeoZheng/HiCroPL)]\n* [CVPR 2025] DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models [[Paper](https://arxiv.org/pdf/2503.13443v1)][[Code](https://github.com/JREion/DPC)]\n* [CVPR 2025] O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models [[Paper](https://arxiv.org/pdf/2503.12096v1)][[Code](https://github.com/ashshaksharifdeen/O-TPT)]\n* [CVPR 2025] R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Sheng_R-TPT_Improving_Adversarial_Robustness_of_Vision-Language_Models_through_Test-Time_Prompt_CVPR_2025_paper.pdf)][[Code](https://github.com/TomSheng21/R-TPT)]\n* [CVPR 2025] TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_TAPT_Test-Time_Adversarial_Prompt_Tuning_for_Robust_Inference_in_Vision-Language_CVPR_2025_paper.pdf)][[Code](https://github.com/xinwong/TAPT)]\n* [CVPR 2025] Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Wu_Skip_Tuning_Pre-trained_Vision-Language_Models_are_Effective_and_Efficient_Adapters_CVPR_2025_paper.pdf)][[Code](https://github.com/Koorye/SkipTuning)]\n* [CVPR 2025] Adaptive Parameter Selection for Tuning Vision-Language Models [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhang_Adaptive_Parameter_Selection_for_Tuning_Vision-Language_Models_CVPR_2025_paper.pdf)]\n* [CVPR 2025] Task-Aware Clustering for Prompting Vision-Language Models [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Hao_Task-Aware_Clustering_for_Prompting_Vision-Language_Models_CVPR_2025_paper.pdf)][[Code](https://github.com/FushengHao/TAC)]\n* [CVPR 2025] Bayesian Test-Time Adaptation for Vision-Language Models [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhou_Bayesian_Test-Time_Adaptation_for_Vision-Language_Models_CVPR_2025_paper.pdf)]\n* [CVPR 2025] NLPrompt: Noise-Label Prompt Learning for Vision-Language Models [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Pan_NLPrompt_Noise-Label_Prompt_Learning_for_Vision-Language_Models_CVPR_2025_paper.pdf)][[Code](https://github.com/qunovo/NLPrompt)]\n* [CVPR 2025] Realistic Test-Time Adaptation of Vision-Language Models [[Paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Zanella_Realistic_Test-Time_Adaptation_of_Vision-Language_Models_CVPR_2025_paper.pdf)][[Code](https://github.com/MaxZanella/StatA)]\n* [NeurIPS 2024] ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/4fd96b997454b5b02698595df70fccaf-Paper-Conference.pdf)][[Code](https://github.com/mrwu-mac/ControlMLLM)]\n* [NeurIPS 2024] Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/c1e1ad233411e25b54bb5df3a0576c2c-Paper-Conference.pdf)][[Code](https://lwpyh.github.io/ProMaC/)]\n* [NeurIPS 2024] Visual Fourier Prompt Tuning [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/0a0eba34ab2ff40ca2d2843324dcc4ab-Paper-Conference.pdf)][[Code](https://github.com/runtsang/VFPT)]\n* [NeurIPS 2024] Improving Visual Prompt Tuning by Gaussian Neighborhood Minimization for Long-Tailed Visual Recognition [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/bc667ac84ef58f2b5022da97a465cbab-Paper-Conference.pdf)][[Code](https://github.com/Keke921/GNM-PT)]\n* [NeurIPS 2024] Few-Shot Adversarial Prompt Learning on Vision-Language Models [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/05aedcaf4bc6e78a5e22b4cf9114c5e8-Paper-Conference.pdf)][[Code](https://github.com/lionel-w2/FAP)]\n* [NeurIPS 2024] Visual Prompt Tuning in Null Space for Continual Learning [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/0f06be0008bc568c88d76206aa17954f-Paper-Conference.pdf)][[Code](https://github.com/zugexiaodui/VPTinNSforCL)]\n* [NeurIPS 2024] IPO: Interpretable Prompt Optimization for Vision-Language Models [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/e52e4de8689a9955b6d3ff421d019387-Paper-Conference.pdf)]\n* [NeurIPS 2024] LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning [[Paper](https://papers.nips.cc/paper_files/paper/2023/file/f0606b882692637835e8ac981089eccd-Paper-Conference.pdf)][[Code](https://github.com/AtsuMiyai/LoCoOp)]\n\n\n#### VLM Knowledge Distillation for Detection\n\n* [NeurIPS 2024] Scaling Open-Vocabulary Object Detection [[Paper](https://papers.nips.cc/paper_files/paper/2023/file/e6d58fc68c0f3c36ae6e0e64478a69c0-Paper-Conference.pdf)][[Code](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)]\n* [NeurIPS 2024] Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection [[Paper](https://papers.nips.cc/paper_files/paper/2022/file/dabf612543b97ea9c8f46d058d33cf74-Paper-Conference.pdf)][[Code](https://github.com/hanoonaR/object-centric-ovd)]\n* [NeurIPS 2024] CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection [[Paper](https://papers.nips.cc/paper_files/paper/2023/file/e10a6a906ef323efaf708f76cf3c1d1e-Paper-Conference.pdf)][[Code](https://github.com/CVMI-Lab/CoDet)]\n\n#### VLM Knowledge Distillation for Segmentation\n\n* [NeurIPS 2024] Relationship Prompt Learning is Enough for Open-Vocabulary Semantic Segmentation [[Paper](https://papers.nips.cc/paper_files/paper/2024/file/8773cdaf02c5af3528e05f1cee816129-Paper-Conference.pdf)]\n\n#### VLM Knowledge Distillation for Other Vision Tasks\n\n\n\n## Abstract\n\nMost visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.\n\n## Citation\nIf you find our work useful in your research, please consider citing:\n```\n@article{zhang2024vision,\n  title={Vision-language models for vision tasks: A survey},\n  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},\n  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},\n  year={2024},\n  publisher={IEEE}\n}\n```\n\n## Menu\n- [Datasets](#datasets)\n  - [Datasets for VLM Pre-training](#datasets-for-vlm-pre-training)\n  - [Datasets for VLM Evaluation](#datasets-for-vlm-evaluation)\n- [Vision-Language Pre-training Methods](#vision-language-pre-training-methods)\n  - [Pre-training with Contrastive Objective](#pre-training-with-contrastive-objective)\n  - [Pre-training with Generative Objective](#pre-training-with-generative-objective)\n  - [Pre-training with Alignment Objective](#pre-training-with-alignment-objective)\n- [Vision-Language Model Transfer Learning Methods](#vision-language-model-transfer-learning-methods)\n  - [Transfer with Prompt Tuning](#transfer-with-prompt-tuning)\n    - [Transfer with Text Prompt Tuning](#transfer-with-text-prompt-tuning)\n    - [Transfer with Visual Prompt Tuning](#transfer-with-visual-prompt-tuning)\n    - [Transfer with Text and Visual Prompt Tuning](#transfer-with-text-and-visual-prompt-tuning)\n  - [Transfer with Feature Adapter](#transfer-with-feature-adapter)\n  - [Transfer with Other Methods](#transfer-with-other-methods)\n- [Vision-Language Model Knowledge Distillation Methods](#vision-language-model-knowledge-distillation-methods)\n  - [Knowledge Distillation for Object Detection](#knowledge-distillation-for-object-detection)\n  - [Knowledge Distillation for Semantic Segmentation](#knowledge-distillation-for-semantic-segmentation)\n\n## Datasets\n\n### Datasets for VLM Pre-training\n\n\n| Dataset                                             |  Year  |     Num of Image-Text Paris     |     Language     | Project |                                  \n|-----------------------------------------------------|:------:|:-------------------------------:|:----------------:|:------------:|\n|[SBU Caption](https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf)|2011|1M|English|[Project](https://www.cs.rice.edu/~vo9/sbucaptions/)|\n|[COCO Caption](https://arxiv.org/pdf/1504.00325v2.pdf)|2016|1.5M|English|[Project](https://github.com/tylin/coco-caption)|\n|[Yahoo Flickr Creative Commons 100 Million](https://arxiv.org/pdf/1503.01817v2.pdf)|2016|100M|English|[Project](http://projects.dfki.uni-kl.de/yfcc100m/)|\n|[Visual Genome](https://arxiv.org/pdf/1602.07332v1.pdf)|2017|5.4M|English|[Project](http://visualgenome.org/)|\n|[Conceptual Captions 3M](https://aclanthology.org/P18-1238.pdf)|2018|3.3M|English|[Project](https://ai.google.com/research/ConceptualCaptions/)|\n|[Localized Narratives](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123500630.pdf)|2020|0.87M|English|[Project](https://google.github.io/localized-narratives/)|\n|[Conceptual 12M](https://openaccess.thecvf.com/content/CVPR2021/papers/Changpinyo_Conceptual_12M_Pushing_Web-Scale_Image-Text_Pre-Training_To_Recognize_Long-Tail_Visual_CVPR_2021_paper.pdf)|2021|12M|English|[Project](https://github.com/google-research-datasets/conceptual-12m)|\n|[Wikipedia-based Image Text](https://arxiv.org/pdf/2103.01913v2.pdf)|2021|37.6M|108 Languages|[Project](https://github.com/google-research-datasets/wit)|\n|[Red Caps](https://arxiv.org/pdf/2111.11431v1.pdf)|2021|12M|English|[Project](https://redcaps.xyz/)|\n|[LAION400M](https://arxiv.org/pdf/2111.02114v1.pdf)|2021|400M|English|[Project](https://laion.ai/blog/laion-400-open-dataset/)|\n|[LAION5B](https://arxiv.org/pdf/2210.08402.pdf)|2022|5B|Over 100 Languages|[Project](https://laion.ai/blog/laion-5b/)|\n|[WuKong](https://arxiv.org/pdf/2202.06767.pdf)|2022|100M|Chinese|[Project](https://wukong-dataset.github.io/wukong-dataset/)|\n|[CLIP](https://arxiv.org/pdf/2103.00020.pdf)|2021|400M|English|-|\n|[ALIGN](https://arxiv.org/pdf/2102.05918.pdf)|2021|1.8B|English|-|\n|[FILIP](https://arxiv.org/pdf/2111.07783.pdf)|2021|300M|English|-|\n|[WebLI](https://arxiv.org/pdf/2209.06794.pdf)|2022|12B|English|-|\n\n\n\n### Datasets for VLM Evaluation\n\n#### Image Classification\n\n| Dataset                                             |  Year  | Classes | Training | Testing |Evaluation Metric| Project|                                  \n|-----------------------------------------------------|:------:|:-------:|:--------:|:-------:|:------:|:-----------:|\n|MNIST|1998|10|60,000|10,000|Accuracy|[Project](http://yann.lecun.com/exdb/mnist/)|\n|Caltech-101|2004|102|3,060|6,085|Mean Per Class|[Project](https://data.caltech.edu/records/mzrjq-6wc02)|\n|PASCAL VOC 2007|2007|20|5,011|4,952|11-point mAP|[Project](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/)|\n|Oxford 102 Flowers|2008|102|2,040|6,149|Mean Per Class|[Project](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/)|\n|CIFAR-10|2009|10|50,000|10,000|Accuracy|[Project](https://www.cs.toronto.edu/~kriz/cifar.html)|\n|CIFAR-100|2009|100|50,000|10,000|Accuracy|[Project](https://www.cs.toronto.edu/~kriz/cifar.html)|\n|ImageNet-1k|2009|1000|1,281,167|50,000|Accuracy|[Project](https://www.image-net.org/)|\n|SUN397|2010|397|19,850|19,850|Accuracy|[Project](https://vision.princeton.edu/projects/2010/SUN/)|\n|SVHN|2011|10|73,257|26,032|Accuracy|[Project](http://ufldl.stanford.edu/housenumbers/)|\n|STL-10|2011|10|1,000|8,000|Accuracy|[Project](https://cs.stanford.edu/~acoates/stl10/)|\n|GTSRB|2011|43|26,640|12,630|Accuracy|[Project](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign)|\n|KITTI Distance|2012|4|6,770|711|Accuracy|[Project](https://github.com/harshilpatel312/KITTI-distance-estimation)|\n|IIIT5k|2012|36|2,000|3,000|Accuracy|[Project](https://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset)|\n|Oxford-IIIT PETS|2012|37|3,680|3,669|Mean Per Class|[Project](https://www.robots.ox.ac.uk/~vgg/data/pets/)|\n|Stanford Cars|2013|196|8,144|8,041|Accuracy|[Project](http://ai.stanford.edu/~jkrause/cars/car_dataset.html)|\n|FGVC Aircraft|2013|100|6,667|3,333|Mean Per Class|[Project](https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/)|\n|Facial Emotion|2013|8|32,140|3,574|Accuracy|[Project](https://www.kaggle.com/competitions/challenges-in-representation-learning-facial-expression-recognition-challenge/data)|\n|Rendered SST2|2013|2|7,792|1,821|Accuracy|[Project](https://github.com/openai/CLIP/blob/main/data/rendered-sst2.md)|\n|Describable Textures|2014|47|3,760|1,880|Accuracy|[Project](https://www.robots.ox.ac.uk/~vgg/data/dtd/)|\n|Food-101|2014|101|75,750|25,250|Accuracy|[Project](https://www.kaggle.com/datasets/dansbecker/food-101)|\n|Birdsnap|2014|500|42,283|2,149|Accuracy|[Project](https://thomasberg.org/)|\n|RESISC45|2017|45|3,150|25,200|Accuracy|[Project](https://pan.baidu.com/s/1mifR6tU?_at_=1679281159364#list/path=%2F)|\n|CLEVR Counts|2017|8|2,000|500|Accuracy|[Project](https://cs.stanford.edu/people/jcjohns/clevr/)|\n|PatchCamelyon|2018|2|294,912|32,768|Accuracy|[Project](https://github.com/basveeling/pcam)|\n|EuroSAT|2019|10|10,000|5,000|Accuracy|[Project](https://github.com/phelber/eurosat)|\n|Hateful Memes|2020|2|8,500|500|ROC AUC|[Project](https://ai.facebook.com/blog/hateful-memes-challenge-and-data-set/)|\n|Country211|2021|211|43,200|21,100|Accuracy|[Project](https://github.com/openai/CLIP/blob/main/data/country211.md)|\n\n#### Image-Text Retrieval\n\n| Dataset                                             |  Year  | Classes | Training | Testing |Evaluation Metric| Project|                                  \n|-----------------------------------------------------|:------:|:-------:|:--------:|:-------:|:------:|:-----------:|\n|Flickr30k|2014|-|31,783|-|Recall|[Project](https://shannon.cs.illinois.edu/DenotationGraph/)\n|COCO Caption|2015|-|82,783|5,000|Recall|[Project](https://github.com/tylin/coco-caption)\n\n\n#### Action Recognition\n\n| Dataset                                             |  Year  | Classes | Training | Testing |Evaluation Metric| Project|                                  \n|-----------------------------------------------------|:------:|:-------:|:--------:|:-------:|:------:|:-----------:|\n|UCF101|2012|101|9,537|1,794|Accuracy|[Project](https://www.crcv.ucf.edu/data/UCF101.php)|\n|Kinetics700|2019|700|494,801|31,669|Mean (top1, top5)|[Project](https://www.deepmind.com/open-source/kinetics)|\n|RareAct|2020|122|7,607|-|mWAP, mSAP|[Project](https://github.com/antoine77340/RareAct)|\n\n#### Object Detection\n\n| Dataset                                             |  Year  | Classes | Training | Testing |Evaluation Metric| Project|                                  \n|-----------------------------------------------------|:------:|:-------:|:--------:|:-------:|:------:|:-----------:|\n|COCO 2014 Detection|2014|80|83,000|41,000|Box mAP|[Project](https://www.kaggle.com/datasets/jeffaudi/coco-2014-dataset-for-yolov3)|\n|COCO 2017 Detection|2017|80|118,000|5,000|Box mAP|[Project](https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset)|\n|LVIS|2019|1203|118,000|5,000|Box mAP|[Project](https://www.lvisdataset.org/)|\n|ODinW|2022|314|132,413|20,070|Box mAP|[Project](https://eval.ai/web/challenges/challenge-page/1839/overview)|\n\n#### Semantic Segmentation\n\n| Dataset                                             |  Year  | Classes | Training | Testing |Evaluation Metric| Project|                                  \n|-----------------------------------------------------|:------:|:-------:|:--------:|:-------:|:------:|:-----------:|\n|PASCAL VOC 2012|2012|20|1,464|1,449|mIoU|[Project](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/)|\n|PASCAL Content|2014|459|4,998|5,105|mIoU|[Project](https://www.cs.stanford.edu/~roozbeh/pascal-context/)|\n|Cityscapes|2016|19|2,975|500|mIoU|[Project](https://www.cityscapes-dataset.com/)|\n|ADE20k|2017|150|25,574|2,000|mIoU|[Project](https://groups.csail.mit.edu/vision/datasets/ADE20K/)|\n\n## Vision-Language Pre-training Methods\n\n\n\n\n\n\n### Pre-training with Contrastive Objective\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[CLIP: Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)|ICML 2021|[Code](https://github.com/openai/CLIP)|\n|[ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/pdf/2102.05918.pdf)|ICML 2021|-|\n|[OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation](https://github.com/facebookresearch/OTTER)|arXiv 2021|[Code](https://github.com/facebookresearch/OTTER)|\n|[Florence: A New Foundation Model for Computer Vision](https://arxiv.org/abs/2111.11432)|arXiv 2021|-|\n|[RegionClip: Region-based Language-Image Pretraining](https://arxiv.org/abs/2112.09106)|arXiv 2021|[Code](https://github.com/microsoft/RegionCLIP)|\n|[DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](https://arxiv.org/abs/2110.05208)|ICLR 2022|[Code](https://github.com/Sense-GVT/DeCLIP)|\n|[FILIP: Fine-grained Interactive Language-Image Pre-Training](https://arxiv.org/abs/2111.07783)|ICLR 2022|-|\n|[KELIP: Large-scale Bilingual Language-Image Contrastive Learning](https://arxiv.org/abs/2203.14463)|ICLRW 2022|[Code](https://github.com/navervision/KELIP)|\n|[ZeroVL: Contrastive Vision-Language Pre-training with Limited Resources](https://arxiv.org/abs/2112.09331)|ECCV 2022|[Code](https://github.com/zerovl/ZeroVL)|\n|[SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750)|ECCV 2022|[Code](https://github.com/facebookresearch/SLIP)|\n|[UniCL: Unified Contrastive Learning in Image-Text-Label Space](https://arxiv.org/abs/2204.03610)|CVPR 2022|[Code](https://github.com/microsoft/UniCL)|\n|[LiT: Zero-Shot Transfer with Locked-image text Tuning](https://arxiv.org/abs/2111.07991)|CVPR 2022|[Code](https://google-research.github.io/vision_transformer/lit/)|\n|[GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)|CVPR 2022|[Code](https://github.com/NVlabs/GroupViT)|\n|[PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining](https://arxiv.org/abs/2204.14095)|NeurIPS 2022|-|\n|[UniCLIP: Unified Framework for Contrastive Language-Image Pre-training](https://arxiv.org/abs/2209.13430)|NeurIPS 2022|-|\n|[K-LITE: Learning Transferable Visual Models with External Knowledge](https://arxiv.org/abs/2204.09222)|NeurIPS 2022|[Code](https://github.com/microsoft/klite)|\n|[FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone](https://arxiv.org/abs/2206.07643)|NeurIPS 2022|[Code](https://github.com/microsoft/FIBER)|\n|[Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)|arXiv 2022|[Code](https://github.com/OFA-Sys/Chinese-CLIP)|\n|[AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679)|arXiv 2022|[Code](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP)|\n|[SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation](https://arxiv.org/abs/2211.14813)|arXiv 2022|[Code](https://github.com/ArrowLuo/SegCLIP)|\n|[NLIP: Noise-robust Language-Image Pre-training](https://arxiv.org/abs/2212.07086)|AAAI 2023|-|\n|[PaLI: A Jointly-Scaled Multilingual Language-Image Model](https://arxiv.org/abs/2209.06794)|ICLR 2023|[Project](https://ai.googleblog.com/2022/09/pali-scaling-language-image-learning-in.html)|\n|[HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention](https://arxiv.org/abs/2303.02995)|ICLR 2023|[Code](https://github.com/jeykigung/hiclip)|\n|[CLIPPO: Image-and-Language Understanding from Pixels Only](https://arxiv.org/abs/2212.08045)|CVPR 2023|[Code](https://github.com/google-research/big_vision)|\n|[RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training](https://openaccess.thecvf.com/content/CVPR2023/papers/Xie_RA-CLIP_Retrieval_Augmented_Contrastive_Language-Image_Pre-Training_CVPR_2023_paper.pdf)|CVPR 2023|-|\n|[DeAR: Debiasing Vision-Language Models with Additive Residuals](https://arxiv.org/abs/2303.10431)|CVPR 2023|-|\n|[Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training](https://arxiv.org/abs/2301.02280)|CVPR 2023|[Code](https://github.com/facebookresearch/diht)|\n|[LaCLIP: Improving CLIP Training with Language Rewrites](https://arxiv.org/abs/2305.20088)|NeurIPS 2023|[Code](https://github.com/LijieFan/LaCLIP)|\n|[ALIP: Adaptive Language-Image Pre-training with Synthetic Caption](https://arxiv.org/pdf/2308.08428.pdf)|ICCV 2023|[Code](https://github.com/deepglint/ALIP)|\n|[GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training](https://arxiv.org/pdf/2308.11331v1.pdf)|ICCV 2023|-|\n|[CLIPpy: Perceptual Grouping in Contrastive Vision-Language Models](https://arxiv.org/abs/2210.09996)|ICCV 2023|-|\n|[ViTamin: Designing Scalable Vision Models in the Vision-Language Era](https://arxiv.org/abs/2404.02132v1)|CVPR 2024|[Code](https://github.com/Beckschen/ViTamin)|\n|[Iterated Learning Improves Compositionality in Large Vision-Language Models](https://arxiv.org/abs/2404.02145v1)|CVPR 2024|-|\n|[FairCLIP: Harnessing Fairness in Vision-Language Learning](https://arxiv.org/abs/2403.19949v1)|CVPR 2024|[Code](https://ophai.hms.harvard.edu/datasets/fairvlmed10k)|\n|[Retrieval-Enhanced Contrastive Vision-Text Models](https://arxiv.org/abs/2306.07196)|ICLR 2024|-|\n|[CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions](https://arxiv.org/abs/2411.16828)]|arXiv 2024|[Code](https://github.com/UCSC-VLAA/CLIPS)|\n|[Sigmoid Loss for Language Image Pre-Training](https://openaccess.thecvf.com/content/ICCV2023/papers/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.pdf)|CVPR 2023|[Code](https://github.com/google-research/big_vision)|\n\n\n\n\n\n\n\n\n\n\n\n### Pre-training with Generative Objective\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482)|CVPR 2022|[Code](https://github.com/facebookresearch/multimodal/tree/main/examples/flava)|\n|[CoCa: Contrastive Captioners are Image-Text Foundation Models](https://arxiv.org/abs/2205.01917)|arXiv 2022|[Code](https://github.com/lucidrains/CoCa-pytorch)|\n|[Too Large; Data Reduction for Vision-Language Pre-Training](https://arxiv.org/abs/2305.20087)|arXiv 2023|[Code](https://github.com/showlab/data-centric.vlp)|\n|[SAM: Segment Anything](https://arxiv.org/abs/2304.02643)|arXiv 2023|[Code](https://github.com/facebookresearch/segment-anything)|\n|[SEEM: Segment Everything Everywhere All at Once](https://arxiv.org/pdf/2304.06718.pdf)|arXiv 2023|[Code](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)|\n|[Semantic-SAM: Segment and Recognize Anything at Any Granularity](https://arxiv.org/pdf/2307.04767.pdf)|arXiv 2023|[Code](https://github.com/UX-Decoder/Semantic-SAM)|\n|[Generative Region-Language Pretraining for Open-Ended Object Detection](https://arxiv.org/pdf/2403.10191v1.pdf)|CVPR 2024|[Code](https://github.com/FoundationVision/GenerateU)|\n|[InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks](https://arxiv.org/abs/2312.14238)|CVPR 2024|[Code](https://github.com/OpenGVLab/InternVL)|\n|[VILA: On Pre-training for Visual Language Models](https://arxiv.org/abs/2312.07533)|CVPR 2024|-|\n|[Enhancing Vision-Language Pre-training with Rich Supervisions](https://arxiv.org/pdf/2403.03346v1.pdf)|CVPR 2024|-|\n|[Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization](https://arxiv.org/abs/2309.04669)|ICLR 2024|[Code](https://github.com/jy0205/LaVIT)|\n|[MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning](https://arxiv.org/abs/2309.07915)|ICLR 2024|[Code](https://github.com/PKUnlp-icler/MIC)|\n|[RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness](https://arxiv.org/abs/2405.17220)|arXiv 2024|[Code](https://github.com/RLHF-V/RLAIF-V)|\n|[RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback](https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_RLHF-V_Towards_Trustworthy_MLLMs_via_Behavior_Alignment_from_Fine-grained_Correctional_CVPR_2024_paper.pdf)|CVPR 2024|[Code](https://github.com/RLHF-V/RLHF-V)|\n|[Efficient Vision-Language Pre-training by Cluster Masking](https://arxiv.org/pdf/2405.08815)|CVPR 2024|[Code](https://github.com/Zi-hao-Wei/Efficient-Vision-Language-Pre-training-by-Cluster-Masking)|\n\n\n\n\n### Pre-training with Alignment Objective\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[GLIP: Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)|CVPR 2022|[Code](https://github.com/microsoft/GLIP)|\n|[DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection](https://arxiv.org/abs/2209.09407)|NeurIPS 2022|-|\n|[nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training](https://arxiv.org/abs/2210.09304)|CVPR 2023|[Code](https://github.com/shallowtoil/xclip)|\n|[Do Vision and Language Encoders Represent the World Similarly?](https://openaccess.thecvf.com/content/CVPR2024/papers/Maniparambil_Do_Vision_and_Language_Encoders_Represent_the_World_Similarly_CVPR_2024_paper.pdf)|CVPR 2024|[Code](https://github.com/mayug/0-shot-llm-vision)|\n|[Non-autoregressive Sequence-to-Sequence Vision-Language Models](https://arxiv.org/abs/2403.02249v1)|CVPR 2024|-|\n|[MMRL: Multi-Modal Representation Learning for Vision-Language Models](https://arxiv.org/abs/2503.08497)|CVPR 2025|[Code](https://github.com/yunncheng/MMRL)|\n\n\n\n\n## Vision-Language Model Transfer Learning Methods\n\n### Transfer with Prompt Tuning\n\n#### Transfer with Text Prompt Tuning\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[CoOp: Learning to Prompt for Vision-Language Models](https://arxiv.org/abs/2109.01134)|IJCV 2022|[Code](https://github.com/KaiyangZhou/CoOp)|\n|[CoCoOp: Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557)|CVPR 2022|[Code](https://github.com/KaiyangZhou/CoOp)|\n|[ProDA: Prompt Distribution Learning](https://arxiv.org/abs/2205.03340)|CVPR 2022|-|\n|[DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/abs/2112.01518)|CVPR 2022|[Code](https://github.com/raoyongming/DenseCLIP)|\n|[TPT: Test-time prompt tuning for zero-shot generalization in vision-language models](https://arxiv.org/abs/2209.07511)|NeurIPS 2022|[Code](https://github.com/azshue/TPT)|\n|[DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations](https://arxiv.org/abs/2206.09541)|NeurIPS 2022|[Code](https://github.com/sunxm2357/DualCoOp)|\n|[CPL: Counterfactual Prompt Learning for Vision and Language Models](https://arxiv.org/abs/2210.10362)|EMNLP 2022|[Code](https://github.com/eric-ai-lab/CPL)|\n|[Bayesian Prompt Learning for Image-Language Model Generalization](https://arxiv.org/abs/2210.02390v2)|arXiv 2022|-|\n|[UPL: Unsupervised Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2204.03649)|arXiv 2022|[Code](https://github.com/tonyhuang2022/UPL)|\n|[ProGrad: Prompt-aligned Gradient for Prompt Tuning](https://arxiv.org/abs/2205.14865)|arXiv 2022|[Code](https://github.com/BeierZhu/Prompt-align)|\n|[SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language Models](https://arxiv.org/abs/2208.13474)|arXiv 2022|[Code](https://github.com/kding1225/softcpt)|\n|[SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models](https://arxiv.org/abs/2211.02219)|TCSVT 2023|[Code](https://github.com/machengcheng2016/Subspace-Prompt-Learning)|\n|[LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision \u0026 Language Models](https://arxiv.org/abs/2210.01115)|CVPR 2023|[Code](https://www.adrianbulat.com/lasp)|\n|[LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition](https://arxiv.org/abs/2305.04536)|ACLW 2024|[Code](https://github.com/richard-peng-xia/LMPT)|\n|[Texts as Images in Prompt Tuning for Multi-Label Image Recognition](https://arxiv.org/abs/2211.12739)|CVPR 2023|[code](https://github.com/guozix/TaI-DPT)\n|[Visual-Language Prompt Tuning with Knowledge-guided Context Optimization](https://arxiv.org/abs/2303.13283)|CVPR 2023|[Code](https://github.com/htyao89/KgCoOp)|\n|[Learning to Name Classes for Vision and Language Models](https://arxiv.org/abs/2304.01830v1)|CVPR 2023|-|\n|[PLOT: Prompt Learning with Optimal Transport for Vision-Language Models](https://arxiv.org/abs/2210.01253)|ICLR 2023|[Code](https://github.com/CHENGY12/PLOT)|\n|[CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification](https://arxiv.org/abs/2209.03320)|ICCV 2023|[Code](https://github.com/sarahpratt/CuPL)|\n|[ProTeCt: Prompt Tuning for Hierarchical Consistency](https://arxiv.org/abs/2306.02240)|arXiv 2023|-|\n|[Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning](https://arxiv.org/abs/2306.01669)|arXiv 2023|[Code](http://github.com/BatsResearch/menghini-enhanceCLIPwithCLIP-code)|\n|[Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?](https://arxiv.org/pdf/2307.11978v1.pdf)|ICCV 2023|[Code](https://github.com/CEWu/PTNL)|\n|[Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models](https://arxiv.org/pdf/2303.06571.pdf)|ICCV 2023|-|\n|[Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models](https://arxiv.org/pdf/2308.11186v1.pdf)|ICCV 2023|-|\n|[Read-only Prompt Optimization for Vision-Language Few-shot Learning](https://arxiv.org/pdf/2308.14960.pdf)|ICCV 2023|[Code](https://github.com/mlvlab/RPO)|\n|[Bayesian Prompt Learning for Image-Language Model Generalization](https://arxiv.org/pdf/2210.02390.pdf)|ICCV 2023|[Code](https://github.com/saic-fi/Bayesian-Prompt-Learning)|\n|[Distribution-Aware Prompt Tuning for Vision-Language Models](https://arxiv.org/pdf/2309.03406.pdf)|ICCV 2023|[Code](https://github.com/mlvlab/DAPT)|\n|[LPT: Long-Tailed Prompt Tuning For Image Classification](https://arxiv.org/pdf/2210.01033.pdf)|ICCV 2023|[Code](https://github.com/DongSky/LPT)|\n|[Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning](https://openaccess.thecvf.com/content/ICCV2023/papers/Feng_Diverse_Data_Augmentation_with_Diffusions_for_Effective_Test-time_Prompt_Tuning_ICCV_2023_paper.pdf)|ICCV 2023|[Code](https://github.com/chunmeifeng/DiffTPT)|\n|[Efficient Test-Time Prompt Tuning for Vision-Language Models](https://arxiv.org/abs/2408.05775)|arXiv 2024|-|\n|[Text-driven Prompt Generation for Vision-Language Models in Federated Learning](https://arxiv.org/abs/2310.06123)|ICLR 2024|-|\n|[C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion](https://openreview.net/pdf?id=jzzEHTBFOT)|ICLR 2024|-|\n|[Prompt Gradient Projection for Continual Learning](https://openreview.net/pdf?id=EH2O3h7sBI)|ICLR 2024|-|\n|[Nemesis: Normalizing the soft-prompt vectors of vision-language models](https://openreview.net/pdf?id=zmJDzPh1Dm)|ICLR 2024|[Code](https://github.com/ShyFoo/Nemesis)|\n|[DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning](https://arxiv.org/abs/2309.05173)|ICLR 2024|[Code](https://github.com/ZhengxiangShi/DePT)|\n|[TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model](https://arxiv.org/abs/2311.18231)|CVPR 2024|[Code](https://github.com/htyao89/Textual-based_Class-aware_prompt_tuning/)|\n|[One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models](https://arxiv.org/abs/2403.01849v1)|CVPR 2024|[Code](https://github.com/TreeLLi/APT)|\n|[Any-Shift Prompting for Generalization over Distributions](https://arxiv.org/abs/2402.10099)|CVPR 2024|-|\n|[Towards Better Vision-Inspired Vision-Language Models](https://www.lamda.nju.edu.cn/caoyh/files/VIVL.pdf)|CVPR 2024|-|\n|[Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models](https://arxiv.org/pdf/2407.05342v1)|ECCV 2024|[Code](https://github.com/lloongx/DIKI)|\n|[Historical Test-time Prompt Tuning for Vision Foundation Models](https://arxiv.org/pdf/2410.20346)|NeurIPS 2024|–|\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n#### Transfer with Visual Prompt Tuning\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[Exploring Visual Prompts for Adapting Large-Scale Models](https://arxiv.org/abs/2203.17274)|arXiv 2022|[Code](https://github.com/hjbahng/visual_prompting)|\n|[Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification](https://arxiv.org/abs/2306.02243)|arXiv 2023|-|\n|[Fine-Grained Visual Prompting](https://arxiv.org/abs/2306.04356)|arXiv 2023|-|\n|[LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models](https://arxiv.org/pdf/2309.01155v1.pdf)|ICCV 2023|[Code](https://chengshiest.github.io/logo/)|\n|[Progressive Visual Prompt Learning with Contrastive Feature Re-formation](https://arxiv.org/abs/2304.08386)|IJCV 2024|[Code](https://github.com/MCG-NJU/ProVP)|\n|[Visual In-Context Prompting](https://arxiv.org/abs/2311.13601)|CVPR 2024|[Code](https://github.com/UX-Decoder/DINOv)|\n|[FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance](https://arxiv.org/abs/2407.05578v1)|ECCV 2024|[Code](https://pumpkin805.github.io/FALIP/)|\n\n\n\n\n#### Transfer with Text and Visual Prompt Tuning\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[UPT: Unified Vision and Language Prompt Learning](https://arxiv.org/abs/2210.07225)|arXiv 2022|[Code](https://github.com/yuhangzang/upt)|\n|[MVLPT: Multitask Vision-Language Prompt Tuning](https://arxiv.org/abs/2211.11720)|arXiv 2022|[Code](https://github.com/facebookresearch/vilbert-multi-task)|\n|[CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model](https://arxiv.org/abs/2208.08340)|arXiv 2022|[Code](https://github.com/fanrena/DPT)|\n|[MaPLe: Multi-modal Prompt Learning](https://arxiv.org/abs/2210.03117)|CVPR 2023|[Code](https://github.com/muzairkhattak/multimodal-prompt-learning)|\n|[Learning to Prompt Segment Anything Models](https://arxiv.org/pdf/2401.04651.pdf)|arXiv 2024|-|\n|[An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models](https://openreview.net/pdf?id=nc5GgFAvtk)|ICLR 2024|-|\n|[GalLoP: Learning Global and Local Prompts for Vision-Language Models](https://arxiv.org/pdf/2407.01400)|ECCV 2024|-|\n|[CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts](https://arxiv.org/abs/2311.16445)|ECCV 2024|[Code](https://github.com/YichaoCai1/CLAP)|\n|[Learning to Prompt Segment Anything Models](https://arxiv.org/pdf/2401.04651.pdf)|arXiv 2024|–|\n\n\n\n\n\n### Transfer with Feature Adapter\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[Clip-Adapter: Better Vision-Language Models with Feature Adapters](https://arxiv.org/abs/2110.04544)|arXiv 2021|[Code](https://github.com/gaopengcuhk/CLIP-Adapter)|\n|[Tip-Adapte: Training-free Adaption of CLIP for Few-shot Classification](https://arxiv.org/abs/2207.09519)|ECCV 2022|[Code](https://github.com/gaopengcuhk/Tip-Adapter)|\n|[SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models](https://arxiv.org/abs/2210.03794)|BMVC 2022|[Code](https://github.com/omipan/svl_adapter)|\n|[CLIPPR: Improving Zero-Shot Models with Label Distribution Priors](https://arxiv.org/abs/2212.00784)|arXiv 2022|[Code](https://github.com/jonkahana/CLIPPR)|\n|[SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification](https://arxiv.org/abs/2211.16191)|arXiv 2022|-|\n|[SuS-X: Training-Free Name-Only Transfer of Vision-Language Models](https://arxiv.org/abs/2211.16198)|ICCV 2023|[Code](https://github.com/vishaal27/SuS-X)|\n|[VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control](https://arxiv.org/abs/2308.09804)|ICCV 2023|[Code](https://github.com/HenryHZY/VL-PET)|\n|[SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More](https://arxiv.org/abs/2304.09148)|arXiv 2023|[Code](http://tianrun-chen.github.io/SAM-Adaptor/)|\n|[Segment Anything in High Quality](https://arxiv.org/abs/2306.01567)|arXiv 2023|[Code](https://github.com/SysCV/SAM-HQ)|\n|[HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding](https://arxiv.org/abs/2311.14064)|COLING 2025|[Code](https://github.com/richard-peng-xia/HGCLIP)|\n|[CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models](https://arxiv.org/abs/2311.16445)|arXiv 2023|-|\n|[AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation](https://arxiv.org/abs/2407.04603)|NeurIPS 2024|[Code](https://github.com/MCG-NJU/AWT)|\n|[A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models](https://arxiv.org/abs/2312.12730)|CVPR 2024|[Code](https://github.com/jusiro/CLAP)|\n|[Efficient Test-Time Adaptation of Vision-Language Models](https://arxiv.org/abs/2403.18293v1)|CVPR 2024|[Code](https://kdiaaa.github.io/tda/)|\n|[Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models](https://arxiv.org/abs/2403.17589v1)|CVPR 2024|[Code](https://github.com/YBZh/DMN)|\n\n\n\n### Transfer with Other Methods\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[VT-Clip: Enhancing Vision-Language Models with Visual-guided Texts](https://arxiv.org/abs/2112.02399)|arXiv 2021|-|\n|[Wise-FT: Robust fine-tuning of zero-shot models](https://arxiv.org/abs/2109.01903)|CVPR 2022|[Code](https://github.com/mlfoundations/wise-ft)|\n|[MaskCLIP: Extract Free Dense Labels from CLIP](https://arxiv.org/abs/2112.01071)|ECCV 2022|[Code](https://github.com/chongzhou96/MaskCLIP)|\n|[MUST: Masked Unsupervised Self-training for Label-free Image Classification](https://arxiv.org/abs/2206.02967)|ICLR 2023| [Code](https://github.com/salesforce/MUST)|\n|[CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention](https://arxiv.org/abs/2209.14169)|AAAI 2023|[Code](https://github.com/ziyuguo99/calip)|\n|[Semantic Prompt for Few-Shot Image Recognition](https://arxiv.org/abs/2303.14123v1)|CVPR 2023|-|\n|[Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners](https://arxiv.org/abs/2303.02151)|CVPR 2023|[Code](https://github.com/ZrrSkywalker/CaFo)|\n|[Task Residual for Tuning Vision-Language Models](https://arxiv.org/abs/2211.10277)|CVPR 2023|[Code](https://github.com/geekyutao/TaskRes)|\n|[Deeply Coupled Cross-Modal Prompt Learning](https://arxiv.org/abs/2305.17903)|ACL 2023|[Code](https://github.com/GingL/CMPA)|\n|[Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation](https://arxiv.org/abs/2306.16658)|arXiv 2023|-|\n|[Personalize Segment Anything Model with One Shot](https://arxiv.org/abs/2305.03048)|arXiv 2023|[Code](https://github.com/ZrrSkywalker/Personalize-SAM)|\n|[Chils: Zero-shot image classification with hierarchical label sets](https://proceedings.mlr.press/v202/novack23a/novack23a.pdf)|ICML 2023|[Code](https://github.com/acmi-lab/CHILS)|\n|[Improving Zero-shot Generalization and Robustness of Multi-modal Models](https://openaccess.thecvf.com/content/CVPR2023/papers/Ge_Improving_Zero-Shot_Generalization_and_Robustness_of_Multi-Modal_Models_CVPR_2023_paper.pdf)|CVPR 2023|[Code](https://github.com/gyhandy/Hierarchy-CLIP)|\n|[Exploiting Category Names for Few-Shot Classification with Vision-Language Models](https://openreview.net/pdf?id=w25Q9Ttjrs)|ICLR W 2023|-|\n|[Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models](https://arxiv.org/abs/2311.17091)|arXiv 2023|[Code](https://github.com/zhiheLu/Ensemble_VLM)|\n|[Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models](https://arxiv.org/pdf/2307.15049v1.pdf)|ICCV 2023|[Code](https://wuw2019.github.io/RMT/)|\n|[PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization](https://arxiv.org/pdf/2307.15199v1.pdf)|ICCV 2023|[Code](https://promptstyler.github.io/)|\n|[PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation](https://assets.amazon.science/ff/08/64f27eb54b82a0c59c95dc138af4/padclip-pseudo-labeling-with-adaptive-debiasing-in-clip.pdf)|ICCV 2023|-|\n|[Black Box Few-Shot Adaptation for Vision-Language models](https://arxiv.org/pdf/2304.01752.pdf)|ICCV 2023|[Code](https://github.com/saic-fi/LFA)|\n|[AD-CLIP: Adapting Domains in Prompt Space Using CLIP](https://arxiv.org/pdf/2308.05659.pdf)|ICCVW 2023|-|\n|[Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning](https://arxiv.org/pdf/2306.14565.pdf)|arXiv 2023|[Code](https://fuxiaoliu.github.io/LRV/)|\n|[Language Models as Black-Box Optimizers for Vision-Language Models](https://arxiv.org/abs/2309.05950)|arXiv 2023|-|\n|[Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching](https://arxiv.org/abs/2305.13310)|ICLR 2024|[Code](https://github.com/aim-uofa/Matcher)|\n|[Consistency-guided Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2306.01195)|ICLR 2024|-|\n|[Efficient Test-Time Adaptation of Vision-Language Models](https://arxiv.org/abs/2403.18293v1)|CVPR 2024|[Code](https://kdiaaa.github.io/tda/)|\n|[Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models](https://arxiv.org/abs/2403.17589v1)|CVPR 2024|[Code](https://github.com/YBZh/DMN)|\n|[A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models](https://arxiv.org/abs/2312.12730)|CVPR 2024|[Code](https://github.com/jusiro/CLAP)|\n|[Anchor-based Robust Finetuning of Vision-Language Models](https://arxiv.org/abs/2404.06244)|CVPR 2024||\n|[Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners](https://arxiv.org/abs/2404.02117v1)|CVPR 2024|[Code](https://github.com/KHU-AGI/PriViLege)|\n\n\n\n\n\n\n\n\n\n\n\n\n\n## Vision-Language Model Knowledge Distillation Methods\n\n### Knowledge Distillation for Object Detection\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921)|ICLR 2022|[Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)|\n|[DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model](https://arxiv.org/abs/2203.14940)|CVPR 2022|[Code](https://github.com/dyabel/detpro)|\n|[XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling](https://arxiv.org/abs/2111.12698)|CVPR 2022|[Code](https://github.com/hbdat/cvpr22_cross_modal_pseudo_labeling)|\n|[Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection](https://arxiv.org/abs/2207.03482)|NeurIPS 2022|[Code](https://github.com/hanoonaR/object-centric-ovd)|\n|[PromptDet: Towards Open-vocabulary Detection using Uncurated Images](https://arxiv.org/abs/2203.16513)|ECCV 2022|[Code](https://github.com/fcjian/PromptDet)|\n|[PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box Labels](https://arxiv.org/abs/2111.09452)|ECCV 2022|[Code](https://github.com/salesforce/PB-OVD)|\n|[OV-DETR: Open-Vocabulary DETR with Conditional Matching](https://arxiv.org/abs/2203.11876)|ECCV 2022|[Code](https://github.com/yuhangzang/OV-DETR)|\n|[Detic: Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605)|ECCV 2022|[Code](https://github.com/facebookresearch/Detic)|\n|[OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230)|ECCV 2022|[Code](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit)|\n|[VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object Detection](https://arxiv.org/abs/2207.08954)|ECCV 2022|[Code](https://github.com/xiaofeng94/VL-PLM)|\n|[ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding Alignment](https://arxiv.org/abs/2109.12066)|arXiv 2022|[Code](https://github.com/Johnathan-Xie/ZSD-YOLO)|\n|[HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation](https://arxiv.org/abs/2203.10593)|arXiv 2022|[Code](https://github.com/mengqiDyangge/HierKD)|\n|[VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection](https://arxiv.org/abs/2211.14843)|ICLR 2023|[Code](https://github.com/clin1223/VLDet)|\n|[F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models](https://arxiv.org/abs/2209.15639)|ICLR 2023|[Code](https://github.com/google-research/google-research/tree/master/fvlm)|\n|[CondHead: Learning to Detect and Segment for Open Vocabulary Object Detection](https://arxiv.org/abs/2212.12130)|CVPR 2023|-|\n|[Aligning Bag of Regions for Open-Vocabulary Object Detection](https://arxiv.org/abs/2302.13996)|CVPR 2023|[Code](https://github.com/wusize/ovdet)|\n|[Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2305.07011v1)|CVPR 2023|[Code](https://github.com/mcahny/rovit)|\n|[Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection](https://arxiv.org/abs/2303.05892)|CVPR 2023|[Code](https://github.com/LutingWang/OADP)|\n|[CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching](https://arxiv.org/abs/2303.13076v1)|CVPR 2023|[Code](https://github.com/tgxs002/CORA)|\n|[DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment](https://arxiv.org/abs/2304.04514v1)|CVPR 2023|-|\n|[Detecting Everything in the Open World: Towards Universal Object Detection](https://arxiv.org/abs/2303.11749)|CVPR 2023|[Code](https://github.com/zhenyuw16/UniDetector)|\n|[CapDet: Unifying Dense Captioning and Open-World Detection Pretraining](https://arxiv.org/abs/2303.02489)|CVPR 2023|-|\n|[Contextual Object Detection with Multimodal Large Language Models](https://arxiv.org/abs/2305.18279)|arXiv 2023|[Code](https://github.com/yuhangzang/ContextDET)|\n|[Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models](https://arxiv.org/abs/2305.17207)|arXiv 2023|[Code](https://github.com/gyhandy/One-Class-Anything)|\n|[EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment](https://arxiv.org/pdf/2309.01151v1.pdf)|ICCV 2023|[Code](https://chengshiest.github.io/edadet)|\n|[Improving Pseudo Labels for Open-Vocabulary Object Detection](https://arxiv.org/pdf/2308.06412.pdf)|arXiv 2023|-|\n|[RegionGPT: Towards Region Understanding Vision Language Model](https://arxiv.org/pdf/2403.02330v1.pdf)|CVPR 2024|[Code](https://guoqiushan.github.io/regiongpt.github.io/)|\n|[LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors](https://arxiv.org/pdf/2402.04630.pdf)|ICLR 2024|-|\n|[Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction](https://openreview.net/pdf?id=M0MF4t3hE9)|ICLR 2024|-|\n|[Open-Vocabulary Object Detection via Language Hierarchy](https://arxiv.org/pdf/2410.20371)|NeurIPS 2024|–|\n\n\n\n\n\n### Knowledge Distillation for Semantic Segmentation\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples](https://arxiv.org/abs/2112.03185)|arXiv 2021|-|\n|[ReCo: Retrieve and Co-segment for Zero-shot Transfer](https://arxiv.org/abs/2206.07045)|NeurIPS 2022|[Code](https://github.com/NoelShin/reco)|\n|[CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2203.02668)|CVPR 2022|[Code](https://github.com/CVI-SZU/CLIMS)|\n|[CLIPSeg: Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)|CVPR 2022|[Code](https://github.com/timojl/clipseg)|\n|[ZegFormer: Decoupling Zero-Shot Semantic Segmentation](https://arxiv.org/abs/2112.07910)|CVPR 2022|[Code](https://github.com/dingjiansw101/ZegFormer)|\n|[LSeg: Language-driven Semantic Segmentation](https://arxiv.org/abs/2201.03546)|ICLR 2022|[Code](https://github.com/isl-org/lang-seg)|\n|[ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model](https://arxiv.org/abs/2112.14757)|ECCV 2022|[Code](https://github.com/MendelXu/zsseg.baseline)|\n|[OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels](https://arxiv.org/abs/2112.12143)|ECCV 2022|[Code](https://github.com/tensorflow/tpu/tree/641c1ac6e26ed788327b973582cbfa297d7d31e7/models/official/detection/projects/openseg)|\n|[Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models](https://arxiv.org/abs/2210.15138)|BMVC 2022|[Code](https://github.com/chaofanma/Fusioner)|\n|[OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP](https://arxiv.org/abs/2210.04150)|CVPR 2023|[Code](https://github.com/facebookresearch/ov-seg)|\n|[ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation](https://arxiv.org/abs/2212.03588)|CVPR 2023|[Code](https://github.com/ZiqinZhou66/ZegCLIP)|\n|[CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2212.09506)|CVPR 2023|[Code](https://github.com/linyq2117/CLIP-ES)|\n|[FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation](https://arxiv.org/abs/2303.17225v1)|CVPR 2023|[Code](https://freeseg.github.io/)|\n|[Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations](https://arxiv.org/abs/2303.16891v1)|CVPR 2023|[Code](https://vibashan.github.io/ovis-web/)|\n|[Exploring Open-Vocabulary Semantic Segmentation without Human Labels](https://arxiv.org/abs/2306.00450)|arXiv 2023|-|\n|[OpenVIS: Open-vocabulary Video Instance Segmentation](https://arxiv.org/abs/2305.16835)|arXiv 2023|-|\n|[Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2305.01275)|arXiv 2023|-|\n|[Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2305.05803)|arXiv 2023|[Code](https://github.com/cskyl/SAM_WSSS)|\n|[Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models](https://arxiv.org/abs/2311.17095)|arXiv 2023|-|\n|[SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning](https://arxiv.org/pdf/2308.06531v1.pdf)|ICCV 2023|[Code](https://github.com/aim-uofa/SegPrompt)|\n|[ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation](https://arxiv.org/pdf/2308.07078.pdf)|arXiv 2023|-|\n|[Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP](https://arxiv.org/pdf/2308.02487.pdf)|arXiv 2023|[Code](https://github.com/bytedance/fc-clip)|\n|[Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models](https://arxiv.org/abs/2311.17095)|arXiv 2023|-|\n|[CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction](https://arxiv.org/abs/2310.01403)|ICLR 2024|-|\n\n\n\n### Knowledge Distillation for Other Tasks\n\n| Paper                                             |  Published in | Code/Project |                                  \n|---------------------------------------------------|:-------------:|:------------:|\n|[Controlling Vision-Language Models for Universal Image Restoration](https://arxiv.org/abs/2310.01018)|arXiv 2023|[Code](https://github.com/Algolzw/daclip-uir)|\n|[FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition](https://arxiv.org/pdf/2402.03241.pdf)|ICLR 2024|[Project](https://visual-ai.github.io/froster)|\n|[AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection](https://arxiv.org/pdf/2310.18961.pdf)|ICLR 2024|[Code](https://github.com/zqhang/AnomalyCLIP)|\n|[EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata](https://arxiv.org/abs/2301.04647)|CVPR 2023|[Code](https://hellomuffin.github.io/exif-as-language/)|\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjingyi0000%2Fvlm_survey","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjingyi0000%2Fvlm_survey","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjingyi0000%2Fvlm_survey/lists"}