Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fawazsammani/awesome-vision-language-pretraining

Awesome Vision-Language Pretraining Papers
https://github.com/fawazsammani/awesome-vision-language-pretraining

List: awesome-vision-language-pretraining

Last synced: 16 days ago
JSON representation

Awesome Vision-Language Pretraining Papers

Awesome Lists containing this project

README

        

# Awesome Vision-Language Pretraining [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

Pretraining is blowing the field of Vision-Language Research! Let's keep track on all the works before it gets too late! Papers not based on pretraining can be found in other awesomes linked at the end of the repo.

If you find some overlooked papers, please open issues or pull requests, and provide the paper(s) in this format:
```
- **[]** Paper Name [[pdf]]() [[code]]()
```

Note: most pretrained models can be found on [hf models](https://huggingface.co/models)

## Papers

- **[ViLBERT]** Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [[pdf]](https://arxiv.org/pdf/1908.02265.pdf) [[code]](https://github.com/facebookresearch/vilbert-multi-task) [[code]](https://github.com/jiasenlu/vilbert_beta)
- **[Unified-VLP]** Unified Vision-Language Pre-Training for Image Captioning and VQA [[pdf]](https://arxiv.org/pdf/1909.11059.pdf) [[code]](https://github.com/LuoweiZhou/VLP)
- **[ImageBERT]** Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data [[pdf]](https://arxiv.org/pdf/2001.07966.pdf)
- **[SimVLM]** Simple Visual Language Model Pretraining with Weak Supervision [[pdf]](https://arxiv.org/pdf/2108.10904.pdf)
- **[ALBEF]** Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [[pdf]](https://arxiv.org/pdf/2107.07651.pdf) [[code]](https://github.com/salesforce/ALBEF)
- **[LXMERT]** Learning Cross-Modality Encoder Representations from Transformers [[pdf]](https://arxiv.org/pdf/1908.07490.pdf) [[code]](https://github.com/airsplay/lxmert) [[code]](https://huggingface.co/docs/transformers/model_doc/lxmert)
- **[X-LXMERT]** Paint, Caption and Answer Questions with Multi-Modal Transformers [[pdf]](https://arxiv.org/pdf/2009.11278.pdf) [[code]](https://github.com/allenai/x-lxmert)
- **[VisualBERT]** A Simple and Performant Baseline for Vision and Language [[pdf]](https://arxiv.org/pdf/1908.03557.pdf)
- **[UNIMO]** Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [[pdf]](https://arxiv.org/pdf/2012.15409.pdf) [[code]](https://github.com/PaddlePaddle/Research/tree/master/NLP/UNIMO)
- **[UNIMO-2]** End-to-End Unified Vision-Language Grounded Learning [[pdf]](https://arxiv.org/pdf/2203.09067.pdf) [[code]](https://unimo-ptm.github.io/)
- **[BLIP]** Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [[pdf]](https://arxiv.org/pdf/2201.12086.pdf) [[code]](https://github.com/salesforce/BLIP) [[code]](https://huggingface.co/docs/transformers/model_doc/blip) [[video]](https://www.youtube.com/watch?v=X2k7n4FuI7c&ab_channel=YannicKilcher) [[demo]](https://huggingface.co/spaces/nielsr/comparing-VQA-models)
- **[BLIP-2]** Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [[pdf]](https://arxiv.org/pdf/2301.12597.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BLIP-2) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/blip-2) [[finetuning colab]](https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing) [[blog]](https://huggingface.co/blog/blip-2)
- **[Uni-EDEN]** Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training [[pdf]](https://arxiv.org/pdf/2201.04026.pdf)
- **[VisualGPT]** Data-efficient Adaptation of Pretrained Language Models for Image Captioning [[pdf]](https://arxiv.org/pdf/2102.10407.pdf) [[code]](https://github.com/Vision-CAIR/VisualGPT)
- **[MiniVLM]** A Smaller and Faster Vision-Language Model [[pdf]](https://arxiv.org/pdf/2012.06946.pdf)
- **[XGPT]** Cross-modal Generative Pre-Training for Image Captioning [[pdf]](https://arxiv.org/pdf/2003.01473.pdf)
- **[ViTCAP]** Injecting Semantic Concepts into End-to-End Image Captioning [[pdf]](https://arxiv.org/pdf/2112.05230.pdf)
- **[LEMON]** Scaling Up Vision-Language Pre-training for Image Captioning [[pdf]](https://arxiv.org/pdf/2111.12233.pdf)
- **[IC3]** Image Captioning by Committee Consensus [[pdf]](https://arxiv.org/pdf/2302.01328.pdf) [[code]](https://github.com/DavidMChan/caption-by-committee)
- **[TAP]** Text-Aware Pre-training for Text-VQA and Text-Caption [[pdf]](https://arxiv.org/pdf/2012.04638.pdf)
- **[PICa]** An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [[pdf]](https://arxiv.org/pdf/2109.05014.pdf)
- **[Prophet]** Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering [[pdf]](https://openaccess.thecvf.com/content/CVPR2023/papers/Shao_Prompting_Large_Language_Models_With_Answer_Heuristics_for_Knowledge-Based_Visual_CVPR_2023_paper.pdf) [[code]](https://github.com/MILVLG/prophet)
- **[CVLP]** Contrastive Visual-Linguistic Pretraining [[pdf]](https://arxiv.org/pdf/2007.13135.pdf) [[code]](https://github.com/ArcherYunDong/CVLP)
- **[UniT]** Multimodal Multitask Learning with a Unified Transformer [[pdf]](https://arxiv.org/pdf/2102.10772.pdf) [[website]](https://mmf.sh/)
- **[VL-BERT]** Pre-training of Generic Visual-Linguistic Representations [[pdf]](https://arxiv.org/pdf/1908.08530.pdf) [[code]](https://github.com/jackroos/VL-BERT)
- **[Unicoder-VL]** A Universal Encoder for Vision and Language by Cross-modal Pre-training [[pdf]](https://arxiv.org/pdf/1908.06066.pdf)
- **[UNITER]** UNiversal Image-TExt Representation Learning [[pdf]](https://arxiv.org/pdf/1909.11740.pdf) [[code]](https://github.com/ChenRocks/UNITER)
- **[ViLT]** Vision-and-Language Transformer Without Convolution or Region Supervision [[pdf]](https://arxiv.org/pdf/2102.03334.pdf) [[code]](https://github.com/dandelin/ViLT) [[code]](https://huggingface.co/docs/transformers/model_doc/vilt) [[demo]](https://huggingface.co/spaces/nielsr/vilt-vqa)
- **[GLIP ]** Grounded Language-Image Pre-training [[pdf]](https://arxiv.org/pdf/2112.03857.pdf) [[code]](https://github.com/microsoft/GLIP)
- **[GLIPv2]** Unifying Localization and Vision-Language Understanding [[pdf]](https://arxiv.org/pdf/2206.05836.pdf) [[code]](https://github.com/microsoft/GLIP)
- **[VLMo]** Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [[pdf]](https://arxiv.org/pdf/2111.02358.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/vlmo)
- **[METER]** An Empirical Study of Training End-to-End Vision-and-Language Transformers [[pdf]](https://arxiv.org/pdf/2111.02387.pdf) [[code]](https://github.com/zdou0830/METER)
- **[WenLan]** Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training [[pdf]](https://arxiv.org/pdf/2103.06561.pdf)
- **[InterBERT]** Vision-and-Language Interaction for Multi-modal Pretraining [[pdf]](https://arxiv.org/pdf/2003.13198.pdf)
- **[SemVLP]** Vision-Language Pre-training by Aligning Semantics at Multiple Levels [[pdf]](https://arxiv.org/pdf/2103.07829.pdf)
- **[E2E-VLP]** End-to-End Vision-Language Pre-training Enhanced by Visual Learning [[pdf]](https://arxiv.org/pdf/2106.01804.pdf)
- **[VinVL]** Revisiting Visual Representations in Vision-Language Models [[pdf]](https://arxiv.org/pdf/2101.00529.pdf) [[code]](https://github.com/microsoft/Oscar) [[code]](https://github.com/pzzhang/VinVL)
- **[UFO]** A UniFied TransfOrmer for Vision-Language Representation Learning [[pdf]](https://arxiv.org/pdf/2111.10023.pdf)
- **[Florence]** A New Foundation Model for Computer Vision [[pdf]](https://arxiv.org/pdf/2111.11432.pdf)
- **[VILLA]** Large-Scale Adversarial Training for Vision-and-Language Representation Learning [[pdf]](https://arxiv.org/pdf/2006.06195.pdf) [[code]](https://github.com/zhegan27/VILLA)
- **[TDEN]** Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network [[pdf]](https://arxiv.org/pdf/2101.11562.pdf) [[code]](https://github.com/YehLi/TDEN)
- **[ERNIE-ViL]** Knowledge Enhanced Vision-Language Representations Through Scene Graph [[pdf]](https://arxiv.org/pdf/2006.16934.pdf)
- **[Vokenization]** Improving Language Understanding with Contextualized, Visual-Grounded Supervision [[pdf]](https://arxiv.org/pdf/2010.06775.pdf) [[code]](https://github.com/airsplay/vokenization)
- **[12-in-1]** Multi-Task Vision and Language Representation Learning [[pdf]](https://arxiv.org/pdf/1912.02315.pdf) [[code]](https://github.com/facebookresearch/vilbert-multi-task)
- **[KVL-BERT]** Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning [[pdf]](https://arxiv.org/pdf/2012.07000.pdf)
- **[Oscar]** Object-Semantics Aligned Pre-training for Vision-Language Tasks [[pdf]](https://arxiv.org/pdf/2004.06165.pdf) [[code]](https://github.com/microsoft/Oscar)
- **[VIVO]** Visual Vocabulary Pre-Training for Novel Object Captioning [[pdf]](https://arxiv.org/pdf/2009.13682.pdf)
- **[SOHO]** End-to-End Pre-training for Vision-Language Representation Learning [[pdf]](https://arxiv.org/pdf/2104.03135.pdf) [[code]](https://github.com/researchmm/soho)
- **[Pixel-BERT]** Aligning Image Pixels with Text by Deep Multi-Modal Transformers [[pdf]](https://arxiv.org/pdf/2004.00849.pdf)
- **[LightningDOT]** Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval [[pdf]](https://arxiv.org/pdf/2103.08784.pdf) [[code]](https://github.com/intersun/LightningDOT)
- **[VirTex]** Learning Visual Representations from Textual Annotations [[pdf]](https://arxiv.org/pdf/2006.06666.pdf) [[code]](https://github.com/kdexd/virtex)
- **[Uni-Perceiver]** Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks [[pdf]](https://arxiv.org/pdf/2112.01522.pdf) [[code]](https://github.com/fundamentalvision/Uni-Perceiver)
- **[Uni-Perceiver v2]** A Generalist Model for Large-Scale Vision and Vision-Language Tasks [[pdf]](https://arxiv.org/pdf/2211.09808.pdf) [[code]](https://github.com/fundamentalvision/Uni-Perceiver)
- **[CoCa]** Contrastive Captioners are Image-Text Foundation Models [[pdf]](https://arxiv.org/pdf/2205.01917.pdf) [[code]](https://github.com/lucidrains/CoCa-pytorch) [[code]](https://github.com/mlfoundations/open_clip) [[colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb)
- **[Flamingo]** A Visual Language Model for Few-Shot Learning [[pdf]](https://arxiv.org/pdf/2204.14198.pdf) [[code]](https://github.com/lucidrains/flamingo-pytorch) [[code]](https://github.com/mlfoundations/open_flamingo) [[code]](https://github.com/dhansmair/flamingo-mini) [[website]](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) [[blog]](https://wandb.ai/gladiator/Flamingo%20VLM/reports/DeepMind-Flamingo-A-Visual-Language-Model-for-Few-Shot-Learning--VmlldzoyOTgzMDI2) [[blog]](https://laion.ai/blog/open-flamingo/) [[blog]](https://laion.ai/blog/open-flamingo-v2/)
- **[BEiT-3]** Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks [[pdf]](https://arxiv.org/pdf/2208.10442.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/beit3)
- **[UniCL]** Unified Contrastive Learning in Image-Text-Label Space [[pdf]](https://arxiv.org/pdf/2204.03610.pdf) [[code]](https://github.com/microsoft/UniCL)
- **[UVLP]** Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment [[pdf]](https://arxiv.org/pdf/2203.00242.pdf)
- **[OFA]** Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework [[pdf]](https://arxiv.org/pdf/2202.03052.pdf) [[code]](https://github.com/OFA-Sys/OFA) [[models and demos]](https://huggingface.co/OFA-Sys)
- **[GPV-1]** Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture [[pdf]](https://arxiv.org/pdf/2104.00743.pdf) [[code]](https://github.com/allenai/gpv-1/) [[website]](https://prior.allenai.org/projects/gpv)
- **[GPV-2]** Webly Supervised Concept Expansion for General Purpose Vision Models [[pdf]](https://arxiv.org/pdf/2202.02317.pdf) [[code]](https://github.com/allenai/gpv2/) [[website]](https://prior.allenai.org/projects/gpv2)
- **[TCL]** Vision-Language Pre-Training with Triple Contrastive Learning [[pdf]](https://arxiv.org/pdf/2202.10401.pdf) [[code]](https://github.com/uta-smile/TCL)
- **[L-Verse]** Bidirectional Generation Between Image and Text [[pdf]](https://arxiv.org/pdf/2111.11133.pdf)
- **[FLAVA]** A Foundational Language And Vision Alignment Model [[pdf]](https://arxiv.org/pdf/2112.04482.pdf) [[code]](https://github.com/facebookresearch/multimodal/tree/main/examples/flava) [[code]](https://huggingface.co/docs/transformers/model_doc/flava) [[website]](https://flava-model.github.io/) [[tutorial]](https://pytorch.org/tutorials/beginner/flava_finetuning_tutorial.html)
- **[COTS]** Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [[pdf]](https://arxiv.org/pdf/2204.07441.pdf)
- **[VL-ADAPTER]** Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [[pdf]](https://arxiv.org/pdf/2112.06825.pdf) [[code]](https://github.com/ylsung/VL_adapter)
- **[Unified-IO]** A Unified Model for Vision, Language, and Multi-Modal Tasks [[pdf]](https://arxiv.org/pdf/2206.08916.pdf) [[website]](https://unified-io.allenai.org/)
- **[ViLTA]** Enhancing Vision-Language Pre-training through Textual Augmentation [[pdf]](https://arxiv.org/pdf/2308.16689.pdf)
- **[CapDet]** Unifying Dense Captioning and Open-World Detection Pretraining [[pdf]](https://arxiv.org/pdf/2303.02489.pdf)
- **[PTP]** Position-guided Text Prompt for Vision-Language Pre-training [[pdf]](https://arxiv.org/pdf/2212.09737.pdf) [[code]](https://github.com/sail-sg/ptp)
- **[X-VLM]** Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts [[pdf]](https://arxiv.org/pdf/2111.08276v3.pdf) [[code]](https://github.com/zengyan-97/x-vlm)
- **[FewVLM]** A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2110.08484.pdf) [[code]](https://github.com/woojeongjin/FewVLM)
- **[M3AE]** Multimodal Masked Autoencoders Learn Transferable Representations [[pdf]](https://arxiv.org/pdf/2205.14204.pdf) [[code]](https://github.com/young-geng/m3ae_public)
- **[CFM-ViT]** Contrastive Feature Masking Open-Vocabulary Vision Transformer [[pdf]](https://arxiv.org/pdf/2309.00775.pdf)
- **[mPLUG]** Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [[pdf]](https://arxiv.org/pdf/2205.12005v2.pdf)
- **[PaLI]** A Jointly-Scaled Multilingual Language-Image Model [[pdf]](https://arxiv.org/pdf/2209.06794.pdf) [[blog]](https://ai.googleblog.com/2022/09/pali-scaling-language-image-learning-in.html) [[code]](https://github.com/kyegomez/PALI3)
- **[GIT]** A Generative Image-to-text Transformer for Vision and Language [[pdf]](https://arxiv.org/pdf/2205.14100.pdf) [[code]](https://github.com/microsoft/GenerativeImage2Text) [[code]](https://huggingface.co/docs/transformers/model_doc/git) [[demo]](https://huggingface.co/spaces/nielsr/comparing-VQA-models)
- **[MaskVLM]** Masked Vision and Language Modeling for Multi-modal Representation Learning [[pdf]](https://arxiv.org/pdf/2208.02131.pdf)
- **[DALL-E]** Zero-Shot Text-to-Image Generation [[pdf]](https://arxiv.org/pdf/2102.12092.pdf) [[code]](https://github.com/openai/DALL-E) [[code]](https://github.com/borisdayma/dalle-mini) [[code]](https://github.com/lucidrains/DALLE-pytorch) [[code]](https://github.com/kuprel/min-dalle) [[code]](https://github.com/robvanvolt/DALLE-models) [[code]](https://github.com/kakaobrain/minDALL-E) [[website]](https://openai.com/blog/dall-e/) [[video]](https://www.youtube.com/watch?v=j4xgkjWlfL4&t=1432s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=jMqLTPcA9CQ&t=1034s&ab_channel=TheAIEpiphany) [[video]](https://www.youtube.com/watch?v=x_8uHX5KngE&ab_channel=TheAIEpiphany) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/Building-efficient-image-input-pipelines--VmlldzoyMjMxOTQw) [[blog]](https://ml.berkeley.edu/blog/posts/vq-vae/) [[blog]](https://ml.berkeley.edu/blog/posts/dalle2/) [[blog]](https://towardsdatascience.com/understanding-how-dall-e-mini-works-114048912b3b)
- **[DALL-E-2]** Hierarchical Text-Conditional Image Generation with CLIP Latents [[pdf]](https://arxiv.org/pdf/2204.06125.pdf) [[code]](https://github.com/lucidrains/DALLE2-pytorch) [[website]](https://openai.com/dall-e-2/) [[blog]](http://adityaramesh.com/posts/dalle2/dalle2.html) [[blog]](https://www.assemblyai.com/blog/how-dall-e-2-actually-works/) [[blog]](https://medium.com/augmented-startups/how-does-dall-e-2-work-e6d492a2667f)
- **[DALL-E 3]** Improving Image Generation with Better Captions [[pdf]](https://cdn.openai.com/papers/dall-e-3.pdf) [[consistency decoder]](https://github.com/openai/consistencydecoder) [[website]](https://openai.com/dall-e-3)
- **[GigaGAN]** Scaling up GANs for Text-to-Image Synthesis [[pdf]](https://arxiv.org/pdf/2303.05511.pdf) [[code]](https://github.com/lucidrains/gigagan-pytorch) [[code]](https://github.com/jianzhnie/GigaGAN) [[website]](https://mingukkang.github.io/GigaGAN/)
- **[Parti]** Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [[pdf]](https://arxiv.org/pdf/2206.10789.pdf) [[code]](https://github.com/google-research/parti) [[code]](https://github.com/lucidrains/parti-pytorch) [[video]](https://www.youtube.com/watch?v=qS-iYnp00uc&ab_channel=YannicKilcher) [[blog]](https://parti.research.google/)
- **[Paella]** Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces [[pdf]](https://arxiv.org/pdf/2211.07292.pdf) [[code]](https://github.com/dome272/Paella) [[video]](https://www.youtube.com/watch?v=6zeLSANd41k&ab_channel=AICoffeeBreakwithLetitia)
- **[Make-A-Scene]** Scene-Based Text-to-Image Generation with Human Priors [[pdf]](https://arxiv.org/pdf/2203.13131.pdf)
- **[FIBER]** Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [[pdf]](https://arxiv.org/pdf/2206.07643.pdf) [[code]](https://github.com/microsoft/FIBER)
- **[VL-BEiT]** Generative Vision-Language Pretraining [[pdf]](https://arxiv.org/pdf/2206.01127.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/vl-beit)
- **[MetaLM]** Language Models are General-Purpose Interfaces [[pdf]](https://arxiv.org/pdf/2206.06336.pdf) [[code]](https://github.com/microsoft/unilm/tree/master/metalm)
- **[VL-T5]** Unifying Vision-and-Language Tasks via Text Generation [[pdf]](https://arxiv.org/pdf/2102.02779.pdf) [[code]](https://github.com/j-min/VL-T5)
- **[UNICORN]** Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling [[pdf]](https://arxiv.org/pdf/2111.12085.pdf)
- **[MI2P]** Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification [[pdf]](https://openaccess.thecvf.com/content/CVPR2022/papers/Liang_Expanding_Large_Pre-Trained_Unimodal_Models_With_Multimodal_Information_Injection_for_CVPR_2022_paper.pdf)
- **[MDETR]** Modulated Detection for End-to-End Multi-Modal Understanding [[pdf]](https://arxiv.org/pdf/2104.12763.pdf) [[code]](https://github.com/ashkamath/mdetr)
- **[VLMixer]** Unpaired Vision-Language Pre-training via Cross-Modal CutMix [[pdf]](https://arxiv.org/pdf/2206.08919.pdf) [[code]](https://github.com/ttengwang/VLMixer)
- **[ViCHA]** Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment [[pdf]](https://arxiv.org/pdf/2208.13628.pdf) [[code]](https://github.com/mshukor/ViCHA)
- **[Img2LLM]** Zero-shot Visual Question Answering with Frozen Large Language Models [[pdf]](https://arxiv.org/pdf/2212.10846.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
- **[PNP-VQA]** Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training [[pdf]](https://arxiv.org/pdf/2210.08773.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb)
- **[StoryDALL-E]** Adapting Pretrained Text-to-Image Transformers for Story Continuation [[pdf]](https://arxiv.org/pdf/2209.06192v1.pdf) [[code]](https://github.com/adymaharana/storydalle)
- **[VLMAE]** Vision-Language Masked Autoencoder [[pdf]](https://arxiv.org/pdf/2208.09374.pdf)
- **[MLIM]** Vision-and-Language Model Pre-training with Masked Language and Image Modeling [[pdf]](https://arxiv.org/pdf/2109.12178.pdf)
- **[MOFI]** Learning Image Representations from Noisy Entity Annotated Images [[pdf]](https://arxiv.org/pdf/2306.07952.pdf)
- **[GILL]** Generating Images with Multimodal Language Models [[pdf]](https://arxiv.org/pdf/2305.17216.pdf) [[code]](https://github.com/kohjingyu/gill) [[website]](https://jykoh.com/gill)
- **[Language Pivoting]** Unpaired Image Captioning by Language Pivoting [[pdf]](https://arxiv.org/pdf/1803.05526.pdf)
- **[Graph-Align]** Unpaired Image Captioning via Scene Graph Alignments [[pdf]](https://arxiv.org/pdf/1903.10658.pdf)
- **[PL-UIC]** Prompt-based Learning for Unpaired Image Captioning [[pdf]](https://arxiv.org/pdf/2205.13125.pdf)
- **[SCL]** Vision-Language Pre-training with Semantic Completion Learning [[pdf]](https://arxiv.org/pdf/2211.13437.pdf)
- **[TaskRes]** Task Residual for Tuning Vision-Language Models [[pdf]](https://arxiv.org/pdf/2211.10277.pdf) [[code]](https://github.com/geekyutao/TaskRes)
- **[EPIC]** Leveraging per Image-Token Consistency for Vision-Language Pre-training [[pdf]](https://arxiv.org/pdf/2211.15398.pdf)
- **[HAAV]** Hierarchical Aggregation of Augmented Views for Image Captioning [[pdf]](https://arxiv.org/pdf/2305.16295.pdf)
- **[FLM]** Accelerating Vision-Language Pretraining with Free Language Modeling [[pdf]](https://arxiv.org/pdf/2303.14038.pdf) [[code]](https://github.com/TencentARC/FLM)
- **[DiHT]** Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training [[pdf]](https://arxiv.org/pdf/2301.02280.pdf) [[code]](https://github.com/facebookresearch/diht)
- **[VL-Match]** Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching [[pdf]](https://openaccess.thecvf.com/content/ICCV2023/papers/Bi_VL-Match_Enhancing_Vision-Language_Pretraining_with_Token-Level_and_Instance-Level_Matching_ICCV_2023_paper.pdf)
- **[Prismer]** A Vision-Language Model with Multi-Modal Experts [[pdf]](https://arxiv.org/pdf/2303.02506.pdf) [[code]](https://github.com/NVlabs/prismer) [[website]](https://shikun.io/projects/prismer) [[demo]](https://huggingface.co/spaces/lorenmt/prismer)
- **[PaLM-E]** An Embodied Multimodal Language Model [[pdf]](https://arxiv.org/pdf/2303.03378.pdf) [[website]](https://palm-e.github.io/) [[blog]](https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html)
- **[X-Decoder]** Generalized Decoding for Pixel, Image, and Language [[pdf]](https://arxiv.org/pdf/2212.11270.pdf) [[code]](https://github.com/microsoft/X-Decoder/tree/main) [[xgpt code]](https://github.com/microsoft/X-Decoder/tree/xgpt) [[website]](https://x-decoder-vl.github.io/) [[demo]](https://huggingface.co/spaces/xdecoder/Demo) [[demo]](https://huggingface.co/spaces/xdecoder/Instruct-X-Decoder)
- **[PerVL]** Personalizing frozen vision-language representations [[pdf]](https://arxiv.org/pdf/2204.01694.pdf) [[code]](https://github.com/NVlabs/PALAVRA)
- **[TextManiA]** Enriching Visual Feature by Text-driven Manifold Augmentation [[pdf]](https://arxiv.org/pdf/2307.14611.pdf) [[code]](https://github.com/postech-ami/TextManiA) [[website]](https://moon-yb.github.io/TextManiA.github.io/) [[GAN Inversion]](https://arxiv.org/pdf/2004.00049.pdf)
- **[Cola]** Language Models are Visual Reasoning Coordinators [[pdf]](https://openreview.net/pdf?id=kdHpWogtX6Y) [[code]](https://github.com/cliangyu/Cola)
- **[K-LITE]** Learning Transferable Visual Models with External Knowledge [[pdf]](https://arxiv.org/pdf/2204.09222.pdf) [[code]](https://github.com/microsoft/klite)
- **[SINC]** Self-Supervised In-Context Learning for Vision-Language Tasks [[pdf]](https://arxiv.org/pdf/2307.07742.pdf)
- **[Visual ChatGPT]** Talking, Drawing and Editing with Visual Foundation Models [[pdf]](https://arxiv.org/pdf/2303.04671.pdf) [[code]](https://github.com/microsoft/visual-chatgpt)
- **[CM3Leon]** Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning [[pdf]](https://scontent-bru2-1.xx.fbcdn.net/v/t39.2365-6/358725877_789390529544546_1176484804732743296_n.pdf?_nc_cat=108&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=6UJxCrFyo1kAX8C0-if&_nc_ht=scontent-bru2-1.xx&oh=00_AfAhpsoexYSfiwS5xjTkgm08RWW8EB9mLvcCXDwWBUI3AA&oe=64C33272)
- **[GRiT]** A Generative Region-to-text Transformer for Object Understanding [[pdf]](https://arxiv.org/pdf/2212.00280) [[code]](https://github.com/JialianW/GRiT)
- **[KOSMOS-1]** Language Is Not All You Need: Aligning Perception with Language Models [[pdf]](https://arxiv.org/pdf/2302.14045.pdf) [[code]](https://github.com/microsoft/unilm)
- **[KOSMOS-2]** Grounding Multimodal Large Language Models to the World [[pdf]](https://openreview.net/pdf?id=lLmqxkfSIw) [[code]](https://github.com/microsoft/unilm/tree/master/kosmos-2) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/kosmos-2) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb) [[demo]](https://huggingface.co/spaces/ydshieh/Kosmos-2)
- **[MultiModal-GPT]** A Vision and Language Model for Dialogue with Humans [[pdf]](https://arxiv.org/pdf/2305.04790.pdf) [[code]](https://github.com/open-mmlab/Multimodal-GPT)
- **[LLaVA Series]** All LLava Models [[LLava]](https://github.com/haotian-liu/LLaVA) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llava) [[tutorial]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) [[LLaVA-NeXT]](https://github.com/LLaVA-VL/LLaVA-NeXT/) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llava_next) [[hf docs]](https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision) [[demo]](https://huggingface.co/spaces/merve/llava-next) [[hf card]](https://huggingface.co/llava-hf) [[LLaVA-CoT]](https://github.com/PKU-YuanGroup/LLaVA-CoT)
- **[InternVL]** A Pioneering Open-Source Alternative to GPT-4o [[github]](https://github.com/OpenGVLab/InternVL)
- **[MiniCPM-V]** A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone [[pdf]](https://arxiv.org/pdf/2408.01800) [[code]](https://github.com/OpenBMB/MiniCPM-V)
- **[LLaVA-MORE]** Enhancing Visual Instruction Tuning with LLaMA 3.1 [[github]](https://github.com/aimagelab/LLaVA-MORE)
- **[ViP-LLaVA]** Making Large Multimodal Models Understand Arbitrary Visual Prompts [[pdf]](https://arxiv.org/pdf/2312.00784.pdf) [[code]](https://github.com/mu-cai/vip-llava) [[demo]](https://pages.cs.wisc.edu/~mucai/vip-llava.html) [[website]](https://vip-llava.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViP-LLaVa/Inference_with_ViP_LLaVa_for_fine_grained_VQA.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/vipllava)
- **[Qwen-VL]** A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [[pdf]](https://arxiv.org/pdf/2308.12966) [[code]](https://github.com/QwenLM/Qwen-VL) [[tutorial]](https://github.com/QwenLM/Qwen-VL/blob/master/TUTORIAL.md) [[blog]](https://qwenlm.github.io/blog/qwen-vl/) [[blog]](https://qwenlm.github.io/blog/qwen2-vl/)
- **[Qwen2-VL]** Enhancing Vision-Language Model’s Perception of the World at Any Resolution [[pdf]](https://arxiv.org/pdf/2409.12191) [[code]](https://github.com/QwenLM/Qwen2-VL) [[hf card]](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) [[demo]](https://huggingface.co/spaces/Qwen/Qwen2-VL)
- **[VILA]** On Pre-training for Visual Language Models [[pdf]](https://arxiv.org/pdf/2312.07533.pdf) [[code]](https://github.com/Efficient-Large-Model/VILA) [[hf page]](https://huggingface.co/Efficient-Large-Model)
- **[NExT-Chat]** An LMM for Chat, Detection and Segmentation [[pdf]](https://github.com/NExT-ChatV/NExT-Chat/blob/main/NExT_Chat.pdf) [[code]](https://github.com/NExT-ChatV/NExT-Chat) [[demo]](https://516398b33beb3e8b9f.gradio.live/) [[website]](https://next-chatv.github.io/)
- **[MiniGPT-4]** Enhancing Vision-Language Understanding with Advanced Large Language Models [[pdf]](https://openreview.net/attachment?id=1tZbq88f27&name=pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4) [[website]](https://minigpt-4.github.io/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/minigpt4)
- **[MiniGPT-v2]** Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [[pdf]](https://arxiv.org/pdf/2310.09478.pdf) [[code]](https://github.com/Vision-CAIR/MiniGPT-4) [[website]](https://minigpt-v2.github.io/) [[demo]](https://876a8d3e814b8c3a8b.gradio.live/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/MiniGPT-v2)
- **[LLaMA-Adapter]** Efficient Fine-tuning of Language Models with Zero-init Attention [[pdf]](https://openreview.net/attachment?id=d4UiXAHN2W&name=pdf) [[code]](https://github.com/OpenGVLab/LLaMA-Adapter)
- **[LLaMA-Adapter V2]** Parameter-Efficient Visual Instruction Model [[pdf]](https://arxiv.org/pdf/2304.15010.pdf) [[code]](https://github.com/OpenGVLab/LLaMA-Adapter) [[demo]](http://llama-adapter.opengvlab.com/)
- **[LaVIN]** Efficient Vision-Language Instruction Tuning for Large Language Models [[pdf]](https://arxiv.org/pdf/2305.15023.pdf) [[code]](https://github.com/luogen1996/LaVIN) [[website]](https://luogen1996.github.io/lavin/)
- **[InstructBLIP]** Towards General-purpose Vision-Language Models with Instruction Tuning [[pdf]](https://arxiv.org/pdf/2305.06500.pdf) [[code]](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) [[code]](https://huggingface.co/docs/transformers/model_doc/instructblip) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/InstructBLIP/Inference_with_InstructBLIP.ipynb)
- **[Otter/MIMIC-IT]** A Multi-Modal Model with In-Context Instruction Tuning [[pdf]](https://arxiv.org/pdf/2305.03726.pdf) [[pdf]](https://arxiv.org/pdf/2306.05425.pdf) [[code]](https://github.com/Luodian/Otter) [[website]](https://otter-ntu.github.io/)
- **[CogVLM]** Visual Expert for Pretrained Language Models [[pdf]](https://arxiv.org/pdf/2311.03079.pdf) [[code]](https://github.com/THUDM/CogVLM) [[demo]](http://36.103.203.44:7861/) [[hf card]](https://huggingface.co/THUDM)
- **[ImageBind]** One Embedding Space To Bind Them All [[pdf]](https://arxiv.org/pdf/2305.05665.pdf) [[code]](https://github.com/facebookresearch/ImageBind) [[website]](https://imagebind.metademolab.com/)
- **[TextBind]** Multi-turn Interleaved Multimodal Instruction-following in the Wild [[pdf]](https://arxiv.org/pdf/2309.08637.pdf) [[code]](https://github.com/SihengLi99/TextBind) [[website]](https://textbind.github.io/) [[demo]](https://ailabnlp.tencent.com/research_demos/textbind/) [[models]](https://huggingface.co/SihengLi/TextBind)
- **[MetaVL]** Transferring In-Context Learning Ability From Language Models to Vision-Language Models [[pdf]](https://arxiv.org/pdf/2306.01311.pdf)
- **[M³IT]** A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning [[pdf]](https://arxiv.org/pdf/2306.04387.pdf) [[website]](https://m3-it.github.io/)
- **[Instruction-ViT]** Multi-Modal Prompts for Instruction Learning in ViT [[pdf]](https://arxiv.org/pdf/2305.00201.pdf)
- **[MultiInstruct]** Improving Multi-Modal Zero-Shot Learning via Instruction Tuning [[pdf]](https://arxiv.org/pdf/2212.10773.pdf) [[code]](https://github.com/VT-NLP/MultiInstruct)
- **[VisIT-Bench]** A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use [[pdf]](https://arxiv.org/pdf/2308.06595.pdf) [[code]](https://github.com/mlfoundations/VisIT-Bench/) [[website]](https://visit-bench.github.io/) [[blog]](https://laion.ai/blog/visit_bench/) [[dataset]](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) [[leaderboard]](https://huggingface.co/spaces/mlfoundations/VisIT-Bench-Leaderboard)
- **[GPT4RoI]** Instruction Tuning Large Language Model on Region-of-Interest [[pdf]](https://arxiv.org/pdf/2307.03601.pdf) [[code]](https://github.com/jshilong/GPT4RoI) [[demo]](http://139.196.83.164:7000/)
- **[PandaGPT]** One Model To Instruction-Follow Them All [[pdf]](https://arxiv.org/pdf/2305.16355.pdf) [[code]](https://github.com/yxuansu/PandaGPT) [[demo]](https://huggingface.co/spaces/GMFTBY/PandaGPT) [[website]](https://panda-gpt.github.io/)
- **[ChatBridge]** Bridging Modalities with Large Language Model as a Language Catalyst [[pdf]](https://arxiv.org/pdf/2305.16103.pdf) [[code]](https://github.com/joez17/ChatBridge) [[website]](https://iva-chatbridge.github.io/)
- **[ImageBind]** One Embedding Space To Bind Them All [[pdf]](https://arxiv.org/pdf/2305.05665.pdf) [[code]](https://github.com/facebookresearch/ImageBind) [[demo]](https://imagebind.metademolab.com/)
- **[Video-LLaMA]** An Instruction-tuned Audio-Visual Language Model for Video Understanding [[pdf]](https://arxiv.org/pdf/2306.02858.pdf) [[code]](https://github.com/DAMO-NLP-SG/Video-LLaMA) [[demo]](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA)
- **[VideoChat]** Chat-Centric Video Understanding [[pdf]](https://arxiv.org/pdf/2305.06355.pdf) [[code]](https://github.com/OpenGVLab/Ask-Anything)
- **[InternGPT]** Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language [[pdf]](https://arxiv.org/pdf/2305.05662.pdf) [[code]](https://github.com/OpenGVLab/InternGPT)
- **[mPLUG-Owl]** Modularization Empowers Large Language Models with Multimodality [[pdf]](https://arxiv.org/pdf/2304.14178.pdf) [[code]](https://github.com/X-PLUG/mPLUG-Owl) [[demo]](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl)
- **[VisionLLM]** Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [[pdf]](https://arxiv.org/pdf/2305.11175.pdf) [[code]](https://github.com/OpenGVLab/VisionLLM) [[blog]](https://wandb.ai/byyoung3/ml-news/reports/Introducing-VisionLLM-A-New-Method-for-Multi-Modal-LLM-s--Vmlldzo0NTMzNzIz)
- **[X-LLM]** Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages [[pdf]](https://arxiv.org/pdf/2305.04160.pdf) [[code]](https://github.com/phellonchen/X-LLM) [[website]](https://x-llm.github.io/)
- **[OBELICS]** An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents [[pdf]](https://arxiv.org/pdf/2306.16527.pdf) [[code]](https://github.com/huggingface/OBELICS) [[blog]](https://huggingface.co/blog/idefics) [[models]](https://huggingface.co/HuggingFaceM4/idefics-80b) [[instruct model 9b]](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct) [[instruct model 80b]](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) [[demo]](https://huggingface.co/spaces/HuggingFaceM4/idefics_playground) [[dataset]](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)
- **[EvALign-ICL]** Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning [[pdf]](https://arxiv.org/pdf/2310.00647.pdf) [[code]](https://github.com/mshukor/EvALign-ICL) [[website]](https://evalign-icl.github.io/)
- **[Plug and Pray]** Exploiting off-the-shelf components of Multi-Modal Models [[pdf]](https://arxiv.org/pdf/2307.14539.pdf)
- **[VL-PET]** Vision-and-Language Parameter-Efficient Tuning via Granularity Control [[pdf]](https://arxiv.org/pdf/2308.09804v1.pdf) [[code]](https://github.com/HenryHZY/VL-PET)
- **[ICIS]** Image-free Classifier Injection for Zero-Shot Classification [[pdf]](https://arxiv.org/pdf/2308.10599.pdf) [[code]](https://github.com/ExplainableML/ImageFreeZSL)
- **[NExT-GPT]** Any-to-Any Multimodal LLM [[pdf]](https://arxiv.org/pdf/2309.05519.pdf) [[code]](https://github.com/NExT-GPT/NExT-GPT) [[website]](https://next-gpt.github.io/) [[demo]](https://4271670c463565f1a4.gradio.live/)
- **[UnIVAL]** Unified Model for Image, Video, Audio and Language Tasks [[pdf]](https://arxiv.org/pdf/2307.16184.pdf) [[code]](https://github.com/mshukor/UnIVAL)
- **[BUS]** Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization [[pdf]](https://arxiv.org/pdf/2307.08504.pdf)
- **[VPGTrans]** Transfer Visual Prompt Generator across LLMs [[pdf]](https://arxiv.org/pdf/2305.01278.pdf) [[code]](https://github.com/VPGTrans/VPGTrans) [[website]](https://vpgtrans.github.io/) [[demo]](https://7116ee36952d2ef342.gradio.live/)
- **[PromptCap]** Prompt-Guided Task-Aware Image Captioning [[pdf]](https://arxiv.org/pdf/2211.09699.pdf) [[code]](https://github.com/Yushi-Hu/PromptCap) [[website]](https://yushi-hu.github.io/promptcap_demo/) [[hf checkpoint]](https://huggingface.co/tifa-benchmark/promptcap-coco-vqa)
- **[P-Former]** Bootstrapping Vision-Language Learning with Decoupled Language Pre-training [[pdf]](https://arxiv.org/pdf/2307.07063.pdf)
- **[TL;DR]** Too Large; Data Reduction for Vision-Language Pre-Training [[pdf]](https://arxiv.org/pdf/2305.20087.pdf) [[code]](https://github.com/showlab/datacentric.vlp)
- **[PMA-Net]** Prototypical Memory Networks for Image Captioning [[pdf]](https://arxiv.org/pdf/2308.12383.pdf) [[code]](https://github.com/aimagelab/PMA-Net)
- **[Encyclopedic VQA]** Visual questions about detailed properties of fine-grained categories [[pdf]](https://arxiv.org/pdf/2306.09224.pdf) [[code]](https://github.com/google-research/google-research/tree/master/encyclopedic_vqa)
- **[CMOTA]** Story Visualization by Online Text Augmentation with Context Memory [[pdf]](https://arxiv.org/pdf/2308.07575.pdf) [[code]](https://github.com/yonseivnl/cmota)
- **[CPT]** CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models [[pdf]](https://arxiv.org/pdf/2109.11797.pdf) [[code]](https://github.com/thunlp/CPT)
- **[TeS]** Improved Visual Fine-tuning with Natural Language Supervision [[pdf]](https://arxiv.org/pdf/2304.01489.pdf) [[code]](https://github.com/idstcv/TeS)
- **[MP]** Tuning Pre-trained Model via Moment Probing [[pdf]](https://arxiv.org/pdf/2307.11342.pdf) [[code]](https://github.com/mingzeG/Moment-Probing)
- **[SmolVLM]** Small yet mighty Vision Language Model [[blog]](https://huggingface.co/blog/smolvlm) [[code]](https://github.com/huggingface/smollm) [[mode]](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct) [[finetune demo]](https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb) [[demo]](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
- **[AIMV2]** Multimodal Autoregressive Pre-training of Large Vision Encoders [[pdf]](https://arxiv.org/pdf/2411.14402) [[code]](https://github.com/apple/ml-aim) [[hf collection]](https://huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c)
- Multimodal Contrastive Training for Visual Representation Learning [[pdf]](https://arxiv.org/pdf/2104.12836.pdf)
- Learning Visual Representations via Language-Guided Sampling [[pdf]](https://arxiv.org/pdf/2302.12248.pdf)
- Image Captioners Are Scalable Vision Learners Too [[pdf]](https://arxiv.org/pdf/2306.07915.pdf)
- Masked Autoencoding Does Not Help Natural Language Supervision at Scale [[pdf]](https://arxiv.org/pdf/2301.07836.pdf)
- Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text [[pdf]](https://arxiv.org/pdf/2112.07074.pdf)
- Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers [[pdf]](https://arxiv.org/pdf/2109.04448.pdf)
- A Closer Look at the Robustness of Vision-and-Language Pre-trained Models [[pdf]](https://arxiv.org/pdf/2012.08673.pdf)
- Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training [[pdf]](https://arxiv.org/pdf/2106.13488.pdf)
- Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs [[pdf]](https://arxiv.org/pdf/2011.15124.pdf)
- Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions [[pdf]](https://arxiv.org/pdf/2010.12831.pdf)
- Cross-Modal Textual and Visual Context for Image Captioning [[pdf]](https://arxiv.org/pdf/2205.04363.pdf)
- Multi-modal Alignment using Representation Codebook [[pdf]](https://arxiv.org/pdf/2203.00048.pdf)
- A Closer Look at the Robustness of Vision-and-Language Pre-trained Models [[pdf]](https://arxiv.org/pdf/2012.08673.pdf)
- Multimodal Few-Shot Learning with Frozen Language Models [[pdf]](https://arxiv.org/pdf/2106.13884.pdf)
- On Guiding Visual Attention with Language Specification [[pdf]](https://arxiv.org/pdf/2202.08926.pdf)
- Compressing Visual-linguistic Model via Knowledge Distillation [[pdf]](https://arxiv.org/pdf/2104.02096.pdf)
- Playing Lottery Tickets with Vision and Language [[pdf]](https://arxiv.org/pdf/2104.11832.pdf)
- Do DALL-E and Flamingo Understand Each Other? [[pdf]](https://arxiv.org/pdf/2212.12249.pdf)
- Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning [[pdf]](https://arxiv.org/pdf/2212.13563.pdf) [[code]](https://github.com/kakaobrain/noc)
- Towards an Exhaustive Evaluation of Vision-Language Foundation Models [[pdf]](https://openaccess.thecvf.com/content/ICCV2023W/MMFM/papers/Salin_Towards_an_Exhaustive_Evaluation_of_Vision-Language_Foundation_Models_ICCVW_2023_paper.pdf)

## CLIP-related
- **[CLIP]** Learning Transferable Visual Models From Natural Language Supervision [[pdf]](https://arxiv.org/pdf/2103.00020.pdf) [[code]](https://github.com/openai/CLIP) [[code]](https://github.com/Zasder3/train-CLIP) [[code]](https://github.com/moein-shariatnia/OpenAI-CLIP) [[code]](https://github.com/lucidrains/x-clip) [[code]](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/clip) [[website]](https://openai.com/blog/clip/) [[video]](https://www.youtube.com/watch?v=T9XSU0pKX2E&t=1455s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=fQyHEXZB-nM&ab_channel=AleksaGordi%C4%87-TheAIEpiphany) [[video code]](https://www.youtube.com/watch?v=jwZQD0Cqz4o&t=4610s&ab_channel=TheAIEpiphany) [[CLIP_benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[clip-retrieval]](https://github.com/rom1504/clip-retrieval) [[clip-retrieval blog]](https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c)
- **[OpenCLIP]** Reproducible scaling laws for contrastive language-image learning [[pdf]](https://arxiv.org/pdf/2212.07143.pdf) [[code]](https://github.com/mlfoundations/open_clip) [[code]](https://github.com/LAION-AI/scaling-laws-openclip) [[clip colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) [[clip benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[hf models]](https://huggingface.co/models?library=open_clip)
- **[CLIPScore]** A Reference-free Evaluation Metric for Image Captioning [[pdf]](https://arxiv.org/pdf/2104.08718.pdf) [[code]](https://github.com/jmhessel/clipscore)
- **[LiT]** Zero-Shot Transfer with Locked-image text Tuning [[pdf]](https://arxiv.org/pdf/2111.07991.pdf) [[code]](https://github.com/google-research/vision_transformer) [[website]](https://google-research.github.io/vision_transformer/lit/)
- **[SigLIP]** Sigmoid Loss for Language Image Pre-Training [[pdf]](https://arxiv.org/pdf/2303.15343.pdf) [[code]](https://github.com/google-research/big_vision) [[colab demo]](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/siglip) [[hf models]](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba)
- **[Alpha-CLIP]** A CLIP Model Focusing on Wherever You Want [[pdf]](https://arxiv.org/pdf/2312.03818) [[code]](https://github.com/SunzeY/AlphaCLIP) [[website]](https://aleafy.github.io/alpha-clip/)
- **[FGVP]** Fine-Grained Visual Prompting [[pdf]](https://arxiv.org/pdf/2306.04356) [[code]](https://github.com/ylingfeng/FGVP)
- **[ALIGN]** Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [[pdf]](https://arxiv.org/pdf/2102.05918.pdf) [[code]](https://huggingface.co/docs/transformers/model_doc/align)
- **[DeCLIP]** Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm [[pdf]](https://arxiv.org/pdf/2110.05208.pdf) [[code]](https://github.com/Sense-GVT/DeCLIP)
- **[FLIP]** Scaling Language-Image Pre-training via Masking [[pdf]](https://arxiv.org/pdf/2212.00794.pdf) [[code]](https://github.com/facebookresearch/flip)
- **[Counting-aware CLIP]** Teaching CLIP to Count to Ten [[pdf]](https://arxiv.org/pdf/2302.12066.pdf) [[website]](https://teaching-clip-to-count.github.io/)
- **[ALIP]** Adaptive Language-Image Pre-training with Synthetic Caption [[pdf]](https://arxiv.org/pdf/2308.08428.pdf) [[code]](https://github.com/deepglint/ALIP)
- **[FILIP]** Fine-grained Interactive Language-Image Pre-Training [[pdf]](https://arxiv.org/pdf/2111.07783.pdf)
- **[SLIP]** Self-supervision meets Language-Image Pre-training [[pdf]](https://arxiv.org/pdf/2112.12750.pdf) [[code]](https://github.com/facebookresearch/SLIP)
- **[WiSE-FT]** Robust fine-tuning of zero-shot models [[pdf]](https://arxiv.org/pdf/2109.01903.pdf) [[code]](https://github.com/mlfoundations/wise-ft)
- **[FLYP]** Finetune like you pretrain: Improved finetuning of zero-shot vision models [[pdf]](https://arxiv.org/pdf/2212.00638.pdf) [[code]](https://github.com/locuslab/FLYP)
- **[MAGIC]** Plugging Visual Controls in Text Generation [[pdf]](https://arxiv.org/pdf/2205.02655.pdf) [[code]](https://github.com/yxuansu/MAGIC)
- **[ZeroCap]** Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [[pdf]](https://arxiv.org/pdf/2111.14447.pdf) [[code]](https://github.com/YoadTew/zero-shot-image-to-text)
- **[CapDec]** Text-Only Training for Image Captioning using Noise-Injected CLIP [[pdf]](https://arxiv.org/pdf/2211.00575.pdf) [[code]](https://github.com/DavidHuji/CapDec) [[colab]](https://colab.research.google.com/drive/1Jgj0uaALtile2iyqlN1r72UYRe9SZw-H?usp=sharing)
- **[DeCap]** Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training [[pdf]](https://arxiv.org/pdf/2303.03032.pdf) [[code]](https://github.com/dhg-wei/DeCap)
- **[ViECap]** Transferable Decoding with Visual Entities for Zero-Shot Image Captioning [[pdf]](https://arxiv.org/pdf/2307.16525.pdf) [[code]](https://github.com/FeiElysia/ViECap)
- **[CLOSE]** Learning Visual Tasks Using Only Language Supervision [[pdf]](https://arxiv.org/pdf/2211.09778.pdf) [[code]](https://github.com/allenai/close)
- **[xCLIP]** Non-Contrastive Learning Meets Language-Image Pre-Training [[pdf]](https://arxiv.org/pdf/2210.09304.pdf)
- **[EVA]** Visual Representation Fantasies from BAAI [[series]](https://github.com/baaivision/EVA)
- **[VT-CLIP]** Enhancing Vision-Language Models with Visual-guided Texts [[pdf]](https://arxiv.org/pdf/2112.02399.pdf)
- **[CLIP-ViL]** How Much Can CLIP Benefit Vision-and-Language Tasks? [[pdf]](https://arxiv.org/pdf/2107.06383.pdf) [[code]](https://github.com/clip-vil/CLIP-ViL)
- **[RegionCLIP]** Region-based Language-Image Pretraining [[pdf]](https://arxiv.org/pdf/2112.09106.pdf) [[code]](https://github.com/microsoft/RegionCLIP)
- **[DenseCLIP]** Language-Guided Dense Prediction with Context-Aware Prompting [[pdf]](https://arxiv.org/pdf/2112.01518.pdf) [[code]](https://github.com/raoyongming/DenseCLIP)
- **[E-CLIP]** Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling [[pdf]](https://arxiv.org/pdf/2109.04699.pdf)
- **[MaskCLIP]** Masked Self-Distillation Advances Contrastive Language-Image Pretraining [[pdf]](https://arxiv.org/pdf/2208.12262.pdf)
- **[CLIPSeg]** Image Segmentation Using Text and Image Prompts [[pdf]](https://arxiv.org/pdf/2112.10003.pdf) [[code]](https://github.com/timojl/clipseg) [[code]](https://huggingface.co/docs/transformers/model_doc/clipseg) [[colab]](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/123_clipseg-zero-shot.ipynb) [[demo]](https://huggingface.co/spaces/nielsr/CLIPSeg) [[demo]](https://huggingface.co/spaces/Sijuade/CLIPSegmentation) [[demo]](https://huggingface.co/spaces/taesiri/CLIPSeg) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg2) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg) [[blog]](https://huggingface.co/blog/clipseg-zero-shot)
- **[OWL-ViT]** Simple Open-Vocabulary Object Detection with Vision Transformers [[pdf]](https://arxiv.org/pdf/2205.06230.pdf) [[code]](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/owlvit) [[tutorial]](https://huggingface.co/docs/transformers/tasks/zero_shot_object_detection) [[colab]](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb) [[demo]](https://huggingface.co/spaces/adirik/OWL-ViT) [[demo]](https://huggingface.co/spaces/adirik/image-guided-owlvit) [[demo]](https://huggingface.co/spaces/johko/OWL-ViT) [[demo]](https://huggingface.co/spaces/kellyxiaowei/OWL-ViT) [[demo]](https://huggingface.co/spaces/wendys-llc/OWL-ViT)
- **[ClipCap]** CLIP Prefix for Image Captioning [[pdf]](https://arxiv.org/pdf/2111.09734.pdf) [[code]](https://github.com/rmokady/CLIP_prefix_caption) [[code]](https://github.com/TheoCoombes/ClipCap)
- **[VQGAN-CLIP]** Open Domain Image Generation and Editing with Natural Language Guidance [[pdf]](https://arxiv.org/pdf/2204.08583.pdf) [[code]](https://github.com/nerdyrodent/VQGAN-CLIP) [[code]](https://github.com/EleutherAI/vqgan-clip) [[code]](https://github.com/justinjohn0306/VQGAN-CLIP) [[code]](https://www.kaggle.com/code/basu369victor/playing-with-vqgan-clip/notebook) [[colab]](https://colab.research.google.com/github/dribnet/clipit/blob/master/demos/Moar_Settings.ipynb) [[colab]](https://colab.research.google.com/drive/1L8oL-vLJXVcRzCFbPwOoMkPKJ8-aYdPN) [[colab]](https://colab.research.google.com/github/justinjohn0306/VQGAN-CLIP/blob/main/VQGAN%2BCLIP(Updated).ipynb)
- **[AltCLIP]** Altering the Language Encoder in CLIP for Extended Language Capabilities [[pdf]](https://arxiv.org/pdf/2211.06679v2.pdf) [[code]](https://github.com/FlagAI-Open/FlagAI) [[code]](https://huggingface.co/docs/transformers/model_doc/altclip)
- **[CLIPPO]** Image-and-Language Understanding from Pixels Only [[pdf]](https://arxiv.org/pdf/2212.08045.pdf)
- **[FDT]** Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [[pdf]](https://arxiv.org/pdf/2303.14865.pdf) [[code]](https://github.com/yuxiaochen1103/FDT)
- **[DIME-FM]** DIstilling Multimodal and Efficient Foundation Models [[pdf]](https://arxiv.org/pdf/2303.18232.pdf) [[code]](https://github.com/sunxm2357/DIME-FM) [[website]](https://cs-people.bu.edu/sunxm/DIME-FM/)
- **[ViLLA]** Fine-Grained Vision-Language Representation Learning from Real-World Data [[pdf]](https://arxiv.org/pdf/2308.11194v1.pdf) [[code]](https://github.com/StanfordMIMI/villa)
- **[BASIC]** Combined Scaling for Zero-shot Transfer Learning [[pdf]](https://arxiv.org/pdf/2111.10050v3.pdf)
- **[CoOp]** Learning to Prompt for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2109.01134.pdf) [[code]](https://github.com/KaiyangZhou/CoOp)
- **[CoCoOp]** Conditional Prompt Learning for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2203.05557.pdf) [[code]](https://github.com/KaiyangZhou/CoOp)
- **[RPO]** Read-only Prompt Optimization for Vision-Language Few-shot Learning [[pdf]](https://arxiv.org/pdf/2308.14960.pdf) [[code]](https://github.com/mlvlab/RPO)
- **[KgCoOp]** Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [[pdf]](https://arxiv.org/pdf/2303.13283.pdf) [[code]](https://github.com/htyao89/KgCoOp)
- **[ECO]** Ensembling Context Optimization for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2307.14063.pdf)
- **[UPT]** Unified Vision and Language Prompt Learning [[pdf]](https://arxiv.org/pdf/2210.07225.pdf) [[code]](https://github.com/yuhangzang/UPT)
- **[UPL]** Unsupervised Prompt Learning for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2204.03649.pdf) [[code]](https://github.com/tonyhuang2022/UPL)
- **[ProDA]** Prompt Distribution Learning [[pdf]](https://arxiv.org/pdf/2205.03340.pdf)
- **[CTP]** Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2303.17169.pdf)
- **[MVLPT]** Multitask Vision-Language Prompt Tuning [[pdf]](https://arxiv.org/pdf/2211.11720.pdf) [[code]](https://github.com/sIncerass/MVLPT)
- **[DAPT]** Distribution-Aware Prompt Tuning for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2309.03406v1.pdf) [[code]](https://github.com/mlvlab/DAPT)
- **[LFA]** Black Box Few-Shot Adaptation for Vision-Language models [[pdf]](https://arxiv.org/pdf/2304.01752.pdf) [[code]](https://github.com/saic-fi/LFA)
- **[LaFTer]** Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections [[pdf]](https://arxiv.org/pdf/2305.18287.pdf) [[code]](https://github.com/jmiemirza/LaFTer)
- **[TAP]** Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification [[pdf]](https://arxiv.org/pdf/2309.06809.pdf)
- **[CLIP-Adapter]** Better Vision-Language Models with Feature Adapters [[pdf]](https://arxiv.org/pdf/2110.04544.pdf) [[code]](https://github.com/gaopengcuhk/CLIP-Adapter)
- **[Tip-Adapter]** Training-free Adaption of CLIP for Few-shot Classification [[pdf]](https://arxiv.org/pdf/2207.09519.pdf) [[code]](https://github.com/gaopengcuhk/Tip-Adapter)
- **[CALIP]** Zero-Shot Enhancement of CLIP with Parameter-free Attention [[pdf]](https://arxiv.org/pdf/2209.14169.pdf) [[code]](https://github.com/ZiyuGuo99/CALIP)
- **[SHIP]** Improving Zero-Shot Generalization for CLIP with Synthesized Prompts [[pdf]](https://arxiv.org/pdf/2307.07397.pdf) [[code]](https://github.com/mrflogs/SHIP)
- **[LoGoPrompt]** Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2309.01155.pdf) [[website]](https://chengshiest.github.io/logo/)
- **[GRAM]** Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models [[pdf]](https://arxiv.org/pdf/2303.06571.pdf)
- **[MaPLe]** Multi-modal Prompt Learning [[pdf]](https://arxiv.org/pdf/2210.03117.pdf) [[code]](https://github.com/muzairkhattak/multimodal-prompt-learning)
- **[PromptSR]** Self-regulating Prompts: Foundational Model Adaptation without Forgetting [[pdf]](https://arxiv.org/pdf/2307.06948.pdf) [[code]](https://github.com/muzairkhattak/PromptSRC)
- **[ProGrad]** Prompt-aligned Gradient for Prompt Tuning [[pdf]](https://arxiv.org/pdf/2205.14865.pdf) [[code]](https://github.com/BeierZhu/Prompt-align)
- **[Prompt Align]** Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [[pdf]](https://arxiv.org/pdf/2311.01459.pdf) [[code]](https://github.com/jameelhassan/PromptAlign) [[website]](https://jameelhassan.github.io/promptalign/)
- **[APE]** Enhancing Few-shot CLIP with Adaptive Prior Refinement [[pdf]](https://arxiv.org/pdf/2304.01195.pdf) [[code]](https://github.com/yangyangyang127/APE)
- **[CuPL]** Generating customized prompts for zero-shot image classification [[pdf]](https://arxiv.org/pdf/2209.03320.pdf) [[code]](https://github.com/sarahpratt/CuPL)
- **[WaffleCLIP]** Visual Classification with Random Words and Broad Concepts [[pdf]](https://arxiv.org/pdf/2306.07282.pdf) [[code]](https://github.com/ExplainableML/WaffleCLIP)
- **[R-AMT]** Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models [[pdf]](https://arxiv.org/pdf/2307.15049.pdf) [[code]](https://github.com/wuw2019/R-AMT) [[website]](https://wuw2019.github.io/R-AMT/)
- **[SVL-Adapter]** Self-Supervised Adapter for Vision-Language Pretrained Models [[pdf]](https://arxiv.org/pdf/2210.03794.pdf) [[code]](https://github.com/omipan/svl_adapter)
- **[KAPT]** Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [[pdf]](https://arxiv.org/pdf/2308.11186.pdf)
- **[SuS-X]** Training-Free Name-Only Transfer of Vision-Language Models [[pdf]](https://arxiv.org/pdf/2211.16198.pdf) [[code]](https://github.com/vishaal27/SuS-X)
- **[Internet Explorer]** Targeted Representation Learning on the Open Web [[pdf]](https://arxiv.org/pdf/2302.14051.pdf) [[code]](https://github.com/internet-explorer-ssl/internet-explorer) [[website]](https://internet-explorer-ssl.github.io/)
- **[REACT]** Learning Customized Visual Models with Retrieval-Augmented Knowledge [[pdf]](https://arxiv.org/pdf/2301.07094v1.pdf) [[code]](https://github.com/microsoft/react) [[website]](https://react-vl.github.io/)
- **[SEARLE]** Zero-Shot Composed Image Retrieval with Textual Inversion [[pdf]](https://arxiv.org/pdf/2303.15247.pdf) [[code]](https://github.com/miccunifi/SEARLE)
- **[CLIPpy]** Perceptual Grouping in Contrastive Vision-Language Models [[pdf]](https://arxiv.org/pdf/2210.09996.pdf)
- **[CDUL]** CLIP-Driven Unsupervised Learning for Multi-Label Image Classification [[pdf]](https://arxiv.org/pdf/2307.16634.pdf)
- **[DN]** Test-Time Distribution Normalization for Contrastively Learned Vision-language Models [[pdf]](https://arxiv.org/pdf/2302.11084.pdf) [[code]](https://github.com/fengyuli-dev/distribution-normalization) [[website]](https://fengyuli-dev.github.io/dn-website/)
- **[TPT]** Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [[pdf]](https://arxiv.org/pdf/2209.07511.pdf) [[code]](https://github.com/azshue/TPT) [[website]](https://azshue.github.io/TPT/)
- **[ReCLIP]** A Strong Zero-Shot Baseline for Referring Expression Comprehension [[pdf]](https://arxiv.org/pdf/2204.05991.pdf)
- **[PLOT]** Prompt Learning with Optimal Transport for Vision-Language Models [[pdf]](https://arxiv.org/pdf/2210.01253.pdf) [[code]](https://github.com/CHENGY12/PLOT)
- **[GEM]** Emerging Localization Properties in Vision-Language Transformers [[pdf]](https://arxiv.org/pdf/2312.00878.pdf) [[code]](https://github.com/WalBouss/GEM) [[demo]](https://huggingface.co/spaces/WalidBouss/GEM)
- **[SynthCLIP]** Are We Ready for a Fully Synthetic CLIP Training? [[pdf]](https://arxiv.org/pdf/2402.01832.pdf) [[code]](https://github.com/hammoudhasan/SynthCLIP) [[dataset]](https://huggingface.co/datasets/hammh0a/SynthCLIP)
- **[MirrorCLIP]** Disentangling text from visual images through reflection [[pdf]](https://openreview.net/pdf?id=FYm8coxdiR) [[code]](https://github.com/tcwangbuaa/MirrorCLIP)
- **[WATT]** Weight Average Test-Time Adaptation of CLIP [[pdf]](https://openreview.net/pdf?id=4D7hnJ9oM6) [[code]](https://github.com/Mehrdad-Noori/WATT)
- Visual Classification via Description from Large Language Models [[pdf]](https://arxiv.org/pdf/2210.07183.pdf) [[code]](https://github.com/sachit-menon/classify_by_description_release) [[website]](https://cv.cs.columbia.edu/sachit/classviadescr/)
- Understanding the Modality Gap in Multi-modal Contrastive Representation Learning [[pdf]](https://arxiv.org/pdf/2203.02053.pdf) [[code]](https://github.com/Weixin-Liang/Modality-Gap) [[website]](https://modalitygap.readthedocs.io/en/latest/)
- Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts [[pdf]](https://arxiv.org/pdf/2307.11661.pdf) [[code]](https://github.com/mayug/VDT-Adapter)
- Improving CLIP Fine-tuning Performance [[pdf]](https://openaccess.thecvf.com/content/ICCV2023/papers/Wei_Improving_CLIP_Fine-tuning_Performance_ICCV_2023_paper.pdf) [[supp]](https://openaccess.thecvf.com/content/ICCV2023/supplemental/Wei_Improving_CLIP_Fine-tuning_ICCV_2023_supplemental.pdf) [[code]](https://github.com/SwinTransformer/Feature-Distillation)
- The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis [[pdf]](https://openaccess.thecvf.com/content/CVPR2022W/MULA/papers/Barraco_The_Unreasonable_Effectiveness_of_CLIP_Features_for_Image_Captioning_An_CVPRW_2022_paper.pdf)
- Prompting Visual-Language Models for Efficient Video Understanding [[pdf]](https://arxiv.org/pdf/2112.04478.pdf)
- Robust Cross-Modal Representation Learning with Progressive Self-Distillation [[pdf]](https://arxiv.org/pdf/2204.04588.pdf)
- Disentangling visual and written concepts in CLIP [[pdf]](https://arxiv.org/pdf/2206.07835.pdf) [[code]](https://github.com/joaanna/disentangling_spelling_in_clip) [[website]](https://joaanna.github.io/disentangling_spelling_in_clip/)
- CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment [[pdf]](https://arxiv.org/pdf/2203.07190.pdf)
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet [[pdf]](https://arxiv.org/pdf/2212.06138.pdf) [[code]](https://github.com/LightDXY/FT-CLIP)
- Zero-Shot Video Captioning with Evolving Pseudo-Tokens [[pdf]](https://arxiv.org/pdf/2207.11100.pdf) [[code]](https://github.com/YoadTew/zero-shot-video-to-text)
- What does CLIP know about a red circle? Visual prompt engineering for VLMs [[pdf]](https://arxiv.org/pdf/2304.06712.pdf)
- Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? [[pdf]](https://arxiv.org/pdf/2307.11978.pdf) [[code]](https://github.com/CEWu/PTNL)
- Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks [[pdf]](https://arxiv.org/pdf/2307.06795.pdf) [[code]](https://github.com/FactoDeepLearning/MultitaskVLFM)
- Image-based CLIP-Guided Essence Transfer [[pdf]](https://arxiv.org/pdf/2110.12427.pdf) [[code]](https://github.com/hila-chefer/TargetCLIP)
- Unsupervised Semantic Image Segmentation with Stylegan and CLIP [[pdf]](https://arxiv.org/pdf/2107.12518.pdf) [[code]](https://github.com/warmspringwinds/segmentation_in_style)
- Exploring Visual Prompts for Adapting Large-Scale Models [[pdf]](https://arxiv.org/pdf/2203.17274.pdf) [[code]](https://github.com/hjbahng/visual_prompting) [[website]](https://hjbahng.github.io/visual_prompting/)
- CLIP-Guided Diffusion [[code]](https://github.com/openai/guided-diffusion) [[code]](https://github.com/afiaka87/clip-guided-diffusion) [[code]](https://github.com/nerdyrodent/CLIP-Guided-Diffusion) [[code]](https://github.com/crowsonkb/v-diffusion-pytorch)
- Exploring the Visual Shortcomings of Multimodal LLMs [[pdf]](https://arxiv.org/pdf/2401.06209)
- Fast Zero Shot Object Detection with OpenAI CLIP [[video]](https://www.youtube.com/watch?v=i3OYlaoj-BM&ab_channel=JamesBriggs)
- [CLIP Varaints](https://github.com/lucidrains/x-clip)
- [awesome-clip](https://github.com/yzhuoning/Awesome-CLIP)

## Policy Gradients with Image Captioning
- Self-critical Sequence Training for Image Captioning [[pdf]](https://arxiv.org/pdf/1612.00563.pdf) [[code]](https://github.com/ruotianluo/self-critical.pytorch)
- A Better Variant of Self-Critical Sequence Training [[pdf]](https://arxiv.org/pdf/2003.09971.pdf) [[code]](https://github.com/ruotianluo/self-critical.pytorch)
- Fine-grained Image Captioning with CLIP Reward [[pdf]](https://arxiv.org/pdf/2205.13115.pdf) [[code]](https://github.com/j-min/CLIP-Caption-Reward)
- Distinctive Image Captioning via CLIP Guided Group Optimization [[pdf]](https://arxiv.org/pdf/2208.04254.pdf)
- Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization [[pdf]](https://www.arxiv.org/pdf/2408.14547)

## Segmentation + Vision-Language
- **[Semantic-SAM]** Segment and Recognize Anything at Any Granularity [[github]](https://github.com/UX-Decoder/Semantic-SAM)
- **[SEEM]** Segment Everything Everywhere All at Once [[github]](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
- **[Grounding DINO]** Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [[github]](https://github.com/IDEA-Research/GroundingDINO)
- **[DINO-X]** A Unified Vision Model for Open-World Object Detection and Understanding [[github]](https://github.com/IDEA-Research/DINO-X-API)

## Miscellaneous
- SetFit: Efficient Few-Shot Learning Without Prompts [[pdf]](https://arxiv.org/pdf/2209.11055.pdf) [[code]](https://github.com/huggingface/setfit) [[blog]](https://huggingface.co/blog/setfit)
- LoRA: Low-Rank Adaptation of Large Language Models [[pdf]](https://arxiv.org/pdf/2106.09685.pdf) [[code]](https://github.com/microsoft/LoRA) [[code]](https://github.com/huggingface/peft) [[colab]](https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing) [[video]](https://www.youtube.com/watch?v=KEv-F5UkhxU&ab_channel=AICoffeeBreakwithLetitia) [[blog]](https://huggingface.co/blog/peft) [[blog]](https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive) [[blog]](https://medium.com/@abdullahsamilguser/lora-low-rank-adaptation-of-large-language-models-7af929391fee) [[blog]](https://towardsdatascience.com/understanding-lora-low-rank-adaptation-for-finetuning-large-models-936bce1a07c6) [[hf docs]](https://huggingface.co/docs/diffusers/training/lora) [[library]](https://huggingface.co/docs/peft/index) [[library tutorial]](https://huggingface.co/learn/cookbook/prompt_tuning_peft)
- QLoRA: Efficient Finetuning of Quantized LLMs [[pdf]](https://arxiv.org/pdf/2305.14314.pdf) [[code]](https://github.com/artidoro/qlora) [[demo]](https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi) [[blog]](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- LLaMA Family: [[hf card]](https://huggingface.co/meta-llama) [[github]](https://github.com/meta-llama) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llama) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llama2) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llama3) [[llama report]](https://arxiv.org/pdf/2302.13971v1) [[llama2 report]](https://arxiv.org/pdf/2307.09288.pdf) [[llama3 report]](https://arxiv.org/pdf/2407.21783) [[llama3 report summary]](https://x.com/A_K_Nain/status/1815942598944547074) [[llama3.1 hf blog]](https://huggingface.co/blog/llama31) [[code llama report]](https://arxiv.org/pdf/2308.12950.pdf) Other Resources: [[llama2.c]](https://github.com/karpathy/llama2.c) [[llama.cpp]](https://github.com/ggerganov/llama.cpp) [[finetune script]](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da) [[finetune script]](https://www.philschmid.de/sagemaker-llama2-qlora) [[finetune script]](https://www.philschmid.de/instruction-tune-llama-2) [[finetune script]](https://www.philschmid.de/fsdp-qlora-llama3) [[tutorials]](https://github.com/amitsangani/Llama-2) [[yarn]](https://github.com/jquesnelle/yarn) [[openbio]](https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B) [[huggingface-llama-recipes]](https://github.com/huggingface/huggingface-llama-recipes/tree/main)
- GPT-3: Language Models are Few-Shot Learners [[pdf]](https://arxiv.org/pdf/2005.14165.pdf) [[miniGPT]](https://github.com/karpathy/minGPT) [[nanoGPT]](https://github.com/karpathy/nanoGPT)
- ChatGPT [[blog]](https://openai.com/blog/chatgpt/) [[RLHF]](https://arxiv.org/pdf/2009.01325.pdf) [[InstructGPT]](https://arxiv.org/pdf/2203.02155.pdf) [[InstructGPT blog]](https://openai.com/blog/instruction-following/) [[rlhf hf]](https://huggingface.co/blog/rlhf) [[rlhf wandb]](https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx) [[code]](https://github.com/lucidrains/PaLM-rlhf-pytorch)
- SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions [[pdf]](https://arxiv.org/pdf/2212.10560.pdf) [[code]](https://github.com/yizhongw/self-instruct)
- Mistral [[models]](https://huggingface.co/mistralai)
- Qwen [[models]](https://huggingface.co/Qwen)
- SmolLM2 [[github]](https://github.com/huggingface/smollm)
- Mixture of Experts LLM [[hf collections]](https://huggingface.co/collections/mlabonne/mixture-of-experts-65980c40330942d1282b76f5) [[video]](https://www.youtube.com/watch?v=mwO6v4BlgZQ) [[notebook]](https://colab.research.google.com/drive/1k6C_oJfEKUq0mtuWKisvoeMHxTcIxWRa?usp=sharing)
- Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation [[pdf]](https://arxiv.org/pdf/2402.18334.pdf) [[code]](https://github.com/BatsResearch/bonito)
- Large Language Models Are Reasoning Teachers [[pdf]](https://arxiv.org/pdf/2212.10071.pdf) [[code]](https://github.com/itsnamgyu/reasoning-teacher)
- Cramming: Training a Language Model on a Single GPU in One Day [[pdf]](https://arxiv.org/pdf/2212.14034.pdf)
- Downstream Datasets Make Surprisingly Good Pretraining Corpora [[pdf]](https://arxiv.org/pdf/2209.14389.pdf)
- FCM: Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models [[pdf]](https://arxiv.org/pdf/2210.13432.pdf) [[code]](https://github.com/lucidrains/x-transformers)
- Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [[pdf]](https://arxiv.org/pdf/2305.02301.pdf)
- Deep Learning Tuning Playbook [[github]](https://github.com/google-research/tuning_playbook)
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [[pdf]](https://arxiv.org/pdf/2203.05482.pdf) [[code]](https://github.com/mlfoundations/model-soups)
- Neural Priming for Sample-Efficient Adaptation [[pdf]](https://arxiv.org/pdf/2306.10191.pdf) [[code]](https://github.com/RAIVNLab/neural-priming)
- ALiBi: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [[pdf]](https://arxiv.org/pdf/2108.12409.pdf)
- Fine-tuning can cripple your foundation model; preserving features may be the solution [[pdf]](https://arxiv.org/pdf/2308.13320.pdf)
- Can LLMs learn from a single example? [[article]](https://www.fast.ai/posts/2023-09-04-learning-jumps/)
- Finetuned Language Models Are Zero-Shot Learners [[pdf]](https://arxiv.org/pdf/2109.01652.pdf)
- [Falcon LLM](https://huggingface.co/tiiuae)
- [x-transformers](https://github.com/lucidrains/x-transformers)
- LLM Resources [mlabonne](https://github.com/mlabonne/llm-course) [[wandb](https://www.wandb.courses/courses/training-fine-tuning-LLMs) [[train and deploy]](https://www.youtube.com/watch?v=Ma4clS-IdhA&t=1709s) [[supervised FT]](https://www.youtube.com/watch?v=NXevvEF3QVI&t=1s) [[how LLM Chatbots work]](https://www.youtube.com/watch?v=C6ZszXYPDDw&t=26s) [[finetuning tutorial]](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl) [[pytorch finetuning tutorial]](https://pytorch.org/blog/finetune-llms/?utm_content=278057355&utm_medium=social&utm_source=linkedin&hss_channel=lcp-78618366) [[finetuning tutorial]](https://huggingface.co/learn/cookbook/fine_tuning_code_llm_on_single_gpu) [[hf youtube tutorial]](https://www.youtube.com/watch?v=2-SPH9hIKT8) [[hf slides]](https://docs.google.com/presentation/d/1uFd95VFSefD_Pom12kZ6q7ZppBJuT-T1vSGMUojDaBQ/edit#slide=id.p) [[andrej karpathy tutorials]](https://www.youtube.com/@AndrejKarpathy/videos)

**Prompting**

- The Power of Scale for Parameter-Efficient Prompt Tuning [[pdf]](https://arxiv.org/pdf/2104.08691.pdf) [[code]](https://github.com/google-research/prompt-tuning) [[code]](https://huggingface.co/docs/peft/task_guides/clm-prompt-tuning) [[blog]](https://ai.googleblog.com/2022/02/guiding-frozen-language-models-with.html?m=1) [[blog]](https://heidloff.net/article/introduction-to-prompt-tuning/)
- Prefix-Tuning: Optimizing Continuous Prompts for Generation [[pdf]](https://arxiv.org/pdf/2101.00190.pdf) [[code]](https://github.com/XiangLi1999/PrefixTuning)
- Visual Prompt Tuning [[pdf]](https://arxiv.org/pdf/2203.12119.pdf) [[code]](https://github.com/kmnp/vpt)
- GPT Understands, Too [[pdf]](https://arxiv.org/pdf/2103.10385.pdf) [[code]](https://github.com/THUDM/P-tuning)
- Making Pre-trained Language Models Better Few-shot Learners [[pdf]](https://arxiv.org/pdf/2012.15723.pdf) [[code]](https://github.com/princeton-nlp/LM-BFF)
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts [[pdf]](https://arxiv.org/pdf/2010.15980.pdf)
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [[pdf]](https://arxiv.org/pdf/2201.11903.pdf) [[blog]](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html?m=1)
- Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models [[pdf]](https://arxiv.org/pdf/2305.04091.pdf) [[code]](https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting)
- Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [[pdf]](https://arxiv.org/pdf/2311.04205.pdf) [[code]](https://github.com/uclaml/Rephrase-and-Respond) [[website]](https://uclaml.github.io/Rephrase-and-Respond/)
- System 2 Attention (is something you might need too) [[pdf]](https://arxiv.org/pdf/2311.11829.pdf)
- Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting [[pdf]](https://arxiv.org/pdf/2305.04388.pdf)
- Large Language Models Can Self-Improve [[pdf]](https://arxiv.org/pdf/2210.11610v2.pdf)
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion [[pdf]](https://arxiv.org/pdf/2208.01618.pdf) [[code]](https://github.com/rinongal/textual_inversion) [[code]](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) [[website]](https://textual-inversion.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/text_inversion) [[blog]](https://medium.com/@onkarmishra/how-textual-inversion-works-and-its-applications-5e3fda4aa0bc)
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation [[pdf]](https://arxiv.org/pdf/2208.12242.pdf) [[code]](https://github.com/XavierXiao/Dreambooth-Stable-Diffusion) [[code]](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) [[website]](https://dreambooth.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/dreambooth)
- Visual Prompting via Image Inpainting [[pdf]](https://arxiv.org/pdf/2209.00647.pdf) [[code]](https://github.com/amirbar/visual_prompting) [[website]](https://yossigandelsman.github.io/visual_prompt/)
- What Makes Good Examples for Visual In-Context Learning? [[pdf]](https://arxiv.org/pdf/2301.13670.pdf) [[code]](https://github.com/ZhangYuanhan-AI/visual_prompt_retrieval)
- Understanding and Improving Visual Prompting: A Label-Mapping Perspective [[pdf]](https://arxiv.org/pdf/2211.11635.pdf) [[code]](https://github.com/OPTML-Group/ILM-VP)
- Exploring Demonstration Ensembling for In-context Learning [[pdf]](https://openreview.net/pdf?id=9kK4R_8nAsD)
- Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations [[pdf]](https://arxiv.org/pdf/2212.09865.pdf) [[code]](https://github.com/alrope123/z-icl)

## New Large-Scale Datasets
- **[VisualCOMET]** Reasoning about the Dynamic Context of a Still Image [[pdf]](https://arxiv.org/pdf/2004.10796.pdf) [[website]](https://visualcomet.xyz/)
- **[LAION]** [[website]](https://laion.ai/) [[hf website]](https://huggingface.co/laion) [[paper]](https://openreview.net/pdf?id=M3Y74vmsMcY) [[paper]](https://arxiv.org/pdf/2111.02114.pdf) [[img2dataset]](https://github.com/rom1504/img2dataset)
- **[Conceptual 12M]** Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [[pdf]](https://arxiv.org/pdf/2102.08981.pdf) [[code]](https://github.com/google-research-datasets/conceptual-12m)
- **[Winoground]** Probing Vision and Language Models for Visio-Linguistic Compositionality [[pdf]](https://arxiv.org/pdf/2204.03162.pdf) [[dataset]](https://huggingface.co/datasets/facebook/winoground)

### Libraries
- LAVIS [[github]](https://github.com/salesforce/LAVIS) [[docs]](https://opensource.salesforce.com/LAVIS//latest/index.html#)
- Diffusers [[github]](https://github.com/huggingface/diffusers) [[docs]](https://huggingface.co/docs/diffusers/index)
- X-modaler [[github]](https://github.com/YehLi/xmodaler) [[docs]](https://xmodaler.readthedocs.io/en/latest/index.html)
- MMF [[github]](https://github.com/facebookresearch/mmf) [[docs]](https://mmf.sh/docs/)
- TorchMultimodal [[github]](https://github.com/facebookresearch/multimodal) [[blog]](https://pytorch.org/blog/introducing-torchmultimodal/) [[blog]](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/)
- [Transformers-VQA](https://github.com/YIKUAN8/Transformers-VQA)
- [MMT-Retrieval](https://github.com/UKPLab/MMT-Retrieval)

### Evaluation
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)
- [MMMU](https://mmmu-benchmark.github.io/)

### Projects
- [Florence-VL](https://www.microsoft.com/en-us/research/project/project-florence-vl/)
- [unilm](https://github.com/microsoft/unilm)

### Other Awesomes
- [awesome-Vision-and-Language-Pre-training](https://github.com/phellonchen/awesome-Vision-and-Language-Pre-training)
- [awesome-vision-language-pretraining-papers](https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers)
- [vqa](https://github.com/jokieleung/awesome-visual-question-answering)
- [image captioning](https://github.com/forence/Awesome-Visual-Captioning)
- [image captioning](https://github.com/zhjohnchan/awesome-image-captioning)
- [scene graphs](https://github.com/huoxingmeishi/Awesome-Scene-Graphs)

### Survey Papers
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [[pdf]](https://arxiv.org/pdf/2403.09611.pdf)
- The (R)Evolution of Multimodal Large Language Models: A Survey [[pdf]](https://arxiv.org/pdf/2402.12451.pdf)
- Vision-Language Pre-training: Basics, Recent Advances, and Future Trends [[pdf]](https://arxiv.org/pdf/2210.09263.pdf)
- A Survey of Vision-Language Pre-Trained Models [[pdf]](https://arxiv.org/pdf/2202.10936.pdf)
- VLP: A Survey on Vision-Language Pre-training [[pdf]](https://arxiv.org/pdf/2202.09061.pdf)
- Vision Language models: towards multi-modal deep learning [[blog]](https://theaisummer.com/vision-language-models/)
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants [[pdf]](https://arxiv.org/pdf/2309.10020.pdf)
- Instruction Tuning for Large Language Models: A Survey [[pdf]](https://arxiv.org/pdf/2308.10792.pdf)

### Resources
- [Comparing image captioning models](https://huggingface.co/spaces/nielsr/comparing-captioning-models)
- [Comparing visual question answering (VQA) models](https://huggingface.co/spaces/nielsr/comparing-VQA-models)
- [Generalized Visual Language Models](https://lilianweng.github.io/posts/2022-06-09-vlm/)
- [Prompting in Vision CVPR23 Tutorial](https://prompting-in-vision.github.io/)
- [CVPR23 Tutorial](https://vlp-tutorial.github.io/2023/)
- [CVPR22 Tutorial](https://vlp-tutorial.github.io/2022/)
- [CVPR21 Tutorial](https://vqa2vln-tutorial.github.io/)
- [CVPR20 Tutorial](https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research/)