Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-vision-language-pretraining
Awesome Vision-Language Pretraining Papers
https://github.com/fawazsammani/awesome-vision-language-pretraining
- ViLBERT - multi-task) [[code]](https://github.com/jiasenlu/vilbert_beta)
- Unified-VLP
- ImageBERT
- SimVLM
- ALBEF
- LXMERT
- X-LXMERT - lxmert)
- VisualBERT
- UNIMO
- UNIMO-2 - ptm.github.io/)
- BLIP - VQA-models)
- BLIP-2 - Tutorials/tree/master/BLIP-2) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/blip-2) [[finetuning colab]](https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing) [[blog]](https://huggingface.co/blog/blip-2)
- Uni-EDEN
- VisualGPT - CAIR/VisualGPT)
- MiniVLM
- XGPT
- ViTCAP
- LEMON
- IC3 - by-committee)
- TAP
- PICa
- CVLP
- UniT
- VL-BERT - BERT)
- Unicoder-VL
- UNITER
- ViLT - vqa)
- GLIP
- GLIPv2
- VLMo
- METER
- WenLan
- InterBERT
- SemVLP
- E2E-VLP
- VinVL
- UFO
- Florence
- VILLA
- TDEN
- ERNIE-ViL
- Vokenization
- 12-in-1 - multi-task)
- KVL-BERT
- Oscar
- VIVO
- SOHO
- Pixel-BERT
- LightningDOT
- VirTex
- Uni-Perceiver - Perceiver)
- Uni-Perceiver v2 - Perceiver)
- CoCa - pytorch) [[code]](https://github.com/mlfoundations/open_clip) [[colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb)
- Flamingo - pytorch) [[code]](https://github.com/mlfoundations/open_flamingo) [[code]](https://github.com/dhansmair/flamingo-mini) [[website]](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) [[blog]](https://wandb.ai/gladiator/Flamingo%20VLM/reports/DeepMind-Flamingo-A-Visual-Language-Model-for-Few-Shot-Learning--VmlldzoyOTgzMDI2) [[blog]](https://laion.ai/blog/open-flamingo/) [[blog]](https://laion.ai/blog/open-flamingo-v2/)
- BEiT-3
- UniCL
- UVLP
- OFA - Sys/OFA) [[models and demos]](https://huggingface.co/OFA-Sys)
- GPV-1 - 1/) [[website]](https://prior.allenai.org/projects/gpv)
- GPV-2
- TCL - smile/TCL)
- L-Verse
- FLAVA - model.github.io/) [[tutorial]](https://pytorch.org/tutorials/beginner/flava_finetuning_tutorial.html)
- COTS
- VL-ADAPTER
- Unified-IO - io.allenai.org/)
- ViLTA
- CapDet
- PTP - sg/ptp)
- X-VLM - 97/x-vlm)
- FewVLM
- M3AE - geng/m3ae_public)
- CFM-ViT
- mPLUG
- PaLI - scaling-language-image-learning-in.html)
- GIT - VQA-models)
- MaskVLM
- DALL-E - E) [[code]](https://github.com/borisdayma/dalle-mini) [[code]](https://github.com/lucidrains/DALLE-pytorch) [[code]](https://github.com/kuprel/min-dalle) [[code]](https://github.com/robvanvolt/DALLE-models) [[code]](https://github.com/kakaobrain/minDALL-E) [[website]](https://openai.com/blog/dall-e/) [[video]](https://www.youtube.com/watch?v=j4xgkjWlfL4&t=1432s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=jMqLTPcA9CQ&t=1034s&ab_channel=TheAIEpiphany) [[video]](https://www.youtube.com/watch?v=x_8uHX5KngE&ab_channel=TheAIEpiphany) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/Building-efficient-image-input-pipelines--VmlldzoyMjMxOTQw) [[blog]](https://ml.berkeley.edu/blog/posts/vq-vae/) [[blog]](https://ml.berkeley.edu/blog/posts/dalle2/) [[blog]](https://towardsdatascience.com/understanding-how-dall-e-mini-works-114048912b3b)
- DALL-E-2 - pytorch) [[website]](https://openai.com/dall-e-2/) [[blog]](http://adityaramesh.com/posts/dalle2/dalle2.html) [[blog]](https://www.assemblyai.com/blog/how-dall-e-2-actually-works/) [[blog]](https://medium.com/augmented-startups/how-does-dall-e-2-work-e6d492a2667f)
- DALL-E 3 - e-3)
- GigaGAN - pytorch) [[code]](https://github.com/jianzhnie/GigaGAN) [[website]](https://mingukkang.github.io/GigaGAN/)
- Parti - research/parti) [[code]](https://github.com/lucidrains/parti-pytorch) [[video]](https://www.youtube.com/watch?v=qS-iYnp00uc&ab_channel=YannicKilcher) [[blog]](https://parti.research.google/)
- Paella
- Make-A-Scene
- FIBER
- VL-BEiT - beit)
- MetaLM
- VL-T5 - min/VL-T5)
- UNICORN
- MI2P
- MDETR
- VLMixer
- ViCHA
- Img2LLM - vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
- PNP-VQA - vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb)
- StoryDALL-E
- VLMAE
- MLIM
- MOFI
- GILL
- Language Pivoting
- Graph-Align
- PL-UIC
- SCL
- TaskRes
- EPIC
- HAAV
- FLM
- DiHT
- VL-Match
- Prismer
- PaLM-E - e.github.io/) [[blog]](https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html)
- X-Decoder - Decoder/tree/main) [[xgpt code]](https://github.com/microsoft/X-Decoder/tree/xgpt) [[website]](https://x-decoder-vl.github.io/) [[demo]](https://huggingface.co/spaces/xdecoder/Demo) [[demo]](https://huggingface.co/spaces/xdecoder/Instruct-X-Decoder)
- PerVL
- TextManiA - ami/TextManiA) [[website]](https://moon-yb.github.io/TextManiA.github.io/) [[GAN Inversion]](https://arxiv.org/pdf/2004.00049.pdf)
- Cola
- K-LITE
- SINC
- Visual ChatGPT - chatgpt)
- CM3Leon
- KOSMOS-1
- KOSMOS-2 - 2) [[code]](https://huggingface.co/docs/transformers/model_doc/kosmos-2) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb) [[demo]](https://huggingface.co/spaces/ydshieh/Kosmos-2)
- MultiModal-GPT - mmlab/Multimodal-GPT)
- LLaVA - liu/LLaVA) [[website]](https://llava-vl.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/llava)
- ViP-LLaVA - cai/vip-llava) [[demo]](https://pages.cs.wisc.edu/~mucai/vip-llava.html) [[website]](https://vip-llava.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViP-LLaVa/Inference_with_ViP_LLaVa_for_fine_grained_VQA.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/vipllava)
- VILA - Large-Model/VILA) [[hf page]](https://huggingface.co/Efficient-Large-Model)
- NExT-Chat - ChatV/NExT-Chat) [[demo]](https://516398b33beb3e8b9f.gradio.live/) [[website]](https://next-chatv.github.io/)
- MiniGPT-4 - CAIR/MiniGPT-4) [[website]](https://minigpt-4.github.io/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/minigpt4)
- MiniGPT-v2 - CAIR/MiniGPT-4) [[website]](https://minigpt-v2.github.io/) [[demo]](https://876a8d3e814b8c3a8b.gradio.live/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/MiniGPT-v2)
- LLaMA-Adapter - Adapter)
- LLaMA-Adapter V2 - Adapter) [[demo]](http://llama-adapter.opengvlab.com/)
- LaVIN
- InstructBLIP - Tutorials/blob/master/InstructBLIP/Inference_with_InstructBLIP.ipynb)
- Otter/MIMIC-IT - ntu.github.io/)
- CogVLM
- ImageBind
- TextBind
- MetaVL
- M³IT - it.github.io/)
- Instruction-ViT
- MultiInstruct - NLP/MultiInstruct)
- VisIT-Bench - Bench/) [[website]](https://visit-bench.github.io/) [[blog]](https://laion.ai/blog/visit_bench/) [[dataset]](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) [[leaderboard]](https://huggingface.co/spaces/mlfoundations/VisIT-Bench-Leaderboard)
- GPT4RoI
- PandaGPT - gpt.github.io/)
- ChatBridge - chatbridge.github.io/)
- ImageBind
- Video-LLaMA - NLP-SG/Video-LLaMA) [[demo]](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA)
- VideoChat - Anything)
- InternGPT
- mPLUG-Owl - PLUG/mPLUG-Owl) [[demo]](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl)
- VisionLLM - news/reports/Introducing-VisionLLM-A-New-Method-for-Multi-Modal-LLM-s--Vmlldzo0NTMzNzIz)
- X-LLM - LLM) [[website]](https://x-llm.github.io/)
- OBELICS - 80b) [[instruct model 9b]](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct) [[instruct model 80b]](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) [[demo]](https://huggingface.co/spaces/HuggingFaceM4/idefics_playground) [[dataset]](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)
- EvALign-ICL - ICL) [[website]](https://evalign-icl.github.io/)
- Plug and Pray
- VL-PET - PET)
- ICIS
- NExT-GPT - GPT/NExT-GPT) [[website]](https://next-gpt.github.io/) [[demo]](https://4271670c463565f1a4.gradio.live/)
- UnIVAL
- BUS
- VPGTrans
- PromptCap - Hu/PromptCap) [[website]](https://yushi-hu.github.io/promptcap_demo/) [[hf checkpoint]](https://huggingface.co/tifa-benchmark/promptcap-coco-vqa)
- P-Former
- TL;DR
- PMA-Net - Net)
- Encyclopedic VQA - research/google-research/tree/master/encyclopedic_vqa)
- CMOTA
- CPT
- TeS
- MP - Probing)
- CLIP - CLIP) [[code]](https://github.com/moein-shariatnia/OpenAI-CLIP) [[code]](https://github.com/lucidrains/x-clip) [[code]](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/clip) [[website]](https://openai.com/blog/clip/) [[video]](https://www.youtube.com/watch?v=T9XSU0pKX2E&t=1455s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=fQyHEXZB-nM&ab_channel=AleksaGordi%C4%87-TheAIEpiphany) [[video code]](https://www.youtube.com/watch?v=jwZQD0Cqz4o&t=4610s&ab_channel=TheAIEpiphany) [[CLIP_benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[clip-retrieval]](https://github.com/rom1504/clip-retrieval) [[clip-retrieval blog]](https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c)
- OpenCLIP - AI/scaling-laws-openclip) [[clip colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) [[clip benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[hf models]](https://huggingface.co/models?library=open_clip)
- CLIPScore
- LiT - research/vision_transformer) [[website]](https://google-research.github.io/vision_transformer/lit/)
- SigLIP - research/big_vision) [[colab demo]](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/siglip) [[hf models]](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba)
- ALIGN
- DeCLIP - GVT/DeCLIP)
- FLIP
- Counting-aware CLIP - clip-to-count.github.io/)
- ALIP
- FILIP
- SLIP
- WiSE-FT - ft)
- FLYP
- MAGIC
- ZeroCap - shot-image-to-text)
- CapDec - H?usp=sharing)
- DeCap - wei/DeCap)
- ViECap
- CLOSE
- xCLIP
- EVA - CLIP]](https://arxiv.org/pdf/2303.15389.pdf) [[EVA-02]](https://arxiv.org/pdf/2303.11331.pdf) [[code]](https://github.com/baaivision/EVA)
- VT-CLIP
- CLIP-ViL - vil/CLIP-ViL)
- RegionCLIP
- DenseCLIP
- E-CLIP
- MaskCLIP
- CLIPSeg - zero-shot.ipynb) [[demo]](https://huggingface.co/spaces/nielsr/CLIPSeg) [[demo]](https://huggingface.co/spaces/Sijuade/CLIPSegmentation) [[demo]](https://huggingface.co/spaces/taesiri/CLIPSeg) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg2) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg) [[blog]](https://huggingface.co/blog/clipseg-zero-shot)
- OWL-ViT - research/scenic/tree/main/scenic/projects/owl_vit) [[code]](https://huggingface.co/docs/transformers/model_doc/owlvit) [[code]](https://huggingface.co/docs/transformers/tasks/zero_shot_object_detection) [[colab]](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb) [[demo]](https://huggingface.co/spaces/adirik/OWL-ViT) [[demo]](https://huggingface.co/spaces/adirik/image-guided-owlvit) [[demo]](https://huggingface.co/spaces/johko/OWL-ViT) [[demo]](https://huggingface.co/spaces/kellyxiaowei/OWL-ViT) [[demo]](https://huggingface.co/spaces/wendys-llc/OWL-ViT)
- ClipCap
- VQGAN-CLIP - CLIP) [[code]](https://github.com/EleutherAI/vqgan-clip) [[code]](https://github.com/justinjohn0306/VQGAN-CLIP) [[code]](https://www.kaggle.com/code/basu369victor/playing-with-vqgan-clip/notebook) [[colab]](https://colab.research.google.com/github/dribnet/clipit/blob/master/demos/Moar_Settings.ipynb) [[colab]](https://colab.research.google.com/drive/1L8oL-vLJXVcRzCFbPwOoMkPKJ8-aYdPN) [[colab]](https://colab.research.google.com/github/justinjohn0306/VQGAN-CLIP/blob/main/VQGAN%2BCLIP(Updated).ipynb)
- AltCLIP - Open/FlagAI) [[code]](https://huggingface.co/docs/transformers/model_doc/altclip)
- CLIPPO
- FDT
- DIME-FM - FM) [[website]](https://cs-people.bu.edu/sunxm/DIME-FM/)
- ViLLA
- BASIC
- CoOp
- CoCoOp
- RPO
- KgCoOp
- ECO
- UPT
- UPL
- ProDA
- CTP
- MVLPT
- DAPT
- LFA - fi/LFA)
- LaFTer
- TAP
- CLIP-Adapter - Adapter)
- Tip-Adapter - Adapter)
- CALIP
- SHIP
- LoGoPrompt
- GRAM
- MaPLe - prompt-learning)
- PromptSR
- ProGrad - align)
- Prompt Align
- APE
- CuPL
- WaffleCLIP
- R-AMT - AMT) [[website]](https://wuw2019.github.io/R-AMT/)
- SVL-Adapter
- KAPT
- SuS-X - X)
- Internet Explorer - explorer-ssl/internet-explorer) [[website]](https://internet-explorer-ssl.github.io/)
- REACT - vl.github.io/)
- SEARLE
- CLIPpy
- CDUL
- DN - dev/distribution-normalization) [[website]](https://fengyuli-dev.github.io/dn-website/)
- TPT
- ReCLIP
- PLOT
- GEM
- SynthCLIP
- [pdf - menon/classify_by_description_release) [[website]](https://cv.cs.columbia.edu/sachit/classviadescr/)
- [pdf - Liang/Modality-Gap) [[website]](https://modalitygap.readthedocs.io/en/latest/)
- [pdf - Adapter)
- [pdf - tuning_ICCV_2023_supplemental.pdf) [[code]](https://github.com/SwinTransformer/Feature-Distillation)
- [pdf - CLIP)
- [pdf - shot-video-to-text)
- [pdf - chefer/TargetCLIP)
- [code - guided-diffusion) [[code]](https://github.com/nerdyrodent/CLIP-Guided-Diffusion) [[code]](https://github.com/crowsonkb/v-diffusion-pytorch)
- [video
- CLIP Varaints
- awesome-clip
- [pdf - critical.pytorch)
- [pdf - critical.pytorch)
- [pdf - min/CLIP-Caption-Reward)
- Semantic-SAM
- SEEM
- Grounding DINO
- [pdf - o?usp=sharing) [[video]](https://www.youtube.com/watch?v=KEv-F5UkhxU&ab_channel=AICoffeeBreakwithLetitia) [[blog]](https://huggingface.co/blog/peft) [[blog]](https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive) [[blog]](https://medium.com/@abdullahsamilguser/lora-low-rank-adaptation-of-large-language-models-7af929391fee) [[blog]](https://towardsdatascience.com/understanding-lora-low-rank-adaptation-for-finetuning-large-models-936bce1a07c6) [[hf docs]](https://huggingface.co/docs/diffusers/training/lora) [[library]](https://huggingface.co/docs/peft/index) [[library tutorial]](https://huggingface.co/learn/cookbook/prompt_tuning_peft)
- [pdf - playground-tgi) [[blog]](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- [pdf - python]](https://github.com/abetlen/llama-cpp-python) [[blog]](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) [[video]](https://www.youtube.com/watch?v=E5OnoYF2oAk&t=1915s)
- [pdf - llama) [[demo]](https://labs.perplexity.ai/) [[demo]](https://huggingface.co/chat) [[blog]](https://ai.meta.com/llama/) [[blog]](https://ai.meta.com/resources/models-and-libraries/llama/) [[blog]](https://huggingface.co/blog/llama2) [[blog]](https://www.philschmid.de/llama-2) [[blog]](https://medium.com/towards-generative-ai/understanding-llama-2-architecture-its-ginormous-impact-on-genai-e278cb81bd5c) [[llama2.c]](https://github.com/karpathy/llama2.c) [[finetune script]](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da) [[finetune script]](https://www.philschmid.de/sagemaker-llama2-qlora) [[finetune script]](https://www.philschmid.de/instruction-tune-llama-2) [[tutorials]](https://github.com/amitsangani/Llama-2) [[yarn]](https://github.com/jquesnelle/yarn)
- [pdf - 120b) [[official demo]](https://www.galactica.org/) [[demo]](https://huggingface.co/spaces/lewtun/galactica-demo)
- [blog - following/) [[rlhf hf]](https://huggingface.co/blog/rlhf) [[rlhf wandb]](https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx) [[code]](https://github.com/lucidrains/PaLM-rlhf-pytorch)
- [blog - 4) [[technical report]](https://arxiv.org/pdf/2303.08774.pdf)
- [pdf - instruct)
- [website - lab/stanford_alpaca) [[model]](https://huggingface.co/chavinlo/alpaca-13b) [[demo]](https://alpaca-ai.ngrok.io/) [[gpt4-x-alpaca]](https://huggingface.co/chavinlo/gpt4-x-alpaca)
- [website - sys/FastChat) [[demo]](https://chat.lmsys.org/)
- [code
- [models
- [hf collections
- [blog - release-65d5efbccdbb8c4202ec078b) [[bugs]](https://unsloth.ai/blog/gemma-bugs)
- [github
- [pdf - teacher)
- [pdf - transformers)
- [github
- [pdf - soups)
- [pdf - priming)
- [article
- Falcon LLM
- x-transformers
- mlabonne - fine-tuning-LLMs) [[train and deploy]](https://www.youtube.com/watch?v=Ma4clS-IdhA&t=1709s) [[supervised FT]](https://www.youtube.com/watch?v=NXevvEF3QVI&t=1s) [[how LLM Chatbots work]](https://www.youtube.com/watch?v=C6ZszXYPDDw&t=26s) [[finetuning tutorial]](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl) [[pytorch finetuning tutorial]](https://pytorch.org/blog/finetune-llms/?utm_content=278057355&utm_medium=social&utm_source=linkedin&hss_channel=lcp-78618366) [[finetuning tutorial]](https://huggingface.co/learn/cookbook/fine_tuning_code_llm_on_single_gpu)
- [pdf - research/prompt-tuning) [[code]](https://huggingface.co/docs/peft/task_guides/clm-prompt-tuning) [[blog]](https://ai.googleblog.com/2022/02/guiding-frozen-language-models-with.html?m=1) [[blog]](https://heidloff.net/article/introduction-to-prompt-tuning/)
- [pdf - tuning)
- [pdf - nlp/LM-BFF)
- [pdf - models-perform-reasoning-via.html?m=1)
- [pdf - Edgerunners/Plan-and-Solve-Prompting)
- [pdf - and-Respond) [[website]](https://uclaml.github.io/Rephrase-and-Respond/)
- [pdf - inversion.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/text_inversion) [[blog]](https://medium.com/@onkarmishra/how-textual-inversion-works-and-its-applications-5e3fda4aa0bc)
- [pdf - Stable-Diffusion) [[code]](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) [[website]](https://dreambooth.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/dreambooth)
- [pdf - AI/visual_prompt_retrieval)
- [pdf - Group/ILM-VP)
- [pdf - icl)
- VisualCOMET
- LAION
- Conceptual 12M - research-datasets/conceptual-12m)
- Winoground
- [github
- [github
- [github
- [github
- [github
- Transformers-VQA
- MMT-Retrieval
- Florence-VL
- unilm
- awesome-Vision-and-Language-Pre-training
- awesome-vision-language-pretraining-papers
- vqa
- image captioning
- image captioning
- scene graphs
- [blog
- Comparing image captioning models
- Comparing visual question answering (VQA) models
- Generalized Visual Language Models
- Prompting in Vision CVPR23 Tutorial
- CVPR23 Tutorial
- CVPR22 Tutorial
- CVPR21 Tutorial
- CVPR20 Tutorial
Programming Languages
Keywords
deep-learning
5
vision-and-language
4
multimodal
3
pretraining
3
multimodal-deep-learning
3
pytorch
2
llm
2
vision-language-transformer
2
foundation-models
2
image-captioning
2
artificial-intelligence
2
vqa
2
chatbot
2
contrastive-learning
2
multi-modal-learning
2
jax
1
latent-diffusion-models
1
score-based-generative-modeling
1
stable-diffusion
1
stable-diffusion-diffusers
1
text2image
1
cross-modal-retrieval
1
multi-tasking
1
hateful-memes
1
dialog
1
tden
1
video-captioning
1
visual-question-answering
1
zero-shot-learning
1
clip
1
pre-training
1
databricks
1
dolly
1
gpt
1
attention-mechanism
1
transformers
1
deep-learning-library
1
multimodal-datasets
1
salesforce
1
vision-framework
1
vision-language-pretraining
1
visual-question-anwsering
1
diffusion
1
flax
1
image-generation
1
image2image
1
vl-ptms
1
attention-networks
1
awesome-list
1
multi-modal
1