Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-vision-language-pretraining
Awesome Vision-Language Pretraining Papers
https://github.com/fawazsammani/awesome-vision-language-pretraining
Last synced: 5 days ago
JSON representation
-
Papers
- VLMo
- METER
- WenLan
- InterBERT
- SemVLP
- E2E-VLP
- VinVL
- UFO
- Florence
- VILLA
- TDEN
- ERNIE-ViL
- Vokenization
- 12-in-1 - multi-task)
- KVL-BERT
- Oscar
- VIVO
- SOHO
- Pixel-BERT
- VLKD
- LightningDOT
- VirTex
- Uni-Perceiver - Perceiver)
- Uni-Perceiver v2 - Perceiver)
- DiHT
- VL-Match
- Prismer
- PaLM-E - e.github.io/) [[blog]](https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html)
- M3AE - geng/m3ae_public)
- MaskVLM
- DALL-E-2 - pytorch) [[website]](https://openai.com/dall-e-2/) [[blog]](http://adityaramesh.com/posts/dalle2/dalle2.html) [[blog]](https://www.assemblyai.com/blog/how-dall-e-2-actually-works/) [[blog]](https://medium.com/augmented-startups/how-does-dall-e-2-work-e6d492a2667f)
- KOSMOS-1
- VILA - Large-Model/VILA) [[hf page]](https://huggingface.co/Efficient-Large-Model)
- Plug and Pray
- LEMON
- IC3 - by-committee)
- TAP
- PICa
- CVLP
- ViLBERT - multi-task) [[code]](https://github.com/jiasenlu/vilbert_beta)
- Unified-VLP
- ImageBERT
- SimVLM
- ALBEF
- LXMERT
- X-LXMERT - lxmert)
- VisualBERT
- UNIMO
- UNIMO-2 - ptm.github.io/)
- BLIP - VQA-models)
- Uni-EDEN
- VisualGPT - CAIR/VisualGPT)
- MiniVLM
- XGPT
- ViTCAP
- UniT
- VL-BERT - BERT)
- Unicoder-VL
- UNITER
- ViLT - vqa)
- GLIPv2
- CoCa - pytorch) [[code]](https://github.com/mlfoundations/open_clip) [[colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb)
- Flamingo - pytorch) [[code]](https://github.com/mlfoundations/open_flamingo) [[code]](https://github.com/dhansmair/flamingo-mini) [[website]](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) [[blog]](https://wandb.ai/gladiator/Flamingo%20VLM/reports/DeepMind-Flamingo-A-Visual-Language-Model-for-Few-Shot-Learning--VmlldzoyOTgzMDI2) [[blog]](https://laion.ai/blog/open-flamingo/) [[blog]](https://laion.ai/blog/open-flamingo-v2/)
- BEiT-3
- UniCL
- UVLP
- OFA - Sys/OFA) [[models and demos]](https://huggingface.co/OFA-Sys)
- GPV-1 - 1/) [[website]](https://prior.allenai.org/projects/gpv)
- GPV-2
- TCL - smile/TCL)
- L-Verse
- FLAVA - model.github.io/) [[tutorial]](https://pytorch.org/tutorials/beginner/flava_finetuning_tutorial.html)
- COTS
- VL-ADAPTER
- Unified-IO - io.allenai.org/)
- ViLTA
- CapDet
- PTP - sg/ptp)
- X-VLM - 97/x-vlm)
- FewVLM
- M3AE - geng/m3ae_public)
- CFM-ViT
- mPLUG
- PaLI - scaling-language-image-learning-in.html)
- GIT - VQA-models)
- MaskVLM
- DALL-E - E) [[code]](https://github.com/borisdayma/dalle-mini) [[code]](https://github.com/lucidrains/DALLE-pytorch) [[code]](https://github.com/kuprel/min-dalle) [[code]](https://github.com/robvanvolt/DALLE-models) [[code]](https://github.com/kakaobrain/minDALL-E) [[website]](https://openai.com/blog/dall-e/) [[video]](https://www.youtube.com/watch?v=j4xgkjWlfL4&t=1432s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=jMqLTPcA9CQ&t=1034s&ab_channel=TheAIEpiphany) [[video]](https://www.youtube.com/watch?v=x_8uHX5KngE&ab_channel=TheAIEpiphany) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/Building-efficient-image-input-pipelines--VmlldzoyMjMxOTQw) [[blog]](https://ml.berkeley.edu/blog/posts/vq-vae/) [[blog]](https://ml.berkeley.edu/blog/posts/dalle2/) [[blog]](https://towardsdatascience.com/understanding-how-dall-e-mini-works-114048912b3b)
- DALL-E-2 - pytorch) [[website]](https://openai.com/dall-e-2/) [[blog]](http://adityaramesh.com/posts/dalle2/dalle2.html) [[blog]](https://www.assemblyai.com/blog/how-dall-e-2-actually-works/) [[blog]](https://medium.com/augmented-startups/how-does-dall-e-2-work-e6d492a2667f)
- DALL-E 3 - e-3)
- GigaGAN - pytorch) [[code]](https://github.com/jianzhnie/GigaGAN) [[website]](https://mingukkang.github.io/GigaGAN/)
- Parti - research/parti) [[code]](https://github.com/lucidrains/parti-pytorch) [[video]](https://www.youtube.com/watch?v=qS-iYnp00uc&ab_channel=YannicKilcher) [[blog]](https://parti.research.google/)
- Paella
- Make-A-Scene
- Make-A-Video - a-video-pytorch) [[blog]](https://makeavideo.studio/) [[blog]](https://ai.facebook.com/blog/generative-ai-text-to-video/) [[video]](https://www.youtube.com/watch?v=AcvmyqGgMh8&ab_channel=AICoffeeBreakwithLetitia) [[video]](https://www.youtube.com/watch?v=MmAJk2BD6WA)
- FIBER
- VL-BEiT - beit)
- MetaLM
- VL-T5 - min/VL-T5)
- UNICORN
- MI2P
- MDETR
- VLMixer
- ViCHA
- StoryDALL-E
- VLMAE
- MLIM
- MOFI
- Multimodal-CoT - science/mm-cot)
- GILL
- Language Pivoting
- Graph-Align
- PL-UIC
- SCL
- TaskRes
- EPIC
- HAAV
- FLM
- X-Decoder - Decoder/tree/main) [[xgpt code]](https://github.com/microsoft/X-Decoder/tree/xgpt) [[website]](https://x-decoder-vl.github.io/) [[demo]](https://huggingface.co/spaces/xdecoder/Demo) [[demo]](https://huggingface.co/spaces/xdecoder/Instruct-X-Decoder)
- PerVL
- TextManiA - ami/TextManiA) [[website]](https://moon-yb.github.io/TextManiA.github.io/) [[GAN Inversion]](https://arxiv.org/pdf/2004.00049.pdf)
- Cola
- K-LITE
- SINC
- Visual ChatGPT - chatgpt)
- CM3Leon
- KOSMOS-1
- MultiModal-GPT - mmlab/Multimodal-GPT)
- LLaVA+
- LLaVA-Interactive - VL/LLaVA-Interactive-Demo) [[website]](https://llava-vl.github.io/llava-interactive/) [[demo]](https://llavainteractive.ngrok.io/)
- NExT-Chat - ChatV/NExT-Chat) [[demo]](https://516398b33beb3e8b9f.gradio.live/) [[website]](https://next-chatv.github.io/)
- MiniGPT-4 - CAIR/MiniGPT-4) [[website]](https://minigpt-4.github.io/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/minigpt4)
- MiniGPT-v2 - CAIR/MiniGPT-4) [[website]](https://minigpt-v2.github.io/) [[demo]](https://876a8d3e814b8c3a8b.gradio.live/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/MiniGPT-v2)
- LLaMA-Adapter - Adapter)
- LLaMA-Adapter V2 - Adapter) [[demo]](http://llama-adapter.opengvlab.com/)
- LaVIN
- InstructBLIP - Tutorials/blob/master/InstructBLIP/Inference_with_InstructBLIP.ipynb)
- Otter/MIMIC-IT - ntu.github.io/)
- CogVLM
- ImageBind
- TextBind
- MetaVL
- Instruction-ViT
- MultiInstruct - NLP/MultiInstruct)
- VisIT-Bench - Bench/) [[website]](https://visit-bench.github.io/) [[blog]](https://laion.ai/blog/visit_bench/) [[dataset]](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) [[leaderboard]](https://huggingface.co/spaces/mlfoundations/VisIT-Bench-Leaderboard)
- GPT4RoI
- PandaGPT - gpt.github.io/)
- ChatBridge - chatbridge.github.io/)
- Video-LLaMA - NLP-SG/Video-LLaMA) [[demo]](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA)
- VideoChat - Anything)
- InternGPT
- mPLUG-Owl - PLUG/mPLUG-Owl) [[demo]](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl)
- VisionLLM - news/reports/Introducing-VisionLLM-A-New-Method-for-Multi-Modal-LLM-s--Vmlldzo0NTMzNzIz)
- X-LLM - LLM) [[website]](https://x-llm.github.io/)
- OBELICS - 80b) [[instruct model 9b]](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct) [[instruct model 80b]](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) [[demo]](https://huggingface.co/spaces/HuggingFaceM4/idefics_playground) [[dataset]](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)
- EvALign-ICL - ICL) [[website]](https://evalign-icl.github.io/)
- Plug and Pray
- VL-PET - PET)
- ICIS
- NExT-GPT - GPT/NExT-GPT) [[website]](https://next-gpt.github.io/) [[demo]](https://4271670c463565f1a4.gradio.live/)
- UnIVAL
- BUS
- VPGTrans
- PromptCap - Hu/PromptCap) [[website]](https://yushi-hu.github.io/promptcap_demo/) [[hf checkpoint]](https://huggingface.co/tifa-benchmark/promptcap-coco-vqa)
- P-Former
- TL;DR
- PMA-Net - Net)
- Encyclopedic VQA - research/google-research/tree/master/encyclopedic_vqa)
- CMOTA
- CPT
- TeS
- MP - Probing)
- LLaVA - liu/LLaVA) [[website]](https://llava-vl.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/llava)
- Graph-Align
- M³IT - it.github.io/)
- GLIP
- MDETR
- METER
- PerVL
- GLIP
- Unified-IO - io.allenai.org/)
- CapDet
- GIT - VQA-models)
- StoryDALL-E
- VIVO
- GPV-2
- LEMON
- TAP
- VILLA
- Unified-VLP
- TDEN
- Oscar
- GLIPv2
- UNIMO-2 - ptm.github.io/)
- BLIP-2 - Tutorials/tree/master/BLIP-2) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/blip-2) [[finetuning colab]](https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing) [[blog]](https://huggingface.co/blog/blip-2)
- Uni-EDEN
- MiniVLM
- BLIP-2 - Tutorials/tree/master/BLIP-2) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/blip-2) [[finetuning colab]](https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing) [[blog]](https://huggingface.co/blog/blip-2)
- KOSMOS-2 - 2) [[code]](https://huggingface.co/docs/transformers/model_doc/kosmos-2) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb) [[demo]](https://huggingface.co/spaces/ydshieh/Kosmos-2)
- ViP-LLaVA - cai/vip-llava) [[demo]](https://pages.cs.wisc.edu/~mucai/vip-llava.html) [[website]](https://vip-llava.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViP-LLaVa/Inference_with_ViP_LLaVa_for_fine_grained_VQA.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/vipllava)
- VILA - Large-Model/VILA) [[hf page]](https://huggingface.co/Efficient-Large-Model)
- MiniGPT-4 - CAIR/MiniGPT-4) [[website]](https://minigpt-4.github.io/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/minigpt4)
- LLaMA-Adapter - Adapter)
- ImageBERT
- SimVLM
- XGPT
- ViTCAP
- IC3 - by-committee)
- CVLP
- VL-BERT - BERT)
- WenLan
- E2E-VLP
- UFO
- ERNIE-ViL
- Vokenization
- SOHO
- Pixel-BERT
- VirTex
- Uni-Perceiver - Perceiver)
- Uni-Perceiver v2 - Perceiver)
- UniCL
- UVLP
- GPV-1 - 1/) [[website]](https://prior.allenai.org/projects/gpv)
- TCL - smile/TCL)
- FLAVA - model.github.io/) [[tutorial]](https://pytorch.org/tutorials/beginner/flava_finetuning_tutorial.html)
- COTS
- ViLTA
- PTP - sg/ptp)
- FewVLM
- CFM-ViT
- VL-BEiT - beit)
- mPLUG
- PaLI - scaling-language-image-learning-in.html) [[code]](https://github.com/kyegomez/PALI3)
- VL-T5 - min/VL-T5)
- UNICORN
- VLMixer
- Img2LLM - vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
- VLMAE
- MLIM
- MOFI
- GILL
- PL-UIC
- SCL
- HAAV
- FLM
- DiHT
- Prismer
- TextManiA - ami/TextManiA) [[website]](https://moon-yb.github.io/TextManiA.github.io/) [[GAN Inversion]](https://arxiv.org/pdf/2004.00049.pdf)
- K-LITE
- SINC
- Visual ChatGPT - chatgpt)
- KOSMOS-2 - 2) [[code]](https://huggingface.co/docs/transformers/model_doc/kosmos-2) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb) [[demo]](https://huggingface.co/spaces/ydshieh/Kosmos-2)
- MultiModal-GPT - mmlab/Multimodal-GPT)
- MiniGPT-v2 - CAIR/MiniGPT-4) [[website]](https://minigpt-v2.github.io/) [[demo]](https://876a8d3e814b8c3a8b.gradio.live/) [[demo]](https://huggingface.co/spaces/Vision-CAIR/MiniGPT-v2)
- LLaMA-Adapter V2 - Adapter) [[demo]](http://llama-adapter.opengvlab.com/)
- InstructBLIP - Tutorials/blob/master/InstructBLIP/Inference_with_InstructBLIP.ipynb)
- Otter/MIMIC-IT - ntu.github.io/)
- CogVLM
- ImageBind
- Instruction-ViT
- VisIT-Bench - Bench/) [[website]](https://visit-bench.github.io/) [[blog]](https://laion.ai/blog/visit_bench/) [[dataset]](https://huggingface.co/datasets/mlfoundations/VisIT-Bench) [[leaderboard]](https://huggingface.co/spaces/mlfoundations/VisIT-Bench-Leaderboard)
- PandaGPT - gpt.github.io/)
- Video-LLaMA - NLP-SG/Video-LLaMA) [[demo]](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA)
- VisionLLM - news/reports/Introducing-VisionLLM-A-New-Method-for-Multi-Modal-LLM-s--Vmlldzo0NTMzNzIz)
- X-LLM - LLM) [[website]](https://x-llm.github.io/)
- VL-PET - PET)
- ICIS
- BUS
- VPGTrans
- X-LXMERT - lxmert)
- UNIMO
- Img2LLM - vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
- PNP-VQA - vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb)
- UniT
- TextBind
- EvALign-ICL - ICL) [[website]](https://evalign-icl.github.io/)
- CMOTA
- Prophet
- GRiT
- KOSMOS-2 - 2) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/kosmos-2) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/KOSMOS-2/Inference_with_KOSMOS_2_for_multimodal_grounding.ipynb) [[demo]](https://huggingface.co/spaces/ydshieh/Kosmos-2)
- LLaVA Series - Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) [[LLaVA-NeXT]](https://github.com/LLaVA-VL/LLaVA-NeXT/) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llava_next) [[hf docs]](https://huggingface.co/docs/transformers/main/en/model_doc/llava_onevision) [[demo]](https://huggingface.co/spaces/merve/llava-next) [[hf card]](https://huggingface.co/llava-hf)
- InternVL
- MiniCPM-V - V)
- LLaVA-MORE
- Qwen-VL - VL) [[tutorial]](https://github.com/QwenLM/Qwen-VL/blob/master/TUTORIAL.md) [[blog]](https://qwenlm.github.io/blog/qwen-vl/) [[blog]](https://qwenlm.github.io/blog/qwen2-vl/) [[hf card]](https://huggingface.co/Qwen)
- M³IT - it.github.io/)
- InterBERT
- VinVL
- Flamingo - pytorch) [[code]](https://github.com/mlfoundations/open_flamingo) [[code]](https://github.com/dhansmair/flamingo-mini) [[website]](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) [[blog]](https://wandb.ai/gladiator/Flamingo%20VLM/reports/DeepMind-Flamingo-A-Visual-Language-Model-for-Few-Shot-Learning--VmlldzoyOTgzMDI2) [[blog]](https://laion.ai/blog/open-flamingo/) [[blog]](https://laion.ai/blog/open-flamingo-v2/)
- ChatBridge - chatbridge.github.io/)
- SemVLP
- EPIC
- PaLM-E - e.github.io/) [[blog]](https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html)
- LaVIN
- ALBEF
- BLIP - VQA-models)
- VLMo
- PICa
- 12-in-1 - multi-task)
- CoCa - pytorch) [[code]](https://github.com/mlfoundations/open_clip) [[colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb)
- BEiT-3
- VL-ADAPTER
- MetaLM
- DALL-E - E) [[code]](https://github.com/borisdayma/dalle-mini) [[code]](https://github.com/lucidrains/DALLE-pytorch) [[code]](https://github.com/kuprel/min-dalle) [[code]](https://github.com/robvanvolt/DALLE-models) [[code]](https://github.com/kakaobrain/minDALL-E) [[website]](https://openai.com/blog/dall-e/) [[video]](https://www.youtube.com/watch?v=j4xgkjWlfL4&t=1432s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=jMqLTPcA9CQ&t=1034s&ab_channel=TheAIEpiphany) [[video]](https://www.youtube.com/watch?v=x_8uHX5KngE&ab_channel=TheAIEpiphany) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) [[blog]](https://wandb.ai/dalle-mini/dalle-mini/reports/Building-efficient-image-input-pipelines--VmlldzoyMjMxOTQw) [[blog]](https://ml.berkeley.edu/blog/posts/vq-vae/) [[blog]](https://ml.berkeley.edu/blog/posts/dalle2/) [[blog]](https://towardsdatascience.com/understanding-how-dall-e-mini-works-114048912b3b)
- Language Pivoting
- TaskRes
- MultiInstruct - NLP/MultiInstruct)
- PNP-VQA - vqa) [[colab]](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb)
- UnIVAL
- LLaVA - liu/LLaVA) [[website]](https://llava-vl.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/llava)
- ViP-LLaVA - cai/vip-llava) [[demo]](https://pages.cs.wisc.edu/~mucai/vip-llava.html) [[website]](https://vip-llava.github.io/) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViP-LLaVa/Inference_with_ViP_LLaVa_for_fine_grained_VQA.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/vipllava)
- MetaVL
- GPT4RoI
- VideoChat - Anything)
- mPLUG-Owl - PLUG/mPLUG-Owl) [[demo]](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl)
- OBELICS - 80b) [[instruct model 9b]](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct) [[instruct model 80b]](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) [[demo]](https://huggingface.co/spaces/HuggingFaceM4/idefics_playground) [[dataset]](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)
- NExT-GPT - GPT/NExT-GPT) [[website]](https://next-gpt.github.io/) [[demo]](https://4271670c463565f1a4.gradio.live/)
- ViCHA
- KVL-BERT
- PromptCap - Hu/PromptCap) [[website]](https://yushi-hu.github.io/promptcap_demo/) [[hf checkpoint]](https://huggingface.co/tifa-benchmark/promptcap-coco-vqa)
- P-Former
- TL;DR
- PMA-Net - Net)
- Encyclopedic VQA - research/google-research/tree/master/encyclopedic_vqa)
- CPT
- TeS
- MP - Probing)
-
Miscellaneous
- [pdf - instruct)
- [code
- [github - os) [[twitter post]](https://twitter.com/danielhanchen/status/1769550950270910630)
- [github
- x-transformers
- mlabonne - fine-tuning-LLMs) [[train and deploy]](https://www.youtube.com/watch?v=Ma4clS-IdhA&t=1709s) [[supervised FT]](https://www.youtube.com/watch?v=NXevvEF3QVI&t=1s) [[how LLM Chatbots work]](https://www.youtube.com/watch?v=C6ZszXYPDDw&t=26s) [[finetuning tutorial]](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl) [[pytorch finetuning tutorial]](https://pytorch.org/blog/finetune-llms/?utm_content=278057355&utm_medium=social&utm_source=linkedin&hss_channel=lcp-78618366) [[finetuning tutorial]](https://huggingface.co/learn/cookbook/fine_tuning_code_llm_on_single_gpu)
- [pdf - playground-tgi) [[blog]](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- [pdf - 120b) [[official demo]](https://www.galactica.org/) [[demo]](https://huggingface.co/spaces/lewtun/galactica-demo)
- [blog - 4) [[technical report]](https://arxiv.org/pdf/2303.08774.pdf)
- [website - lab/stanford_alpaca) [[model]](https://huggingface.co/chavinlo/alpaca-13b) [[demo]](https://alpaca-ai.ngrok.io/) [[gpt4-x-alpaca]](https://huggingface.co/chavinlo/gpt4-x-alpaca)
- [website - sys/FastChat) [[demo]](https://chat.lmsys.org/)
- [pdf - teacher)
- [pdf - transformers)
- [pdf - soups)
- [pdf - priming)
- [article
- Falcon LLM
- [pdf - research/prompt-tuning) [[code]](https://huggingface.co/docs/peft/task_guides/clm-prompt-tuning) [[blog]](https://ai.googleblog.com/2022/02/guiding-frozen-language-models-with.html?m=1) [[blog]](https://heidloff.net/article/introduction-to-prompt-tuning/)
- [pdf - tuning)
- [pdf - nlp/LM-BFF)
- [pdf - models-perform-reasoning-via.html?m=1)
- [pdf - Edgerunners/Plan-and-Solve-Prompting)
- [pdf - AI/visual_prompt_retrieval)
- [pdf - Group/ILM-VP)
- [pdf - icl)
- [pdf - and-Respond) [[website]](https://uclaml.github.io/Rephrase-and-Respond/)
- [blog - following/) [[rlhf hf]](https://huggingface.co/blog/rlhf) [[rlhf wandb]](https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx) [[code]](https://github.com/lucidrains/PaLM-rlhf-pytorch)
- [pdf - o?usp=sharing) [[video]](https://www.youtube.com/watch?v=KEv-F5UkhxU&ab_channel=AICoffeeBreakwithLetitia) [[blog]](https://huggingface.co/blog/peft) [[blog]](https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive) [[blog]](https://medium.com/@abdullahsamilguser/lora-low-rank-adaptation-of-large-language-models-7af929391fee) [[blog]](https://towardsdatascience.com/understanding-lora-low-rank-adaptation-for-finetuning-large-models-936bce1a07c6) [[hf docs]](https://huggingface.co/docs/diffusers/training/lora) [[library]](https://huggingface.co/docs/peft/index) [[library tutorial]](https://huggingface.co/learn/cookbook/prompt_tuning_peft)
- [pdf - python]](https://github.com/abetlen/llama-cpp-python) [[blog]](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) [[video]](https://www.youtube.com/watch?v=E5OnoYF2oAk&t=1915s)
- [pdf - llama) [[demo]](https://labs.perplexity.ai/) [[demo]](https://huggingface.co/chat) [[blog]](https://ai.meta.com/llama/) [[blog]](https://ai.meta.com/resources/models-and-libraries/llama/) [[blog]](https://huggingface.co/blog/llama2) [[blog]](https://www.philschmid.de/llama-2) [[blog]](https://medium.com/towards-generative-ai/understanding-llama-2-architecture-its-ginormous-impact-on-genai-e278cb81bd5c) [[llama2.c]](https://github.com/karpathy/llama2.c) [[finetune script]](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da) [[finetune script]](https://www.philschmid.de/sagemaker-llama2-qlora) [[finetune script]](https://www.philschmid.de/instruction-tune-llama-2) [[tutorials]](https://github.com/amitsangani/Llama-2) [[yarn]](https://github.com/jquesnelle/yarn)
- [models
- [hf collections
- [blog - release-65d5efbccdbb8c4202ec078b) [[bugs]](https://unsloth.ai/blog/gemma-bugs)
- [pdf - inversion.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/text_inversion) [[blog]](https://medium.com/@onkarmishra/how-textual-inversion-works-and-its-applications-5e3fda4aa0bc)
- [pdf - Stable-Diffusion) [[code]](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) [[website]](https://dreambooth.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/dreambooth)
- [pdf - Group/ILM-VP)
- [hf card - llama) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llama) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llama2) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/llama3) [[llama report]](https://arxiv.org/pdf/2302.13971v1) [[llama2 report]](https://arxiv.org/pdf/2307.09288.pdf) [[llama3 report]](https://arxiv.org/pdf/2407.21783) [[llama3 report summary]](https://x.com/A_K_Nain/status/1815942598944547074) [[llama3.1 hf blog]](https://huggingface.co/blog/llama31) [[code llama report]](https://arxiv.org/pdf/2308.12950.pdf) Other Resources: [[llama2.c]](https://github.com/karpathy/llama2.c) [[llama.cpp]](https://github.com/ggerganov/llama.cpp) [[finetune script]](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da) [[finetune script]](https://www.philschmid.de/sagemaker-llama2-qlora) [[finetune script]](https://www.philschmid.de/instruction-tune-llama-2) [[finetune script]](https://www.philschmid.de/fsdp-qlora-llama3) [[tutorials]](https://github.com/amitsangani/Llama-2) [[yarn]](https://github.com/jquesnelle/yarn) [[openbio]](https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B) [[huggingface-llama-recipes]](https://github.com/huggingface/huggingface-llama-recipes/tree/main)
- [pdf - models-perform-reasoning-via.html?m=1)
- [pdf - inversion.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/text_inversion) [[blog]](https://medium.com/@onkarmishra/how-textual-inversion-works-and-its-applications-5e3fda4aa0bc)
- [pdf - llama) [[demo]](https://labs.perplexity.ai/) [[demo]](https://huggingface.co/chat) [[blog]](https://ai.meta.com/llama/) [[blog]](https://ai.meta.com/resources/models-and-libraries/llama/) [[blog]](https://huggingface.co/blog/llama2) [[blog]](https://www.philschmid.de/llama-2) [[blog]](https://medium.com/towards-generative-ai/understanding-llama-2-architecture-its-ginormous-impact-on-genai-e278cb81bd5c) [[llama2.c]](https://github.com/karpathy/llama2.c) [[finetune script]](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da) [[finetune script]](https://www.philschmid.de/sagemaker-llama2-qlora) [[finetune script]](https://www.philschmid.de/instruction-tune-llama-2) [[tutorials]](https://github.com/amitsangani/Llama-2) [[yarn]](https://github.com/jquesnelle/yarn)
- [pdf - tuning)
- [pdf - python]](https://github.com/abetlen/llama-cpp-python) [[blog]](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) [[video]](https://www.youtube.com/watch?v=E5OnoYF2oAk&t=1915s)
- [pdf - teacher)
- [pdf - o?usp=sharing) [[video]](https://www.youtube.com/watch?v=KEv-F5UkhxU&ab_channel=AICoffeeBreakwithLetitia) [[blog]](https://huggingface.co/blog/peft) [[blog]](https://www.ml6.eu/blogpost/low-rank-adaptation-a-technical-deep-dive) [[blog]](https://medium.com/@abdullahsamilguser/lora-low-rank-adaptation-of-large-language-models-7af929391fee) [[blog]](https://towardsdatascience.com/understanding-lora-low-rank-adaptation-for-finetuning-large-models-936bce1a07c6) [[hf docs]](https://huggingface.co/docs/diffusers/training/lora) [[library]](https://huggingface.co/docs/peft/index) [[library tutorial]](https://huggingface.co/learn/cookbook/prompt_tuning_peft)
- [pdf - playground-tgi) [[blog]](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- [pdf - transformers)
- [pdf - soups)
- [pdf - priming)
- [pdf - research/prompt-tuning) [[code]](https://huggingface.co/docs/peft/task_guides/clm-prompt-tuning) [[blog]](https://ai.googleblog.com/2022/02/guiding-frozen-language-models-with.html?m=1) [[blog]](https://heidloff.net/article/introduction-to-prompt-tuning/)
- [pdf - nlp/LM-BFF)
- [pdf - Edgerunners/Plan-and-Solve-Prompting)
- [pdf - and-Respond) [[website]](https://uclaml.github.io/Rephrase-and-Respond/)
- [pdf - Stable-Diffusion) [[code]](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) [[website]](https://dreambooth.github.io/) [[hf docs]](https://huggingface.co/docs/diffusers/training/dreambooth)
- [pdf - AI/visual_prompt_retrieval)
- [pdf - icl)
-
CLIP-related
- FLIP
- CLIP Varaints
- awesome-clip
- UPL
- ProDA
- CTP
- OpenCLIP - AI/scaling-laws-openclip) [[clip colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) [[clip benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[hf models]](https://huggingface.co/models?library=open_clip)
- CLIPScore
- LiT - research/vision_transformer) [[website]](https://google-research.github.io/vision_transformer/lit/)
- SigLIP - research/big_vision) [[colab demo]](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb)
- ALIGN
- DeCLIP - GVT/DeCLIP)
- FLIP
- Counting-aware CLIP - clip-to-count.github.io/)
- ALIP
- STAIR
- FILIP
- SLIP
- WiSE-FT - ft)
- FLYP
- MAGIC
- ZeroCap - shot-image-to-text)
- CapDec - H?usp=sharing)
- DeCap - wei/DeCap)
- ViECap
- CLOSE
- xCLIP
- EVA - CLIP]](https://arxiv.org/pdf/2303.15389.pdf) [[EVA-02]](https://arxiv.org/pdf/2303.11331.pdf) [[code]](https://github.com/baaivision/EVA)
- VT-CLIP
- CLIP-ViL - vil/CLIP-ViL)
- RegionCLIP
- DenseCLIP
- E-CLIP
- X-CLIP - CLIP) [[code]](https://huggingface.co/docs/transformers/model_doc/xclip)
- MaskCLIP
- CLIPSeg - zero-shot.ipynb) [[demo]](https://huggingface.co/spaces/nielsr/CLIPSeg) [[demo]](https://huggingface.co/spaces/Sijuade/CLIPSegmentation) [[demo]](https://huggingface.co/spaces/taesiri/CLIPSeg) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg2) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg) [[blog]](https://huggingface.co/blog/clipseg-zero-shot)
- OWL-ViT - research/scenic/tree/main/scenic/projects/owl_vit) [[code]](https://huggingface.co/docs/transformers/model_doc/owlvit) [[code]](https://huggingface.co/docs/transformers/tasks/zero_shot_object_detection) [[colab]](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb) [[demo]](https://huggingface.co/spaces/adirik/OWL-ViT) [[demo]](https://huggingface.co/spaces/adirik/image-guided-owlvit) [[demo]](https://huggingface.co/spaces/johko/OWL-ViT) [[demo]](https://huggingface.co/spaces/kellyxiaowei/OWL-ViT) [[demo]](https://huggingface.co/spaces/wendys-llc/OWL-ViT)
- ClipCap
- VQGAN-CLIP - CLIP) [[code]](https://github.com/EleutherAI/vqgan-clip) [[code]](https://github.com/justinjohn0306/VQGAN-CLIP) [[code]](https://www.kaggle.com/code/basu369victor/playing-with-vqgan-clip/notebook) [[colab]](https://colab.research.google.com/github/dribnet/clipit/blob/master/demos/Moar_Settings.ipynb) [[colab]](https://colab.research.google.com/drive/1L8oL-vLJXVcRzCFbPwOoMkPKJ8-aYdPN) [[colab]](https://colab.research.google.com/github/justinjohn0306/VQGAN-CLIP/blob/main/VQGAN%2BCLIP(Updated).ipynb)
- AltCLIP - Open/FlagAI) [[code]](https://huggingface.co/docs/transformers/model_doc/altclip)
- CLIPPO
- FDT
- DIME-FM - FM) [[website]](https://cs-people.bu.edu/sunxm/DIME-FM/)
- ViLLA
- BASIC
- CoOp
- CoCoOp
- RPO
- KgCoOp
- ECO
- UPT
- MVLPT
- DAPT
- LFA - fi/LFA)
- LaFTer
- TAP
- CLIP-Adapter - Adapter)
- Tip-Adapter - Adapter)
- CALIP
- CaFo
- SHIP
- LoGoPrompt
- GRAM
- MaPLe - prompt-learning)
- PromptSR
- ProGrad - align)
- APE
- CuPL
- WaffleCLIP
- R-AMT - AMT) [[website]](https://wuw2019.github.io/R-AMT/)
- SVL-Adapter
- KAPT
- SuS-X - X)
- Internet Explorer - explorer-ssl/internet-explorer) [[website]](https://internet-explorer-ssl.github.io/)
- REACT - vl.github.io/)
- SEARLE
- CLIPpy
- CDUL
- DN - dev/distribution-normalization) [[website]](https://fengyuli-dev.github.io/dn-website/)
- [pdf - menon/classify_by_description_release) [[website]](https://cv.cs.columbia.edu/sachit/classviadescr/)
- [pdf - Adapter)
- [pdf - tuning_ICCV_2023_supplemental.pdf) [[code]](https://github.com/SwinTransformer/Feature-Distillation)
- [pdf - CLIP)
- [pdf - shot-video-to-text)
- [pdf - chefer/TargetCLIP)
- [code - guided-diffusion) [[code]](https://github.com/nerdyrodent/CLIP-Guided-Diffusion) [[code]](https://github.com/crowsonkb/v-diffusion-pytorch)
- [video
- [pdf - menon/classify_by_description_release) [[website]](https://cv.cs.columbia.edu/sachit/classviadescr/)
- Prompt Align
- TPT
- ReCLIP
- PLOT
- GEM
- SynthCLIP
- [pdf - Liang/Modality-Gap) [[website]](https://modalitygap.readthedocs.io/en/latest/)
- SigLIP - research/big_vision) [[colab demo]](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/siglip) [[hf models]](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba)
- CLIP - CLIP) [[code]](https://github.com/moein-shariatnia/OpenAI-CLIP) [[code]](https://github.com/lucidrains/x-clip) [[code]](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/clip) [[website]](https://openai.com/blog/clip/) [[video]](https://www.youtube.com/watch?v=T9XSU0pKX2E&t=1455s&ab_channel=YannicKilcher) [[video]](https://www.youtube.com/watch?v=fQyHEXZB-nM&ab_channel=AleksaGordi%C4%87-TheAIEpiphany) [[video code]](https://www.youtube.com/watch?v=jwZQD0Cqz4o&t=4610s&ab_channel=TheAIEpiphany) [[CLIP_benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[clip-retrieval]](https://github.com/rom1504/clip-retrieval) [[clip-retrieval blog]](https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c)
- ZeroCap - shot-image-to-text)
- CLOSE
- CuPL
- Internet Explorer - explorer-ssl/internet-explorer) [[website]](https://internet-explorer-ssl.github.io/)
- [pdf - shot-video-to-text)
- Alpha-CLIP - clip/)
- FGVP
- EVA
- OWL-ViT - research/scenic/tree/main/scenic/projects/owl_vit) [[hf docs]](https://huggingface.co/docs/transformers/en/model_doc/owlvit) [[tutorial]](https://huggingface.co/docs/transformers/tasks/zero_shot_object_detection) [[colab]](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb) [[demo]](https://huggingface.co/spaces/adirik/OWL-ViT) [[demo]](https://huggingface.co/spaces/adirik/image-guided-owlvit) [[demo]](https://huggingface.co/spaces/johko/OWL-ViT) [[demo]](https://huggingface.co/spaces/kellyxiaowei/OWL-ViT) [[demo]](https://huggingface.co/spaces/wendys-llc/OWL-ViT)
- [video
- FDT
- ReCLIP
- [pdf - chefer/TargetCLIP)
- xCLIP
- Prompt Align
- UPL
- ProDA
- DenseCLIP
- CoCoOp
- Tip-Adapter - Adapter)
- TPT
- ALIP
- WaffleCLIP
- APE
- R-AMT - AMT) [[website]](https://wuw2019.github.io/R-AMT/)
- SVL-Adapter
- KAPT
- SuS-X - X)
- REACT - vl.github.io/)
- CLIPpy
- CDUL
- DN - dev/distribution-normalization) [[website]](https://fengyuli-dev.github.io/dn-website/)
- PLOT
- GEM
- SynthCLIP
- [pdf - Liang/Modality-Gap) [[website]](https://modalitygap.readthedocs.io/en/latest/)
- [pdf - Adapter)
- [pdf - CLIP)
- OpenCLIP - AI/scaling-laws-openclip) [[clip colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) [[clip benchmark]](https://github.com/LAION-AI/CLIP_benchmark) [[hf models]](https://huggingface.co/models?library=open_clip)
- CLIPScore
- LiT - research/vision_transformer) [[website]](https://google-research.github.io/vision_transformer/lit/)
- SigLIP - research/big_vision) [[colab demo]](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb) [[hf notebook]](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SigLIP/Inference_with_(multilingual)_SigLIP%2C_a_better_CLIP_model.ipynb) [[hf docs]](https://huggingface.co/docs/transformers/model_doc/siglip) [[hf models]](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba)
- ALIGN
- DeCLIP - GVT/DeCLIP)
- FILIP
- SLIP
- WiSE-FT - ft)
- FLYP
- CapDec - H?usp=sharing)
- DeCap - wei/DeCap)
- ViECap
- VT-CLIP
- CLIP-ViL - vil/CLIP-ViL)
- RegionCLIP
- E-CLIP
- MaskCLIP
- CLIPSeg - zero-shot.ipynb) [[demo]](https://huggingface.co/spaces/nielsr/CLIPSeg) [[demo]](https://huggingface.co/spaces/Sijuade/CLIPSegmentation) [[demo]](https://huggingface.co/spaces/taesiri/CLIPSeg) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg2) [[demo]](https://huggingface.co/spaces/aryswisnu/CLIPSeg) [[blog]](https://huggingface.co/blog/clipseg-zero-shot)
- ClipCap
- CLIPPO
- DIME-FM - FM) [[website]](https://cs-people.bu.edu/sunxm/DIME-FM/)
- BASIC
- CoOp
- RPO
- KgCoOp
- ECO
- UPT
- CTP
- MVLPT
- DAPT
- LFA - fi/LFA)
- LaFTer
- TAP
- CLIP-Adapter - Adapter)
- CALIP
- SHIP
- LoGoPrompt
- GRAM
- MaPLe - prompt-learning)
- PromptSR
- ProGrad - align)
-
Segmentation + Vision-Language
-
New Large-Scale Datasets
-
Libraries
- [github
- [github
- [github
- [github
- [github - torchmultimodal/) [[blog]](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/)
- Transformers-VQA
- MMT-Retrieval
-
Projects
-
Other Awesomes
-
Resources
-
- VisualCOMET
- LAION
- Conceptual 12M - research-datasets/conceptual-12m)
- Winoground
- Winoground
- Conceptual 12M - research-datasets/conceptual-12m)
-
Survey Papers
-
-
Policy Gradients with Image Captioning
Programming Languages
Categories
Sub Categories
Keywords
deep-learning
5
vision-and-language
4
pretraining
3
multimodal
3
foundation-models
3
multimodal-deep-learning
3
llm
3
vqa
2
pytorch
2
image-captioning
2
multi-modal
2
gpt
2
chatbot
2
multi-modal-learning
2
contrastive-learning
2
artificial-intelligence
2
vision-language-model
2
vision-language-transformer
2
diffusion
1
flax
1
image-generation
1
image2image
1
clip
1
jax
1
latent-diffusion-models
1
zero-shot-learning
1
score-based-generative-modeling
1
stable-diffusion
1
stable-diffusion-diffusers
1
text2image
1
visual-question-answering
1
cross-modal-retrieval
1
video-captioning
1
databricks
1
dolly
1
attention-mechanism
1
transformers
1
course
1
vision-language
1
large-language-models
1
open-world-detection
1
machine-learning
1
roadmap
1
deep-learning-library
1
open-world
1
multimodal-datasets
1
object-detection
1
salesforce
1
pre-training
1
vision-framework
1