awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
https://github.com/gokayfem/awesome-vlm-architectures
Last synced: 2 days ago
JSON representation
-
**PaliGemma 2: A Family of Versatile VLMs for Transfer**
-
**AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders**
-
**Apollo: An Exploration of Video Understanding in Large Multimodal Models**
-
**Pixtral 12B: A Cutting-Edge Open Multimodal Language Model**
-
**Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos**
-
**Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding**
- 
-
**UI-TARS: Pioneering Automated GUI Interaction with Native Agents**
- ](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
-
**VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling**
- ](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448)
-
**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**
-
**SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model**
-
**InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**
-
**Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling**
-
**LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation**
-
**EVEv2: Improved Baselines for Encoder-Free Vision-Language Models**
-
**Maya: An Instruction Finetuned Multilingual Multimodal Model**
-
**MiniMax-01: Scaling Foundation Models with Lightning Attention**
-
**NVLM: Open Frontier-Class Multimodal LLMs**
-
**OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference**
-
Important References
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**PaliGemma: A Versatile and Transferable 3B Vision-Language Model**
- ](https://huggingface.co/spaces/big-vision/paligemma)
-
**Idefics3-8B: Building and Better Understanding Vision-Language Models**
- 
-
**MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**
- ](https://huggingface.co/spaces/TIGER-Lab/Mantis)
-
**xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**
- 
-
**Parrot: Multilingual Visual Instruction Tuning**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/papers/2406.19389)
-
-
**EVLM: An Efficient Vision-Language Model for Visual Understanding**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**VITA: Towards Open-Source Interactive Omni Multimodal LLM**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/VITA-MLLM)
-
-
**Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**
- ](https://huggingface.co/Alpha-VLLM/SPHINX)
-
**BLIP: Bootstrapping Language-Image Pre-training**
-
**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**
- 
-
**KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**
-
**MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/spaces/llizhx/TinyGPT-V)
-
-
**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**GLaMM: Pixel Grounding Large Multimodal Model**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/spaces/LanguageBind/MoE-LLaVA)
-
-
**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**PaLI: A Jointly-Scaled Multilingual Language-Image Model**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/spaces/Willow123/InternLM-XComposer)
-
**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/papers/2405.15738)
-
-
**INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/papers/2407.16198)
-
-
**VILA²: VILA Augmented VILA**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/openbmb/MiniCPM-V-2_6)
-
-
**LLaVA-OneVision: Easy Visual Task Transfer**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-  [](https://huggingface.co/papers/2408.03326)
-
-
**Idefics2**
- 
-
**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**
-
**BakLLaVA**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B)
-
**IDEFICS**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ](https://huggingface.co/spaces/Qwen/Qwen-VL-Plus)
-
**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**
- 
-
**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- 
-
-
**MouSi: Poly-Visual-Expert Vision-Language Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**Flamingo: a Visual Language Model for Few-Shot Learning**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**ImageBind: One Embedding Space To Bind Them All**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
-
-
**LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**
- ](https://llava.hliu.cc/)
-
**ARIA: An Open Multimodal Native Mixture-of-Experts Model**
-
**LLaVA-CoT: Let Vision Language Models Reason Step-by-Step**
-
**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**
-
**Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series**
- ![arXiv - vl/)
-
**Moondream-next: Compact Vision-Language Model with Enhanced Capabilities**
-
**MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ![arXiv - o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9)
-
Categories
Important References
3
**Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models**
2
**KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**
2
Architectures
2
**SigLIP: Sigmoid Loss for Language Image Pre-Training**
1
**Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding**
1
**Pixtral 12B: A Cutting-Edge Open Multimodal Language Model**
1
**Apollo: An Exploration of Video Understanding in Large Multimodal Models**
1
**COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
1
**Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos**
1
**Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series**
1
**SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**
1
**CLIP: Contrastive Language-Image Pre-training**
1
**SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model**
1
**Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
1
**TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**
1
**DeepSeek-VL: Towards Real-World Vision-Language Understanding**
1
**OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference**
1
**OtterHD: A High-Resolution Multi-modality Model**
1
**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**
1
**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**
1
**BLIP: Bootstrapping Language-Image Pre-training**
1
**MetaCLIP: Demystifying CLIP Data**
1
**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**
1
**ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**
1
**InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**
1
**Idefics2**
1
**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**
1
**Maya: An Instruction Finetuned Multilingual Multimodal Model**
1
**BakLLaVA**
1
**xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**
1
**CogVLM: Visual Expert for Pretrained Language Models**
1
**MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**
1
**LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**
1
**CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**
1
**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**
1
**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**
1
**Parrot: Multilingual Visual Instruction Tuning**
1
**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**
1
**Fuyu-8B: A Multimodal Architecture for AI Agents**
1
**LLaVA 1.5: Improved Baselines with Visual Instruction Tuning**
1
**VITA: Towards Open-Source Interactive Omni Multimodal LLM**
1
**LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation**
1
**MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**
1
**Flamingo: a Visual Language Model for Few-Shot Learning**
1
**FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**
1
**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**
1
**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**
1
**MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**
1
**Ferret: Refer and Ground Anything Anywhere at Any Granularity**
1
**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**
1
**GLaMM: Pixel Grounding Large Multimodal Model**
1
**PaliGemma 2: A Family of Versatile VLMs for Transfer**
1
**UI-TARS: Pioneering Automated GUI Interaction with Native Agents**
1
**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**
1
**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**
1
**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**
1
**PaLM-E: An Embodied Multimodal Language Model**
1
**ImageBind: One Embedding Space To Bind Them All**
1
**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**
1
**LLaVA-OneVision: Easy Visual Task Transfer**
1
**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**
1
**VILA²: VILA Augmented VILA**
1
**Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**
1
**MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming**
1
**VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling**
1
**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**
1
**MiniMax-01: Scaling Foundation Models with Lightning Attention**
1
**LLaVA-CoT: Let Vision Language Models Reason Step-by-Step**
1
**LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**
1
**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**
1
**Idefics3-8B: Building and Better Understanding Vision-Language Models**
1
**NVLM: Open Frontier-Class Multimodal LLMs**
1
**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**
1
**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**
1
**PaliGemma: A Versatile and Transferable 3B Vision-Language Model**
1
**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**
1
**INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**
1
**u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**
1
**GLIP: Grounded Language-Image Pre-training**
1
**Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling**
1
**IDEFICS**
1
**EVEv2: Improved Baselines for Encoder-Free Vision-Language Models**
1
**EVLM: An Efficient Vision-Language Model for Visual Understanding**
1
**PaLI: A Jointly-Scaled Multilingual Language-Image Model**
1
**AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders**
1
**MouSi: Poly-Visual-Expert Vision-Language Models**
1
**ARIA: An Open Multimodal Native Mixture-of-Experts Model**
1
**Alpha-CLIP: A CLIP Model Focusing on Wherever You Want**
1
**Moondream-next: Compact Vision-Language Model with Enhanced Capabilities**
1