awesome-vlm-architectures

Famous Vision Language Models and Their Architectures
https://github.com/gokayfem/awesome-vlm-architectures

Last synced: 2 days ago
JSON representation

**PaliGemma 2: A Family of Versatile VLMs for Transfer**
- ![arXiv
**AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders**
- ![arXiv
**Apollo: An Exploration of Video Understanding in Large Multimodal Models**
- ![arXiv
**Pixtral 12B: A Cutting-Edge Open Multimodal Language Model**
- ![arXiv
**Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos**
- ![arXiv
**Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/omni-research/Tarsier-7b)
**UI-TARS: Pioneering Automated GUI Interaction with Native Agents**
- ![arXiv - TARS) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
**VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling**
- ![arXiv - Flash) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448)
**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**
- ![arXiv
**SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model**
- ![arXiv
**InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**
- ![arXiv
**Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling**
- ![arXiv
**LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation**
- ![arXiv
**EVEv2: Improved Baselines for Encoder-Free Vision-Language Models**
- ![arXiv
**Maya: An Instruction Finetuned Multilingual Multimodal Model**
- ![arXiv
**MiniMax-01: Scaling Foundation Models with Lightning Attention**
- ![arXiv
**NVLM: Open Frontier-Class Multimodal LLMs**
- ![arXiv
**OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference**
- ![arXiv
Important References
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
**PaliGemma: A Versatile and Transferable 3B Vision-Language Model**
- ![arXiv - research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/big-vision/paligemma)
**Idefics3-8B: Building and Better Understanding Vision-Language Models**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/HuggingFaceM4/idefics3)
**MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**
- ![arXiv - AI-Lab/Mantis) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/TIGER-Lab/Mantis)
**xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/Salesforce/xgen-mm-1-models-and-datasets-662971d6cecbf3a7f80ecc2e)
**Parrot: Multilingual Visual Instruction Tuning**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - AI/Parrot)
**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Seg) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2406.19389)
**EVLM: An Efficient Vision-Language Model for Visual Understanding**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.14177)
**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.15841)
**VITA: Towards Open-Source Interactive Omni Multimodal LLM**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - MLLM/VITA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/VITA-MLLM)
**Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/gokaygokay/Florence-2)
**CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/THUDM/cogvlm2-6645f36a29948b67dc4eef75)
**SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**
- ![arXiv - vllm/llama2-accessory) [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/Alpha-VLLM/SPHINX)
**BLIP: Bootstrapping Language-Image Pre-training**
- ![arXiv
**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/hysts/InstructBLIP)
**KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**
- - ![arXiv
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - 2) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ydshieh/Kosmos-2)
**MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - nlp/multiinstruct)
**TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - V) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/llizhx/TinyGPT-V)
**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**GLaMM: Pixel Grounding Large Multimodal Model**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - oryx/groundingLMM)
**u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - LLaVA)
**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - YuanGroup/MoE-LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/LanguageBind/MoE-LLaVA)
**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - ucsd/bliva)
**MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - automl/mobilevlm)
**FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - VL/LLaVA-Plus-Codebase)
**CogVLM: Visual Expert for Pretrained Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**Ferret: Refer and Ground Anything Anywhere at Any Granularity**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - ferret)
Architectures
- **Fuyu-8B: A Multimodal Architecture for AI Agents**
  - ![Link
- **IDEFICS**
  - ![Model - 80b)
**Fuyu-8B: A Multimodal Architecture for AI Agents**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - - Spaces-blue)](https://huggingface.co/adept/fuyu-8b)
**OtterHD: A High-Resolution Multi-modality Model**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Otter-AI/OtterHD-Demo)
**PaLI: A Jointly-Scaled Multilingual Language-Image Model**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - research/big_vision)
**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**PaLM-E: An Embodied Multimodal Language Model**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - e.github.io)
**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - cair/minigpt-4)
**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - vllm/)
**CLIP: Contrastive Language-Image Pre-training**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**MetaCLIP: Demystifying CLIP Data**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**Alpha-CLIP: A CLIP Model Focusing on Wherever You Want**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**GLIP: Grounded Language-Image Pre-training**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**SigLIP: Sigmoid Loss for Language Image Pre-Training**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - research/vision_transformer)
**LLaVA 1.5: Improved Baselines with Visual Instruction Tuning**
- ![arXiv
**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**
- ![arXiv - XComposer) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/Willow123/InternLM-XComposer)
**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - llava) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2405.15738)
**INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - LLaVA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.16198)
**VILA²: VILA Augmented VILA**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.17453)
**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - V) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/openbmb/MiniCPM-V-2_6)
**LLaVA-OneVision: Easy Visual Task Transfer**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Website-blue)](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2408.03326)
**Idefics2**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/HuggingFaceM4/idefics-8b)
**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**
- ![arXiv
**BakLLaVA**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![GitHub - Spaces-blue)](https://huggingface.co/SkunkworksAI/BakLLaVA-1)
**Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![HuggingFace - 9B)
  - ![arXiv
**DeepSeek-VL: Towards Real-World Vision-Language Understanding**
- ![arXiv - ai/DeepSeek-VL) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B)
**IDEFICS**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![Model - 80b)
**Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
- ![arXiv - vl) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Qwen/Qwen-VL-Plus)
**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Salesforce/BLIP2)
**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat)
**MouSi: Poly-Visual-Expert Vision-Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**Flamingo: a Visual Language Model for Few-Shot Learning**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**ImageBind: One Embedding Space To Bind Them All**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv
**LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**
- ![arXiv - liu/LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://llava.hliu.cc/)
**ARIA: An Open Multimodal Native Mixture-of-Experts Model**
- ![arXiv
**LLaVA-CoT: Let Vision Language Models Reason Step-by-Step**
- ![arXiv
**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**
- ![arXiv
**Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series**
- ![arXiv - vl/)
**Moondream-next: Compact Vision-Language Model with Enhanced Capabilities**
- ![arXiv
**MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming**
- **KOSMOS-2: Grounding Multimodal Large Language Models to the World**
  - ![arXiv - o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9)

Categories

Sub Categories

**KOSMOS-2: Grounding Multimodal Large Language Models to the World** 54 **IDEFICS** 1 **Fuyu-8B: A Multimodal Architecture for AI Agents** 1

awesome-vlm-architectures

**PaliGemma 2: A Family of Versatile VLMs for Transfer**

**AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders**

**Apollo: An Exploration of Video Understanding in Large Multimodal Models**

**Pixtral 12B: A Cutting-Edge Open Multimodal Language Model**

**Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos**

**Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding**

**UI-TARS: Pioneering Automated GUI Interaction with Native Agents**

**VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling**

**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**

**SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model**

**InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**

**Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling**

**LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation**

**EVEv2: Improved Baselines for Encoder-Free Vision-Language Models**

**Maya: An Instruction Finetuned Multilingual Multimodal Model**

**MiniMax-01: Scaling Foundation Models with Lightning Attention**

**NVLM: Open Frontier-Class Multimodal LLMs**

**OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference**

Important References

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**PaliGemma: A Versatile and Transferable 3B Vision-Language Model**

**Idefics3-8B: Building and Better Understanding Vision-Language Models**

**MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**

**xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**

**Parrot: Multilingual Visual Instruction Tuning**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**EVLM: An Efficient Vision-Language Model for Visual Understanding**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**VITA: Towards Open-Source Interactive Omni Multimodal LLM**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**

**BLIP: Bootstrapping Language-Image Pre-training**

**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**

**KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**GLaMM: Pixel Grounding Large Multimodal Model**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**CogVLM: Visual Expert for Pretrained Language Models**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**Ferret: Refer and Ground Anything Anywhere at Any Granularity**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

Architectures

**Fuyu-8B: A Multimodal Architecture for AI Agents**

**IDEFICS**

**Fuyu-8B: A Multimodal Architecture for AI Agents**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**OtterHD: A High-Resolution Multi-modality Model**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**PaLI: A Jointly-Scaled Multilingual Language-Image Model**

**KOSMOS-2: Grounding Multimodal Large Language Models to the World**

**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**

PaliGemma 2: A Family of Versatile VLMs for Transfer

AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Pixtral 12B: A Cutting-Edge Open Multimodal Language Model

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model

InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Maya: An Instruction Finetuned Multilingual Multimodal Model

MiniMax-01: Scaling Foundation Models with Lightning Attention

NVLM: Open Frontier-Class Multimodal LLMs

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

KOSMOS-2: Grounding Multimodal Large Language Models to the World

PaliGemma: A Versatile and Transferable 3B Vision-Language Model

Idefics3-8B: Building and Better Understanding Vision-Language Models

MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning

xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models

Parrot: Multilingual Visual Instruction Tuning

KOSMOS-2: Grounding Multimodal Large Language Models to the World

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

KOSMOS-2: Grounding Multimodal Large Language Models to the World

EVLM: An Efficient Vision-Language Model for Visual Understanding

KOSMOS-2: Grounding Multimodal Large Language Models to the World

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

VITA: Towards Open-Source Interactive Omni Multimodal LLM

KOSMOS-2: Grounding Multimodal Large Language Models to the World

Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities

KOSMOS-2: Grounding Multimodal Large Language Models to the World

CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding

KOSMOS-2: Grounding Multimodal Large Language Models to the World

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

BLIP: Bootstrapping Language-Image Pre-training

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

KOSMOS-2: Grounding Multimodal Large Language Models to the World

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

KOSMOS-2: Grounding Multimodal Large Language Models to the World

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

KOSMOS-2: Grounding Multimodal Large Language Models to the World

GLaMM: Pixel Grounding Large Multimodal Model

KOSMOS-2: Grounding Multimodal Large Language Models to the World

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

KOSMOS-2: Grounding Multimodal Large Language Models to the World

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

KOSMOS-2: Grounding Multimodal Large Language Models to the World

MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices

KOSMOS-2: Grounding Multimodal Large Language Models to the World

FROZEN: Multimodal Few-Shot Learning with Frozen Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

KOSMOS-2: Grounding Multimodal Large Language Models to the World

CogVLM: Visual Expert for Pretrained Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

Ferret: Refer and Ground Anything Anywhere at Any Granularity

KOSMOS-2: Grounding Multimodal Large Language Models to the World

Fuyu-8B: A Multimodal Architecture for AI Agents

IDEFICS

Fuyu-8B: A Multimodal Architecture for AI Agents

KOSMOS-2: Grounding Multimodal Large Language Models to the World

OtterHD: A High-Resolution Multi-modality Model

KOSMOS-2: Grounding Multimodal Large Language Models to the World

PaLI: A Jointly-Scaled Multilingual Language-Image Model

KOSMOS-2: Grounding Multimodal Large Language Models to the World

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

KOSMOS-2: Grounding Multimodal Large Language Models to the World

PaLM-E: An Embodied Multimodal Language Model

KOSMOS-2: Grounding Multimodal Large Language Models to the World