An open API service indexing awesome lists of open source software.

awesome-vlm-architectures

Famous Vision Language Models and Their Architectures
https://github.com/gokayfem/awesome-vlm-architectures

Last synced: 2 days ago
JSON representation

  • **PaliGemma 2: A Family of Versatile VLMs for Transfer**

  • **AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders**

  • **Apollo: An Exploration of Video Understanding in Large Multimodal Models**

  • **Pixtral 12B: A Cutting-Edge Open Multimodal Language Model**

  • **Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos**

  • **Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding**

    • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/omni-research/Tarsier-7b)
  • **UI-TARS: Pioneering Automated GUI Interaction with Native Agents**

    • ![arXiv - TARS) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
  • **VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling**

    • ![arXiv - Flash) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448)
  • **VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**

  • **SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model**

  • **InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**

  • **Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling**

  • **LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation**

  • **EVEv2: Improved Baselines for Encoder-Free Vision-Language Models**

  • **Maya: An Instruction Finetuned Multilingual Multimodal Model**

  • **MiniMax-01: Scaling Foundation Models with Lightning Attention**

  • **NVLM: Open Frontier-Class Multimodal LLMs**

  • **OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference**

  • Important References

  • **PaliGemma: A Versatile and Transferable 3B Vision-Language Model**

    • ![arXiv - research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/big-vision/paligemma)
  • **Idefics3-8B: Building and Better Understanding Vision-Language Models**

    • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/HuggingFaceM4/idefics3)
  • **MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**

    • ![arXiv - AI-Lab/Mantis) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/TIGER-Lab/Mantis)
  • **xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**

    • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/Salesforce/xgen-mm-1-models-and-datasets-662971d6cecbf3a7f80ecc2e)
  • **Parrot: Multilingual Visual Instruction Tuning**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Seg) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2406.19389)
  • **EVLM: An Efficient Vision-Language Model for Visual Understanding**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.14177)
  • **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.15841)
  • **VITA: Towards Open-Source Interactive Omni Multimodal LLM**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - MLLM/VITA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/VITA-MLLM)
  • **Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/gokaygokay/Florence-2)
  • **CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/THUDM/cogvlm2-6645f36a29948b67dc4eef75)
  • **SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**

    • ![arXiv - vllm/llama2-accessory) [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/Alpha-VLLM/SPHINX)
  • **BLIP: Bootstrapping Language-Image Pre-training**

  • **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**

    • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/hysts/InstructBLIP)
  • **KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - 2) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ydshieh/Kosmos-2)
  • **MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - V) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/llizhx/TinyGPT-V)
  • **CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **GLaMM: Pixel Grounding Large Multimodal Model**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - YuanGroup/MoE-LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/LanguageBind/MoE-LLaVA)
  • **BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - VL/LLaVA-Plus-Codebase)
  • **CogVLM: Visual Expert for Pretrained Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **Ferret: Refer and Ground Anything Anywhere at Any Granularity**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • Architectures

    • **Fuyu-8B: A Multimodal Architecture for AI Agents**

    • **IDEFICS**

  • **Fuyu-8B: A Multimodal Architecture for AI Agents**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • - Spaces-blue)](https://huggingface.co/adept/fuyu-8b)
  • **OtterHD: A High-Resolution Multi-modality Model**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Otter-AI/OtterHD-Demo)
  • **PaLI: A Jointly-Scaled Multilingual Language-Image Model**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **PaLI-3 Vision Language Models: Smaller, Faster, Stronger**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **PaLM-E: An Embodied Multimodal Language Model**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **CLIP: Contrastive Language-Image Pre-training**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **MetaCLIP: Demystifying CLIP Data**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **Alpha-CLIP: A CLIP Model Focusing on Wherever You Want**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **GLIP: Grounded Language-Image Pre-training**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **SigLIP: Sigmoid Loss for Language Image Pre-Training**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - research/vision_transformer)
  • **LLaVA 1.5: Improved Baselines with Visual Instruction Tuning**

  • **InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**

    • ![arXiv - XComposer) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/Willow123/InternLM-XComposer)
  • **ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - llava) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2405.15738)
  • **INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - LLaVA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.16198)
  • **VILA²: VILA Augmented VILA**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.17453)
  • **MiniCPM-V: A GPT-4V Level MLLM on Your Phone**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - V) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/openbmb/MiniCPM-V-2_6)
  • **LLaVA-OneVision: Easy Visual Task Transfer**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Website-blue)](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2408.03326)
  • **Idefics2**

    • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/HuggingFaceM4/idefics-8b)
  • **InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**

  • **BakLLaVA**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![GitHub - Spaces-blue)](https://huggingface.co/SkunkworksAI/BakLLaVA-1)
  • **Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models**

  • **DeepSeek-VL: Towards Real-World Vision-Language Understanding**

    • ![arXiv - ai/DeepSeek-VL) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B)
  • **IDEFICS**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**

    • ![arXiv - vl) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Qwen/Qwen-VL-Plus)
  • **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**

    • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Salesforce/BLIP2)
  • **EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat)
  • **MouSi: Poly-Visual-Expert Vision-Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **Flamingo: a Visual Language Model for Few-Shot Learning**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **ImageBind: One Embedding Space To Bind Them All**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

  • **LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**

    • ![arXiv - liu/LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://llava.hliu.cc/)
  • **ARIA: An Open Multimodal Native Mixture-of-Experts Model**

  • **LLaVA-CoT: Let Vision Language Models Reason Step-by-Step**

  • **DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**

  • **Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series**

  • **Moondream-next: Compact Vision-Language Model with Enhanced Capabilities**

  • **MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9)
Categories
Important References 3 **Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models** 2 **KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models** 2 Architectures 2 **SigLIP: Sigmoid Loss for Language Image Pre-Training** 1 **Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding** 1 **Pixtral 12B: A Cutting-Edge Open Multimodal Language Model** 1 **Apollo: An Exploration of Video Understanding in Large Multimodal Models** 1 **COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training** 1 **Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos** 1 **Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series** 1 **SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models** 1 **CLIP: Contrastive Language-Image Pre-training** 1 **SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model** 1 **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond** 1 **TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones** 1 **DeepSeek-VL: Towards Real-World Vision-Language Understanding** 1 **OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference** 1 **OtterHD: A High-Resolution Multi-modality Model** 1 **MiniCPM-V: A GPT-4V Level MLLM on Your Phone** 1 **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models** 1 **BLIP: Bootstrapping Language-Image Pre-training** 1 **MetaCLIP: Demystifying CLIP Data** 1 **BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions** 1 **ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale** 1 **InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling** 1 **Idefics2** 1 **VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding** 1 **Maya: An Instruction Finetuned Multilingual Multimodal Model** 1 **BakLLaVA** 1 **xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models** 1 **CogVLM: Visual Expert for Pretrained Language Models** 1 **MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning** 1 **LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models** 1 **CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding** 1 **SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models** 1 **PaLI-3 Vision Language Models: Smaller, Faster, Stronger** 1 **Parrot: Multilingual Visual Instruction Tuning** 1 **ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models** 1 **Fuyu-8B: A Multimodal Architecture for AI Agents** 1 **LLaVA 1.5: Improved Baselines with Visual Instruction Tuning** 1 **VITA: Towards Open-Source Interactive Omni Multimodal LLM** 1 **LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation** 1 **MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning** 1 **Flamingo: a Visual Language Model for Few-Shot Learning** 1 **FROZEN: Multimodal Few-Shot Learning with Frozen Language Models** 1 **EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders** 1 **CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding** 1 **MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices** 1 **Ferret: Refer and Ground Anything Anywhere at Any Granularity** 1 **InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output** 1 **GLaMM: Pixel Grounding Large Multimodal Model** 1 **PaliGemma 2: A Family of Versatile VLMs for Transfer** 1 **UI-TARS: Pioneering Automated GUI Interaction with Native Agents** 1 **OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding** 1 **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning** 1 **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models** 1 **PaLM-E: An Embodied Multimodal Language Model** 1 **ImageBind: One Embedding Space To Bind Them All** 1 **InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD** 1 **LLaVA-OneVision: Easy Visual Task Transfer** 1 **OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models** 1 **VILA²: VILA Augmented VILA** 1 **Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities** 1 **MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming** 1 **VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling** 1 **MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning** 1 **MiniMax-01: Scaling Foundation Models with Lightning Attention** 1 **LLaVA-CoT: Let Vision Language Models Reason Step-by-Step** 1 **LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning** 1 **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models** 1 **Idefics3-8B: Building and Better Understanding Vision-Language Models** 1 **NVLM: Open Frontier-Class Multimodal LLMs** 1 **DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding** 1 **MoE-LLaVA: Mixture of Experts for Large Vision-Language Models** 1 **PaliGemma: A Versatile and Transferable 3B Vision-Language Model** 1 **LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents** 1 **INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models** 1 **u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model** 1 **GLIP: Grounded Language-Image Pre-training** 1 **Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling** 1 **IDEFICS** 1 **EVEv2: Improved Baselines for Encoder-Free Vision-Language Models** 1 **EVLM: An Efficient Vision-Language Model for Visual Understanding** 1 **PaLI: A Jointly-Scaled Multilingual Language-Image Model** 1 **AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders** 1 **MouSi: Poly-Visual-Expert Vision-Language Models** 1 **ARIA: An Open Multimodal Native Mixture-of-Experts Model** 1 **Alpha-CLIP: A CLIP Model Focusing on Wherever You Want** 1 **Moondream-next: Compact Vision-Language Model with Enhanced Capabilities** 1