Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-vlm-architectures

Famous Vision Language Models and Their Architectures
https://github.com/gokayfem/awesome-vlm-architectures

Last synced: 1 day ago
JSON representation

  • Architectures

    • **IDEFICS**

    • **PaliGemma: A Versatile and Transferable 3B Vision-Language Model**

      • ![arXiv - research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/big-vision/paligemma)
    • **Idefics3-8B: Building and Better Understanding Vision-Language Models**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/HuggingFaceM4/idefics3)
    • **InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**

      • ![arXiv - XComposer) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/Willow123/InternLM-XComposer)
    • **MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**

      • ![arXiv - AI-Lab/Mantis) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/TIGER-Lab/Mantis)
    • **xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/Salesforce/xgen-mm-1-models-and-datasets-662971d6cecbf3a7f80ecc2e)
    • **ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**

      • ![arXiv - llava) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2405.15738)
    • **Parrot: Multilingual Visual Instruction Tuning**

    • **OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**

      • ![arXiv - Seg) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2406.19389)
    • **EVLM: An Efficient Vision-Language Model for Visual Understanding**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.14177)
    • **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.15841)
    • **INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**

      • ![arXiv - LLaVA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.16198)
    • **VILA²: VILA Augmented VILA**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.17453)
    • **MiniCPM-V: A GPT-4V Level MLLM on Your Phone**

      • ![arXiv - V) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/openbmb/MiniCPM-V-2_6)
    • **LLaVA-OneVision: Easy Visual Task Transfer**

      • ![arXiv - Website-blue)](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2408.03326)
    • **VITA: Towards Open-Source Interactive Omni Multimodal LLM**

      • ![arXiv - MLLM/VITA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/VITA-MLLM)
    • **EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat)
    • **Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/gokaygokay/Florence-2)
    • **CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**

      • ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/THUDM/cogvlm2-6645f36a29948b67dc4eef75)
    • **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**

      • ![arXiv - vl) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Qwen/Qwen-VL-Plus)
    • **SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**

      • ![arXiv - vllm/llama2-accessory) [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/Alpha-VLLM/SPHINX)
    • **BLIP: Bootstrapping Language-Image Pre-training**

    • **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**

      • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Salesforce/BLIP2)
    • **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**

      • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/hysts/InstructBLIP)
    • **KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**

    • **KOSMOS-2: Grounding Multimodal Large Language Models to the World**

      • ![arXiv - 2) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ydshieh/Kosmos-2)
    • **COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**

    • **MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**

    • **MouSi: Poly-Visual-Expert Vision-Language Models**

    • **LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**

    • **TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**

      • ![arXiv - V) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/llizhx/TinyGPT-V)
    • **CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**

    • **GLaMM: Pixel Grounding Large Multimodal Model**

    • **u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**

    • **MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**

      • ![arXiv - YuanGroup/MoE-LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/LanguageBind/MoE-LLaVA)
    • **BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**

    • **MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**

    • **FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**

    • **Flamingo: a Visual Language Model for Few-Shot Learning**

    • **OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**

    • **LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**

      • ![arXiv - VL/LLaVA-Plus-Codebase)
    • **CogVLM: Visual Expert for Pretrained Language Models**

    • **Ferret: Refer and Ground Anything Anywhere at Any Granularity**

    • **Fuyu-8B: A Multimodal Architecture for AI Agents**

      • ![Link
      • - Spaces-blue)](https://huggingface.co/adept/fuyu-8b)
    • **OtterHD: A High-Resolution Multi-modality Model**

      • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Otter-AI/OtterHD-Demo)
    • **PaLI: A Jointly-Scaled Multilingual Language-Image Model**

    • **PaLI-3 Vision Language Models: Smaller, Faster, Stronger**

    • **PaLM-E: An Embodied Multimodal Language Model**

    • **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**

    • **MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**

    • **SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**

    • **CLIP: Contrastive Language-Image Pre-training**

    • **MetaCLIP: Demystifying CLIP Data**

    • **Alpha-CLIP: A CLIP Model Focusing on Wherever You Want**

    • **GLIP: Grounded Language-Image Pre-training**

    • **ImageBind: One Embedding Space To Bind Them All**

    • **SigLIP: Sigmoid Loss for Language Image Pre-Training**

    • **ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**

      • ![arXiv - research/vision_transformer)
    • **LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**

      • ![arXiv - liu/LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://llava.hliu.cc/)
    • **LLaVA 1.5: Improved Baselines with Visual Instruction Tuning**

    • **Idefics2**

      • ![arXiv - Spaces-blue)](https://huggingface.co/spaces/HuggingFaceM4/idefics-8b)
    • **InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**

    • **BakLLaVA**

      • ![GitHub - Spaces-blue)](https://huggingface.co/SkunkworksAI/BakLLaVA-1)
    • **DeepSeek-VL: Towards Real-World Vision-Language Understanding**

      • ![arXiv - ai/DeepSeek-VL) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B)
  • Important References

Sub Categories
**ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale** 4 **Fuyu-8B: A Multimodal Architecture for AI Agents** 2 **IDEFICS** 2 **Flamingo: a Visual Language Model for Few-Shot Learning** 1 **SigLIP: Sigmoid Loss for Language Image Pre-Training** 1 **KOSMOS-2: Grounding Multimodal Large Language Models to the World** 1 **COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training** 1 **SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models** 1 **CLIP: Contrastive Language-Image Pre-training** 1 **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond** 1 **TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones** 1 **DeepSeek-VL: Towards Real-World Vision-Language Understanding** 1 **OtterHD: A High-Resolution Multi-modality Model** 1 **MiniCPM-V: A GPT-4V Level MLLM on Your Phone** 1 **MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models** 1 **BLIP: Bootstrapping Language-Image Pre-training** 1 **MetaCLIP: Demystifying CLIP Data** 1 **BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions** 1 **MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning** 1 **VITA: Towards Open-Source Interactive Omni Multimodal LLM** 1 **LLaVA 1.5: Improved Baselines with Visual Instruction Tuning** 1 **ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models** 1 **Parrot: Multilingual Visual Instruction Tuning** 1 **PaLI-3 Vision Language Models: Smaller, Faster, Stronger** 1 **SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models** 1 **CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding** 1 **LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models** 1 **MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning** 1 **CogVLM: Visual Expert for Pretrained Language Models** 1 **xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models** 1 **KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models** 1 **BakLLaVA** 1 **MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning** 1 **Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities** 1 **VILA²: VILA Augmented VILA** 1 **OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models** 1 **LLaVA-OneVision: Easy Visual Task Transfer** 1 **InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD** 1 **ImageBind: One Embedding Space To Bind Them All** 1 **PaLM-E: An Embodied Multimodal Language Model** 1 **SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models** 1 **InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning** 1 **OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding** 1 **GLaMM: Pixel Grounding Large Multimodal Model** 1 **InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output** 1 **Ferret: Refer and Ground Anything Anywhere at Any Granularity** 1 **MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices** 1 **CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding** 1 **EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders** 1 **FROZEN: Multimodal Few-Shot Learning with Frozen Language Models** 1 **Alpha-CLIP: A CLIP Model Focusing on Wherever You Want** 1 **MouSi: Poly-Visual-Expert Vision-Language Models** 1 **PaLI: A Jointly-Scaled Multilingual Language-Image Model** 1 **EVLM: An Efficient Vision-Language Model for Visual Understanding** 1 **GLIP: Grounded Language-Image Pre-training** 1 **u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model** 1 **INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models** 1 **LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents** 1 **PaliGemma: A Versatile and Transferable 3B Vision-Language Model** 1 **MoE-LLaVA: Mixture of Experts for Large Vision-Language Models** 1 **Idefics3-8B: Building and Better Understanding Vision-Language Models** 1 **BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models** 1 **LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning** 1 **Idefics2** 1