Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
https://github.com/gokayfem/awesome-vlm-architectures
Last synced: 1 day ago
JSON representation
-
Architectures
-
**IDEFICS**
-
**PaliGemma: A Versatile and Transferable 3B Vision-Language Model**
- ![arXiv - research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/big-vision/paligemma)
-
**Idefics3-8B: Building and Better Understanding Vision-Language Models**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/HuggingFaceM4/idefics3)
-
**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**
- ![arXiv - XComposer) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/Willow123/InternLM-XComposer)
-
**MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**
- ![arXiv - AI-Lab/Mantis) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/TIGER-Lab/Mantis)
-
**xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/Salesforce/xgen-mm-1-models-and-datasets-662971d6cecbf3a7f80ecc2e)
-
**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**
- ![arXiv - llava) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2405.15738)
-
**Parrot: Multilingual Visual Instruction Tuning**
- ![arXiv - AI/Parrot)
-
**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**
- ![arXiv - Seg) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2406.19389)
-
**EVLM: An Efficient Vision-Language Model for Visual Understanding**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.14177)
-
**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.15841)
-
**INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**
- ![arXiv - LLaVA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.16198)
-
**VILA²: VILA Augmented VILA**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2407.17453)
-
**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**
- ![arXiv - V) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/openbmb/MiniCPM-V-2_6)
-
**LLaVA-OneVision: Easy Visual Task Transfer**
- ![arXiv - Website-blue)](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/papers/2408.03326)
-
**VITA: Towards Open-Source Interactive Omni Multimodal LLM**
- ![arXiv - MLLM/VITA) [![HuggingFace](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/VITA-MLLM)
-
**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat)
-
**Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/gokaygokay/Florence-2)
-
**CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**
- ![arXiv - Open%20In%20Spaces-blue.svg)](https://huggingface.co/collections/THUDM/cogvlm2-6645f36a29948b67dc4eef75)
-
**Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
- ![arXiv - vl) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Qwen/Qwen-VL-Plus)
-
**SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**
- ![arXiv - vllm/llama2-accessory) [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/Alpha-VLLM/SPHINX)
-
**BLIP: Bootstrapping Language-Image Pre-training**
-
**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Salesforce/BLIP2)
-
**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/hysts/InstructBLIP)
-
**KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**
-
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
- ![arXiv - 2) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/ydshieh/Kosmos-2)
-
**COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
-
**MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**
- ![arXiv - nlp/multiinstruct)
-
**MouSi: Poly-Visual-Expert Vision-Language Models**
-
**LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**
-
**TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**
- ![arXiv - V) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/llizhx/TinyGPT-V)
-
**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**
-
**GLaMM: Pixel Grounding Large Multimodal Model**
- ![arXiv - oryx/groundingLMM)
-
**u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**
- ![arXiv - LLaVA)
-
**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**
- ![arXiv - YuanGroup/MoE-LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/LanguageBind/MoE-LLaVA)
-
**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**
- ![arXiv - ucsd/bliva)
-
**MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**
- ![arXiv - automl/mobilevlm)
-
**FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**
-
**Flamingo: a Visual Language Model for Few-Shot Learning**
-
**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**
-
**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**
- ![arXiv - VL/LLaVA-Plus-Codebase)
-
**CogVLM: Visual Expert for Pretrained Language Models**
-
**Ferret: Refer and Ground Anything Anywhere at Any Granularity**
- ![arXiv - ferret)
-
**Fuyu-8B: A Multimodal Architecture for AI Agents**
- ![Link
- - Spaces-blue)](https://huggingface.co/adept/fuyu-8b)
-
**OtterHD: A High-Resolution Multi-modality Model**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/Otter-AI/OtterHD-Demo)
-
**PaLI: A Jointly-Scaled Multilingual Language-Image Model**
- ![arXiv - research/big_vision)
-
**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**
-
**PaLM-E: An Embodied Multimodal Language Model**
- ![arXiv - e.github.io)
-
**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**
- ![arXiv - cair/minigpt-4)
-
**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**
-
**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**
- ![arXiv - vllm/)
-
**CLIP: Contrastive Language-Image Pre-training**
-
**MetaCLIP: Demystifying CLIP Data**
-
**Alpha-CLIP: A CLIP Model Focusing on Wherever You Want**
-
**GLIP: Grounded Language-Image Pre-training**
-
**ImageBind: One Embedding Space To Bind Them All**
-
**SigLIP: Sigmoid Loss for Language Image Pre-Training**
-
**ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**
- ![arXiv - research/vision_transformer)
-
**LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**
- ![arXiv - liu/LLaVA) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://llava.hliu.cc/)
-
**LLaVA 1.5: Improved Baselines with Visual Instruction Tuning**
-
**Idefics2**
- ![arXiv - Spaces-blue)](https://huggingface.co/spaces/HuggingFaceM4/idefics-8b)
-
**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**
-
**BakLLaVA**
- ![GitHub - Spaces-blue)](https://huggingface.co/SkunkworksAI/BakLLaVA-1)
-
**DeepSeek-VL: Towards Real-World Vision-Language Understanding**
- ![arXiv - ai/DeepSeek-VL) [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B)
-
-
Important References
-
**ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**
-
Categories
Sub Categories
**ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale**
4
**Fuyu-8B: A Multimodal Architecture for AI Agents**
2
**IDEFICS**
2
**Flamingo: a Visual Language Model for Few-Shot Learning**
1
**SigLIP: Sigmoid Loss for Language Image Pre-Training**
1
**KOSMOS-2: Grounding Multimodal Large Language Models to the World**
1
**COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training**
1
**SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models**
1
**CLIP: Contrastive Language-Image Pre-training**
1
**Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
1
**TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones**
1
**DeepSeek-VL: Towards Real-World Vision-Language Understanding**
1
**OtterHD: A High-Resolution Multi-modality Model**
1
**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**
1
**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**
1
**BLIP: Bootstrapping Language-Image Pre-training**
1
**MetaCLIP: Demystifying CLIP Data**
1
**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**
1
**MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**
1
**VITA: Towards Open-Source Interactive Omni Multimodal LLM**
1
**LLaVA 1.5: Improved Baselines with Visual Instruction Tuning**
1
**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**
1
**Parrot: Multilingual Visual Instruction Tuning**
1
**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**
1
**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**
1
**CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding**
1
**LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**
1
**MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning**
1
**CogVLM: Visual Expert for Pretrained Language Models**
1
**xGen-MM (BLIP-3): An Open-Source Framework for Building Powerful and Responsible Large Multimodal Models**
1
**KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models**
1
**BakLLaVA**
1
**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**
1
**Florence-2: A Deep Dive into its Unified Architecture and Multi-Task Capabilities**
1
**VILA²: VILA Augmented VILA**
1
**OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models**
1
**LLaVA-OneVision: Easy Visual Task Transfer**
1
**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**
1
**ImageBind: One Embedding Space To Bind Them All**
1
**PaLM-E: An Embodied Multimodal Language Model**
1
**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**
1
**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**
1
**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**
1
**GLaMM: Pixel Grounding Large Multimodal Model**
1
**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**
1
**Ferret: Refer and Ground Anything Anywhere at Any Granularity**
1
**MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices**
1
**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**
1
**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**
1
**FROZEN: Multimodal Few-Shot Learning with Frozen Language Models**
1
**Alpha-CLIP: A CLIP Model Focusing on Wherever You Want**
1
**MouSi: Poly-Visual-Expert Vision-Language Models**
1
**PaLI: A Jointly-Scaled Multilingual Language-Image Model**
1
**EVLM: An Efficient Vision-Language Model for Visual Understanding**
1
**GLIP: Grounded Language-Image Pre-training**
1
**u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model**
1
**INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models**
1
**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**
1
**PaliGemma: A Versatile and Transferable 3B Vision-Language Model**
1
**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**
1
**Idefics3-8B: Building and Better Understanding Vision-Language Models**
1
**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**
1
**LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning**
1
**Idefics2**
1