Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-unified-multimodal-models
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
https://github.com/showlab/awesome-unified-multimodal-models
Last synced: 6 days ago
JSON representation
-
Table of Contents <!-- omit in toc -->
-
Unified Multimodal Understanding and Generation
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- ![Star - ai/Janus)
- Emu3: Next-Token Prediction is All You Need
- ![Star
- MIO: A Foundation Model on Multimodal Tokens
- MonoFormer: One Transformer for Both Diffusion and Autoregression
- ![Star
- MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- ![Star - ai/Janus)
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
- Emu3: Next-Token Prediction is All You Need
- ![Star
- MIO: A Foundation Model on Multimodal Tokens
- MonoFormer: One Transformer for Both Diffusion and Autoregression
- ![Star
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
- ![Star - o)
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
- ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
- X-VILA: Cross-Modality Alignment for Large Language Model
- Chameleon: Mixed-Modal Early-Fusion Foundation Models
- ![Star
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
- ![Star - CVC/SEED-X)
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
- Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
- ![Star - research/MGM)
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
- Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
- ![Star - o)
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
- ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
- ![Star - NLP/anole)
- X-VILA: Cross-Modality Alignment for Large Language Model
- ![Star
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
- ![Star - CVC/SEED-X)
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
- ![Star - research/MGM)
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
- World Model on Million-Length Video And Language With Blockwise RingAttention
- World Model on Million-Length Video And Language With Blockwise RingAttention
- ![Star
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
- ![Star
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
- ![Star - Interleaved)
- Emu2: Generative Multimodal Models are In-Context Learners
- ![Star - Interleaved)
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
- Emu2: Generative Multimodal Models are In-Context Learners
- Gemini: A Family of Highly Capable Multimodal Models
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
- Gemini: A Family of Highly Capable Multimodal Models
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
- DreamLLM: Synergistic Multimodal Comprehension and Creation
- DreamLLM: Synergistic Multimodal Comprehension and Creation
- Making LLaMA SEE and Draw with SEED Tokenizer
- Making LLaMA SEE and Draw with SEED Tokenizer
- NExT-GPT: Any-to-Any Multimodal LLM
- NExT-GPT: Any-to-Any Multimodal LLM
- LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
- Planting a SEED of Vision in Large Language Model
- LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
- Planting a SEED of Vision in Large Language Model
- Emu: Generative Pretraining in Multimodality
- Emu: Generative Pretraining in Multimodality
- CoDi: Any-to-Any Generation via Composable Diffusion
- CoDi: Any-to-Any Generation via Composable Diffusion
- Multimodal unified attention networks for vision-and-language interactions
- ![Star
- Multimodal unified attention networks for vision-and-language interactions
- UniMuMo: Unified Text, Music, and Motion Generation
- ![Star
- MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation
- ![Website
- MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation
-
Tokenizer
- ![Website - tokenizer/)
- Cosmos Tokenizer: A suite of image and video neural tokenizers
- Cosmos Tokenizer: A suite of image and video neural tokenizers
- ![Star - Tokenizer)
- ![Star - Tokenizer)
- ![arXiv - of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/)
-
Multi Experts
-
-
Acknowledgements
-
- Awesome-Video-Diffusion - MLLM-Hallucination](https://github.com/showlab/Awesome-MLLM-Hallucination).
-
Tokenizer
- Awesome-Video-Diffusion - MLLM-Hallucination](https://github.com/showlab/Awesome-MLLM-Hallucination).
-