awesome-unified-multimodal-models

Awesome Unified Multimodal Models
https://github.com/aidc-ai/awesome-unified-multimodal-models

Last synced: 1 day ago
JSON representation

Awesome Papers & Datasets
- Text-and-Image Unified Models
  - Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model - E-AGI-Lab/Muddit?style=social) | arXiv | 2025/05/23 | [Github](https://github.com/M-E-AGI-Lab/Muddit) | - |
  - FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities - | - |
  - MMaDA: Multimodal Large Diffusion Language Models - Verse/MMaDA?style=social) | arXiv | 2025/05/21 | [Github](https://github.com/Gen-Verse/MMaDA) | [Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) |
  - Unified Multimodal Discrete Diffusion - |
  - Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning - team/SelftokTokenizer?style=social) | arXiv | 2025/05/12 | [Github](https://github.com/selftok-team/SelftokTokenizer) | - |
  - UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation - | - |
  - OmniGen2: Exploration to Advanced Multimodal Generation
  - Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
  - UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation - |
  - UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation - YuanGroup/UniWorld-V1?style=social) | arXiv | 2025/06/03 | [Github](https://github.com/PKU-YuanGroup/UniWorld-V1) | - |
  - Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation - YuanGroup/UniWorld-V1?style=social) | arXiv | 2025/06/10 | [Github](https://github.com/PKU-YuanGroup/UniWorld-V1) | - |
  - QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation - |
  - VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation - CVC/VL-GPT?style=social) | arXiv | 2023/12/14 | [Github](https://github.com/AILab-CVC/VL-GPT) | - |
  - MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
  - Ming-Omni: A Unified Multimodal Model for Perception and Generation - | - |
  - OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation - |
  - SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation - | - |
  - Dual Diffusion for Unified Image Generation and Understanding - Jlee/Dual-Diffusion?style=social) | arXiv | 2024/12/31 | [Github](https://github.com/zijieli-Jlee/Dual-Diffusion) | - |
  - TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation - |
  - UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning - | - |
  - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
  - SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding - | - |
  - Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads - group/Orthus?style=social) | arXiv | 2024/11/28 | [Github](https://github.com/zhijie-group/Orthus) | - |
  - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling - | - |
  - Liquid: Language Models are Scalable and Unified Multi-modal Generators
  - ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation - NLP/anole?style=social) | arXiv | 2024/07/08 | [Github](https://github.com/GAIR-NLP/anole) | - |
  - ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance - | - |
  - VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation - han-lab/vila-u?style=social) | ICLR | 2024/09/06 | [Github](https://github.com/mit-han-lab/vila-u) | [Demo](https://vila-u.hanlab.ai/) |
  - Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models - research/MGM?style=social) | arXiv | 2024/03/27 | [Github](https://github.com/dvlab-research/MGM) | [Demo](http://103.170.5.190:7860/) |
  - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer - Interleaved?style=social) | arXiv | 2024/01/18 | [Github](https://github.com/OpenGVLab/MM-Interleaved) | - |
  - Emu3: Next-Token Prediction is All You Need
  - Chameleon: Mixed-Modal Early-Fusion Foundation Models - |
  - World Model on Million-Length Video And Language With Blockwise RingAttention - |
  - BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset - |
  - UniTok: A Unified Tokenizer for Visual Generation and Understanding
  - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning - |
  - PUMA: Empowering Unified MLLM with Multi-granular Visual Generation - |
  - Generative Multimodal Models are In-Context Learners
  - DreamLLM: Synergistic Multimodal Comprehension and Creation - |
  - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization - |
  - Emu: Generative Pretraining in Multimodality
  - Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction - unify) | - |
  - Transfer between Modalities with MetaQueries - | - |
  - Making LLaMA SEE and Draw with SEED Tokenizer - CVC/SEED?style=social) | ICLR | 2023/10/02 | [Github](https://github.com/AILab-CVC/SEED) | [Demo](https://huggingface.co/spaces/AILab-CVC/SEED-LLaMA) |
  - VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning - family/VARGPT-v1.1?style=social) | arXiv | 2025/04/03 | [Github](https://github.com/VARGPT-family/VARGPT-v1.1) | - |
  - ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement - unified-mllm/ILLUME_plus?style=social) | arXiv | 2025/04/02 | [Github](https://github.com/illume-unified-mllm/ILLUME_plus) | - |
  - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens - | - |
  - VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model - family/VARGPT?style=social) | arXiv | 2025/01/21 | [Github](https://github.com/VARGPT-family/VARGPT) | - |
  - TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation - AI/TokenFlow?style=social) | CVPR | 2024/12/04 | [Github](https://github.com/ByteFlow-AI/TokenFlow) | - |
  - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation - | - |
  - JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation - ai/Janus?style=social) | arXiv | 2024/11/12 | [Github](https://github.com/deepseek-ai/Janus) | [Demo](https://huggingface.co/spaces/deepseek-ai/JanusFlow-1.3B) |
  - OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models - |
  - Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling - ai/Janus?style=social) | arXiv | 2025/01/29 | [Github](https://github.com/deepseek-ai/Janus) | [Demo](https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B) |
  - Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation - ai/Janus?style=social) | arXiv | 2024/10/17 | [Github](https://github.com/deepseek-ai/Janus) | [Demo](https://huggingface.co/spaces/deepseek-ai/Janus-1.3B) |
  - UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding - |
  - Planting a SEED of Vision in Large Language Model - CVC/SEED?style=social) | arXiv | 2023/07/16 | [Github](https://github.com/AILab-CVC/SEED) | [Demo](https://huggingface.co/spaces/AILab-CVC/SEED-LLaMA) |
  - DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies - |
  - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding - | - |
  - LMFusion: Adapting Pretrained Language Models for Multimodal Generation - | - |
  - Emerging Properties in Unified Multimodal Pretraining - seed/BAGEL?style=social) | arXiv | 2025/05/20 | [Github](https://github.com/bytedance-seed/BAGEL) | [Demo](https://demo.bagel-ai.org/) |
  - MonoFormer: One Transformer for Both Diffusion and Autoregression - |
  - Show-o: One Single Transformer to Unify Multimodal Understanding and Generation - o?style=social) | ICLR | 2024/08/22 | [Github](https://github.com/showlab/Show-o) | [Demo](https://huggingface.co/spaces/showlab/Show-o) |
  - Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model - pytorch?style=social) | ICLR | 2024/08/20 | [Github](https://github.com/lucidrains/transfusion-pytorch) | - |
  - Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents - hanlin/Bifrost-1?style=social) | arXiv | 2025/08/08 | [GitHub](https://github.com/HL-hanlin/Bifrost-1) | - |
  - Qwen-Image Technical Report - Image?style=social) | arXiv | 2025/08/04 | [GitHub](https://github.com/QwenLM/Qwen-Image) | [Demo](https://huggingface.co/spaces/Qwen/Qwen-Image) |
  - X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again - Omni-Team/X-Omni?style=social) | arXiv | 2025/07/29 | [GitHub](https://github.com/X-Omni-Team/X-Omni) | [Demo](https://huggingface.co/collections/X-Omni/x-omni-spaces-6888c64f38446f1efc402de7) |
  - Ovis-U1 Technical Report - AI/Ovis-U1?style=social) | arXiv | 2025/06/28 | [GitHub](https://github.com/AIDC-AI/Ovis-U1) | [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B) |
  - TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning - UniImage?style=social) | arXiv | 2025/08/11 | [Github](https://github.com/DruryXu/TBAC-UniImage) | - |
  - UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing - | - |
  - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation - CVC/SEED-X?style=social) | arXiv | 2024/04/22 | [Github](https://github.com/AILab-CVC/SEED-X) | [Demo](https://arc.tencent.com/en/ai-demos/multimodal) |
  - Show-o2: Improved Native Unified Multimodal Models - o?style=social) | arXiv | 2025/06/15 | [Github](https://github.com/showlab/Show-o/tree/main/show-o2) | - |
- Benchmark for Evaluation
  - ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies - Edit.svg?style=social&label=Star) | arxiv | 2025/06/15 | [Github](https://github.com/llllly26/ComplexBench-Edit) |
  - EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits - |
  - RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
  - MMGen-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models - Bench.svg?style=social&label=Star) | arxiv | 2025/05/26 | [Github](https://github.com/hanghuacs/MMIG-Bench) |
  - KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
  - ImgEdit: A Unified Image Editing Dataset and Benchmark - YuanGroup/ImgEdit.svg?style=social&label=Star) | arxiv | 2025/05/26 | [Github](https://github.com/PKU-YuanGroup/ImgEdit) |
  - On Path to Multimodal Generalist: General-Level and General-Bench - Level.svg?style=social&label=Star) | ICML | 2025/05/07 | [Github](https://github.com/path2generalist/General-Level) |
  - MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities - Vet.svg?style=social&label=Star) | arXiv | 2024/08/01 | [Github](https://github.com/yuweihao/MM-Vet/tree/main/v2) |
  - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality - PLUG/mPLUG-Owl.svg?style=social&label=Star) | arXiv | 2024/04/27 | [Github](https://github.com/X-PLUG/mPLUG-Owl) |
  - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy - freiburg/ovqa.svg?style=social&label=Star) | ICLR | 2024/02/11 | [Github](https://github.com/lmb-freiburg/ovqa) |
  - SEED-Bench-2: Benchmarking Multimodal Large Language Models - CVC/SEED-Bench.svg?style=social&label=Star) | arXiv | 2023/11/28 | [Github](https://github.com/AILab-CVC/SEED-Bench/tree/main/SEED-Bench-2) |
  - MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI - Benchmark/MMMU.svg?style=social&label=Star) | CVPR | 2023/11/27 | [Github](https://github.com/MMMU-Benchmark/MMMU) |
  - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities - Vet.svg?style=social&label=Star) | ICML | 2023/08/04 | [Github](https://github.com/yuweihao/MM-Vet) |
  - SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension - CVC/SEED-Bench.svg?style=social&label=Star) | CVPR | 2023/07/30 | [Github](https://github.com/AILab-CVC/SEED-Bench) |
  - MMBench: Is Your Multi-modal Model an All-around Player? - compass/MMBench.svg?style=social&label=Star) | ECCV | 2023/07/12 | [Github](https://github.com/open-compass/MMBench) |
  - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
  - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
  - GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
  - Step1X-Edit: A Practical Framework for General Image Editing - ai/Step1X-Edit.svg?style=social&label=Star) | arXiv | 2025/04/28 | [Github](https://github.com/stepfun-ai/Step1X-Edit) |
  - DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
  - T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation - Huang/T2I-CompBench.svg?style=social&label=Star) | TPAMI | 2025/03/08 | [Github](https://github.com/Karine-Huang/T2I-CompBench) |
  - IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment - |
  - AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
  - I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
  - VQA: Visual Question Answering
  - CompBench: Benchmarking Complex Instruction-guided Image Editing - |
  - Holistic Evaluation of Text-To-Image Models - crfm/helm.svg?style=social&label=Star) | NeurIPS | 2023/11/07 | [Github](https://github.com/stanford-crfm/helm) |
  - Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation - min/DSG.svg?style=social&label=Star) | ICLR | 2023/10/27 | [Github](https://github.com/j-min/DSG) |
  - EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods - ml-research/editval_code.svg?style=social&label=Star) | arXiv | 2023/10/03 | [Github](https://github.com/deep-ml-research/editval_code) |
  - GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
  - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation - Huang/T2I-CompBench.svg?style=social&label=Star) | NeurIPS | 2023/07/12 | [Github](https://github.com/Karine-Huang/T2I-CompBench) |
  - MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing - NLP-Group/MagicBrush.svg?style=social&label=Star) | NeurIPS | 2023/06/16 | [Github](https://github.com/OSU-NLP-Group/MagicBrush) |
  - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering - Hu/tifa.svg?style=social&label=Star) | ICCV | 2023/03/21 | [Github](https://github.com/Yushi-Hu/tifa) |
  - ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty
  - GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
  - Evaluating Text-to-Visual Generation with Image-to-Text Generation
  - ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
  - SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
  - Emu Edit: Precise Image Editing via Recognition and Generation Tasks
  - DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
  - UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
  - HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models
  - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
  - DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models - min/DallEval.svg?style=social&label=Star) | ICCV | 2022/02/08 | [Github](https://github.com/j-min/DallEval) |
  - Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  - VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation - lin/VTBench.svg?style=social&label=Star) | arXiv | 2025/05/19 | [Github](https://github.com/huawei-lin/VTBench) |
  - UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation - lab/UniEval.svg?style=social&label=Star) | arXiv | 2025/05/15 | [Github](https://github.com/xmed-lab/UniEval) |
  - OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
  - Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment - Chen/ISG.svg?style=social&label=Star) | ICLR | 2024/11/26 | [Github](https://github.com/Dongping-Chen/ISG) |
  - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models - h/MMIE.svg?style=social&label=Star) | ICLR | 2024/10/14 | [Github](https://github.com/Lillianwei-h/MMIE) |
  - Holistic Evaluation for Interleaved Text-and-Image Generation
  - OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation - |
  - TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes - PCALab/TextCrafter.svg?style=social&label=Star) | arxiv | 2025/08/05 | [Github](https://github.com/NJU-PCALab/TextCrafter) |
  - OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation - Bench/OneIG-Benchmark.svg?style=social&label=Star) | arxiv | 2025/06/26 | [Github](https://github.com/OneIG-Bench/OneIG-Benchmark) |
  - ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions - Seed/BM-code.svg?style=social&label=Star) | arxiv | 2025/06/03 | [Github](https://github.com/ByteDance-Seed/BM-code) |
  - WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation - YuanGroup/WISE.svg?style=social&label=Star) | arxiv | 2025/05/27 | [Github](https://github.com/PKU-YuanGroup/WISE) |
  - WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation
  - Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? - T2I.svg?style=social&label=Star) | COLM | 2024/06/11 | [Github](https://github.com/zeyofu/Commonsense-T2I) |
  - HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing - VLAA/HQ-Edit.svg?style=social&label=Star) | ICLR | 2024/04/15 | [Github](https://github.com/UCSC-VLAA/HQ-Edit) |
  - FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models - nics/FlashEval.svg?style=social&label=Star) | CVPR | 2024/03/25 | [Github](https://github.com/thu-nics/FlashEval) |
  - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation - research/parti.svg?style=social&label=Star) | TMLR | 2022/06/22 | [Github](https://github.com/google-research/parti/blob/main/PartiPrompts.tsv) |
- Any-to-Any Multimodal models
  - M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance - | - |
  - OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows - |
  - Spider: Any-to-Many Multimodal LLM - |
  - X-VILA: Cross-Modality Alignment for Large Language Model - | - |
  - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling - |
  - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization - |
  - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action - io-2?style=social) | CVPR | 2023/12/28 | [Github](https://github.com/allenai/unified-io-2) | - |
  - NExT-GPT: Any-to-Any Multimodal LLM - GPT/NExT-GPT?style=social) | ICML | 2023/09/11 | [Github](https://github.com/NExT-GPT/NExT-GPT) | - |
  - MIO: A Foundation Model on Multimodal Tokens - Team/MIO?style=social) | arXiv | 2024/09/26 | [Github](https://github.com/MIO-Team/MIO) | |
- Dataset
  - Cambrian-10M - 1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/abs/2406.16860) | NeurIPS | 2024/06/24 |
  - ShareGPT4V - modal models with better captions](https://arxiv.org/pdf/2311.12793) | ECCV | 2023/11/21 |
  - CapsFusion-120M - text data at scale](https://arxiv.org/pdf/2310.20550) | CVPR | 2023/10/31 |
  - GRIT - 2: Grounding multimodal large language models to the world](https://arxiv.org/pdf/2306.14824) | ICLR | 2023/06/26 |
  - DataComp
  - Laion-COCO - en](https://laion.ai/blog/laion-coco/) | - | 2022/09/15 |
  - COYO - 700m: Image-text pair dataset](https://github.com/kakaobrain/coyo-dataset) | - | 2022/08/31 |
  - Laion - 5b: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402) | NeurIPS | 2022/03/31 |
  - Wukong - scale chinese cross-modal pre-training benchmark](https://arxiv.org/pdf/2202.06767) | NeurIPS | 2022/02/14 |
  - RedCaps - curated image-text data created by the people, for the people](https://arxiv.org/pdf/2111.11431) | NeurIPS | 2021/11/22 |
  - Infinity-MM - MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data](https://arxiv.org/pdf/2410.18558) | arXiv | 2024/10/24 |
  - LLaVA-OneVision - OneVision: Easy Visual Task Transfer](https://arxiv.org/pdf/2408.03326) | TMLR | 2024/08/06 |
  - SynCD - image synthetic data for text-to-image customization](https://arxiv.org/pdf/2502.01720) | arXiv | 2025/02/03 |
  - X2I-subject-driven
  - Subjects200K
  - MultiGen-20M
  - LAION-Face - linguistic manner](https://arxiv.org/pdf/2112.03109) | CVPR | 2021/12/06 |
  - PD12M - text dataset with novel governance mechanisms](https://arxiv.org/pdf/2410.23144) | arXiv | 2024/10/30 |
  - SFHQ-T2I - | - | 2024/10/06 |
  - text-to-image-2M - | - | 2024/09/13 |
  - DenseFusion - 1m: Merging vision experts for comprehensive multimodal perception](https://arxiv.org/pdf/2407.08303) | NeurIPS | 2024/07/11 |
  - Megalith - | - | 2024/07/01 |
  - PixelProse
  - CosmicMan-HQ 1.0 - to-image foundation model for humans](https://arxiv.org/pdf/2404.01294) | CVPR | 2024/04/01 |
  - AnyWord-3M
  - BLIP3o-60k - o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset](https://arxiv.org/pdf/2505.09568) | arXiv | 2025/05/14 |
  - TextAtlas5M - scale Dataset for Dense Text Image Generation](https://arxiv.org/pdf/2502.07870) | arXiv | 2025/02/11 |
  - EliGen TrainSet - Level Controlled Image Generation with Regional Attention](https://arxiv.org/pdf/2501.01097) | arXiv | 2025/01/02 |
  - JourneyDB
  - RenderedText - | - | 2023/06/30 |
  - Mario-10M
  - SAM - dfw5-2.xx.fbcdn.net/v/t39.2365-6/10000000_900554171201033_1602411987825904100_n.pdf?_nc_cat=100&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=B3oBIrInbQUQ7kNvwEaPRAg&_nc_oc=Admn--QE4uSxaSrSevMzE9NUEkdPzlxF28dIu1Pi3-T9Wv87G_eomLxfVv1_LurC1lk&_nc_zt=14&_nc_ht=scontent-dfw5-2.xx&_nc_gid=RT_BFXUYrx0OfvqBSy2btQ&oh=00_AfG2YboYUGvYXugQ45dgIz3h8g0B8YDqxpf0ra9lVAa_EQ&oe=6816D3A7) | ICCV | 2023/04/05 |
  - LAION-Aesthetics - 5b: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402) | NeurIPS | 2022/08/16 |
  - CC-12M - scale image-text pre-training to recognize long-tail visual concepts](https://arxiv.org/pdf/2102.08981) | CVPR | 2021/02/17 |
  - AnyEdit - quality image editing for any idea](https://arxiv.org/pdf/2411.15738) | CVPR | 2024/11/24 |
  - OmniEdit
  - UltraEdit - based fine-grained image editing at scale](https://arxiv.org/pdf/2407.05282) | NeurIPS | 2024/07/07 |
  - SEED-Data-Edit - data-edit technical report: A hybrid dataset for instructional image editing](https://arxiv.org/pdf/2405.04007) | arXiv | 2024/05/07 |
  - HQ-Edit - edit: A high-quality dataset for instruction-based image editing](https://arxiv.org/pdf/2404.09990) | arXiv | 2024/04/15 |
  - Magicbrush - guided image editing](https://arxiv.org/pdf/2306.10012) | NeurIPS | 2023/06/16 |
  - InstructP2P
  - OBELICS - scale filtered dataset of interleaved image-text documents](https://arxiv.org/pdf/2306.16527) | NeurIPS | 2023/06/21 |
  - Multimodal C4 - scale corpus of images interleaved with text](https://arxiv.org/pdf/2304.06939) | NeurIPS | 2023/04/14 |
- Applications and Opportunities
  - UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens - |
  - On Fairness of Unified Multimodal Large Language Model for Image Generation - | - |
  - T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT - R1?style=social) | arXiv | 2025/01/29 | [Github](https://github.com/CaraJ7/T2I-R1) | - |

Programming Languages

Python 2 HTML 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-unified-multimodal-models

Awesome Papers & Datasets

Text-and-Image Unified Models

Benchmark for Evaluation

Any-to-Any Multimodal models

Dataset

Applications and Opportunities