awesome-unified-multimodal-models
Awesome Unified Multimodal Models
https://github.com/aidc-ai/awesome-unified-multimodal-models
Last synced: 1 day ago
JSON representation
-
Awesome Papers & Datasets
-
Text-and-Image Unified Models
- Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model - E-AGI-Lab/Muddit?style=social) | arXiv | 2025/05/23 | [Github](https://github.com/M-E-AGI-Lab/Muddit) | - |
- FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities - | - |
- MMaDA: Multimodal Large Diffusion Language Models - Verse/MMaDA?style=social) | arXiv | 2025/05/21 | [Github](https://github.com/Gen-Verse/MMaDA) | [Demo](https://huggingface.co/spaces/Gen-Verse/MMaDA) |
- Unified Multimodal Discrete Diffusion - |
- Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning - team/SelftokTokenizer?style=social) | arXiv | 2025/05/12 | [Github](https://github.com/selftok-team/SelftokTokenizer) | - |
- UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation - | - |
- OmniGen2: Exploration to Advanced Multimodal Generation
- Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
- UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation - |
- UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation - YuanGroup/UniWorld-V1?style=social) | arXiv | 2025/06/03 | [Github](https://github.com/PKU-YuanGroup/UniWorld-V1) | - |
- Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation - YuanGroup/UniWorld-V1?style=social) | arXiv | 2025/06/10 | [Github](https://github.com/PKU-YuanGroup/UniWorld-V1) | - |
- QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation - |
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation - CVC/VL-GPT?style=social) | arXiv | 2023/12/14 | [Github](https://github.com/AILab-CVC/VL-GPT) | - |
- MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
- Ming-Omni: A Unified Multimodal Model for Perception and Generation - | - |
- OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation - |
- SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation - | - |
- Dual Diffusion for Unified Image Generation and Understanding - Jlee/Dual-Diffusion?style=social) | arXiv | 2024/12/31 | [Github](https://github.com/zijieli-Jlee/Dual-Diffusion) | - |
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation - |
- UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning - | - |
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
- SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding - | - |
- Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads - group/Orthus?style=social) | arXiv | 2024/11/28 | [Github](https://github.com/zhijie-group/Orthus) | - |
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling - | - |
- Liquid: Language Models are Scalable and Unified Multi-modal Generators
- ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation - NLP/anole?style=social) | arXiv | 2024/07/08 | [Github](https://github.com/GAIR-NLP/anole) | - |
- ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance - | - |
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation - han-lab/vila-u?style=social) | ICLR | 2024/09/06 | [Github](https://github.com/mit-han-lab/vila-u) | [Demo](https://vila-u.hanlab.ai/) |
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models - research/MGM?style=social) | arXiv | 2024/03/27 | [Github](https://github.com/dvlab-research/MGM) | [Demo](http://103.170.5.190:7860/) |
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer - Interleaved?style=social) | arXiv | 2024/01/18 | [Github](https://github.com/OpenGVLab/MM-Interleaved) | - |
- Emu3: Next-Token Prediction is All You Need
- Chameleon: Mixed-Modal Early-Fusion Foundation Models - |
- World Model on Million-Length Video And Language With Blockwise RingAttention - |
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset - |
- UniTok: A Unified Tokenizer for Visual Generation and Understanding
- MetaMorph: Multimodal Understanding and Generation via Instruction Tuning - |
- PUMA: Empowering Unified MLLM with Multi-granular Visual Generation - |
- Generative Multimodal Models are In-Context Learners
- DreamLLM: Synergistic Multimodal Comprehension and Creation - |
- Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization - |
- Emu: Generative Pretraining in Multimodality
- Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction - unify) | - |
- Transfer between Modalities with MetaQueries - | - |
- Making LLaMA SEE and Draw with SEED Tokenizer - CVC/SEED?style=social) | ICLR | 2023/10/02 | [Github](https://github.com/AILab-CVC/SEED) | [Demo](https://huggingface.co/spaces/AILab-CVC/SEED-LLaMA) |
- VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning - family/VARGPT-v1.1?style=social) | arXiv | 2025/04/03 | [Github](https://github.com/VARGPT-family/VARGPT-v1.1) | - |
- ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement - unified-mllm/ILLUME_plus?style=social) | arXiv | 2025/04/02 | [Github](https://github.com/illume-unified-mllm/ILLUME_plus) | - |
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens - | - |
- VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model - family/VARGPT?style=social) | arXiv | 2025/01/21 | [Github](https://github.com/VARGPT-family/VARGPT) | - |
- TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation - AI/TokenFlow?style=social) | CVPR | 2024/12/04 | [Github](https://github.com/ByteFlow-AI/TokenFlow) | - |
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation - | - |
- JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation - ai/Janus?style=social) | arXiv | 2024/11/12 | [Github](https://github.com/deepseek-ai/Janus) | [Demo](https://huggingface.co/spaces/deepseek-ai/JanusFlow-1.3B) |
- OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models - |
- Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling - ai/Janus?style=social) | arXiv | 2025/01/29 | [Github](https://github.com/deepseek-ai/Janus) | [Demo](https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B) |
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation - ai/Janus?style=social) | arXiv | 2024/10/17 | [Github](https://github.com/deepseek-ai/Janus) | [Demo](https://huggingface.co/spaces/deepseek-ai/Janus-1.3B) |
- UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding - |
- Planting a SEED of Vision in Large Language Model - CVC/SEED?style=social) | arXiv | 2023/07/16 | [Github](https://github.com/AILab-CVC/SEED) | [Demo](https://huggingface.co/spaces/AILab-CVC/SEED-LLaMA) |
- DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies - |
- MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding - | - |
- LMFusion: Adapting Pretrained Language Models for Multimodal Generation - | - |
- Emerging Properties in Unified Multimodal Pretraining - seed/BAGEL?style=social) | arXiv | 2025/05/20 | [Github](https://github.com/bytedance-seed/BAGEL) | [Demo](https://demo.bagel-ai.org/) |
- MonoFormer: One Transformer for Both Diffusion and Autoregression - |
- Show-o: One Single Transformer to Unify Multimodal Understanding and Generation - o?style=social) | ICLR | 2024/08/22 | [Github](https://github.com/showlab/Show-o) | [Demo](https://huggingface.co/spaces/showlab/Show-o) |
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model - pytorch?style=social) | ICLR | 2024/08/20 | [Github](https://github.com/lucidrains/transfusion-pytorch) | - |
- Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents - hanlin/Bifrost-1?style=social) | arXiv | 2025/08/08 | [GitHub](https://github.com/HL-hanlin/Bifrost-1) | - |
- Qwen-Image Technical Report - Image?style=social) | arXiv | 2025/08/04 | [GitHub](https://github.com/QwenLM/Qwen-Image) | [Demo](https://huggingface.co/spaces/Qwen/Qwen-Image) |
- X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again - Omni-Team/X-Omni?style=social) | arXiv | 2025/07/29 | [GitHub](https://github.com/X-Omni-Team/X-Omni) | [Demo](https://huggingface.co/collections/X-Omni/x-omni-spaces-6888c64f38446f1efc402de7) |
- Ovis-U1 Technical Report - AI/Ovis-U1?style=social) | arXiv | 2025/06/28 | [GitHub](https://github.com/AIDC-AI/Ovis-U1) | [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B) |
- TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning - UniImage?style=social) | arXiv | 2025/08/11 | [Github](https://github.com/DruryXu/TBAC-UniImage) | - |
- UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing - | - |
- SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation - CVC/SEED-X?style=social) | arXiv | 2024/04/22 | [Github](https://github.com/AILab-CVC/SEED-X) | [Demo](https://arc.tencent.com/en/ai-demos/multimodal) |
- Show-o2: Improved Native Unified Multimodal Models - o?style=social) | arXiv | 2025/06/15 | [Github](https://github.com/showlab/Show-o/tree/main/show-o2) | - |
-
Benchmark for Evaluation
- ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies - Edit.svg?style=social&label=Star) | arxiv | 2025/06/15 | [Github](https://github.com/llllly26/ComplexBench-Edit) |
- EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits - |
- RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
- MMGen-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models - Bench.svg?style=social&label=Star) | arxiv | 2025/05/26 | [Github](https://github.com/hanghuacs/MMIG-Bench) |
- KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
- ImgEdit: A Unified Image Editing Dataset and Benchmark - YuanGroup/ImgEdit.svg?style=social&label=Star) | arxiv | 2025/05/26 | [Github](https://github.com/PKU-YuanGroup/ImgEdit) |
- On Path to Multimodal Generalist: General-Level and General-Bench - Level.svg?style=social&label=Star) | ICML | 2025/05/07 | [Github](https://github.com/path2generalist/General-Level) |
- MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities - Vet.svg?style=social&label=Star) | arXiv | 2024/08/01 | [Github](https://github.com/yuweihao/MM-Vet/tree/main/v2) |
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality - PLUG/mPLUG-Owl.svg?style=social&label=Star) | arXiv | 2024/04/27 | [Github](https://github.com/X-PLUG/mPLUG-Owl) |
- Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy - freiburg/ovqa.svg?style=social&label=Star) | ICLR | 2024/02/11 | [Github](https://github.com/lmb-freiburg/ovqa) |
- SEED-Bench-2: Benchmarking Multimodal Large Language Models - CVC/SEED-Bench.svg?style=social&label=Star) | arXiv | 2023/11/28 | [Github](https://github.com/AILab-CVC/SEED-Bench/tree/main/SEED-Bench-2) |
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI - Benchmark/MMMU.svg?style=social&label=Star) | CVPR | 2023/11/27 | [Github](https://github.com/MMMU-Benchmark/MMMU) |
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities - Vet.svg?style=social&label=Star) | ICML | 2023/08/04 | [Github](https://github.com/yuweihao/MM-Vet) |
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension - CVC/SEED-Bench.svg?style=social&label=Star) | CVPR | 2023/07/30 | [Github](https://github.com/AILab-CVC/SEED-Bench) |
- MMBench: Is Your Multi-modal Model an All-around Player? - compass/MMBench.svg?style=social&label=Star) | ECCV | 2023/07/12 | [Github](https://github.com/open-compass/MMBench) |
- LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
- Step1X-Edit: A Practical Framework for General Image Editing - ai/Step1X-Edit.svg?style=social&label=Star) | arXiv | 2025/04/28 | [Github](https://github.com/stepfun-ai/Step1X-Edit) |
- DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
- T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation - Huang/T2I-CompBench.svg?style=social&label=Star) | TPAMI | 2025/03/08 | [Github](https://github.com/Karine-Huang/T2I-CompBench) |
- IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment - |
- AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
- I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
- VQA: Visual Question Answering
- CompBench: Benchmarking Complex Instruction-guided Image Editing - |
- Holistic Evaluation of Text-To-Image Models - crfm/helm.svg?style=social&label=Star) | NeurIPS | 2023/11/07 | [Github](https://github.com/stanford-crfm/helm) |
- Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation - min/DSG.svg?style=social&label=Star) | ICLR | 2023/10/27 | [Github](https://github.com/j-min/DSG) |
- EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods - ml-research/editval_code.svg?style=social&label=Star) | arXiv | 2023/10/03 | [Github](https://github.com/deep-ml-research/editval_code) |
- GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment
- T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation - Huang/T2I-CompBench.svg?style=social&label=Star) | NeurIPS | 2023/07/12 | [Github](https://github.com/Karine-Huang/T2I-CompBench) |
- MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing - NLP-Group/MagicBrush.svg?style=social&label=Star) | NeurIPS | 2023/06/16 | [Github](https://github.com/OSU-NLP-Group/MagicBrush) |
- TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering - Hu/tifa.svg?style=social&label=Star) | ICCV | 2023/03/21 | [Github](https://github.com/Yushi-Hu/tifa) |
- ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty
- GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
- Evaluating Text-to-Visual Generation with Image-to-Text Generation
- ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
- SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
- Emu Edit: Precise Image Editing via Recognition and Generation Tasks
- DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
- UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
- HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models
- Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models - min/DallEval.svg?style=social&label=Star) | ICCV | 2022/02/08 | [Github](https://github.com/j-min/DallEval) |
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation - lin/VTBench.svg?style=social&label=Star) | arXiv | 2025/05/19 | [Github](https://github.com/huawei-lin/VTBench) |
- UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation - lab/UniEval.svg?style=social&label=Star) | arXiv | 2025/05/15 | [Github](https://github.com/xmed-lab/UniEval) |
- OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
- Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment - Chen/ISG.svg?style=social&label=Star) | ICLR | 2024/11/26 | [Github](https://github.com/Dongping-Chen/ISG) |
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models - h/MMIE.svg?style=social&label=Star) | ICLR | 2024/10/14 | [Github](https://github.com/Lillianwei-h/MMIE) |
- Holistic Evaluation for Interleaved Text-and-Image Generation
- OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation - |
- TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes - PCALab/TextCrafter.svg?style=social&label=Star) | arxiv | 2025/08/05 | [Github](https://github.com/NJU-PCALab/TextCrafter) |
- OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation - Bench/OneIG-Benchmark.svg?style=social&label=Star) | arxiv | 2025/06/26 | [Github](https://github.com/OneIG-Bench/OneIG-Benchmark) |
- ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions - Seed/BM-code.svg?style=social&label=Star) | arxiv | 2025/06/03 | [Github](https://github.com/ByteDance-Seed/BM-code) |
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation - YuanGroup/WISE.svg?style=social&label=Star) | arxiv | 2025/05/27 | [Github](https://github.com/PKU-YuanGroup/WISE) |
- WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation
- Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? - T2I.svg?style=social&label=Star) | COLM | 2024/06/11 | [Github](https://github.com/zeyofu/Commonsense-T2I) |
- HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing - VLAA/HQ-Edit.svg?style=social&label=Star) | ICLR | 2024/04/15 | [Github](https://github.com/UCSC-VLAA/HQ-Edit) |
- FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models - nics/FlashEval.svg?style=social&label=Star) | CVPR | 2024/03/25 | [Github](https://github.com/thu-nics/FlashEval) |
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation - research/parti.svg?style=social&label=Star) | TMLR | 2022/06/22 | [Github](https://github.com/google-research/parti/blob/main/PartiPrompts.tsv) |
-
Any-to-Any Multimodal models
- M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance - | - |
- OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows - |
- Spider: Any-to-Many Multimodal LLM - |
- X-VILA: Cross-Modality Alignment for Large Language Model - | - |
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling - |
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization - |
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action - io-2?style=social) | CVPR | 2023/12/28 | [Github](https://github.com/allenai/unified-io-2) | - |
- NExT-GPT: Any-to-Any Multimodal LLM - GPT/NExT-GPT?style=social) | ICML | 2023/09/11 | [Github](https://github.com/NExT-GPT/NExT-GPT) | - |
- MIO: A Foundation Model on Multimodal Tokens - Team/MIO?style=social) | arXiv | 2024/09/26 | [Github](https://github.com/MIO-Team/MIO) | |
-
Dataset
- Cambrian-10M - 1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/abs/2406.16860) | NeurIPS | 2024/06/24 |
- ShareGPT4V - modal models with better captions](https://arxiv.org/pdf/2311.12793) | ECCV | 2023/11/21 |
- CapsFusion-120M - text data at scale](https://arxiv.org/pdf/2310.20550) | CVPR | 2023/10/31 |
- GRIT - 2: Grounding multimodal large language models to the world](https://arxiv.org/pdf/2306.14824) | ICLR | 2023/06/26 |
- DataComp
- Laion-COCO - en](https://laion.ai/blog/laion-coco/) | - | 2022/09/15 |
- COYO - 700m: Image-text pair dataset](https://github.com/kakaobrain/coyo-dataset) | - | 2022/08/31 |
- Laion - 5b: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402) | NeurIPS | 2022/03/31 |
- Wukong - scale chinese cross-modal pre-training benchmark](https://arxiv.org/pdf/2202.06767) | NeurIPS | 2022/02/14 |
- RedCaps - curated image-text data created by the people, for the people](https://arxiv.org/pdf/2111.11431) | NeurIPS | 2021/11/22 |
- Infinity-MM - MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data](https://arxiv.org/pdf/2410.18558) | arXiv | 2024/10/24 |
- LLaVA-OneVision - OneVision: Easy Visual Task Transfer](https://arxiv.org/pdf/2408.03326) | TMLR | 2024/08/06 |
- SynCD - image synthetic data for text-to-image customization](https://arxiv.org/pdf/2502.01720) | arXiv | 2025/02/03 |
- X2I-subject-driven
- Subjects200K
- MultiGen-20M
- LAION-Face - linguistic manner](https://arxiv.org/pdf/2112.03109) | CVPR | 2021/12/06 |
- PD12M - text dataset with novel governance mechanisms](https://arxiv.org/pdf/2410.23144) | arXiv | 2024/10/30 |
- SFHQ-T2I - | - | 2024/10/06 |
- text-to-image-2M - | - | 2024/09/13 |
- DenseFusion - 1m: Merging vision experts for comprehensive multimodal perception](https://arxiv.org/pdf/2407.08303) | NeurIPS | 2024/07/11 |
- Megalith - | - | 2024/07/01 |
- PixelProse
- CosmicMan-HQ 1.0 - to-image foundation model for humans](https://arxiv.org/pdf/2404.01294) | CVPR | 2024/04/01 |
- AnyWord-3M
- BLIP3o-60k - o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset](https://arxiv.org/pdf/2505.09568) | arXiv | 2025/05/14 |
- TextAtlas5M - scale Dataset for Dense Text Image Generation](https://arxiv.org/pdf/2502.07870) | arXiv | 2025/02/11 |
- EliGen TrainSet - Level Controlled Image Generation with Regional Attention](https://arxiv.org/pdf/2501.01097) | arXiv | 2025/01/02 |
- JourneyDB
- RenderedText - | - | 2023/06/30 |
- Mario-10M
- SAM - dfw5-2.xx.fbcdn.net/v/t39.2365-6/10000000_900554171201033_1602411987825904100_n.pdf?_nc_cat=100&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=B3oBIrInbQUQ7kNvwEaPRAg&_nc_oc=Admn--QE4uSxaSrSevMzE9NUEkdPzlxF28dIu1Pi3-T9Wv87G_eomLxfVv1_LurC1lk&_nc_zt=14&_nc_ht=scontent-dfw5-2.xx&_nc_gid=RT_BFXUYrx0OfvqBSy2btQ&oh=00_AfG2YboYUGvYXugQ45dgIz3h8g0B8YDqxpf0ra9lVAa_EQ&oe=6816D3A7) | ICCV | 2023/04/05 |
- LAION-Aesthetics - 5b: An open large-scale dataset for training next generation image-text models](https://arxiv.org/pdf/2210.08402) | NeurIPS | 2022/08/16 |
- CC-12M - scale image-text pre-training to recognize long-tail visual concepts](https://arxiv.org/pdf/2102.08981) | CVPR | 2021/02/17 |
- AnyEdit - quality image editing for any idea](https://arxiv.org/pdf/2411.15738) | CVPR | 2024/11/24 |
- OmniEdit
- UltraEdit - based fine-grained image editing at scale](https://arxiv.org/pdf/2407.05282) | NeurIPS | 2024/07/07 |
- SEED-Data-Edit - data-edit technical report: A hybrid dataset for instructional image editing](https://arxiv.org/pdf/2405.04007) | arXiv | 2024/05/07 |
- HQ-Edit - edit: A high-quality dataset for instruction-based image editing](https://arxiv.org/pdf/2404.09990) | arXiv | 2024/04/15 |
- Magicbrush - guided image editing](https://arxiv.org/pdf/2306.10012) | NeurIPS | 2023/06/16 |
- InstructP2P
- OBELICS - scale filtered dataset of interleaved image-text documents](https://arxiv.org/pdf/2306.16527) | NeurIPS | 2023/06/21 |
- Multimodal C4 - scale corpus of images interleaved with text](https://arxiv.org/pdf/2304.06939) | NeurIPS | 2023/04/14 |
-
Applications and Opportunities
- UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens - |
- On Fairness of Unified Multimodal Large Language Model for Image Generation - | - |
- T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT - R1?style=social) | arXiv | 2025/01/29 | [Github](https://github.com/CaraJ7/T2I-R1) | - |
-
Categories
Sub Categories