Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wangkai930418/awesome-diffusion-categorized
collection of diffusion model papers categorized by their subareas
https://github.com/wangkai930418/awesome-diffusion-categorized
List: awesome-diffusion-categorized
continual-learning controlnet detection diffusion diffusion-model diffusion-models few-shot image-edit inpainting inversion segmentation stable-diffusion text-guided tracking
Last synced: 5 days ago
JSON representation
collection of diffusion model papers categorized by their subareas
- Host: GitHub
- URL: https://github.com/wangkai930418/awesome-diffusion-categorized
- Owner: wangkai930418
- Created: 2023-05-19T16:06:31.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-29T08:59:14.000Z (about 2 months ago)
- Last Synced: 2024-10-29T10:07:02.574Z (about 2 months ago)
- Topics: continual-learning, controlnet, detection, diffusion, diffusion-model, diffusion-models, few-shot, image-edit, inpainting, inversion, segmentation, stable-diffusion, text-guided, tracking
- Homepage:
- Size: 190 KB
- Stars: 1,232
- Watchers: 56
- Forks: 62
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - awesome-diffusion-categorized - Collection of diffusion model papers categorized by their subareas. (Other Lists / Monkey C Lists)
README
# Awesome Diffusion Categorized
## Contents
- [Accelerate](#accelerate)
- [Image Restoration](#image-restoration)
- [Colorization](#colorization)
- [Face Restoration](#face-restoration)
- [Storytelling](#storytelling)
- [Virtual Try On](#try-on)
- [Drag Edit](#drag-edit)
- [Diffusion Inversion](#diffusion-models-inversion)
- [Text-Guided Editing](#text-guided-image-editing)
- [Continual Learning](#continual-learning)
- [Remove Concept](#remove-concept)
- [New Concept Learning](#new-concept-learning)
- [T2I augmentation](#t2i-diffusion-model-augmentation)
- [Spatial Control](#spatial-control)
- [Image Translation](#i2i-translation)
- [Seg & Detect & Track](#segmentation-detection-tracking)
- [Adding Conditions](#additional-conditions)
- [Few-Shot](#few-shot)
- [Inpainting](#sd-inpaint)
- [Layout](#layout-generation)
- [Text Generation](#text-generation)
- [Super Resolution](#super-resolution)
- [Video Generation](#video-generation)
- [Video Editing](#video-editing)## Accelerate
**PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models** \
[[ICLR 2024 Spotlight](https://arxiv.org/abs/2401.05252)]
[[Diffusers 1](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pixart)]
[[Diffusers 2](https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS)]
[[Project](https://pixart-alpha.github.io/)]
[[Code](https://github.com/PixArt-alpha/PixArt-alpha?tab=readme-ov-file)]**SDXL-Turbo: Adversarial Diffusion Distillation** \
[[Website](https://arxiv.org/abs/2311.17042)]
[[Diffusers 1](https://huggingface.co/stabilityai/sdxl-turbo)]
[[Diffusers 2](https://huggingface.co/docs/diffusers/en/using-diffusers/sdxl_turbo)]
[[Project](https://huggingface.co/stabilityai)]
[[Code](https://github.com/Stability-AI/generative-models)]**Trajectory Consistency Distillation: Improved Latent Consistency Distillation by Semi-Linear Consistency Function with Trajectory Mapping** \
[[Website](https://arxiv.org/abs/2405.14867)]
[[Diffusers 1](https://huggingface.co/h1t/TCD-SDXL-LoRA)]
[[Diffusers 2](https://huggingface.co/docs/diffusers/en/using-diffusers/inference_with_tcd_lora)]
[[Project](https://tianweiy.github.io/dmd2/)]
[[Code](https://github.com/jabir-zheng/TCD)]**LCM-LoRA: A Universal Stable-Diffusion Acceleration Module** \
[[Website](https://arxiv.org/abs/2311.05556)]
[[Diffusers](https://huggingface.co/docs/diffusers/en/using-diffusers/inference_with_lcm?lcm-lora=LCM-LoRA#lora)]
[[Project](https://latent-consistency-models.github.io/)]
[[Code](https://github.com/luosiallen/latent-consistency-model)]**Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference** \
[[Website](https://arxiv.org/abs/2310.04378)]
[[Project](https://huggingface.co/docs/diffusers/api/pipelines/latent_consistency_models)]
[[Code](https://github.com/luosiallen/latent-consistency-model)]**DMD2: Improved Distribution Matching Distillation for Fast Image Synthesis** \
[[NeurIPS 2024 Oral](https://arxiv.org/abs/2405.14867)]
[[Project](https://tianweiy.github.io/dmd2/)]
[[Code](https://github.com/tianweiy/DMD2)]**DMD1: One-step Diffusion with Distribution Matching Distillation** \
[[CVPR 2024](https://arxiv.org/abs/2311.18828)]
[[Project](https://tianweiy.github.io/dmd/)]
[[Code](https://github.com/devrimcavusoglu/dmd)]**Consistency Models**
[[ICML 2023](https://doi.org/10.48550/arXiv.2410.11081](https://proceedings.mlr.press/v202/song23a.html)]
[[Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/consistency_models)]
[[Code](https://github.com/openai/consistency_models)]**SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation** \
[[CVPR 2024](https://arxiv.org/abs/2312.05239)]
[[Project](https://vinairesearch.github.io/SwiftBrush/)]
[[Code](https://github.com/VinAIResearch/SwiftBrush)]**SwiftBrush V2: Make Your One-Step Diffusion Model Better Than Its Teacher** \
[[ECCV 2024](https://arxiv.org/abs/2408.14176)]
[[Project](https://swiftbrushv2.github.io/)]
[[Code](https://github.com/VinAIResearch/SwiftBrush)]**CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2310.01407)]
[[Project](https://fast-codi.github.io/)]
[[Code](https://github.com/fast-codi/CoDi)]**PCM : Phased Consistency Model** \
[[NeurIPS 2024](https://arxiv.org/abs/2405.18407)]
[[Project](https://g-u-n.github.io/projects/pcm/)]
[[Code](https://github.com/G-U-N/Phased-Consistency-Model)]**Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation** \
[[NeurIPS 2024](https://arxiv.org/abs/2406.06890)]
[[Project](https://yhzhai.github.io/mcm/)]
[[Code](https://github.com/yhZhai/mcm)]**KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis** \
[[NeurIPS 2024](https://arxiv.org/abs/2312.04005)]
[[Project](https://youngwanlee.github.io/KOALA/)]
[[Code](https://github.com/youngwanLEE/sdxl-koala)]**Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation** \
[[Website](https://arxiv.org/abs/2406.02347)]
[[Project](https://gojasper.github.io/flash-diffusion-project/)]
[[Code](https://github.com/gojasper/flash-diffusion)]**Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2411.19108)]
[[Project](https://liewfeng.github.io/TeaCache/)]
[[Code](https://github.com/LiewFeng/TeaCache)]**Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models**
[[Website](https://doi.org/10.48550/arXiv.2410.11081)]
[[Project](https://openai.com/index/simplifying-stabilizing-and-scaling-continuous-time-consistency-models/)]
[[Code](https://github.com/xandergos/sCM-mnist)]**Adaptive Caching for Faster Video Generation with Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2411.02397)]
[[Project](https://adacache-dit.github.io/)]
[[Code](https://github.com/AdaCache-DiT/AdaCache)]**FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality** \
[[Website](https://arxiv.org/abs/2410.19355)]
[[Project](https://vchitect.github.io/FasterCache/)]
[[Code](https://github.com/Vchitect/FasterCache)]**SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions** \
[[Website](https://arxiv.org/abs/2403.16627)]
[[Project](https://idkiro.github.io/sdxs/)]
[[Code](https://github.com/IDKiro/sdxs)]**Reward Guided Latent Consistency Distillation** \
[[Website](https://arxiv.org/abs/2403.11027)]
[[Project](https://rg-lcd.github.io/)]
[[Code](https://github.com/Ji4chenLi/rg-lcd)]**T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching** \
[[Website](https://arxiv.org/abs/2402.14167)]
[[Project](https://t-stitch.github.io/)]
[[Code](https://github.com/NVlabs/T-Stitch)]**Relational Diffusion Distillation for Efficient Image Generation** \
[[ACM MM 2024 (Oral)](https://arxiv.org/abs/2410.07679)]
[[Code](https://github.com/cantbebetter2/RDD)]**UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs** \
[[CVPR 2024](https://arxiv.org/abs/2311.09257)]
[[Code](https://github.com/xuyanwu/SIDDMs-UFOGen)]**SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow** \
[[ECCV 2024](https://arxiv.org/abs/2407.12718)]
[[Code](https://github.com/yuanzhi-zhu/SlimFlow)]**Accelerating Image Generation with Sub-path Linear Approximation Model** \
[[ECCV 2024](https://arxiv.org/abs/2404.13903)]
[[Code](https://github.com/MCG-NJU/SPLAM)]**Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models** \
[[NeurIPS 2023](https://arxiv.org/abs/2305.18455)]
[[Code](https://github.com/pkulwj1994/diff_instruct)]**Fast and Memory-Efficient Video Diffusion Using Streamlined Inference** \
[[NeurIPS 2024](https://arxiv.org/abs/2411.01171)]
[[Code](https://github.com/wuyushuwys/FMEDiffusion)]**A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models** \
[[ICML 2024](https://arxiv.org/abs/2408.05927)]
[[Code](https://github.com/taehong-moon/ee-diffusion)]**Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation** \
[[ICML 2024](https://arxiv.org/abs/2404.04057)]
[[Code](https://github.com/mingyuanzhou/SiD)]**InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation** \
[[ICLR 2024](https://arxiv.org/abs/2309.06380)]
[[Code](https://github.com/gnobitab/instaflow)]**Accelerating Vision Diffusion Transformers with Skip Branches** \
[[Website](https://arxiv.org/abs/2411.17616)]
[[Code](https://github.com/OpenSparseLLMs/Skip-DiT)]**One Step Diffusion via Shortcut Models** \
[[Website](https://arxiv.org/abs/2410.12557)]
[[Code](https://github.com/kvfrans/shortcut-models)]**DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach** \
[[Website](https://arxiv.org/abs/2410.09633)]
[[Code](https://github.com/razvanmatisan/duodiff)]**A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training** \
[[Website](https://arxiv.org/abs/2405.17403)]
[[Code](https://github.com/nus-hpc-ai-lab/speed)]**Stable Consistency Tuning: Understanding and Improving Consistency Models** \
[[Website](https://arxiv.org/abs/2410.18958)]
[[Code](https://github.com/G-U-N/Stable-Consistency-Tuning)]**SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.08887)]
[[Code](https://github.com/williechai/speedup-plugin-for-stable-diffusions)]**Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching** \
[[Website](https://arxiv.org/abs/2406.01733)]
[[Code](https://github.com/horseee/learning-to-cache)]**Distribution Backtracking Builds A Faster Convergence Trajectory for Diffusion Distillation** \
[[Website](https://arxiv.org/abs/2408.15991)]
[[Code](https://github.com/SYZhang0805/DisBack)]**Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2406.01561)]
[[Code](https://github.com/mingyuanzhou/SiD-LSG)]**Diffusion Models Are Innate One-Step Generators** \
[[Website](https://arxiv.org/abs/2405.20750)]
[[Code](https://github.com/Zyriix/GDD)]**Distilling Diffusion Models into Conditional GANs** \
[[ECCV 2024](https://arxiv.org/abs/2405.05967)]
[[Project](https://mingukkang.github.io/Diffusion2GAN/)]**Cache Me if You Can: Accelerating Diffusion Models through Block Caching** \
[[CVPR 2024](https://arxiv.org/abs/2312.03209)]
[[Project](https://fwmb.github.io/blockcaching/)]**Plug-and-Play Diffusion Distillation** \
[[CVPR 2024](https://arxiv.org/abs/2406.01954)]
[[Project](https://5410tiffany.github.io/plug-and-play-diffusion-distillation.github.io/)]**SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds** \
[[NeurIPS 2023](https://arxiv.org/abs/2306.00980)]
[[Project](https://snap-research.github.io/SnapFusion/)]**SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance** \
[[Website](https://arxiv.org/abs/2412.02687)]
[[Project](https://snoopi-onestep.github.io/)]**NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training** \
[[Website](https://arxiv.org/abs/2412.02030)]
[[Project](https://chendaryen.github.io/NitroFusion.github.io/)]**Truncated Consistency Models** \
[[Website](https://arxiv.org/abs/2410.14895)]
[[Project](https://truncated-cm.github.io/)]**Multi-student Diffusion Distillation for Better One-step Generators** \
[[Website](https://arxiv.org/abs/2410.23274)]
[[Project](https://research.nvidia.com/labs/toronto-ai/MSD/index_hidden.html)]**Effortless Efficiency: Low-Cost Pruning of Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.02852)]
[[Project](https://yangzhang-v5.github.io/EcoDiff/)]**FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.10356)]**One-Step Diffusion Distillation through Score Implicit Matching** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.16794)]**Inference-Time Diffusion Model Distillation** \
[[Website](https://arxiv.org/abs/2412.08871)]**HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration** \
[[Website](https://arxiv.org/abs/2410.01723)]**Diff-Instruct\*: Towards Human-Preferred One-step Text-to-image Generative Models** \
[[Website](https://arxiv.org/abs/2410.20898)]**MLCM: Multistep Consistency Distillation of Latent Diffusion Model** \
[[Website](https://arxiv.org/abs/2406.05768)]**EM Distillation for One-step Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.16852)]**LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding** \
[[Website](https://arxiv.org/abs/2410.03355)]**Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference** \
[[Website](https://arxiv.org/abs/2412.02962)]**Importance-based Token Merging for Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.16720)]**Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation** \
[[Website](https://arxiv.org/abs/2405.05224)]**Accelerating Diffusion Models with One-to-Many Knowledge Distillation** \
[[Website](https://arxiv.org/abs/2410.04191)]**Accelerating Video Diffusion Models via Distribution Matching** \
[[Website](https://arxiv.org/abs/2412.05899)]**TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution** \
[[Website](https://arxiv.org/abs/2410.07663)]**DDIL: Improved Diffusion Distillation With Imitation Learning** \
[[Website](https://arxiv.org/abs/2410.11971)]**OSV: One Step is Enough for High-Quality Image to Video Generation** \
[[Website](https://arxiv.org/abs/2409.11367)]**Target-Driven Distillation: Consistency Distillation with Target Timestep Selection and Decoupled Guidance** \
[[Website](https://arxiv.org/abs/2409.01347)]**Token Caching for Diffusion Transformer Acceleration** \
[[Website](https://arxiv.org/abs/2409.18523)]**DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization** \
[[Website](https://arxiv.org/abs/2410.16942)]**Flow Generator Matching** \
[[Website](https://arxiv.org/abs/2410.19310)]**Multistep Distillation of Diffusion Models via Moment Matching** \
[[Website](https://arxiv.org/abs/2406.04103)]**SFDDM: Single-fold Distillation for Diffusion models** \
[[Website](https://arxiv.org/abs/2405.14961)]**LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.11098)]**CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion** \
[[Website](https://arxiv.org/abs/2403.05121)]**SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation** \
[[Website](https://arxiv.org/abs/2403.01505)]**SDXL-Lightning: Progressive Adversarial Diffusion Distillation** \
[[Website](https://arxiv.org/abs/2402.13929)]**Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training** \
[[Website](https://arxiv.org/abs/2411.09998)]**TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2411.18263)]### Train-Free
**AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising** \
[[NeurIPS 2024](https://arxiv.org/abs/2406.06911)]
[[Project](https://czg1225.github.io/asyncdiff_page/)]
[[Code](https://github.com/czg1225/AsyncDiff)]**Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.09873)]
[[Project](https://jiakangyuan.github.io/AdaptiveDiffusion-project-page/)]
[[Code](https://github.com/UniModal4Reasoning/AdaptiveDiffusion)]**DeepCache: Accelerating Diffusion Models for Free** \
[[CVPR 2024](https://arxiv.org/abs/2312.00858)]
[[Project](https://horseee.github.io/Diffusion_DeepCache/)]
[[Code](https://github.com/horseee/DeepCache)]**Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference** \
[[NeurIPS 2024](https://arxiv.org/abs/2312.09608)]
[[Code](https://github.com/hutaihang/faster-diffusion)]**DiTFastAttn: Attention Compression for Diffusion Transformer Models** \
[[NeurIPS 2024](https://arxiv.org/abs/2406.08552)]
[[Code](https://github.com/thu-nics/DiTFastAttn)]**Structural Pruning for Diffusion Models** \
[[NeurIPS 2023](https://arxiv.org/abs/2305.10924)]
[[Code](https://github.com/VainF/Diff-Pruning)]**AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration** \
[[ICCV 2023](https://arxiv.org/abs/2309.10438)]
[[Code](https://github.com/lilijiangg/AutoDiffusion)]**Agent Attention: On the Integration of Softmax and Linear Attention** \
[[ECCV 2024](https://arxiv.org/abs/2312.08874)]
[[Code](https://github.com/LeapLabTHU/Agent-Attention)]**Token Merging for Fast Stable Diffusion** \
[[CVPRW 2024](https://arxiv.org/abs/2303.17604)]
[[Code](https://github.com/dbolya/tomesd)]**FORA: Fast-Forward Caching in Diffusion Transformer Acceleration** \
[[Website](https://arxiv.org/abs/2407.01425)]
[[Code](https://github.com/prathebaselva/FORA)]**Real-Time Video Generation with Pyramid Attention Broadcast** \
[[Website](https://arxiv.org/abs/2408.12588)]
[[Code](https://github.com/NUS-HPC-AI-Lab/VideoSys)]**Accelerating Diffusion Transformers with Token-wise Feature Caching** \
[[Website](https://arxiv.org/abs/2410.05317)]
[[Code](https://github.com/Shenyi-Z/ToCa)]**TGATE-V1: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.02747v1)]
[[Code](https://github.com/HaozheLiu-ST/T-GATE)]**TGATE-V2: Faster Diffusion via Temporal Attention Decomposition** \
[[Website](https://arxiv.org/abs/2404.02747v2)]
[[Code](https://github.com/HaozheLiu-ST/T-GATE)]**SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2411.10510)]
[[Code](https://github.com/Roblox/SmoothCache)]**Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2405.05252)]
[[Project](https://atedm.github.io/)]**Cache Me if You Can: Accelerating Diffusion Models through Block Caching** \
[[Website](https://arxiv.org/abs/2312.03209)]
[[Project](https://github.com/Shenyi-Z/ToCa)]**Token Fusion: Bridging the Gap between Token Pruning and Token Merging** \
[[WACV 2024](https://arxiv.org/abs/2312.01026)]**Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding** \
[[Website](https://fwmb.github.io/blockcaching/)]**PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future** \
[[Website](https://arxiv.org/abs/2408.08822)]**Δ-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2406.01125)]**Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step** \
[[Website](https://arxiv.org/abs/2410.14919)]**Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences** \
[[Website](https://arxiv.org/abs/2410.18881)]**Fast constrained sampling in pre-trained diffusion models** \
[[Website](https://arxiv.org/abs/2410.18804)]## Image Restoration
**Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model** \
[[ICLR 2023 oral](https://arxiv.org/abs/2212.00490)]
[[Project](https://wyhuai.github.io/ddnm.io/)]
[[Code](https://github.com/wyhuai/DDNM)]**Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild** \
[[CVPR 2024](https://arxiv.org/abs/2401.13627)]
[[Project](https://supir.xpixel.group/)]
[[Code](https://github.com/Fanghua-Yu/SUPIR)]**Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model** \
[[CVPR 2024](https://arxiv.org/abs/2403.11157)]
[[Project](https://isee-laboratory.github.io/DiffUIR/)]
[[Code](https://github.com/iSEE-Laboratory/DiffUIR)]**Zero-Reference Low-Light Enhancement via Physical Quadruple Priors** \
[[CVPR 2024](https://arxiv.org/abs/2403.12933)]
[[Project](https://daooshee.github.io/QuadPrior-Website/)]
[[Code](https://github.com/daooshee/QuadPrior/)]**From Posterior Sampling to Meaningful Diversity in Image Restoration** \
[[ICLR 2024](https://arxiv.org/abs/2310.16047)]
[[Project](https://noa-cohen.github.io/MeaningfulDiversityInIR/)]
[[Code](https://github.com/noa-cohen/MeaningfulDiversityInIR)]**Generative Diffusion Prior for Unified Image Restoration and Enhancement** \
[[CVPR 2023](https://arxiv.org/abs/2304.01247)]
[[Project](https://generativediffusionprior.github.io/)]
[[Code](https://github.com/Fayeben/GenerativeDiffusionPrior)]**MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration** \
[[ECCV 2024](https://arxiv.org/abs/2407.10833)]
[[Project](https://renyulin-f.github.io/MoE-DiffIR.github.io/)]
[[Code](https://github.com/renyulin-f/MoE-DiffIR)]**Image Restoration with Mean-Reverting Stochastic Differential Equations** \
[[ICML 2023](https://arxiv.org/abs/2301.11699)]
[[Project](https://algolzw.github.io/ir-sde/index.html)]
[[Code](https://github.com/Algolzw/image-restoration-sde)]**PhoCoLens: Photorealistic and Consistent Reconstruction in Lensless Imaging** \
[[NeurIPS 2024 Spotlight](https://arxiv.org/abs/2409.17996)]
[[Project](https://phocolens.github.io/)]
[[Code](https://github.com/PhoCoLens)]**Denoising Diffusion Models for Plug-and-Play Image Restoration** \
[[CVPR 2023 Workshop NTIRE](https://arxiv.org/abs/2305.08995)]
[[Project](https://yuanzhi-zhu.github.io/DiffPIR/)]
[[Code](https://github.com/yuanzhi-zhu/DiffPIR)]**FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration** \
[[Website](https://arxiv.org/abs/2412.01427)]
[[Project](https://foundir.net/)]
[[Code](https://github.com/House-Leo/FoundIR)]**Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing** \
[[Website](https://arxiv.org/abs/2407.01521)]
[[Project](https://daps-inverse-problem.github.io/)]
[[Code](https://github.com/zhangbingliang2019/DAPS)]**Solving Video Inverse Problems Using Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2409.02574)]
[[Project](https://solving-video-inverse.github.io/main/)]
[[Code](https://github.com/solving-video-inverse/codes)]**Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration** \
[[Website](https://arxiv.org/abs/2410.04811)]
[[Project](https://zhu-zhiyu.github.io/FLUX-IR/)]
[[Code](https://github.com/ZHU-Zhiyu/FLUX-IR)]**AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion** \
[[Website](https://arxiv.org/abs/2310.10123)]
[[Project](https://jiangyitong.github.io/AutoDIR_webpage/)]
[[Code](https://github.com/jiangyitong/AutoDIR)]**FlowIE: Efficient Image Enhancement via Rectified Flow** \
[[CVPR 2024 oral](https://arxiv.org/abs/2406.00508)]
[[Code](https://github.com/EternalEvan/FlowIE)]**ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting** \
[[NeurIPS 2023 (Spotlight)](https://arxiv.org/abs/2307.12348)]
[[Code](https://github.com/zsyOAOA/ResShift)]**GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration** \
[[ICML 2023 oral](https://arxiv.org/abs/2301.12686)]
[[Code](https://github.com/sony/gibbsddrm)]**Diffusion Priors for Variational Likelihood Estimation and Image Denoising** \
[[NeurIPS 2024 Spotlight](https://arxiv.org/abs/2410.17521)]
[[Code](https://github.com/HUST-Tan/DiffusionVI)]**Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance** \
[[CVPR 2024](https://arxiv.org/abs/2312.16519)]
[[Code](https://github.com/tirer-lab/DDPG)]**DiffIR: Efficient Diffusion Model for Image Restoration** \
[[ICCV 2023](https://arxiv.org/abs/2303.09472)]
[[Code](https://github.com/Zj-BinXia/DiffIR)]**LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2407.08939)]
[[Code](https://github.com/JianghaiSCU/LightenDiffusion)]**Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model** \
[[ECCV 2024](https://arxiv.org/abs/2408.13459)]
[[Code](https://github.com/Chen-Rao/VD-Diff)]**DAVI: Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problem** \
[[ECCV 2024](https://arxiv.org/abs/2407.16125)]
[[Code](https://github.com/mlvlab/DAVI)]**Low-Light Image Enhancement with Wavelet-based Diffusion Models** \
[[SIGGRAPH Asia 2023](https://arxiv.org/abs/2306.00306)]
[[Code](https://github.com/JianghaiSCU/Diffusion-Low-Light)]**Residual Denoising Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2308.13712)]
[[Code](https://github.com/nachifur/RDDM)]**Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks** \
[[CVPR 2024](https://arxiv.org/abs/2403.00644)]
[[Code](https://github.com/yuhaoliu7456/Diff-Plugin)]**Deep Equilibrium Diffusion Restoration with Parallel Sampling** \
[[CVPR 2024](https://arxiv.org/abs/2311.11600)]
[[Code](https://github.com/caojiezhang/deqir)]**ReFIR: Grounding Large Restoration Models with Retrieval Augmentation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.05601)]
[[Code](https://github.com/csguoh/ReFIR)]**DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.18666)]
[[Code](https://github.com/shallowdream204/DreamClear)]**Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression** \
[[Website](https://arxiv.org/abs/2412.08912)]
[[Code](https://github.com/alimd94/DiQP)]**Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems** \
[[Website](https://arxiv.org/abs/2407.11288)]
[[Code](https://github.com/ualcalar17/ZAPS)]**UniProcessor: A Text-induced Unified Low-level Image Processor** \
[[Website](https://arxiv.org/abs/2407.20928)]
[[Code](https://github.com/IntMeGroup/UniProcessor)]**Refusion: Enabling Large-Size Realistic Image Restoration with Latent-Space Diffusion Models** \
[[CVPR 2023 Workshop NTIRE](https://arxiv.org/abs/2304.08291)]
[[Code](https://github.com/Algolzw/image-restoration-sde)]**Equipping Diffusion Models with Differentiable Spatial Entropy for Low-Light Image Enhancement** \
[[CVPR 2024 Workshop NTIRE](https://arxiv.org/abs/2404.09735)]
[[Code](https://github.com/shermanlian/spatial-entropy-loss)]**PnP-Flow: Plug-and-Play Image Restoration with Flow Matching** \
[[Website](https://arxiv.org/abs/2410.02423)]
[[Code](https://github.com/annegnx/PnP-Flow)]**Deep Data Consistency: a Fast and Robust Diffusion Model-based Solver for Inverse Problems** \
[[Website](https://arxiv.org/abs/2405.10748)]
[[Code](https://github.com/Hanyu-Chen373/DeepDataConsistency)]**Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration** \
[[Website](https://arxiv.org/abs/2308.08730)]
[[Code](https://github.com/wlydlut/C2F-DFT)]**Stimulating the Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling** \
[[Website](https://arxiv.org/abs/2307.03992)]
[[Code](https://github.com/Li-Tong-621/DMID)]**Solving Linear Inverse Problems Provably via Posterior Sampling with Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2307.00619)]
[[Code](https://github.com/liturout/psld)]**Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior** \
[[Website](https://arxiv.org/abs/2406.09389)]
[[Code](https://github.com/ztMotaLee/Sagiri)]**Frequency Compensated Diffusion Model for Real-scene Dehazing** \
[[Website](https://arxiv.org/abs/2308.10510)]
[[Code](https://github.com/W-Jilly/frequency-compensated-diffusion-model-pytorch)]**Efficient Image Deblurring Networks based on Diffusion Models** \
[[Website](https://arxiv.org/abs/2401.05907)]
[[Code](https://github.com/bnm6900030/swintormer)]**Blind Image Restoration via Fast Diffusion Inversion** \
[[Website](https://arxiv.org/abs/2405.19572)]
[[Code](https://github.com/hamadichihaoui/BIRD)]**DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.16749)]
[[Code](https://github.com/sun-umn/DMPlug)]**Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling** \
[[Website](https://arxiv.org/abs/2305.16965)]
[[Code](https://github.com/GongyeLiu/SSD)]**Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration** \
[[Website](https://arxiv.org/abs/2406.18516)]
[[Code](https://github.com/KangLiao929/Noise-DA/)]**Unlimited-Size Diffusion Restoration** \
[[Website](https://arxiv.org/abs/2303.00354)]
[[Code](https://github.com/wyhuai/DDNM/tree/main/hq_demo)]**VmambaIR: Visual State Space Model for Image Restoration** \
[[Website](https://arxiv.org/abs/2403.11423)]
[[Code](https://github.com/AlphacatPlus/VmambaIR)]**Using diffusion model as constraint: Empower Image Restoration Network Training with Diffusion Model** \
[[Website](https://arxiv.org/abs/2406.19030)]
[[Code](https://github.com/JosephTiTan/DiffLoss)]**Super-resolving Real-world Image Illumination Enhancement: A New Dataset and A Conditional Diffusion Model** \
[[Website](https://arxiv.org/abs/2410.12961)]
[[Code](https://github.com/Yaofang-Liu/Super-Resolving)]**TIP: Text-Driven Image Processing with Semantic and Restoration Instructions** \
[[ECCV 2024](https://arxiv.org/abs/2312.11595)]
[[Project](https://chenyangqiqi.github.io/tip/)]**Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.16152)]
[[Project](https://giannisdaras.github.io/warped_diffusion.github.io/)]**GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration** \
[[Website](https://arxiv.org/abs/2411.17687)]
[[Project](https://sudraj2002.github.io/gendegpage/)]**VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.00156)]
[[Project](https://vision-xl.github.io/)]**Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model** \
[[ICCV 2023](https://arxiv.org/abs/2308.13164)]**Multiscale Structure Guided Diffusion for Image Deblurring** \
[[ICCV 2023](https://arxiv.org/abs/2212.01789)]**Boosting Image Restoration via Priors from Pre-trained Models** \
[[CVPR 2024](https://arxiv.org/abs/2403.06793)]**A Modular Conditional Diffusion Framework for Image Reconstruction** \
[[Website](https://arxiv.org/abs/2411.05993)]**Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model** \
[[Website](https://arxiv.org/abs/2407.17193)]**Particle-Filtering-based Latent Diffusion for Inverse Problems** \
[[Website](https://arxiv.org/abs/2408.13868)]**Bayesian Conditioned Diffusion Models for Inverse Problem** \
[[Website](https://arxiv.org/abs/2406.09768)]**ReCo-Diff: Explore Retinex-Based Condition Strategy in Diffusion Model for Low-Light Image Enhancement** \
[[Website](https://arxiv.org/abs/2312.12826)]**Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration** \
[[Website](https://arxiv.org/abs/2312.02918)]**Tell Me What You See: Text-Guided Real-World Image Denoising**\
[[Website](https://arxiv.org/abs/2312.10191)]**Zero-LED: Zero-Reference Lighting Estimation Diffusion Model for Low-Light Image Enhancement** \
[[Website](https://arxiv.org/abs/2403.02879)]**Prototype Clustered Diffusion Models for Versatile Inverse Problems** \
[[Website](https://arxiv.org/abs/2407.09768)]**AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement** \
[[Website](https://arxiv.org/abs/2407.14900)]**Taming Generative Diffusion for Universal Blind Image Restoration** \
[[Website](https://arxiv.org/abs/2408.11287)]**Efficient Image Restoration through Low-Rank Adaptation and Stable Diffusion XL** \
[[Website](https://arxiv.org/abs/2408.17060)]**Empirical Bayesian image restoration by Langevin sampling with a denoising diffusion implicit prior** \
[[Website](https://arxiv.org/abs/2409.04384)]**Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration** \
[[Website](https://arxiv.org/abs/2409.03455)]**FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process** \
[[Website](https://arxiv.org/abs/2409.07451)]**Diffusion State-Guided Projected Gradient for Inverse Problems** \
[[Website](https://arxiv.org/abs/2410.03463)]**InstantIR: Blind Image Restoration with Instant Generative Reference** \
[[Website](https://arxiv.org/abs/2410.06551)]**Score-Based Variational Inference for Inverse Problems** \
[[Website](https://arxiv.org/abs/2410.05646)]**Towards Flexible and Efficient Diffusion Low Light Enhancer** \
[[Website](https://arxiv.org/abs/2410.12346)]**G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving** \
[[Website](https://arxiv.org/abs/2410.14710)]**AllRestorer: All-in-One Transformer for Image Restoration under Composite Degradations** \
[[Website](https://arxiv.org/abs/2411.10708)]**DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration** \
[[Website](https://arxiv.org/abs/2411.18745)]**Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion** \
[[Website](https://arxiv.org/abs/2412.00557)]**DIVD: Deblurring with Improved Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2412.00773)]**Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration** \
[[Website](https://arxiv.org/abs/2412.00878)]**Enhancing and Accelerating Diffusion-Based Inverse Problem Solving through Measurements Optimization** \
[[Website](https://arxiv.org/abs/2412.03941)]**Are Conditional Latent Diffusion Models Effective for Image Restoration?** \
[[Website](https://arxiv.org/abs/2412.09324)]## Colorization
**Control Color: Multimodal Diffusion-based Interactive Image Colorization** \
[[Website](https://arxiv.org/abs/2402.10855)]
[[Project](https://zhexinliang.github.io/Control_Color/)]
[[Code](https://github.com/ZhexinLiang/Control-Color)]**Multimodal Semantic-Aware Automatic Colorization with Diffusion Prior** \
[[Website](https://arxiv.org/abs/2404.16678)]
[[Project](https://servuskk.github.io/ColorDiff-Image/)]
[[Code](https://github.com/servuskk/ColorDiff-Image)]**ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text** \
[[Website](https://arxiv.org/abs/2401.01456)]
[[Code](https://github.com/tellurion-kanata/colorizeDiffusion)]**Diffusing Colors: Image Colorization with Text Guided Diffusion** \
[[SIGGRAPH Asia 2023](https://arxiv.org/abs/2312.04145)]
[[Project](https://pub.res.lightricks.com/diffusing-colors/)]**Enhancing Diffusion Posterior Sampling for Inverse Problems by Integrating Crafted Measurements** \
[[Website](https://arxiv.org/abs/2411.09850)]**DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models** \
[[Website](https://arxiv.org/abs/2308.01655)]## Face Restoration
**DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior** \
[[Website](https://arxiv.org/abs/2308.15070)]
[[Project](https://0x3f3f3f3fun.github.io/projects/diffbir/)]
[[Code](https://github.com/XPixelGroup/DiffBIR)]**OSDFace: One-Step Diffusion Model for Face Restoration** \
[[Website](https://arxiv.org/abs/2411.17163)]
[[Project](https://jkwang28.github.io/OSDFace-web/)]
[[Code](https://github.com/jkwang28/OSDFace)]**ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration** \
[[Website](https://arxiv.org/abs/2412.05043)]
[[Project](https://chiweihsiao.github.io/refldm.github.io/)]
[[Code](https://github.com/ChiWeiHsiao/ref-ldm)]**InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention** \
[[Website](https://arxiv.org/abs/2412.06753)]
[[Project](https://snap-research.github.io/InstantRestore/)]
[[Code](https://github.com/snap-research/InstantRestore)]**DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration** \
[[CVPR 2023](https://arxiv.org/abs/2303.06885)]
[[Code](https://github.com/Kaldwin0106/DR2_Drgradation_Remover)]**PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance** \
[[NeurIPS 2023](https://arxiv.org/abs/2309.10810)]
[[Code](https://github.com/pq-yang/pgdiff)]**DifFace: Blind Face Restoration with Diffused Error Contraction** \
[[Website](https://arxiv.org/abs/2312.15736)]
[[Code](https://github.com/zsyOAOA/DifFace)]**AuthFace: Towards Authentic Blind Face Restoration with Face-oriented Generative Diffusion Prior** \
[[Website](https://arxiv.org/abs/2410.09864)]
[[Code](https://github.com/EthanLiang99/AuthFace)]**RestorerID: Towards Tuning-Free Face Restoration with ID Preservation** \
[[Website](https://arxiv.org/abs/2411.14125)]
[[Code](https://github.com/YingJiacheng/RestorerID)]**Towards Real-World Blind Face Restoration with Generative Diffusion Prior** \
[[Website](https://arxiv.org/abs/2312.15736)]
[[Code](https://github.com/chenxx89/BFRffusion)]**Towards Unsupervised Blind Face Restoration using Diffusion Prior** \
[[Website](https://arxiv.org/abs/2410.04618)]
[[Project](https://dt-bfr.github.io/)]**DiffBFR: Bootstrapping Diffusion Model Towards Blind Face Restoration** \
[[Website](https://arxiv.org/abs/2305.04517)]**CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.06106)]**DiffMAC: Diffusion Manifold Hallucination Correction for High Generalization Blind Face Restoration** \
[[Website](https://arxiv.org/abs/2403.10098)]**Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling** \
[[Website](https://arxiv.org/abs/2409.08906)]**Overcoming False Illusions in Real-World Face Restoration with Multi-Modal Guided Diffusion Model** \
[[Website](https://arxiv.org/abs/2410.04161)]**DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration** \
[[Website](https://arxiv.org/abs/2411.10508)]## Storytelling
⭐⭐**Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2306.00973)]
[[Project](https://haoningwu3639.github.io/StoryGen_Webpage/)]
[[Code](https://github.com/haoningwu3639/StoryGen)]⭐⭐**Training-Free Consistent Text-to-Image Generation** \
[[SIGGRAPH 2024](https://arxiv.org/abs/2402.03286)]
[[Project](https://consistory-paper.github.io/)]
[[Code](https://github.com/kousw/experimental-consistory)]**The Chosen One: Consistent Characters in Text-to-Image Diffusion Models** \
[[SIGGRAPH 2024](https://arxiv.org/abs/2311.10093)]
[[Project](https://omriavrahami.com/the-chosen-one/)]
[[Code](https://github.com/ZichengDuan/TheChosenOne)]**DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation** \
[[Website](https://arxiv.org/abs/2412.07589)]
[[Project](https://jianzongwu.github.io/projects/diffsensei/)]
[[Code](https://github.com/jianzongwu/DiffSensei)]**AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation** \
[[Website](https://arxiv.org/abs/2406.01388)]
[[Project](https://howe183.github.io/AutoStudio.io/)]
[[Code](https://github.com/donahowe/AutoStudio)]**StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation** \
[[Website](https://arxiv.org/abs/2405.01434)]
[[Project](https://storydiffusion.github.io/)]
[[Code](https://github.com/HVision-NKU/StoryDiffusion)]**StoryGPT-V: Large Language Models as Consistent Story Visualizers** \
[[Website](https://arxiv.org/abs/2312.02252)]
[[Project](https://storygpt-v.s3.amazonaws.com/index.html)]
[[Code](https://github.com/xiaoqian-shen/StoryGPT-V)]**Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation** \
[[Website](https://arxiv.org/abs/2307.06940)]
[[Project](https://ailab-cvc.github.io/Animate-A-Story/)]
[[Code](https://github.com/AILab-CVC/Animate-A-Story)]**TaleCrafter: Interactive Story Visualization with Multiple Characters** \
[[Website](https://arxiv.org/abs/2305.18247)]
[[Project](https://ailab-cvc.github.io/TaleCrafter/)]
[[Code](https://github.com/AILab-CVC/TaleCrafter)]**Story-Adapter: A Training-free Iterative Framework for Long Story Visualization** \
[[Website](https://arxiv.org/abs/2410.06244)]
[[Project](https://jwmao1.github.io/storyadapter/)]
[[Code](https://github.com/jwmao1/story-adapter)]**DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation** \
[[Website](https://arxiv.org/abs/2411.16657)]
[[Project](https://dreamrunner-story2video.github.io/)]
[[Code](https://github.com/wz0919/DreamRunner)]**ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions** \
[[Website](https://arxiv.org/abs/2412.01987)]
[[Project](https://soczech.github.io/showhowto/)]
[[Code](https://github.com/soCzech/showhowto)]**StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion** \
[[ECCV 2024](https://arxiv.org/abs/2404.05979)]
[[Code](https://github.com/tobran/StoryImager)]**Make-A-Story: Visual Memory Conditioned Consistent Story Generation** \
[[CVPR 2023](https://arxiv.org/abs/2211.13319)]
[[Code](https://github.com/ubc-vision/Make-A-Story)]**StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization** \
[[AAAI 2025](https://arxiv.org/abs/2412.07375)]
[[Code](https://github.com/Aria-Zhangjl/StoryWeaver)]**StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation** \
[[Website](https://arxiv.org/abs/2409.12576)]
[[Code](https://github.com/RedAIGC/StoryMaker)]**SEED-Story: Multimodal Long Story Generation with Large Language Model** \
[[Website](https://arxiv.org/abs/2407.08683)]
[[Code](https://github.com/TencentARC/SEED-Story)]**Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2211.10950)]
[[Code](https://github.com/xichenpan/ARLDM)]**Masked Generative Story Transformer with Character Guidance and Caption Augmentation** \
[[Website](https://arxiv.org/abs/2403.08502)]
[[Code](https://github.com/chrispapa2000/maskgst)]**StoryBench: A Multifaceted Benchmark for Continuous Story Visualization** \
[[Website](https://arxiv.org/abs/2308.11606)]
[[Code](https://github.com/google/storybench)]**Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.02482)]
[[Code](https://github.com/muzishen/RCDMs)]**DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion** \
[[Website](https://arxiv.org/abs/2407.12899)]
[[Project](https://dream-xyz.github.io/dreamstory)]**Multi-Shot Character Consistency for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2412.07750)]
[[Project](https://research.nvidia.com/labs/par/video_storyboarding/)]**MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising** \
[[Website](https://arxiv.org/abs/2312.10899)]
[[Project](https://magicscroll.github.io/)]**Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis** \
[[ICASSP 2024](https://arxiv.org/abs/2309.09553)]**CogCartoon: Towards Practical Story Visualization** \
[[Website](https://arxiv.org/abs/2312.10718)]**Generating coherent comic with rich story using ChatGPT and Stable Diffusion** \
[[Website](https://arxiv.org/abs/2305.11067)]**Improved Visual Story Generation with Adaptive Context Modeling** \
[[Website](https://arxiv.org/abs/2305.16811)]**Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control** \
[[Website](https://arxiv.org/abs/2312.07549)]**Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models** \
[[Website](https://arxiv.org/abs/2302.03900)]**Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.11852)]**ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.02820)]**Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection** \
[[Website](https://arxiv.org/abs/2409.19624)]**StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration** \
[[Website](https://arxiv.org/abs/2411.04925)]**Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention** \
[[Website](https://arxiv.org/abs/2411.19261)]## Try On
**TryOnDiffusion: A Tale of Two UNets** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Zhu_TryOnDiffusion_A_Tale_of_Two_UNets_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2306.08276)]
[[Project](https://tryondiffusion.github.io/)]
[[Official Code](https://github.com/tryonlabs/tryondiffusion)]
[[Unofficial Code](https://github.com/fashn-AI/tryondiffusion)]**StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On** \
[[CVPR 2024](https://arxiv.org/abs/2312.01725)]
[[Project](https://rlawjdghek.github.io/StableVITON/)]
[[Code](https://github.com/rlawjdghek/stableviton)]**VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding** \
[[Website](https://arxiv.org/abs/2408.12340)]
[[Project](https://vton-handfit.github.io/)]
[[Code](https://github.com/VTON-HandFit/VTON-HandFit)]**IMAGDressing-v1: Customizable Virtual Dressing** \
[[Website](https://arxiv.org/abs/2407.12705)]
[[Project](https://imagdressing.github.io/)]
[[Code](https://github.com/muzishen/IMAGDressing)]**OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person** \
[[Website](https://arxiv.org/abs/2407.16224)]
[[Project](https://humanaigc.github.io/outfit-anyone/)]
[[Code](https://github.com/HumanAIGC/OutfitAnyone)]**AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.04146)]
[[Project](https://crayon-shinchan.github.io/AnyDressing/)]
[[Code](https://github.com/Crayon-Shinchan/AnyDressing)]**ViViD: Video Virtual Try-on using Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.11794)]
[[Project](https://becauseimbatman0.github.io/ViViD)]
[[Code](https://github.com/BecauseImBatman0/ViViD)]**GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting** \
[[Website](https://arxiv.org/abs/2405.07472)]
[[Project](https://haroldchen19.github.io/gsvton/)]
[[Code](https://github.com/HaroldChen19/GaussianVTON)]**Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images** \
[[Website](https://arxiv.org/abs/2311.16094)]
[[Project](https://cuiaiyu.github.io/StreetTryOn/)]
[[Code](https://github.com/cuiaiyu/street-tryon-benchmark)]**From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation** \
[[Website](https://arxiv.org/abs/2404.15267)]
[[Project](https://huanngzh.github.io/Parts2Whole/)]
[[Code](https://github.com/huanngzh/Parts2Whole)]**PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns** \
[[Website](https://arxiv.org/abs/2312.04534)]
[[Project](https://ningshuliang.github.io/2023/Arxiv/index.html)]
[[Code](https://github.com/ningshuliang/PICTURE)]**StableGarment: Garment-Centric Generation via Stable Diffusion** \
[[Website](https://arxiv.org/abs/2403.10783)]
[[Project](https://raywang335.github.io/stablegarment.github.io/)]
[[Code](https://github.com/logn-2024/StableGarment)]**Improving Diffusion Models for Virtual Try-on** \
[[Website](https://arxiv.org/abs/2403.05139)]
[[Project](https://idm-vton.github.io/)]
[[Code](https://github.com/yisol/IDM-VTON)]**D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On** \
[[ECCV 2024](https://arxiv.org/abs/2407.15111)]
[[Code](https://github.com/Jerome-Young/D4-VTON)]**Improving Virtual Try-On with Garment-focused Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2409.08258)]
[[Code](https://github.com/siqi0905/GarDiff/tree/master)]**Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On** \
[[CVPR 2024](https://arxiv.org/abs/2404.01089)]
[[Code](https://github.com/gal4way/tpd)]**Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow** \
[[ACM MM 2023](https://arxiv.org/abs/2308.06101)]
[[Code](https://github.com/bcmi/DCI-VTON-Virtual-Try-On)]**LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On** \
[[ACM MM 2023](https://arxiv.org/abs/2305.13501)]
[[Code](https://github.com/miccunifi/ladi-vton)]**OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on** \
[[Website](https://arxiv.org/abs/2403.01779)]
[[Code](https://github.com/levihsu/OOTDiffusion)]**CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Model** \
[[Website](https://arxiv.org/abs/2407.15886)]
[[Code](https://github.com/Zheng-Chong/CatVTON)]**Learning Flow Fields in Attention for Controllable Person Image Generation** \
[[Website](https://arxiv.org/abs/2412.08486)]
[[Code](https://github.com/franciszzj/Leffa)]**DreamPaint: Few-Shot Inpainting of E-Commerce Items for Virtual Try-On without 3D Modeling** \
[[Website](https://arxiv.org/abs/2305.01257)]
[[Code](https://github.com/EmergingUnicorns/DeepPaint)]**CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model** \
[[Website](https://arxiv.org/abs/2311.18405)]
[[Code](https://github.com/zengjianhao/cat-dm)]**MV-VTON: Multi-View Virtual Try-On with Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.17364)]
[[Code](https://github.com/hywang2002/MV-VTON)]**M&M VTO: Multi-Garment Virtual Try-On and Editing** \
[[CVPR 2024 Highlight](https://arxiv.org/abs/2406.04542)]
[[Project](https://mmvto.github.io/)]**WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2407.10625)]
[[Project](https://wildvidfit-project.github.io/)]**Fashion-VDM: Video Diffusion Model for Virtual Try-On** \
[[SIGGRAPH Asia 2024](https://arxiv.org/abs/2411.00225)]
[[Project](https://johannakarras.github.io/Fashion-VDM/)]**Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos** \
[[Website](https://arxiv.org/abs/2404.17571)]
[[Project](https://mengtingchen.github.io/tunnel-try-on-page/)]**Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild** \
[[Website](https://arxiv.org/abs/2406.15331)]
[[Project](https://nadavorzech.github.io/max4zero.github.io/)]**TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.18350)]
[[Project](https://rizavelioglu.github.io/tryoffdiff/)]**Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All** \
[[Website](https://arxiv.org/abs/2401.13795)]
[[Project](https://diffuse2choose.github.io/)]**Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment** \
[[Website](https://arxiv.org/abs/2403.12965)]
[[Project](https://mengtingchen.github.io/wear-any-way-page/)]**VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2405.18326)]
[[Project](https://zhengjun-ai.github.io/viton-dit-page/)]**AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario** \
[[Website](https://arxiv.org/abs/2405.18172)]
[[Project](https://colorful-liyu.github.io/anyfit-page/)]**Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism** \
[[Website](https://arxiv.org/abs/2412.09822)]
[[Project](https://zhengjun-ai.github.io/dynamic-tryon-page/)]**FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on** \
[[IJCAI 2024](https://arxiv.org/abs/2404.14162)]**GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon** \
[[Website](https://arxiv.org/abs/2406.02184)]**WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on** \
[[Website](https://arxiv.org/abs/2312.03667)]**Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles** \
[[Website](https://arxiv.org/abs/2401.11239)]**Mobile Fitting Room: On-device Virtual Try-on via Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.01877)]**Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.07371)]**ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On** \
[[Website](https://arxiv.org/abs/2403.13951)]**ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model** \
[[Website](https://arxiv.org/abs/2404.04833)]**AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion** \
[[Website](https://arxiv.org/abs/2408.11553)]**DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing** \
[[Website](https://arxiv.org/abs/2409.01086)]**TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On** \
[[Website](https://arxiv.org/abs/2411.17017)]**Controllable Human Image Generation with Personalized Multi-Garments** \
[[Website](https://arxiv.org/abs/2411.16801)]**RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation** \
[[Website](https://arxiv.org/abs/2411.19528)]**SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.10178)]## Drag Edit
**DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models** \
[[ICLR 2024](https://openreview.net/forum?id=OEL4FJMg1b)]
[[Website](https://arxiv.org/abs/2307.02421)]
[[Project](https://mc-e.github.io/project/DragonDiffusion/)]
[[Code](https://github.com/MC-E/DragonDiffusion)]**Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2305.10973)]
[[Project](https://vcai.mpi-inf.mpg.de/projects/DragGAN/)]
[[Code](https://github.com/XingangPan/DragGAN)]**Readout Guidance: Learning Control from Diffusion Features** \
[[CVPR 2024 Highlight](https://arxiv.org/abs/2312.02150)]
[[Project](https://readout-guidance.github.io/)]
[[Code](https://github.com/google-research/readout_guidance)]**FreeDrag: Feature Dragging for Reliable Point-based Image Editing** \
[[CVPR 2024](https://arxiv.org/abs/2307.04684)]
[[Project](https://lin-chen.site/projects/freedrag/)]
[[Code](https://github.com/LPengYang/FreeDrag)]**DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing** \
[[CVPR 2024](https://arxiv.org/abs/2306.14435)]
[[Project](https://yujun-shi.github.io/projects/dragdiffusion.html)]
[[Code](https://github.com/Yujun-Shi/DragDiffusion)]**InstaDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos** \
[[Website](https://arxiv.org/abs/2405.13722)]
[[Project](https://instadrag.github.io/)]
[[Code](https://github.com/magic-research/InstaDrag)]**GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.07206)]
[[Project](https://gooddrag.github.io/)]
[[Code](https://github.com/zewei-Zhang/GoodDrag)]**Repositioning the Subject within Image** \
[[Website](https://arxiv.org/abs/2401.16861)]
[[Project](https://yikai-wang.github.io/seele/)]
[[Code](https://github.com/Yikai-Wang/ReS)]**Drag-A-Video: Non-rigid Video Editing with Point-based Interaction** \
[[Website](https://arxiv.org/abs/2312.02936)]
[[Project](https://drag-a-video.github.io/)]
[[Code](https://github.com/tyshiwo1/drag-a-video)]**ObjCtrl-2.5D: Training-free Object Control with Camera Poses** \
[[Website](https://arxiv.org/abs/2412.07721)]
[[Project](https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/)]
[[Code](https://github.com/wzhouxiff/ObjCtrl-2.5D)]**DragAnything: Motion Control for Anything using Entity Representation** \
[[Website](https://arxiv.org/abs/2403.07420)]
[[Project](https://weijiawu.github.io/draganything_page/)]
[[Code](https://github.com/showlab/DragAnything)]**InstantDrag: Improving Interactivity in Drag-based Image Editing** \
[[Website](https://arxiv.org/abs/2409.08857)]
[[Project](https://joonghyuk.com/instantdrag-web/)]
[[Code](https://github.com/alex4727/InstantDrag)]**DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing** \
[[CVPR 2024](https://arxiv.org/abs/2402.02583)]
[[Code](https://github.com/MC-E/DragonDiffusion)]**Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation** \
[[CVPR 2024](https://arxiv.org/abs/2404.01050)]
[[Code](https://github.com/haofengl/DragNoise)]**DragVideo: Interactive Drag-style Video Editing** \
[[ECCV 2024](https://arxiv.org/abs/2312.02216)]
[[Code](https://github.com/rickyskywalker/dragvideo-official)]**RotationDrag: Point-based Image Editing with Rotated Diffusion Features** \
[[Website](https://arxiv.org/abs/2401.06442)]
[[Code](https://github.com/Tony-Lowe/RotationDrag)]**TrackGo: A Flexible and Efficient Method for Controllable Video Generation** \
[[Website](https://arxiv.org/abs/2408.11475)]
[[Project](https://zhtjtcz.github.io/TrackGo-Page/)]**DragText: Rethinking Text Embedding in Point-based Image Editing** \
[[Website](https://arxiv.org/abs/2407.17843)]
[[Project](https://micv-yonsei.github.io/dragtext2025/)]**OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2412.09623)]
[[Project](https://lwq20020127.github.io/OmniDrag/)]**FastDrag: Manipulate Anything in One Step** \
[[Website](https://arxiv.org/abs/2405.15769)]
[[Project](https://fastdrag-site.github.io/)]**DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory** \
[[Website](https://arxiv.org/abs/2308.08089)]
[[Project](https://www.microsoft.com/en-us/research/project/dragnuwa/)]**StableDrag: Stable Dragging for Point-based Image Editing** \
[[Website](https://arxiv.org/abs/2403.04437)]
[[Project](https://stabledrag.github.io/)]**DiffUHaul: A Training-Free Method for Object Dragging in Images** \
[[Website](https://arxiv.org/abs/2406.01594)]
[[Project](https://omriavrahami.com/diffuhaul/)]**RegionDrag: Fast Region-Based Image Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.18247)]**Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators** \
[[Website](https://arxiv.org/abs/2401.18085)]**Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing** \
[[Website](https://arxiv.org/abs/2410.03097)]**AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing** \
[[Website](https://arxiv.org/abs/2410.12696)]## Diffusion Models Inversion
⭐⭐⭐**Null-text Inversion for Editing Real Images using Guided Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Mokady_NULL-Text_Inversion_for_Editing_Real_Images_Using_Guided_Diffusion_Models_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2211.09794)]
[[Project](https://null-text-inversion.github.io/)]
[[Code](https://github.com/google/prompt-to-prompt/#null-text-inversion-for-editing-real-images)]⭐⭐**Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code** \
[[ICLR 2024](https://openreview.net/forum?id=FoMZ4ljhVw)]
[[Website](https://arxiv.org/abs/2310.01506)]
[[Project](https://cure-lab.github.io/PnPInversion/)]
[[Code](https://github.com/cure-lab/DirectInversion/tree/main)]⭐**Inversion-Based Creativity Transfer with Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Zhang_Inversion-Based_Style_Transfer_With_Diffusion_Models_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2211.13203)]
[[Code](https://github.com/zyxElsa/InST)]⭐**EDICT: Exact Diffusion Inversion via Coupled Transformations** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Wallace_EDICT_Exact_Diffusion_Inversion_via_Coupled_Transformations_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2211.12446)]
[[Code](https://github.com/salesforce/edict)]⭐**Improving Negative-Prompt Inversion via Proximal Guidance** \
[[Website](https://arxiv.org/abs/2306.05414)]
[[Code](https://github.com/phymhan/prompt-to-prompt)]**An Edit Friendly DDPM Noise Space: Inversion and Manipulations** \
[[CVPR 2024](https://arxiv.org/abs/2304.06140)]
[[Project](https://inbarhub.github.io/DDPM_inversion/)]
[[Code](https://github.com/inbarhub/DDPM_inversion)]
[[Demo](https://huggingface.co/spaces/LinoyTsaban/edit_friendly_ddpm_inversion)]**Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing** \
[[NeurIPS 2023](https://neurips.cc/virtual/2023/poster/72801)]
[[Website](https://arxiv.org/abs/2309.15664)]
[[Code](https://github.com/wangkai930418/DPL)]**Inversion-Free Image Editing with Natural Language** \
[[CVPR 2024](https://arxiv.org/abs/2312.04965)]
[[Project](https://sled-group.github.io/InfEdit/index.html)]
[[Code](https://github.com/sled-group/InfEdit)]**LEDITS++: Limitless Image Editing using Text-to-Image Models** \
[[CVPR 2024](https://arxiv.org/abs/2311.16711)]
[[Project](https://leditsplusplus-project.static.hf.space/index.html)]
[[Code](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/ledits_pp)]**Noise Map Guidance: Inversion with Spatial Context for Real Image Editing** \
[[ICLR 2024](https://openreview.net/forum?id=mhgm0IXtHw)]
[[Website](https://arxiv.org/abs/2402.04625)]
[[Code](https://github.com/hansam95/nmg)]**ReNoise: Real Image Inversion Through Iterative Noising** \
[[ECCV 2024](https://arxiv.org/abs/2403.14602)]
[[Project](https://garibida.github.io/ReNoise-Inversion/)]
[[Code](https://github.com/garibida/ReNoise-Inversion)]**IterInv: Iterative Inversion for Pixel-Level T2I Models** \
[[NeurIPS-W 2023](https://neurips.cc/virtual/2023/74859)]
[[Openreview](https://openreview.net/forum?id=mSGmzVo0aS)]
[[NeuripsW](https://neurips.cc/virtual/2023/workshop/66539#wse-detail-74859)]
[[Website](https://arxiv.org/abs/2310.19540)]
[[Code](https://github.com/Tchuanm/IterInv)]**DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models** \
[[Website](https://arxiv.org/abs/2410.08207)]
[[Project](https://hexiaoxiao-cs.github.io/DICE/)]
[[Code](https://github.com/hexiaoxiao-cs/DICE)]**Object-aware Inversion and Reassembly for Image Editing** \
[[Website](https://arxiv.org/abs/2310.12149)]
[[Project](https://aim-uofa.github.io/OIR-Diffusion/)]
[[Code](https://github.com/aim-uofa/OIR)]**A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/papers/Wu_A_Latent_Space_of_Stochastic_Diffusion_Models_for_Zero-Shot_Image_ICCV_2023_paper.pdf)]
[[Code](https://github.com/humansensinglab/cycle-diffusion)]**Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2403.11105)]
[[Code](https://github.com/leeruibin/SPDInv)]**LocInv: Localization-aware Inversion for Text-Guided Image Editing** \
[[CVPR 2024 AI4CC workshop](https://arxiv.org/abs/2405.01496)]
[[Code](https://github.com/wangkai930418/DPL)]**Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling** \
[[IJCAI 2024](https://arxiv.org/abs/2305.16965)]
[[Code](https://github.com/gongyeliu/ssd)]**StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing** \
[[Website](https://arxiv.org/abs/2303.15649)]
[[Code](https://github.com/sen-mao/StyleDiffusion)]**Generating Non-Stationary Textures using Self-Rectification** \
[[Website](https://arxiv.org/abs/2401.02847)]
[[Code](https://github.com/xiaorongjun000/Self-Rectification)]**Exact Diffusion Inversion via Bi-directional Integration Approximation** \
[[Website](https://arxiv.org/abs/2307.10829)]
[[Code](https://github.com/guoqiang-zhang-x/BDIA)]**IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models** \
[[Website](https://arxiv.org/abs/2412.01794)]
[[Code](https://github.com/X1716/IQA-Adapter)]**Fixed-point Inversion for Text-to-image diffusion models** \
[[Website](https://arxiv.org/abs/2312.12540)]
[[Code](https://github.com/dvirsamuel/FPI)]**Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing** \
[[Website](https://arxiv.org/abs/2403.09468)]
[[Code](https://github.com/furiosa-ai/eta-inversion)]**Effective Real Image Editing with Accelerated Iterative Diffusion Inversion** \
[[ICCV 2023 Oral](https://openaccess.thecvf.com/content/ICCV2023/html/Pan_Effective_Real_Image_Editing_with_Accelerated_Iterative_Diffusion_Inversion_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2309.04907)]**BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.07273)]**Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.18756)]**BARET : Balanced Attention based Real image Editing driven by Target-text Inversion** \
[[WACV 2024](https://arxiv.org/abs/2312.05482)]**Wavelet-Guided Acceleration of Text Inversion in Diffusion-Based Image Editing** \
[[ICASSP 2024](https://arxiv.org/abs/2401.09794)]**Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing** \
[[Website](https://arxiv.org/abs/2408.13395)]**Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations** \
[[Website](https://arxiv.org/abs/2410.10792)]**Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.16807)]**Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2211.07825)]**SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing** \
[[Website](https://arxiv.org/abs/2409.10476)]**Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.04441)]**KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing** \
[[Website](https://arxiv.org/abs/2309.16608)]**Tuning-Free Inversion-Enhanced Control for Consistent Image Editing** \
[[Website](https://arxiv.org/abs/2312.14611)]**LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance** \
[[Website](https://arxiv.org/abs/2307.00522)]## Text Guided Image Editing
⭐⭐⭐**Prompt-to-Prompt Image Editing with Cross Attention Control** \
[[ICLR 2023](https://openreview.net/forum?id=_CDixzkzeyb)]
[[Website](https://arxiv.org/abs/2211.09794)]
[[Project](https://prompt-to-prompt.github.io/)]
[[Code](https://github.com/google/prompt-to-prompt)]
[[Replicate Demo](https://replicate.com/cjwbw/prompt-to-prompt)]⭐⭐⭐**Zero-shot Image-to-Image Translation** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2302.03027)]
[[Project](https://pix2pixzero.github.io/)]
[[Code](https://github.com/pix2pixzero/pix2pix-zero)]
[[Replicate Demo](https://replicate.com/cjwbw/pix2pix-zero)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/v0.16.0/api/pipelines/stable_diffusion/pix2pix_zero)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py)]⭐⭐**InstructPix2Pix: Learning to Follow Image Editing Instructions** \
[[CVPR 2023 (Highlight)](https://openaccess.thecvf.com/content/CVPR2023/html/Brooks_InstructPix2Pix_Learning_To_Follow_Image_Editing_Instructions_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2211.09800)]
[[Project](https://www.timothybrooks.com/instruct-pix2pix/)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/v0.13.0/en/api/pipelines/stable_diffusion/pix2pix)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py)]
[[Official Code](https://github.com/timothybrooks/instruct-pix2pix)]
[[Dataset](http://instruct-pix2pix.eecs.berkeley.edu/)]⭐⭐**Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Tumanyan_Plug-and-Play_Diffusion_Features_for_Text-Driven_Image-to-Image_Translation_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2211.12572)]
[[Project](https://pnp-diffusion.github.io/sm/index.html)]
[[Code](https://github.com/MichalGeyer/plug-and-play)]
[[Dataset](https://www.dropbox.com/sh/8giw0uhfekft47h/AAAF1frwakVsQocKczZZSX6La?dl=0)]
[[Replicate Demo](https://replicate.com/daanelson/plug_and_play_image_translation)]
[[Demo](https://huggingface.co/spaces/hysts/PnP-diffusion-features)]⭐**DiffEdit: Diffusion-based semantic image editing with mask guidance** \
[[ICLR 2023](https://openreview.net/forum?id=3lge0p5o-M-)]
[[Website](https://arxiv.org/abs/2210.11427)]
[[Unofficial Code](https://paperswithcode.com/paper/diffedit-diffusion-based-semantic-image)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/api/pipelines/diffedit)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_diffedit.py)]⭐**Imagic: Text-Based Real Image Editing with Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Kawar_Imagic_Text-Based_Real_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2210.09276)]
[[Project](https://imagic-editing.github.io/)]
[[Diffusers](https://github.com/huggingface/diffusers/tree/main/examples/community#imagic-stable-diffusion)]⭐**Inpaint Anything: Segment Anything Meets Image Inpainting** \
[[Website](https://arxiv.org/abs/2304.06790)]
[[Code 1](https://github.com/geekyutao/Inpaint-Anything)]
[[Code 2](https://github.com/sail-sg/EditAnything)]**MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Cao_MasaCtrl_Tuning-Free_Mutual_Self-Attention_Control_for_Consistent_Image_Synthesis_and_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2304.08465)]
[[Project](https://ljzycmd.github.io/projects/MasaCtrl/)]
[[Code](https://github.com/TencentARC/MasaCtrl)]
[[Demo](https://huggingface.co/spaces/TencentARC/MasaCtrl)]**Collaborative Score Distillation for Consistent Visual Synthesis** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/73044)]
[[Website](https://arxiv.org/abs/2307.04787)]
[[Project](https://subin-kim-cv.github.io/CSD/)]
[[Code](https://github.com/subin-kim-cv/CSD)]**Visual Instruction Inversion: Image Editing via Visual Prompting** \
[[NeurIPS 2023](https://neurips.cc/virtual/2023/poster/70612)]
[[Website](https://arxiv.org/abs/2307.14331)]
[[Project](https://thaoshibe.github.io/visii/)]
[[Code](https://github.com/thaoshibe/visii)]**Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models** \
[[NeurIPS 2023](https://openreview.net/forum?id=lOCHMGO6ow)]
[[Website](https://arxiv.org/abs/2306.09869)]
[[Code](https://github.com/EnergyAttention/Energy-Based-CrossAttention)]**Localizing Object-level Shape Variations with Text-to-Image Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Patashnik_Localizing_Object-Level_Shape_Variations_with_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html
)]
[[Website](https://arxiv.org/abs/2303.11306)]
[[Project](https://orpatashnik.github.io/local-prompt-mixing/)]
[[Code](https://github.com/orpatashnik/local-prompt-mixing)]**Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance** \
[[Website](https://arxiv.org/abs/2210.05559)]
[[Code1](https://github.com/chenwu98/unified-generative-zoo)]
[[Code2](https://github.com/chenwu98/cycle-diffusion)]
[[Diffusers Code](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cycle_diffusion)]**PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models** \
[[Website](https://arxiv.org/abs/2303.17546)]
[[Project](https://vidit98.github.io/publication/conference-paper/pair_diff.html)]
[[Code](https://github.com/Picsart-AI-Research/PAIR-Diffusion)]
[[Demo](https://huggingface.co/spaces/PAIR/PAIR-Diffusion)]**SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models** \
[[CVPR 2024](https://arxiv.org/abs/2312.06739)]
[[Project](https://yuzhou914.github.io/SmartEdit/)]
[[Code](https://github.com/TencentARC/SmartEdit)]**Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing** \
[[CVPR 2024](https://arxiv.org/abs/2311.18608)]
[[Project](https://hyelinnam.github.io/CDS/)]
[[Code](https://github.com/HyelinNAM/CDS)]**Text-Driven Image Editing via Learnable Regions** \
[[CVPR 2024](https://arxiv.org/abs/2311.16432)]
[[Project](https://yuanze-lin.me/LearnableRegions_page/)]
[[Code](https://github.com/yuanze-lin/Learnable_Regions)]**Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators** \
[[ICLR 2024](https://arxiv.org/abs/2401.18085)]
[[Project](https://dangeng.github.io/motion_guidance/)]
[[Code](https://github.com/dangeng/motion_guidance/)]**TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models** \
[[SIGGRAPH Asia 2024](https://arxiv.org/abs/2408.00735)]
[[Project](https://turboedit-paper.github.io/)]
[[Code](https://github.com/GiilDe/turbo-edit)]**Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps** \
[[NeurIPS 2024](https://arxiv.org/abs/2406.14539)]
[[Project](https://yandex-research.github.io/invertible-cd/)]
[[Code](https://github.com/yandex-research/invertible-cd)]**Zero-shot Image Editing with Reference Imitation** \
[[Website](https://arxiv.org/abs/2406.07547)]
[[Project](https://xavierchen34.github.io/MimicBrush-Page/)]
[[Code](https://github.com/ali-vilab/MimicBrush)]**OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision** \
[[Website](https://arxiv.org/abs/2411.07199)]
[[Project](https://tiger-ai-lab.github.io/OmniEdit/)]
[[Code](https://github.com/TIGER-AI-Lab/OmniEdit)]**MultiBooth: Towards Generating All Your Concepts in an Image from Text** \
[[Website](https://arxiv.org/abs/2404.14239)]
[[Project](https://multibooth.github.io/)]
[[Code](https://github.com/chenyangzhu1/MultiBooth)]**Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting** \
[[Website](https://arxiv.org/abs/2404.14007)]
[[Project](https://zwl666666.github.io/infusion/)]
[[Code](https://github.com/zwl666666/infusion)]**StyleBooth: Image Style Editing with Multimodal Instruction** \
[[Website](https://arxiv.org/abs/2404.12154)]
[[Project](https://ali-vilab.github.io/stylebooth-page/)]
[[Code](https://github.com/modelscope/scepter)]**SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing** \
[[Website](https://arxiv.org/abs/2404.05717)]
[[Project](https://swap-anything.github.io/)]
[[Code](https://github.com/eric-ai-lab/swap-anything)]**EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods** \
[[Website](https://arxiv.org/abs/2310.02426)]
[[Project](https://deep-ml-research.github.io/editval/#home)]
[[Code](https://github.com/deep-ml-research/editval_code)]**InsightEdit: Towards Better Instruction Following for Image Editing** \
[[Website](https://arxiv.org/abs/2411.17323)]
[[Project](https://poppyxu.github.io/InsightEdit_web/)]
[[Code](https://github.com/poppyxu/InsightEdit)]**InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions** \
[[Website](https://arxiv.org/abs/2305.18047)]
[[Project](https://qianwangx.github.io/InstructEdit/)]
[[Code](https://github.com/QianWangX/InstructEdit)]**MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path** \
[[Website](https://arxiv.org/abs/2303.16765)]
[[Project](https://qianwangx.github.io/MDP-Diffusion/)]
[[Code](https://github.com/QianWangX/MDP-Diffusion)]**HIVE: Harnessing Human Feedback for Instructional Visual Editing** \
[[Website](https://arxiv.org/abs/2303.09618)]
[[Project](https://shugerdou.github.io/hive/)]
[[Code](https://github.com/salesforce/HIVE)]**FaceStudio: Put Your Face Everywhere in Seconds** \
[[Website](https://arxiv.org/abs/2312.02663)]
[[Project](https://icoz69.github.io/facestudio/)]
[[Code](https://github.com/xyynafc/FaceStudio)]**Towards Small Object Editing: A Benchmark Dataset and A Training-Free Approach** \
[[Website](https://arxiv.org/abs/2411.01545)]
[[Project](https://soebench.github.io/)]
[[Code](https://github.com/panqihe-zjut/SOEBench)]**Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.04410)]
[[Project](https://shi-labs.github.io/Smooth-Diffusion/)]
[[Code](https://github.com/SHI-Labs/Smooth-Diffusion)]**FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction** \
[[Website](https://arxiv.org/abs/2409.18071)]
[[Project](https://github.com/hrz2000/FreeEdit)]
[[Code](https://freeedit.github.io/)]**MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance** \
[[Website](https://arxiv.org/abs/2312.11396)]
[[Project](https://mag-edit.github.io/)]
[[Code](https://github.com/HelenMao/MAG-Edit)]**LIME: Localized Image Editing via Attention Regularization in Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.09256)]
[[Project](https://enis.dev/LIME/)]
[[Code](https://github.com/enisimsar/LIME)]**MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond** \
[[Website](https://arxiv.org/abs/2401.03221)]
[[Project](https://mirrordiffusion.github.io/)]
[[Code](https://github.com/MirrorDiffusion/MirrorDiffusion)]**MagicQuill: An Intelligent Interactive Image Editing System** \
[[Website](https://arxiv.org/abs/2411.09703)]
[[Project](https://magicquill.art/demo/)]
[[Code](https://github.com/magic-quill/magicquill)]**Scaling Concept With Text-Guided Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.24151)]
[[Project](https://wikichao.github.io/ScalingConcept/)]
[[Code](https://github.com/WikiChao/ScalingConcept)]**Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control** \
[[Website](https://arxiv.org/abs/2405.12970)]
[[Project](https://faceadapter.github.io/face-adapter.github.io/)]
[[Code](https://github.com/FaceAdapter/Face-Adapter)]**FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models** \
[[Website](https://arxiv.org/abs/2412.08629)]
[[Project](https://matankleiner.github.io/flowedit/)]
[[Code](https://github.com/fallenshock/FlowEdit)]**FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning** \
[[Website](https://arxiv.org/abs/2408.03355)]
[[Project](https://fastedit-sd.github.io/)]
[[Code](https://github.com/JasonCodeMaker/FastEdit)]**Steering Rectified Flow Models in the Vector Field for Controlled Image Generation** \
[[Website](https://arxiv.org/abs/2412.00100)]
[[Project](https://flowchef.github.io/)]
[[Code](https://github.com/FlowChef/flowchef)]**Delta Denoising Score** \
[[Website](https://arxiv.org/abs/2304.07090)]
[[Project](https://delta-denoising-score.github.io/)]
[[Code](https://github.com/google/prompt-to-prompt/blob/main/DDS_zeroshot.ipynb)]**InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences** \
[[Website](https://arxiv.org/abs/2412.01197)]
[[Project](https://instantswap.github.io/)]
[[Code](https://github.com/chenyangzhu1/InstantSwap)]**UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2210.09477)]
[[Code](https://github.com/xuduo35/UniTune)]
**Learning to Follow Object-Centric Image Editing Instructions Faithfully** \
[[EMNLP 2023](https://arxiv.org/abs/2310.19145)]
[[Code](https://github.com/tuhinjubcse/faithfuledits_emnlp2023)]**GroupDiff: Diffusion-based Group Portrait Editing** \
[[ECCV 2024](https://arxiv.org/abs/2409.14379)]
[[Code](https://github.com/yumingj/GroupDiff)]**TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing** \
[[CVPR 2024](https://arxiv.org/abs/2404.11120)]
[[Code](https://github.com/SherryXTChen/TiNO-Edit)]**ZONE: Zero-Shot Instruction-Guided Local Editing** \
[[CVPR 2024](https://arxiv.org/abs/2312.16794)]
[[Code](https://github.com/lsl001006/ZONE)]**Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation** \
[[CVPR 2024](https://arxiv.org/abs/2312.10113)]
[[Code](https://github.com/guoqincode/focus-on-your-instruction)]**DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation** \
[[ECCV 2024](https://arxiv.org/abs/2403.11415)]
[[Code](https://github.com/dreamsampler/dream-sampler)]**FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing** \
[[ECCV 2024](https://arxiv.org/abs/2407.17850)]
[[Code](https://github.com/kookie12/FlexiEdit)]**Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing** \
[[ECCV 2024](https://arxiv.org/abs/2409.01322)]
[[Code](https://github.com/FusionBrainLab/Guide-and-Rescale)]**Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks** \
[[AAAI 2024](https://arxiv.org/abs/2401.07709)]
[[Code](https://github.com/xiaotianqing/instdiffedit)]**FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference** \
[[AAAI 2024](https://arxiv.org/abs/2305.17423)]
[[Code](https://github.com/pku-dair/hetu)]**Face Aging via Diffusion-based Editing**\
[[BMVC 2023](https://arxiv.org/abs/2309.11321)]
[[Code](https://github.com/MunchkinChen/FADING)]**Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing** \
[[Website](https://arxiv.org/abs/2411.19652)]
[[Code](https://github.com/Mowenyii/Uniform-Attention-Maps)]**FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing** \
[[Website](https://arxiv.org/abs/2408.12429)]
[[Code](https://github.com/a-new-b/flex_edit)]**Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing** \
[[Website](https://arxiv.org/abs/2407.20232)]
[[Code](https://github.com/fabvio/SANE)]**PostEdit: Posterior Sampling for Efficient Zero-Shot Image Editing** \
[[Website](https://arxiv.org/abs/2410.04844)]
[[Code](https://github.com/TFNTF/PostEdit)]**DiT4Edit: Diffusion Transformer for Image Editing** \
[[Website](https://arxiv.org/abs/2411.03286)]
[[Code](https://github.com/fkyyyy/DiT4Edit)]**Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing** \
[[Website](https://arxiv.org/abs/2405.14785)]
[[Code](https://github.com/YangLing0818/EditWorld)]**EditWorld: Simulating World Dynamics for Instruction-Following Image Editing** \
[[Website](https://arxiv.org/abs/2407.17847)]
[[Code](https://github.com/mobiushy/move-act)]**ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing** \
[[Website](https://arxiv.org/abs/2404.04376)]
[[Code](https://github.com/poloclub/clickdiffusion)]**Differential Diffusion: Giving Each Pixel Its Strength** \
[[Website](https://arxiv.org/abs/2306.00950)]
[[Code](https://github.com/exx8/differential-diffusion)]**Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing** \
[[Website](https://arxiv.org/abs/2403.13551)]
[[Code](https://github.com/ground-a-score/ground-a-score)]**InstructDiffusion: A Generalist Modeling Interface for Vision Tasks** \
[[Website](https://arxiv.org/abs/2309.03895)]
[[Code](https://github.com/cientgu/instructdiffusion)]**Region-Aware Diffusion for Zero-shot Text-driven Image Editing** \
[[Website](https://arxiv.org/abs/2302.11797v1)]
[[Code](https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model)]**Forgedit: Text Guided Image Editing via Learning and Forgetting** \
[[Website](https://arxiv.org/abs/2309.10556)]
[[Code](https://github.com/witcherofresearch/Forgedit)]**AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing** \
[[Website](https://arxiv.org/abs/2312.08019)]
[[Code](https://github.com/anonymouspony/adap-edit)]**An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control** \
[[Website](https://arxiv.org/abs/2403.04880)]
[[Code](https://github.com/collovlabs/d-edit)]**FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.11895)]
[[Code](https://github.com/thermal-dynamics/freediff)]**Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance** \
[[Website](https://arxiv.org/abs/2401.02126)]
[[Code](https://github.com/kihensarn/ti-guided-edit)]**SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing** \
[[Website](https://arxiv.org/abs/2401.03433)]
[[Code](https://github.com/jingjiqinggong/specp2p)]**FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing** \
[[Website](https://arxiv.org/abs/2412.07517)]
[[Code](https://github.com/HolmesShuan/FireFlow-Fast-Inversion-of-Rectified-Flow-for-Image-Semantic-Editing)]**PromptFix: You Prompt and We Fix the Photo** \
[[Website](https://arxiv.org/abs/2405.16785)]
[[Code](https://github.com/yeates/PromptFix)]**FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation** \
[[Website](https://arxiv.org/abs/2408.00998)]
[[Code](https://github.com/XiangGao1102/FBSDiff)]**Conditional Score Guidance for Text-Driven Image-to-Image Translation** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/71103)]
[[Website](https://arxiv.org/abs/2305.18007)]**Emu Edit: Precise Image Editing via Recognition and Generation Tasks** \
[[CVPR 2024](https://arxiv.org/abs/2311.10089)]
[[Project](https://emu-edit.metademolab.com/)]**ByteEdit: Boost, Comply and Accelerate Generative Image Editing** \
[[ECCV 2024](https://arxiv.org/abs/2404.04860)]
[[Project](https://byte-edit.github.io/)]**Watch Your Steps: Local Image and Scene Editing by Text Instructions** \
[[ECCV 2024](https://arxiv.org/abs/2308.08947)]
[[Project](https://ashmrz.github.io/WatchYourSteps/)]**TurboEdit: Instant text-based image editing** \
[[ECCV 2024](https://arxiv.org/abs/2408.08332)]
[[Project](https://betterze.github.io/TurboEdit/)]**Novel Object Synthesis via Adaptive Text-Image Harmony** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.20823)]
[[Project](https://xzr52.github.io/ATIH/)]**UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics** \
[[Website](https://arxiv.org/abs/2412.07774)]
[[Project](https://xavierchen34.github.io/UniReal-Page/)]**HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads** \
[[Website](https://arxiv.org/abs/2411.15034)]
[[Project](https://yuci-gpt.github.io/headrouter/)]**MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.00985)]
[[Project](https://mingzhenhuang.com/projects/MultiEdits.html)]**BrushEdit: All-In-One Image Inpainting and Editing** \
[[Website](https://arxiv.org/abs/2412.10316)]
[[Project](https://liyaowei-stu.github.io/project/BrushEdit/)]**Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.07232)]
[[Project](https://research.nvidia.com/labs/par/addit/)]**FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers** \
[[Website](https://arxiv.org/abs/2412.09611)]
[[Project](https://fluxspace.github.io/)]**SeedEdit: Align Image Re-Generation to Image Editing** \
[[Website](https://arxiv.org/abs/2411.06686)]
[[Project](https://team.doubao.com/en/special/seededit)]**Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection** \
[[Website](https://arxiv.org/abs/2405.16823)]
[[Project](https://unifyediting.github.io/)]**Generative Image Layer Decomposition with Visual Effects** \
[[Website](https://arxiv.org/abs/2411.17864)]
[[Project](https://rayjryang.github.io/LayerDecomp/)]**Editable Image Elements for Controllable Synthesis** \
[[Website](https://arxiv.org/abs/2404.16029)]
[[Project](https://jitengmu.github.io/Editable_Image_Elements/)]**SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing** \
[[Website](https://arxiv.org/abs/2410.11815)]
[[Project](https://bestzzhang.github.io/SGEdit/)]**SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion** \
[[Website](https://arxiv.org/abs/2412.04301)]
[[Project](https://swift-edit.github.io/#)]**ReGeneration Learning of Diffusion Models with Rich Prompts for Zero-Shot Image Translation** \
[[Website](https://arxiv.org/abs/2305.04651)]
[[Project](https://yupeilin2388.github.io/publication/ReDiffuser)]**GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.19645)]
[[Project](https://gantastic.github.io/)]**MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers** \
[[Website](https://arxiv.org/abs/2309.04372)]
[[Project](https://oppo-mente-lab.github.io/moe_controller/)]**FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing** \
[[Website](https://arxiv.org/abs/2403.18605)]
[[Project](https://flex-edit.github.io/)]**GeoDiffuser: Geometry-Based Image Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.14403)]
[[Project](https://ivl.cs.brown.edu/research/geodiffuser.html)]**SOEDiff: Efficient Distillation for Small Object Editing** \
[[Website](https://arxiv.org/abs/2405.09114)]
[[Project](https://soediff.github.io/)]**Click2Mask: Local Editing with Dynamic Mask Generation** \
[[Website](https://arxiv.org/abs/2409.08272)]
[[Project](https://omeregev.github.io/click2mask/)]**Stable Flow: Vital Layers for Training-Free Image Editing** \
[[Website](https://arxiv.org/abs/2411.14430)]
[[Project](https://omriavrahami.com/stable-flow/)]**Iterative Multi-granular Image Editing using Diffusion Models** \
[[WACV 2024](https://arxiv.org/abs/2309.00613)]**Text-to-image Editing by Image Information Removal** \
[[WACV 2024](https://arxiv.org/abs/2305.17489)]**TexSliders: Diffusion-Based Texture Editing in CLIP Space** \
[[SIGGRAPH 2024](https://arxiv.org/abs/2405.00672)]**Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models** \
[[CVPR 2023 AI4CC Workshop](https://arxiv.org/abs/2305.15779)]**Learning Feature-Preserving Portrait Editing from Generated Pairs** \
[[Website](https://arxiv.org/abs/2407.20455)]**EmoEdit: Evoking Emotions through Image Manipulation** \
[[Website](https://arxiv.org/abs/2405.12661)]**DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images** \
[[Website](https://arxiv.org/abs/2404.18020)]**LayerDiffusion: Layered Controlled Image Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.18676)]**iEdit: Localised Text-guided Image Editing with Weak Supervision** \
[[Website](https://arxiv.org/abs/2305.05947)]**User-friendly Image Editing with Minimal Text Input: Leveraging Captioning and Injection Techniques** \
[[Website](https://arxiv.org/abs/2306.02717)]**PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing** \
[[Website](https://arxiv.org/abs/2306.16894)]**PRedItOR: Text Guided Image Editing with Diffusion Prior** \
[[Website](https://arxiv.org/abs/2302.07979v2)]**FEC: Three Finetuning-free Methods to Enhance Consistency for Real Image Editing** \
[[Website](https://arxiv.org/abs/2309.14934)]**The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing** \
[[Website](https://arxiv.org/abs/2311.01410)]**Image Translation as Diffusion Visual Programmers** \
[[Website](https://arxiv.org/abs/2312.16794)]**Latent Inversion with Timestep-aware Sampling for Training-free Non-rigid Editing** \
[[Website](https://arxiv.org/abs/2402.08601)]**LoMOE: Localized Multi-Object Editing via Multi-Diffusion** \
[[Website](https://arxiv.org/abs/2403.00437)]**Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing** \
[[Website](https://arxiv.org/abs/2403.03431)]**DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation** \
[[Website](https://arxiv.org/abs/2403.04997)]**InstructGIE: Towards Generalizable Image Editing** \
[[Website](https://arxiv.org/abs/2403.05018)]**LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing** \
[[Website](https://arxiv.org/abs/2403.12585)]**Uncovering the Text Embedding in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.01154)]**Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer** \
[[Website](https://arxiv.org/abs/2404.06835)]**Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion** \
[[Website](https://arxiv.org/abs/2405.15313)]**Text Guided Image Editing with Automatic Concept Locating and Forgetting** \
[[Website](https://arxiv.org/abs/2405.19708)]**The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP** \
[[Website](https://arxiv.org/abs/2406.00457)]**LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing** \
[[Website](https://arxiv.org/abs/2406.17236)]**Achieving Complex Image Edits via Function Aggregation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2408.08495)]**Prompt-Softbox-Prompt: A free-text Embedding Control for Image Editing** \
[[Website](https://arxiv.org/abs/2408.13623)]**InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models** \
[[Website](https://arxiv.org/abs/2409.11734)]**PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM** \
[[Website](https://arxiv.org/abs/2410.05710)]**Augmentation-Driven Metric for Balancing Preservation and Modification in TextGuided Image Editing** \
[[Website](https://arxiv.org/abs/2410.11374)]**Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing** \
[[Website](https://arxiv.org/abs/2410.10496)]**ERDDCI: Exact Reversible Diffusion via Dual-Chain Inversion for High-Quality Image Editing** \
[[Website](https://arxiv.org/abs/2410.14247)]**ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.03982)]**ColorEdit: Training-free Image-Guided Color editing with diffusion model** \
[[Website](https://arxiv.org/abs/2411.10232)]**GalaxyEdit: Large-Scale Image Editing Dataset with Enhanced Diffusion Adapter** \
[[Website](https://arxiv.org/abs/2411.13794)]**Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing** \
[[Website](https://arxiv.org/abs/2411.15843)]**Pathways on the Image Manifold: Image Editing via Video Generation** \
[[Website](https://arxiv.org/abs/2411.16819)]**LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair** \
[[Website](https://arxiv.org/abs/2411.19156)]**Action-based image editing guided by human instructions** \
[[Website](https://arxiv.org/abs/2412.04558)]**Addressing Attribute Leakages in Diffusion-based Image Editing without Training** \
[[Website](https://arxiv.org/abs/2412.04715)]## Continual Learning
**RGBD2: Generative Scene Synthesis via Incremental View Inpainting using RGBD Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/papers/Lei_RGBD2_Generative_Scene_Synthesis_via_Incremental_View_Inpainting_Using_RGBD_CVPR_2023_paper.pdf)]
[[Website](https://arxiv.org/abs/2212.05993)]
[[Project](https://jblei.site/proj/rgbd-diffusion)]
[[Code](https://github.com/Karbo123/RGBD-Diffusion)]**Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning** \
[[ECCV 2024 Oral](https://arxiv.org/abs/2409.01128)]
[[Code](https://github.com/jinglin-liang/DDDR)]**How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.17594)]
[[Code](https://github.com/JiahuaDong/CIFC)]**CLoG: Benchmarking Continual Learning of Image Generation Models** \
[[Website](https://arxiv.org/abs/2406.04584)]
[[Code](https://github.com/linhaowei1/CLoG)]**Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models** \
[[Website](https://arxiv.org/abs/2305.10120)]
[[Code](https://github.com/clear-nus/selective-amnesia)]**Continual Learning of Diffusion Models with Generative Distillation** \
[[Website](https://arxiv.org/abs/2311.14028)]
[[Code](https://github.com/atenrev/difussion_continual_learning)]**Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning** \
[[Website](https://arxiv.org/abs/2311.18266)]
[[Code](https://github.com/KerryDRX/ESCORT)]**Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA** \
[[TMLR](https://arxiv.org/abs/2304.06027)]
[[Project](https://jamessealesmith.github.io/continual-diffusion/)]**Assessing Open-world Forgetting in Generative Image Model Customization** \
[[Website](https://arxiv.org/abs/2410.14159)]
[[Project](https://hecoding.github.io/open-world-forgetting/)]**Class-Incremental Learning using Diffusion Model for Distillation and Replay** \
[[ICCV 2023 VCL workshop best paper](https://arxiv.org/abs/2306.17560)]**Create Your World: Lifelong Text-to-Image Diffusion** \
[[Website](https://arxiv.org/abs/2309.04430)]**Low-Rank Continual Personalization of Diffusion Models** \
[[Website](https://arxiv.org/pdf/2410.04891)]**Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.00700)]**Online Continual Learning of Video Diffusion Models From a Single Video Stream** \
[[Website](https://arxiv.org/abs/2406.04814)]**Exploring Continual Learning of Diffusion Models** \
[[Website](https://arxiv.org/abs/2303.15342)]**DiracDiffusion: Denoising and Incremental Reconstruction with Assured Data-Consistency** \
[[Website](https://arxiv.org/abs/2303.14353)]**DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation** \
[[Website](https://arxiv.org/abs/2308.01127)]**Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters** \
[[Website](https://arxiv.org/abs/2311.18763)]**Premonition: Using Generative Models to Preempt Future Data Changes in Continual Learning** \
[[Website](https://arxiv.org/abs/2403.07356)]**MuseumMaker: Continual Style Customization without Catastrophic Forgetting** \
[[Website](https://arxiv.org/abs/2404.16612)]**Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion** \
[[Website](https://arxiv.org/abs/2411.05544)]## Remove Concept
**Ablating Concepts in Text-to-Image Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Kumari_Ablating_Concepts_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.13516)]
[[Project](https://www.cs.cmu.edu/~concept-ablation/)]
[[Code](https://github.com/nupurkmr9/concept-ablation)]**Erasing Concepts from Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Gandikota_Erasing_Concepts_from_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.07345)]
[[Project](https://erasing.baulab.info/)]
[[Code](https://github.com/rohitgandikota/erasing)]**Paint by Inpaint: Learning to Add Image Objects by Removing Them First** \
[[Website](https://arxiv.org/abs/2404.18212)]
[[Project](https://rotsteinnoam.github.io/Paint-by-Inpaint/)]
[[Code](https://github.com/RotsteinNoam/Paint-by-Inpaint)]**One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications** \
[[Website](https://arxiv.org/abs/2312.16145)]
[[Project](https://lyumengyao.github.io/projects/spm)]
[[Code](https://github.com/Con6924/SPM)]**Editing Massive Concepts in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.13807)]
[[Project](https://silentview.github.io/EMCID/)]
[[Code](https://github.com/SilentView/EMCID)]**Memories of Forgotten Concepts** \
[[Website](https://arxiv.org/abs/2412.00782)]
[[Project](https://matanr.github.io/Memories_of_Forgotten_Concepts/)]
[[Code](https://github.com/matanr/Memories_of_Forgotten_Concepts/)]**STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models** \
[[Website](https://arxiv.org/abs/2408.16807)]
[[Project](https://koushiksrivats.github.io/robust-concept-erasing/)]
[[Code](https://github.com/koushiksrivats/robust-concept-erasing)]**Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models** \
[[ICML 2023 workshop](https://arxiv.org/abs/2307.05977v1)]
[[Code](https://github.com/nannullna/safe-diffusion)]**Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2407.12383)]
[[Code](https://github.com/CharlesGong12/RECE)]**Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion** \
[[ECCV 2024](https://arxiv.org/abs/2407.21032)]
[[Code](https://github.com/nannullna/safeguard-hfi)]**Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.15618)]
[[Code](https://github.com/tuananhbui89/Erasing-Adversarial-Preservation)]**Unveiling Concept Attribution in Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.02542)]
[[Code](https://github.com/mail-research/CAD-attribution4diffusion)]**TraSCE: Trajectory Steering for Concept Erasure** \
[[Website](https://arxiv.org/abs/2412.07658)]
[[Code](https://github.com/anubhav1997/TraSCE/)]**Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts** \
[[Website](https://arxiv.org/abs/2410.12777)]
[[Code](https://github.com/sail-sg/Meta-Unlearning)]**ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion** \
[[Website](https://arxiv.org/abs/2404.17230)]
[[Code](https://github.com/potato-kitty/objectadd)]**Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2303.17591)]
[[Code](https://github.com/SHI-Labs/Forget-Me-Not)]**Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.15234)]
[[Code](https://github.com/OPTML-Group/AdvUnlearn)]**ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning** \
[[Website](https://arxiv.org/abs/2405.19237)]
[[Code](https://github.com/ruchikachavhan/concept-prune)]**Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models** \
[[Website](https://arxiv.org/abs/2305.10120)]
[[Code](https://github.com/clear-nus/selective-amnesia)]**Add-SD: Rational Generation without Manual Reference** \
[[Website](https://arxiv.org/abs/2407.21016)]
[[Code](https://github.com/ylingfeng/Add-SD)]**RealEra: Semantic-level Concept Erasure via Neighbor-Concept Mining** \
[[Website](https://arxiv.org/abs/2410.09140)]
[[Project](https://realerasing.github.io/RealEra/)]**MACE: Mass Concept Erasure in Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2403.06135)]**Continuous Concepts Removal in Text-to-image Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.00580)]**Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.00357)]**Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.08074)]**Direct Unlearning Optimization for Robust and Safe Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2407.21035)]**Prompt Sliders for Fine-Grained Control, Editing and Erasing of Concepts in Diffusion Model** \
[[Website](https://arxiv.org/pdf/2409.16535)]**Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning** \
[[Website](https://arxiv.org/abs/2405.07288)]**Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models** \
[[Website](https://arxiv.org/abs/2310.05873)]**Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers** \
[[Website](https://arxiv.org/abs/2311.17717)]**All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.12807)]**EraseDiff: Erasing Data Influence in Diffusion Models** \
[[Website](https://arxiv.org/abs/2401.05779)]**UnlearnCanvas: A Stylized Image Dataset to Benchmark Machine Unlearning for Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.11846)]**Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts** \
[[Website](https://arxiv.org/abs/2402.11846)]**R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model** \
[[Website](https://arxiv.org/abs/2405.16341)]**Pruning for Robust Concept Erasing in Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.16534)]**Unlearning Concepts from Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.14209)]**EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts** \
[[Website](https://arxiv.org/abs/2408.01014)]**Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning** \
[[Website](https://arxiv.org/abs/2410.05664)]**Understanding the Impact of Negative Prompts: When and How Do They Take Effect?** \
[[Website](https://arxiv.org/abs/2406.02965)]**Model Integrity when Unlearning with T2I Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.02068)]**Learning to Forget using Hypernetworks** \
[[Website](https://arxiv.org/abs/2412.00761)]**Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters** \
[[Website](https://arxiv.org/abs/2412.06143)]## New Concept Learning
⭐⭐⭐**DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation** \
[[CVPR 2023 Honorable Mention](https://openaccess.thecvf.com/content/CVPR2023/html/Ruiz_DreamBooth_Fine_Tuning_Text-to-Image_Diffusion_Models_for_Subject-Driven_Generation_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2208.12242)]
[[Project](https://dreambooth.github.io/)]
[[Official Dataset](https://github.com/google/dreambooth)]
[[Unofficial Code](https://github.com/XavierXiao/Dreambooth-Stable-Diffusion)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/training/dreambooth)]
[[Diffusers Code](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)]⭐⭐⭐**An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion** \
[[ICLR 2023 top-25%](https://openreview.net/forum?id=NAQvF08TcyG)]
[[Website](https://arxiv.org/abs/2208.01618)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/training/text_inversion)]
[[Diffusers Code](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion)]
[[Code](https://github.com/rinongal/textual_inversion)]⭐⭐**Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Kumari_Multi-Concept_Customization_of_Text-to-Image_Diffusion_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2212.04488)]
[[Project](https://www.cs.cmu.edu/~custom-diffusion/)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/main/en/training/custom_diffusion)]
[[Diffusers Code](https://github.com/huggingface/diffusers/tree/main/examples/custom_diffusion)]
[[Code](https://github.com/adobe-research/custom-diffusion)]⭐⭐**ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement** \
[[ECCV 2024](https://arxiv.org/abs/2407.07197)]
[[Project](https://moatifbutt.github.io/colorpeel/)]
[[Code](https://github.com/moatifbutt/color-peel)]⭐⭐**ReVersion: Diffusion-Based Relation Inversion from Images** \
[[Website](https://arxiv.org/abs/2303.13495)]
[[Project](https://ziqihuangg.github.io/projects/reversion.html)]
[[Code](https://github.com/ziqihuangg/ReVersion)]⭐**SINE: SINgle Image Editing with Text-to-Image Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Zhang_SINE_SINgle_Image_Editing_With_Text-to-Image_Diffusion_Models_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2212.04489)]
[[Project](https://zhang-zx.github.io/SINE/)]
[[Code](https://github.com/zhang-zx/SINE)]⭐**Break-A-Scene: Extracting Multiple Concepts from a Single Image** \
[[SIGGRAPH Asia 2023](https://arxiv.org/abs/2305.16311)]
[[Project](https://omriavrahami.com/break-a-scene/)]
[[Code](https://github.com/google/break-a-scene)]⭐**Concept Decomposition for Visual Exploration and Inspiration** \
[[SIGGRAPH Asia 2023](https://arxiv.org/abs/2305.18203)]
[[Project](https://inspirationtree.github.io/inspirationtree/)]
[[Code](https://github.com/google/inspiration_tree)]**Cones: Concept Neurons in Diffusion Models for Customized Generation** \
[[ICML 2023 Oral](https://icml.cc/virtual/2023/oral/25582)]
[[ICML 2023 Oral](https://dl.acm.org/doi/10.5555/3618408.3619298)]
[[Website](https://arxiv.org/abs/2303.05125)]
[[Code](https://github.com/Johanan528/Cones)]**BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/70870)]
[[Website](https://arxiv.org/abs/2305.14720)]
[[Project](https://dxli94.github.io/BLIP-Diffusion-website/)]
[[Code](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion)]**Inserting Anybody in Diffusion Models via Celeb Basis** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/71823)]
[[Website](https://arxiv.org/abs/2306.00926)]
[[Project](https://celeb-basis.github.io/)]
[[Code](https://github.com/ygtxr1997/celebbasis)]**Controlling Text-to-Image Diffusion by Orthogonal Finetuning** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/72033)]
[[Website](https://arxiv.org/abs/2306.07280)]
[[Project](https://oft.wyliu.com/)]
[[Code](https://github.com/Zeju1997/oft)]**Photoswap: Personalized Subject Swapping in Images** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/70336)]
[[Website](https://arxiv.org/abs/2305.18286)]
[[Project](https://photoswap.github.io/)]
[[Code](https://github.com/eric-ai-lab/photoswap)]**Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/71844)]
[[Website](https://arxiv.org/abs/2305.18292)]
[[Project](https://showlab.github.io/Mix-of-Show/)]
[[Code](https://github.com/TencentARC/Mix-of-Show)]**ITI-GEN: Inclusive Text-to-Image Generation** \
[[ICCV 2023 Oral](https://arxiv.org/abs/2309.05569)]
[[Website](https://arxiv.org/abs/2309.05569)]
[[Project](https://czhang0528.github.io/iti-gen)]
[[Code](https://github.com/humansensinglab/ITI-GEN)]**Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Liu_Unsupervised_Compositional_Concepts_Discovery_with_Text-to-Image_Generative_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2306.05357)]
[[Project](https://energy-based-model.github.io/unsupervised-concept-discovery/)]
[[Code](https://github.com/nanlliu/Unsupervised-Compositional-Concepts-Discovery)]**ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation** \
[[ICCV 2023 Oral](https://openaccess.thecvf.com/content/ICCV2023/html/Wei_ELITE_Encoding_Visual_Concepts_into_Textual_Embeddings_for_Customized_Text-to-Image_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2302.13848)]
[[Code](https://github.com/csyxwei/ELITE)]**A Neural Space-Time Representation for Text-to-Image Personalization** \
[[SIGGRAPH Asia 2023](https://arxiv.org/abs/2305.15391)]
[[Project](https://neuraltextualinversion.github.io/NeTI/)]
[[Code](https://github.com/NeuralTextualInversion/NeTI)]**Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2302.12228)]
[[Project](https://tuning-encoder.github.io/)]
[[Code](https://github.com/mkshing/e4t-diffusion)]**Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/71329)]
[[Website](https://arxiv.org/abs/2302.04841)]
[[Code](https://github.com/yandex-research/DVAR)]**ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction** \
[[ECCV 2024](https://arxiv.org/abs/2407.07077)]
[[Project](https://haoosz.github.io/ConceptExpress/)]
[[Code](https://github.com/haoosz/ConceptExpress)]**Face2Diffusion for Fast and Editable Face Personalization** \
[[CVPR 2024](https://arxiv.org/abs/2403.05094)]
[[Project](https://mapooon.github.io/Face2DiffusionPage/)]
[[Code](https://github.com/mapooon/Face2Diffusion)]**Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models** \
[[CVPR 2024](https://arxiv.org/abs/2404.04243)]
[[Project](https://mudi-t2i.github.io/)]
[[Code](https://github.com/agwmon/MuDI)]**CapHuman: Capture Your Moments in Parallel Universes** \
[[CVPR 2024](https://arxiv.org/abs/2402.00627)]
[[Project](https://caphuman.github.io/)]
[[Code](https://github.com/VamosC/CapHumanf)]**Style Aligned Image Generation via Shared Attention** \
[[CVPR 2024](https://arxiv.org/abs/2312.02133)]
[[Project](https://style-aligned-gen.github.io/)]
[[Code](https://github.com/google/style-aligned/)]**FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition** \
[[CVPR 2024](https://arxiv.org/abs/2405.13870v1)]
[[Project](https://aim-uofa.github.io/FreeCustom/)]
[[Code](https://github.com/aim-uofa/FreeCustom)]**DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization** \
[[CVPR 2024](https://arxiv.org/abs/2402.09812)]
[[Project](https://ku-cvlab.github.io/DreamMatcher/)]
[[Code](https://github.com/KU-CVLAB/DreamMatcher)]**Material Palette: Extraction of Materials from a Single Image** \
[[CVPR 2024](https://arxiv.org/abs/2311.17060)]
[[Project](https://astra-vision.github.io/MaterialPalette/)]
[[Code](https://github.com/astra-vision/MaterialPalette)]**Learning Continuous 3D Words for Text-to-Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2402.08654)]
[[Project](https://ttchengab.github.io/continuous_3d_words/)]
[[Code](https://github.com/ttchengab/continuous_3d_words_code/)]**ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models** \
[[AAAI 2024](https://arxiv.org/abs/2306.04695)]
[[Project](https://conceptbed.github.io/)]
[[Code](https://github.com/conceptbed/evaluations)]**Direct Consistency Optimization for Compositional Text-to-Image Personalization** \
[[NeurIPS 2024](https://arxiv.org/abs/2402.12004)]
[[Project](https://dco-t2i.github.io/)]
[[Code](https://github.com/kyungmnlee/dco)]**The Hidden Language of Diffusion Models** \
[[ICLR 2024](https://arxiv.org/abs/2306.00966)]
[[Project](https://hila-chefer.github.io/Conceptor/)]
[[Code](https://github.com/hila-chefer/Conceptor)]**ZeST: Zero-Shot Material Transfer from a Single Image** \
[[ECCV 2024](https://arxiv.org/abs/2404.06425)]
[[Project](https://ttchengab.github.io/zest/)]
[[Code](https://github.com/ttchengab/zest_code)]**UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization** \
[[Website](https://arxiv.org/abs/2408.05939)]
[[Project](https://aigcdesigngroup.github.io/UniPortrait-Page/)]
[[Code](https://github.com/junjiehe96/UniPortrait)]**MagicFace: Training-free Universal-Style Human Image Customized Synthesis** \
[[Website](https://arxiv.org/abs/2408.07433)]
[[Project](https://codegoat24.github.io/MagicFace/)]
[[Code](https://github.com/CodeGoat24/MagicFace)]**LCM-Lookahead for Encoder-based Text-to-Image Personalization** \
[[Website](https://arxiv.org/abs/2404.03620)]
[[Project](https://lcm-lookahead.github.io/)]
[[Code](https://github.com/OrLichter/lcm-lookahead)]**EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM** \
[[Website](https://arxiv.org/abs/2412.09618)]
[[Project](https://easyref-gen.github.io/)]
[[Code](https://github.com/TempleX98/EasyRef)]**AITTI: Learning Adaptive Inclusive Token for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2406.12805)]
[[Project](https://itsmag11.github.io/AITTI/)]
[[Code](https://github.com/itsmag11/AITTI)]**MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance** \
[[Website](https://arxiv.org/abs/2406.07209)]
[[Project](https://ms-diffusion.github.io/)]
[[Code](https://github.com/MS-Diffusion/MS-Diffusion)]**ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance** \
[[Website](https://arxiv.org/abs/2405.17532)]
[[Project](https://classdiffusion.github.io/)]
[[Code](https://github.com/Rbrq03/ClassDiffusion)]**MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2405.05806)]
[[Project](https://masterweaver.github.io/)]
[[Code](https://github.com/csyxwei/MasterWeaver)]**Customizing Text-to-Image Models with a Single Image Pair** \
[[Website](https://arxiv.org/abs/2405.01536)]
[[Project](https://paircustomization.github.io/)]
[[Code](https://github.com/PairCustomization/PairCustomization)]**DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation** \
[[Website](https://arxiv.org/abs/2410.02067)]
[[Project](https://disenvisioner.github.io/)]
[[Code](https://github.com/EnVision-Research/DisEnvisioner)]**ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving** \
[[Website](https://arxiv.org/abs/2404.16771)]
[[Project](https://ssugarwh.github.io/consistentid.github.io/)]
[[Code](https://github.com/JackAILab/ConsistentID)]**ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning** \
[[Website](https://arxiv.org/abs/2404.15449)]
[[Project](https://idaligner.github.io/)]
[[Code](https://github.com/Weifeng-Chen/ID-Aligner)]**CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.15677)]
[[Project](https://qinghew.github.io/CharacterFactory/)]
[[Code](https://github.com/qinghew/CharacterFactory)]**Customizing Text-to-Image Diffusion with Camera Viewpoint Control** \
[[Website](https://arxiv.org/abs/2404.12333)]
[[Project](https://customdiffusion360.github.io/)]
[[Code](https://github.com/customdiffusion360/custom-diffusion360)]**Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization** \
[[Website](https://arxiv.org/abs/2403.14155)]
[[Project](https://ldynx.github.io/harmony-zero-t2i/)]
[[Code](https://github.com/ldynx/harmony-zero-t2i)]**StyleDrop: Text-to-Image Generation in Any Style** \
[[Website](https://arxiv.org/abs/2306.00983)]
[[Project](https://styledrop.github.io/)]
[[Code](https://github.com/zideliu/StyleDrop-PyTorch)]**FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention** \
[[Website](https://arxiv.org/abs/2305.10431)]
[[Project](https://fastcomposer.mit.edu/)]
[[Code](https://github.com/mit-han-lab/fastcomposer)]**AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning** \
[[Website](https://arxiv.org/abs/2307.04725)]
[[Project](https://animatediff.github.io/)]
[[Code](https://github.com/guoyww/animatediff/)]**Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning**\
[[Website](https://arxiv.org/abs/2307.11410)]
[[Project](https://oppo-mente-lab.github.io/subject_diffusion/)]
[[Code](https://github.com/OPPO-Mente-Lab/Subject-Diffusion)]**Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion** \
[[Website](https://arxiv.org/abs/2303.08767)]
[[Project](https://hiper0.github.io/)]
[[Code](https://github.com/HiPer0/HiPer)]**MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.13370)]
[[Project](https://correr-zhou.github.io/MagicTailor/)]
[[Code](https://github.com/correr-zhou/MagicTailor)]**DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning** \
[[Website](https://arxiv.org/abs/2211.11337)]
[[Project](https://www.sysu-hcp.net/projects/dreamartist/index.html)]
[[Code](https://github.com/7eu7d7/DreamArtist-stable-diffusion)]**SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing** \
[[Website](https://arxiv.org/abs/2310.08094)]
[[Project](https://jarrentwu1031.github.io/SingleInsert-web/)]
[[Code](https://github.com/JarrentWu1031/SingleInsert)]**CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs//2310.19784)]
[[Project](https://jiangyzy.github.io/CustomNet/)]
[[Code](https://github.com/TencentARC/CustomNet)]**When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation** \
[[Website](https://arxiv.org/abs/2311.17461)]
[[Project](https://csxmli2016.github.io/projects/w-plus-adapter/)]
[[Code](https://github.com/csxmli2016/w-plus-adapter)]**InstantID: Zero-shot Identity-Preserving Generation in Seconds** \
[[Website](https://arxiv.org/abs/2401.07519)]
[[Project](https://instantid.github.io/)]
[[Code](https://github.com/InstantID/InstantID)]**PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding** \
[[Website](https://arxiv.org/abs/2312.04461)]
[[Project](https://photo-maker.github.io/)]
[[Code](https://github.com/TencentARC/PhotoMaker)]**CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization** \
[[Website](https://arxiv.org/abs/2311.14631)]
[[Project](https://royzhao926.github.io/CatVersion-page/)]
[[Code](https://github.com/RoyZhao926/CatVersion)]**DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.14216)]
[[Project](https://briannlongzhao.github.io/DreamDistribution/)]
[[Code](https://github.com/briannlongzhao/DreamDistribution)]**λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space** \
[[Website](https://arxiv.org/abs/2402.05195)]
[[Project](https://eclipse-t2i.github.io/Lambda-ECLIPSE/)]
[[Code](https://github.com/eclipse-t2i/lambda-eclipse-inference)]**Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models** \
[[Website](https://arxiv.org/abs/2309.07986)]
[[Project](https://jmhb0.github.io/viewneti/)]
[[Code](https://github.com/jmhb0/view_neti)]**Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition** \
[[Website](https://arxiv.org/abs/2402.15504)]
[[Project](https://danielchyeh.github.io/Gen4Gen/)]
[[Code](https://github.com/louisYen/Gen4Gen)]**StableIdentity: Inserting Anybody into Anywhere at First Sight** \
[[Website](https://arxiv.org/abs/2401.15975)]
[[Project](https://qinghew.github.io/StableIdentity/)]
[[Code](https://github.com/qinghew/StableIdentity)]**DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model** \
[[Website](https://arxiv.org/abs/2402.17412)]
[[Project](https://diffusekrona.github.io/)]
[[Code](https://github.com/IBM/DiffuseKronA)]**TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder** \
[[Website](https://arxiv.org/abs/2409.08248)]
[[Project](https://textboost.github.io/)]
[[Code](https://github.com/nahyeonkaty/textboost)]**EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance** \
[[Website](https://arxiv.org/abs/2409.08091)]
[[Project](https://zichengduan.github.io/pages/EZIGen/index.html)]
[[Code](https://github.com/ZichengDuan/EZIGen)]**OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.10983)]
[[Project](https://kongzhecn.github.io/omg-project/)]
[[Code](https://github.com/kongzhecn/OMG/)]**MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation** \
[[Website](https://arxiv.org/abs/2404.05674)]
[[Project](https://moma-adapter.github.io/)]
[[Code](https://github.com/bytedance/MoMA/tree/main)]**ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs** \
[[Website](https://arxiv.org/abs/2311.13600)]
[[Project](https://ziplora.github.io/)]
[[Code](https://github.com/mkshing/ziplora-pytorch)]**CSGO: Content-Style Composition in Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2311.13600)]
[[Project](https://csgo-gen.github.io/)]
[[Code](https://github.com/instantX-research/CSGO)]**DreamSteerer: Enhancing Source Image Conditioned Editability using Personalized Diffusion Models** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.11208)]
[[Code](https://github.com/Dijkstra14/DreamSteerer)]**Customized Generation Reimagined: Fidelity and Editability Harmonized** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/06727.pdf)]
[[Code](https://github.com/jinjianRick/DCI_ICO)]**Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning** \
[[ECCV 2024](https://arxiv.org/abs/2407.06642)]
[[Code](https://github.com/wfanyue/DPG-T2I-Personalization)]**High-fidelity Person-centric Subject-to-Image Synthesis** \
[[CVPR 2024](https://arxiv.org/abs/2311.10329)]
[[Code](https://github.com/codegoat24/face-diffuser)]**ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation** \
[[SIGGRAPH Asia 2023](https://arxiv.org/abs/2305.16225)]
[[Code](https://github.com/zyxElsa/ProSpect)]**Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier** \
[[WACV 2025](https://arxiv.org/abs/2410.22317)]
[[Code](https://github.com/wangkai930418/mc_ti)]**Multiresolution Textual Inversion** \
[[NeurIPS 2022 workshop](https://arxiv.org/abs/2211.17115)]
[[Code](https://github.com/giannisdaras/multires_textual_inversion)]**Compositional Inversion for Stable Diffusion Models** \
[[AAAI 2024](https://arxiv.org/abs/2312.08048)]
[[Code](https://github.com/zhangxulu1996/Compositional-Inversion)]**Decoupled Textual Embeddings for Customized Image Generation** \
[[AAAI 2024](https://arxiv.org/abs/2312.11826)]
[[Code](https://github.com/PrototypeNx/DETEX)]**DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning** \
[[NeurIPS 2024](https://arxiv.org/abs/2411.04571)]
[[Code](https://github.com/Ldhlwh/DomainGallery)]**TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation** \
[[Website](https://arxiv.org/abs/2410.05591)]
[[Code](https://github.com/KwonGihyun/TweedieMix)]**Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation** \
[[Website](https://arxiv.org/abs/2409.17920)]
[[Code](https://github.com/hqhQAQ/MIP-Adapter)]**Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2408.03632)]
[[Code](https://github.com/Nihukat/Concept-Conductor)]**RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance** \
[[Website](https://arxiv.org/abs/2405.14677)]
[[Code](https://github.com/feifeiobama/RectifID)]**PuLID: Pure and Lightning ID Customization via Contrastive Alignment** \
[[Website](https://arxiv.org/abs/2404.16022)]
[[Code](https://github.com/ToTheBeginning/PuLID)]**Cross Initialization for Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2312.15905)]
[[Code](https://github.com/lyupang/crossinitialization)]**Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach** \
[[Website](https://arxiv.org/abs/2305.13579)]
[[Code](https://github.com/drboog/profusion)]**SVDiff: Compact Parameter Space for Diffusion Fine-Tuning** \
[[Website](https://arxiv.org/abs/2303.11305)]
[[Code](https://github.com/mkshing/svdiff-pytorch)]**ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2306.00971)]
[[Code](https://github.com/haoosz/vico)]**AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image** \
[[Website](https://arxiv.org/abs/2311.15040)]
[[Code](https://github.com/Xiang-cd/unet-finetune)]**A Closer Look at Parameter-Efficient Tuning in Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.15478)]
[[Code](https://github.com/divyakraman/AerialBooth2023)]**FaceChain-FACT: Face Adapter with Decoupled Training for Identity-preserved Personalization** \
[[Website](https://arxiv.org/abs/2410.12312)]
[[Code](https://github.com/modelscope/facechain)]**Controllable Textual Inversion for Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2304.05265)]
[[Code](https://github.com/jnzju/COTI)]**Cross-domain Compositing with Pretrained Diffusion Models** \
[[Website](https://arxiv.org/abs/2302.10167)]
[[Code](https://github.com/cross-domain-compositing/cross-domain-compositing)]**Concept-centric Personalization with Large-scale Diffusion Priors** \
[[Website](https://arxiv.org/abs/2312.08195)]
[[Code](https://github.com/PRIV-Creation/Concept-centric-Personalization)]**Customization Assistant for Text-to-image Generation** \
[[Website](https://arxiv.org/abs/2312.03045)]
[[Code](https://github.com/drboog/profusion)]**Cross Initialization for Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2312.15905)]
[[Code](https://github.com/lyupang/crossinitialization)]**Cones 2: Customizable Image Synthesis with Multiple Subjects** \
[[Website](https://arxiv.org/abs/2305.19327v1)]
[[Code](https://github.com/ali-vilab/cones-v2)]**LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.11627)]
[[Code](https://github.com/Young98CN/LoRA_Composer)]**AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization** \
[[Website](https://arxiv.org/abs/2405.17965)]
[[Code](https://github.com/junjie-shentu/AttenCraft)]**CusConcept: Customized Visual Concept Decomposition with Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.00398)]
[[Code](https://github.com/xzLcan/CusConcept)]**HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/01439.pdf)]
[[Project](https://sites.google.com/view/hybridbooth)]**Language-Informed Visual Concept Learning** \
[[ICLR 2024](https://arxiv.org/abs/2312.03587)]
[[Project](https://ai.stanford.edu/~yzzhang/projects/concept-axes/)]**Key-Locked Rank One Editing for Text-to-Image Personalization** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2305.01644)]
[[Project](https://research.nvidia.com/labs/par/Perfusion/)]**Diffusion in Style** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/papers/Everaert_Diffusion_in_Style_ICCV_2023_paper.pdf)]
[[Project](https://ivrl.github.io/diffusion-in-style/)]**RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization** \
[[CVPR 2024](https://arxiv.org/abs/2403.00483)]
[[Project](https://corleone-huang.github.io/realcustom/)]**RealCustom++: Representing Images as Real-Word for Real-Time Customization** \
[[Website](https://arxiv.org/abs/2408.09744)]
[[Project](https://corleone-huang.github.io/RealCustom_plus_plus/)]**Personalized Residuals for Concept-Driven Text-to-Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2405.12978)]
[[Project](https://cusuh.github.io/personalized-residuals/)]**LogoSticker: Inserting Logos into Diffusion Models for Customized Generation** \
[[ECCV 2024](https://arxiv.org/abs/2407.13752)]
[[Project](https://mingkangz.github.io/logosticker/)]**Diffusion Self-Distillation for Zero-Shot Customized Image Generation** \
[[Website](https://arxiv.org/abs/2411.18616)]
[[Project](https://primecai.github.io/dsd/)]**RelationBooth: Towards Relation-Aware Customized Object Generation** \
[[Website](https://arxiv.org/abs/2410.23280)]
[[Project](https://shi-qingyu.github.io/RelationBooth/)]**LoRACLR: Contrastive Adaptation for Customization of Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.09622)]
[[Project](https://loraclr.github.io/)]**InstructBooth: Instruction-following Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2312.03011)]
[[Project](https://sites.google.com/view/instructbooth)]**AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2406.05000)]
[[Project](https://attndreambooth.github.io/)]**MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation** \
[[Website](https://arxiv.org/abs/2404.11565)]
[[Project](https://snap-research.github.io/mixture-of-attention/)]**ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation** \
[[Website](https://arxiv.org/abs/2412.08645)]
[[Project](https://object-mate.com/)]**PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization** \
[[Website](https://arxiv.org/abs/2312.06354)]
[[Project](https://portraitbooth.github.io/)]**Subject-driven Text-to-Image Generation via Apprenticeship Learning** \
[[Website](https://arxiv.org/abs/2304.00186)]
[[Project](https://open-vision-language.github.io/suti/)]**Orthogonal Adaptation for Modular Customization of Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.02432)]
[[Project](https://ryanpo.com/ortha/)]**Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation** \
[[Website](https://arxiv.org/abs/2306.08247)]
[[Project](https://bigaandsmallq.github.io/COW/)]**HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2307.06949)]
[[Project](https://hyperdreambooth.github.io/)]**Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models** \
[[Website](https://arxiv.org/abs/2307.06925)]
[[Project](https://datencoder.github.io/)]**$P+$: Extended Textual Conditioning in Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2303.09522)]
[[Project](https://prompt-plus.github.io/)]**PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2309.05793)]
[[Project](https://photoverse2d.github.io/)]**InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning** \
[[Website](https://arxiv.org/abs/2304.03411)]
[[Project](https://jshi31.github.io/InstantBooth/)]**Total Selfie: Generating Full-Body Selfies** \
[[Website](https://arxiv.org/abs/2308.14740)]
[[Project](https://homes.cs.washington.edu/~boweiche/project_page/totalselfie/)]**PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation** \
[[Website](https://arxiv.org/abs/2411.17048)]
[[Project](https://personalvideo.github.io/)]**DreamTuner: Single Image is Enough for Subject-Driven Generation** \
[[Website](https://arxiv.org/abs/2312.13691)]
[[Project](https://dreamtuner-diffusion.github.io/)]**SerialGen: Personalized Image Generation by First Standardization Then Personalization** \
[[Website](https://arxiv.org/abs/2412.01485)]
[[Project](https://serialgen.github.io/)]**PALP: Prompt Aligned Personalization of Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2401.06105)]
[[Project](https://prompt-aligned.github.io/)]**TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion** \
[[CVPR 2024](https://arxiv.org/abs/2401.09416)]
[[Project](https://texturedreamer.github.io/)]**Visual Style Prompting with Swapping Self-Attention** \
[[Website](https://arxiv.org/abs/2402.12974)]
[[Project](https://curryjung.github.io/VisualStylePrompt/)]**Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm** \
[[Website](https://arxiv.org/abs/2403.11781)]
[[Project](https://infinite-id.github.io/)]**Non-confusing Generation of Customized Concepts in Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.06914)]
[[Project](https://clif-official.github.io/clif/)]**HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation** \
[[Website](https://arxiv.org/abs/2410.08192)]
[[Project](https://sites.google.com/view/hybridboot/)]**Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models** \
[[NeurIPS 2024](https://arxiv.org/abs/2411.01179)]**ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image** \
[[ECCV 2024](https://arxiv.org/abs/2402.11849)]**Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models** \
[[CVPR 2024](https://arxiv.org/abs/2404.03913)]**JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2407.06187)]**DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models** \
[[AAAI 2024](https://arxiv.org/abs/2309.06933)]**FreeTuner: Any Subject in Any Style with Training-free Diffusion** \
[[Website](https://arxiv.org/abs/2405.14201)]**Towards Prompt-robust Face Privacy Protection via Adversarial Decoupling Augmentation Framework** \
[[Website](https://arxiv.org/abs/2305.03980)]**InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser** \
[[Website](https://arxiv.org/abs/2311.15040)]**DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2305.03374)]**Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2304.02642)]**Gradient-Free Textual Inversion** \
[[Website](https://arxiv.org/abs/2304.05818)]**Identity Encoder for Personalized Diffusion** \
[[Website](https://arxiv.org/abs/2304.07429)]**Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation** \
[[Website](https://arxiv.org/abs/2303.09319)]**ELODIN: Naming Concepts in Embedding Spaces** \
[[Website](https://arxiv.org/abs/2303.04001)]**Generate Anything Anywhere in Any Scene** \
[[Website](https://arxiv.org/abs/2306.17154)]**Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model** \
[[Website](https://arxiv.org/abs/2306.07596)]**Face0: Instantaneously Conditioning a Text-to-Image Model on a Face** \
[[Website](https://arxiv.org/abs/2306.06638v1)]**MagiCapture: High-Resolution Multi-Concept Portrait Customization** \
[[Website](https://arxiv.org/abs/2309.06895)]**A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization** \
[[Website](https://arxiv.org/abs/2311.04315)]**DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics** \
[[Website](https://arxiv.org/abs/2311.09753)]**An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2311.11919)]**Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.13833)]**Memory-Efficient Personalization using Quantized Diffusion Model** \
[[Website](https://arxiv.org/abs/2401.04339)]**BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models** \
[[Website](https://arxiv.org/abs/2401.13974)]**Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization** \
[[Website](https://arxiv.org/abs/2401.16762)]**Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding** \
[[Website](https://arxiv.org/abs/2401.15708)]**SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation** \
[[Website](https://arxiv.org/abs/2402.00631)]**Visual Concept-driven Image Generation with Text-to-Image Diffusion Model** \
[[Website](https://arxiv.org/abs/2402.11487)]**IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2403.13535)]**MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration** \
[[Website](https://arxiv.org/abs/2403.15059)]**DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation** \
[[Website](https://arxiv.org/abs/2403.19235)]**OneActor: Consistent Character Generation via Cluster-Conditioned Guidance** \
[[Website](https://arxiv.org/abs/2404.10267)]**StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.15287)]**Exploring Diffusion Models' Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks** \
[[Website](https://arxiv.org/abs/2405.19931)]**Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter** \
[[Website](https://arxiv.org/abs/2406.02881)]**PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction** \
[[Website](https://arxiv.org/abs/2406.05641)]**AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2406.18893)]**Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation** \
[[Website](https://arxiv.org/abs/2407.09779)]**PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control** \
[[Website](https://arxiv.org/abs/2408.05083)]**MagicID: Flexible ID Fidelity Generation System** \
[[Website](https://arxiv.org/abs/2408.09248)]**CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization** \
[[Website](https://arxiv.org/abs/2408.15914)]**ArtiFade: Learning to Generate High-quality Subject from Blemished Images** \
[[Website](https://arxiv.org/abs/2409.03745)]**CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization** \
[[Website](https://arxiv.org/abs/2409.05606)]**Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis** \
[[Website](https://arxiv.org/abs/2409.19111)]**Event-Customized Image Generation** \
[[Website](https://arxiv.org/abs/2410.02483)]**LEARNING TO CUSTOMIZE TEXT-TO-IMAGE DIFFUSION IN DIVERSE CONTEXT** \
[[Website](https://arxiv.org/pdf/2410.10058)]**HYPNOS : Highly Precise Foreground-focused Diffusion Finetuning for Inanimate Objects** \
[[Website](https://arxiv.org/abs/2410.14265)]**Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator** \
[[Website](https://arxiv.org/abs/2411.15466)]**Foundation Cures Personalization: Recovering Facial Personalized Models' Prompt Consistency** \
[[Website](https://arxiv.org/abs/2411.15277)]**Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects** \
[[Website](https://arxiv.org/abs/2411.18936)]**DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.19390)]## T2I Diffusion Model augmentation
⭐⭐⭐**Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2301.13826)]
[[Project](https://yuval-alaluf.github.io/Attend-and-Excite/)]
[[Official Code](https://github.com/yuval-alaluf/Attend-and-Excite)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_attend_and_excite.py)]
[[Diffusers doc](https://huggingface.co/docs/diffusers/api/pipelines/attend_and_excite)]
[[Replicate Demo](https://replicate.com/daanelson/attend-and-excite)]**SEGA: Instructing Diffusion using Semantic Dimensions** \
[[NeurIPS 2023](https://openreview.net/forum?id=KIPAIy329j&referrer=%5Bthe%20profile%20of%20Patrick%20Schramowski%5D(%2Fprofile%3Fid%3D~Patrick_Schramowski1))]
[[Website](https://arxiv.org/abs/2301.12247)]
[[Code](https://github.com/ml-research/semantic-image-editing)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/semantic_stable_diffusion/pipeline_semantic_stable_diffusion.py)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/api/pipelines/semantic_stable_diffusion)]**Improving Sample Quality of Diffusion Models Using Self-Attention Guidance** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Hong_Improving_Sample_Quality_of_Diffusion_Models_Using_Self-Attention_Guidance_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2210.00939)]
[[Project](https://ku-cvlab.github.io/Self-Attention-Guidance/)]
[[Code Official](https://github.com/KU-CVLAB/Self-Attention-Guidance)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/api/pipelines/self_attention_guidance)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_sag.py)]**Expressive Text-to-Image Generation with Rich Text** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Ge_Expressive_Text-to-Image_Generation_with_Rich_Text_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2304.06720)]
[[Project](https://rich-text-to-image.github.io/)]
[[Code](https://github.com/SongweiGe/rich-text-to-image)]
[[Demo](https://huggingface.co/spaces/songweig/rich-text-to-image)]**Editing Implicit Assumptions in Text-to-Image Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Orgad_Editing_Implicit_Assumptions_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.08084)]
[[Project](https://time-diffusion.github.io/)]
[[Code](https://github.com/bahjat-kawar/time-diffusion)]
[[Demo](https://huggingface.co/spaces/bahjat-kawar/time-diffusion)]**ElasticDiffusion: Training-free Arbitrary Size Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2311.18822)]
[[Project](https://elasticdiffusion.github.io/)]
[[Code](https://github.com/moayedhajiali/elasticdiffusion-official)]
[[Demo](https://replicate.com/moayedhajiali/elasticdiffusion)]**MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Zhao_MagicFusion_Boosting_Text-to-Image_Generation_Performance_by_Fusing_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.13126)]
[[Project](https://magicfusion.github.io/)]
[[Code](https://github.com/MagicFusion/MagicFusion.github.io)]**Discriminative Class Tokens for Text-to-Image Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Schwartz_Discriminative_Class_Tokens_for_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.17155)]
[[Project](https://vesteinn.github.io/disco/)]
[[Code](https://github.com/idansc/discriminative_class_tokens)]**Compositional Visual Generation with Composable Diffusion Models** \
[[ECCV 2022](https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/6940_ECCV_2022_paper.php)]
[[Website](https://arxiv.org/abs/2206.01714)]
[[Project](https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/)]
[[Code](https://github.com/energy-based-model/Compositional-Visual-Generation-with-Composable-Diffusion-Models-PyTorch)]**DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models** \
[[ICCV 2023](https://arxiv.org/abs/2402.19481)]
[[Project](https://hanlab.mit.edu/projects/distrifusion)]
[[Code](https://github.com/mit-han-lab/distrifuser)]
[[Blog](https://hanlab.mit.edu/blog/distrifusion)]**Diffusion Self-Guidance for Controllable Image Generation** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/70344)]
[[Website](https://arxiv.org/abs/2306.00986)]
[[Project](https://dave.ml/selfguidance/)]
[[Code](https://github.com/Sainzerjj/Free-Guidance-Diffusion)]**ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/72054)]
[[Website](https://arxiv.org/abs/2304.05977)]
[[Code](https://github.com/THUDM/ImageReward)]**DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/72425)]
[[Website](https://arxiv.org/abs/2306.14685)]
[[Code](https://github.com/ximinng/DiffSketcher)]**Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/72543)]
[[Website](https://arxiv.org/abs/2306.08877)]
[[Code](https://github.com/RoyiRa/Syntax-Guided-Generation)]**DemoFusion: Democratising High-Resolution Image Generation With No $$$** \
[[CVPR 2024](https://arxiv.org/abs/2311.16973)]
[[Project](https://ruoyidu.github.io/demofusion/demofusion.html)]
[[Code](https://github.com/PRIS-CV/DemoFusion)]**Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2403.05239)]
[[Project](https://hcplayercvpr2024.github.io/)]
[[Code](https://github.com/hcplayercvpr2024/hcplayer)]**Training Diffusion Models with Reinforcement Learning** \
[[ICLR 2024](https://arxiv.org/abs/2305.13301)]
[[Project](https://rl-diffusion.github.io/)]
[[Code](https://github.com/kvablack/ddpo-pytorch)]**Divide & Bind Your Attention for Improved Generative Semantic Nursing**\
[[BMVC 2023 Oral](https://arxiv.org/abs/2307.10864)]
[[Project](https://sites.google.com/view/divide-and-bind)]
[[Code](https://github.com/boschresearch/Divide-and-Bind)]**Make It Count: Text-to-Image Generation with an Accurate Number of Objects** \
[[Website](https://arxiv.org/abs/2406.10210)]
[[Project](https://make-it-count-paper.github.io//)]
[[Code](https://github.com/Litalby1/make-it-count)]**OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction** \
[[Website](https://arxiv.org/abs/2410.04932)]
[[Project](https://len-li.github.io/omnibooth-web/)]
[[Code](https://github.com/EnVision-Research/OmniBooth)]**Margin-aware Preference Optimization for Aligning Diffusion Models without Reference** \
[[Website](https://arxiv.org/abs/2406.06424)]
[[Project](https://mapo-t2i.github.io/)]
[[Code](https://github.com/mapo-t2i/mapo)]**Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step** \
[[Website](https://arxiv.org/abs/2406.04314)]
[[Project](https://rockeycoss.github.io/spo.github.io/)]
[[Code](https://github.com/RockeyCoss/SPO)]**Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2403.07860)]
[[Project](https://shihaozhaozsh.github.io/LaVi-Bridge/)]
[[Code](https://github.com/ShihaoZhaoZSH/LaVi-Bridge)]**MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts** \
[[Website](https://arxiv.org/abs/2410.23332)]
[[Project](https://sites.google.com/view/mole4diffuser/)]
[[Code](https://github.com/JiePKU/MoLE)]**Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets** \
[[Website](https://arxiv.org/abs/2412.07775)]
[[Project](https://nabla-gfn.github.io/)]
[[Code](https://github.com/lzzcd001/nabla-gfn)]**CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching** \
[[Website](https://arxiv.org/abs/2404.03653)]
[[Project](https://caraj7.github.io/comat/)]
[[Code](https://github.com/CaraJ7/CoMat)]**Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions** \
[[Website](https://arxiv.org/abs/2403.17064)]
[[Project](https://compvis.github.io/attribute-control/)]
[[Code](https://github.com/CompVis/attribute-control)]**Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance** \
[[Website](https://arxiv.org/abs/2403.17377)]
[[Project](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/)]
[[Code](https://github.com/KU-CVLAB/Perturbed-Attention-Guidance)]**Real-World Image Variation by Aligning Diffusion Inversion Chain** \
[[Website](https://arxiv.org/abs/2305.18729)]
[[Project](https://rival-diff.github.io/)]
[[Code](https://github.com/julianjuaner/RIVAL/)]**FreeU: Free Lunch in Diffusion U-Net** \
[[Website](https://arxiv.org/abs/2309.11497)]
[[Project](https://chenyangsi.top/FreeU/)]
[[Code](https://github.com/ChenyangSi/FreeU)]**GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis** \
[[Website](https://arxiv.org/abs/2412.06089)]
[[Project](https://dair-iitd.github.io/GraPE/)]
[[Code](https://github.com/dair-iitd/GraPE)]**ConceptLab: Creative Generation using Diffusion Prior Constraints** \
[[Website](https://arxiv.org/abs/2308.02669)]
[[Project](https://kfirgoldberg.github.io/ConceptLab/)]
[[Code](https://github.com/kfirgoldberg/ConceptLab)]**Aligning Text-to-Image Diffusion Models with Reward Backpropagationn** \
[[Website](https://arxiv.org/abs/2310.03739)]
[[Project](https://align-prop.github.io/)]
[[Code](https://github.com/mihirp1998/AlignProp/)]**Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models** \
[[Website](https://arxiv.org/abs/2310.07653)]
[[Project](https://minidalle3.github.io/)]
[[Code](https://github.com/Zeqiang-Lai/Mini-DALLE3)]**ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2310.07702)]
[[Project](https://yingqinghe.github.io/scalecrafter/)]
[[Code](https://github.com/YingqingHe/ScaleCrafter)]**One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls** \
[[Website](https://arxiv.org/abs/2311.15744)]
[[Project](https://jabir-zheng.github.io/OneMoreStep/)]
[[Code](https://github.com/mhh0318/OneMoreStep)]**TokenCompose: Grounding Diffusion with Token-level Supervision**\
[[Website](https://arxiv.org/abs/2312.03626)]
[[Project](https://mlpc-ucsd.github.io/TokenCompose/)]
[[Code](https://github.com/mlpc-ucsd/TokenCompose)]**DiffusionGPT: LLM-Driven Text-to-Image Generation System** \
[[Website](https://arxiv.org/abs/2401.10061)]
[[Project](https://diffusiongpt.github.io/)]
[[Code](https://github.com/DiffusionGPT/DiffusionGPT)]**Decompose and Realign: Tackling Condition Misalignment in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2306.14408)]
[[Project](https://wileewang.github.io/Decompose-and-Realign/)]
[[Code](https://github.com/EnVision-Research/Decompose-and-Realign)]**Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support** \
[[Website](https://arxiv.org/abs/2401.14688)]
[[Project](https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B)]
[[Code](https://github.com/IDEA-CCNL/Fooocus-Taiyi-XL)]**ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations** \
[[Website](https://arxiv.org/abs/2312.04655)]
[[Project](https://eclipse-t2i.vercel.app/)]
[[Code](https://github.com/eclipse-t2i/eclipse-inference)]**MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion** \
[[Website](https://arxiv.org/abs/2402.12741)]
[[Project](https://measure-infinity.github.io/mulan/)]
[[Code](https://github.com/measure-infinity/mulan-code)]**ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.02084)]
[[Project](https://res-adapter.github.io/)]
[[Code](https://github.com/bytedance/res-adapter)]**Stylus: Automatic Adapter Selection for Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.18928)]
[[Project](https://stylus-diffusion.github.io/)]
[[Code](https://github.com/stylus-diffusion/stylus)]**MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.09977)]
[[Project](https://nithin-gk.github.io/maxfusion.github.io/)]
[[Code](https://github.com/Nithin-GK/MaxFusion)]**Negative Token Merging: Image-based Adversarial Feature Guidance** \
[[Website](https://arxiv.org/abs/2412.01339)]
[[Project](https://negtome.github.io/)]
[[Code](https://github.com/1jsingh/negtome)]**Iterative Object Count Optimization for Text-to-image Diffusion Models** \
[[Website](https://arxiv.org/abs/2408.11721)]
[[Project](https://ozzafar.github.io/count_token/)]
[[Code](https://github.com/ozzafar/discriminative_class_tokens_for_counting)]**ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment** \
[[Website](https://arxiv.org/abs/2403.05135)]
[[Project](https://ella-diffusion.github.io/)]
[[Code](https://github.com/ELLA-Diffusion/ELLA)]**HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts** \
[[Website](https://arxiv.org/abs/2409.02919)]
[[Project](https://liuxinyv.github.io/HiPrompt/)]
[[Code](https://github.com/Liuxinyv/HiPrompt)]**Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis** \
[[Website](https://arxiv.org/abs/2411.17769)]
[[Project](https://itsmag11.github.io/Omegance/)]
[[Code](https://github.com/itsmag11/Omegance)]**TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation** \
[[Website](https://arxiv.org/abs/2404.18919)]
[[Project](https://howe140.github.io/theatergen.io/)]
[[Code](https://github.com/donahowe/Theatergen)]**SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models** \
[[ACM MM 2023 Oral](https://arxiv.org/abs/2305.05189)]
[[Code](https://github.com/Qrange-group/SUR-adapter)]**Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models** \
[[ICLR 2024](https://arxiv.org/abs/2402.05375)]
[[Code](https://github.com/sen-mao/SuppressEOT)]**Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis** \
[[NeurIPS 2024](https://arxiv.org/abs/2411.07132)]
[[Code](https://github.com/hutaihang/ToMe)]**Dynamic Prompt Optimizing for Text-to-Image Generation** \
[[CVPR 2024](https://arxiv.org/abs/2404.04095)]
[[Code](https://github.com/Mowenyii/PAE)]**Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2403.08381)]
[[Code](https://github.com/PangzeCheung/SingDiffusion)]**Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance** \
[[CVPR 2024](https://arxiv.org/abs/2404.05384)]
[[Code](https://github.com/SmilesDZgk/S-CFG)]**InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization** \
[[CVPR 2024](https://arxiv.org/abs/2404.04650)]
[[Code](https://github.com/xiefan-guo/initno)]**Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2404.07389)]
[[Code](https://github.com/YasminZhang/EBAMA/tree/master)]**On Discrete Prompt Optimization for Diffusion Models** \
[[ICML 2024](https://arxiv.org/abs/2407.01606)]
[[Code](https://github.com/ruocwang/dpo-diffusion)]**Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function** \
[[NeurIPS 2024](https://arxiv.org/abs/2409.19967)]
[[Code](https://github.com/I2-Multimedia-Lab/Magnet)]**Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization** \
[[ACM MM 2024](https://arxiv.org/abs/2410.12700)]
[[Code](https://github.com/achernarwang/LiVO)]**DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models** \
[[NeurIPS 2023](https://arxiv.org/abs/2305.16381)]
[[Code](https://github.com/google-research/google-research/tree/master/dpok)]**Diffusion Model Alignment Using Direct Preference Optimization** \
[[Website](https://arxiv.org/abs/2311.12908)]
[[Code](https://github.com/SalesforceAIResearch/DiffusionDPO)]**SePPO: Semi-Policy Preference Optimization for Diffusion Alignment** \
[[Website](https://arxiv.org/abs/2410.05255)]
[[Code](https://github.com/DwanZhang-AI/SePPO)]**Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback** \
[[Website](https://arxiv.org/abs/2412.00122)]
[[Code](https://github.com/kingniu0329/Visions)]**Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.16333)]
[[Code](https://github.com/TruthAI-Lab/PCIG)]**Progressive Compositionality In Text-to-Image Generative Models** \
[[Website](https://arxiv.org/abs/2410.16719)]
[[Code](https://github.com/evansh666/EvoGen)]**Improving Long-Text Alignment for Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.11817)]
[[Code](https://github.com/luping-liu/LongAlign)]**Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization** \
[[Website](https://arxiv.org/abs/2406.06382)]
[[Code](https://github.com/yigu1008/Diffusion-RPO)]**RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images** \
[[Website](https://arxiv.org/abs/2409.03644)]
[[Code](https://github.com/Wangbenzhi/RealisHuman)]**Aggregation of Multi Diffusion Models for Enhancing Learned Representations** \
[[Website](https://arxiv.org/abs/2410.01262)]
[[Code](https://github.com/hammour-steak/amdm)]**AID: Attention Interpolation of Text-to-Image Diffusion** \
[[Website](https://arxiv.org/abs/2403.17924)]
[[Code](https://github.com/QY-H00/attention-interpolation-diffusion)]**Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance** \
[[Website](https://arxiv.org/abs/2410.22376)]
[[Code](https://github.com/krafton-ai/Rare2Frequent)]**FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis** \
[[Website](https://arxiv.org/abs/2403.12963)]
[[Code](https://github.com/LeonHLJ/FouriScale)]**ORES: Open-vocabulary Responsible Visual Synthesis** \
[[Website](https://arxiv.org/abs/2308.13785)]
[[Code](https://github.com/kodenii/ores)]**Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness** \
[[Website](https://arxiv.org/abs/2302.10893)]
[[Code](https://github.com/ml-research/fair-diffusion)]**Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models** \
[[Website](https://arxiv.org/abs/2406.07844)]
[[Code](https://github.com/ArmanZarei/Mitigating-T2I-Comp-Issues)]**IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2410.07171)]
[[Code](https://github.com/YangLing0818/IterComp)]**InstructG2I: Synthesizing Images from Multimodal Attributed Graphs** \
[[Website](https://arxiv.org/abs/2410.07157)]
[[Code](https://github.com/PeterGriffinJin/InstructG2I)]**Detector Guidance for Multi-Object Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2306.02236)]
[[Code](https://github.com/luping-liu/Detector-Guidance)]**Designing a Better Asymmetric VQGAN for StableDiffusion** \
[[Website](https://arxiv.org/abs/2306.04632)]
[[Code](https://github.com/buxiangzhiren/Asymmetric_VQGAN)]**FABRIC: Personalizing Diffusion Models with Iterative Feedback** \
[[Website](https://arxiv.org/abs/2307.10159)]
[[Code](https://github.com/sd-fabric/fabric)]**Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.16223)]
[[Code](https://github.com/SHI-Labs/Prompt-Free-Diffusion)]**Progressive Text-to-Image Diffusion with Soft Latent Direction** \
[[Website](https://arxiv.org/abs/2309.09466)]
[[Code](https://github.com/babahui/progressive-text-to-image)]**Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy** \
[[Website](https://arxiv.org/abs/2310.09247)]
[[Code](https://github.com/yandex-research/text-to-img-hypernymy)]**TraDiffusion: Trajectory-Based Training-Free Image Generation** \
[[Website](https://arxiv.org/abs/2408.09739)]
[[Code](https://github.com/och-mac/TraDiffusion)]**If at First You Don’t Succeed, Try, Try Again:Faithful Diffusion-based Text-to-Image Generation by Selection** \
[[Website](https://arxiv.org/abs/2305.13308)]
[[Code](https://github.com/ExplainableML/ImageSelect)]**LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts** \
[[Website](https://arxiv.org/abs/2310.10640)]
[[Code](https://github.com/hananshafi/llmblueprint)]**Making Multimodal Generation Easier: When Diffusion Models Meet LLMs** \
[[Website](https://arxiv.org/abs/2310.08949)]
[[Code](https://github.com/zxy556677/EasyGen)]**Enhancing Diffusion Models with Text-Encoder Reinforcement Learning** \
[[Website](https://arxiv.org/abs/2311.15657)]
[[Code](https://github.com/chaofengc/texforce)]**AltDiffusion: A Multilingual Text-to-Image Diffusion Model** \
[[Website](https://arxiv.org/abs/2308.09991)]
[[Code](https://github.com/superhero-7/AltDiffusion)]**It is all about where you start: Text-to-image generation with seed selection** \
[[Website](https://arxiv.org/abs/2304.14530)]
[[Code](https://github.com/dvirsamuel/SeedSelect)]**End-to-End Diffusion Latent Optimization Improves Classifier Guidance** \
[[Website](https://arxiv.org/abs/2303.13703)]
[[Code](https://github.com/salesforce/doodl)]**Correcting Diffusion Generation through Resampling** \
[[Website](https://arxiv.org/abs/2312.06038)]
[[Code](https://github.com/ucsb-nlp-chang/diffusion_resampling)]**Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs** \
[[Website](https://arxiv.org/abs/2401.11708)]
[[Code](https://github.com/YangLing0818/RPG-DiffusionMaster)]**Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation** \
[[Website](https://arxiv.org/abs/2411.18301)]
[[Code](https://github.com/wtybest/enmmdit)]**A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2402.12760)]
[[Code](https://github.com/naylenv/uf-fgtg)]**PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement** \
[[Website](https://arxiv.org/abs/2403.04014)]
[[Code](https://github.com/ma-labo/promptcharm)]**Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.06381)]
[[Code](https://github.com/YaNgZhAnG-V5/attention_regulation)]**Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2403.07860)]
[[Code](https://github.com/ShihaoZhaoZSH/LaVi-Bridge)]**Aligning Few-Step Diffusion Models with Dense Reward Difference Learning** \
[[Website](https://arxiv.org/abs/2411.11727)]
[[Code](https://github.com/ZiyiZhang27/sdpo)]**LightIt: Illumination Modeling and Control for Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2403.10615)]
[[Project](https://peter-kocsis.github.io/LightIt/)]**Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.21638)]
[[Project](https://deepaksridhar.github.io/factorgraphdiffusion.github.io/)]**Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG** \
[[Website](https://arxiv.org/abs/2412.09614)]
[[Project](https://context-canvas.github.io/)]**Scalable Ranked Preference Optimization for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2410.18013)]
[[Project](https://snap-research.github.io/RankDPO/)]**A Noise is Worth Diffusion Guidance** \
[[Website](https://arxiv.org/abs/2412.03895)]
[[Project](https://cvlab-kaist.github.io/NoiseRefine/)]**LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors** \
[[Website](https://arxiv.org/abs/2412.04460)]
[[Project](https://layerfusion.github.io/)]**ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2410.01731)]
[[Project](https://comfygen-paper.github.io/)]**LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2407.00737)]
[[Project](https://xiaobul.github.io/LLM4GEN/)]**RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance** \
[[Website](https://arxiv.org/abs/2405.17661)]
[[Project](https://sbyebss.github.io/refdrop/)]**UniFL: Improve Stable Diffusion via Unified Feedback Learning** \
[[Website](https://arxiv.org/abs/2404.05595)]
[[Project](https://uni-fl.github.io/)]**Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2412.02168)]
[[Project](https://generative-photography.github.io/project/)]**ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting** \
[[Website](https://arxiv.org/abs/2411.17176)]
[[Project](https://chengyou-jia.github.io/ChatGen-Home/)]**Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2403.16990)]
[[Project](https://omer11a.github.io/bounded-attention/)]**Semantic Guidance Tuning for Text-To-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.15964)]
[[Project](https://korguy.github.io/)]**Amazing Combinatorial Creation: Acceptable Swap-Sampling for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2310.01819)]
[[Project](https://asst2i.github.io/anon/)]**Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation** \
[[Website](https://arxiv.org/abs/2401.17664)]
[[Project](https://vlislab22.github.io/ImageAnything/)]**DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling** \
[[Website](https://arxiv.org/abs/2412.00759)]
[[Project](https://shelsin.github.io/dymo.github.io/)]**Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation** \
[[Website](https://arxiv.org/abs/2402.10491)]
[[Project](https://guolanqing.github.io/Self-Cascade/)]**FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes** \
[[Website](https://arxiv.org/abs/2402.18331)]
[[Project](https://finediffusion.github.io/)]**Lazy Diffusion Transformer for Interactive Image Editing** \
[[Website](https://arxiv.org/abs/2404.12382)]
[[Project](https://lazydiffusion.github.io/)]**Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis** \
[[Website](https://arxiv.org/abs/2404.13686)]
[[Project](https://hyper-sd.github.io/)]**Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.13706)]
[[Project](https://cs-people.bu.edu/vpetsiuk/arc/)]**Norm-guided latent space exploration for text-to-image generation** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/70922)]
[[Website](https://arxiv.org/abs/2306.08687)]**Improving Diffusion-Based Image Synthesis with Context Prediction** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/70058)]
[[Website](https://arxiv.org/abs/2401.02015)]**GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections** \
[[ECCV 2024](https://arxiv.org/abs/2408.12352)]**MultiGen: Zero-shot Image Generation from Multi-modal Prompt** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/01296.pdf)]**On Mechanistic Knowledge Localization in Text-to-Image Generative Models** \
[[ICML 2024](https://arxiv.org/abs/2405.01008)]**Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.00447)]**Generating Compositional Scenes via Text-to-image RGBA Instance Generation** \
[[NeurIPS 2024](https://arxiv.org/abs/2411.10913)]**A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization** \
[[Website](https://arxiv.org/abs/2410.00321)]**PROUD: PaRetO-gUided Diffusion Model for Multi-objective Generation** \
[[Website](https://arxiv.org/abs/2407.04493)]**Exposure Diffusion: HDR Image Generation by Consistent LDR denoising** \
[[Website](https://arxiv.org/abs/2405.14304)]**Information Theoretic Text-to-Image Alignment** \
[[Website](https://arxiv.org/abs/2405.20759)]**Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2404.09976)]**Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control** \
[[Website](https://arxiv.org/abs/2404.13766)]**Aligning Diffusion Models by Optimizing Human Utility** \
[[Website](https://arxiv.org/abs/2404.04465)]**Instruct-Imagen: Image Generation with Multi-modal Instruction** \
[[Website](https://arxiv.org/abs/2401.01952)]**CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.06059)]**MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask** \
[[Website](https://arxiv.org/abs/2309.04399)]**Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images** \
[[Website](https://arxiv.org/abs/2308.16582)]**Text2Layer: Layered Image Generation using Latent Diffusion Model** \
[[Website](https://arxiv.org/abs/2307.09781)]**Stimulating the Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling** \
[[Website](https://arxiv.org/abs/2307.03992)]**A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation** \
[[Website](https://arxiv.org/abs/2310.16656)]**UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion** \
[[Website](https://arxiv.org/abs/2401.13388)]**Improving Compositional Text-to-image Generation with Large Vision-Language Models** \
[[Website](https://arxiv.org/abs/2310.06311)]**Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else** \
[[Website](https://arxiv.org/abs/2310.07419)]**Unseen Image Synthesis with Diffusion Models** \
[[Website](https://arxiv.org/abs/2310.09213)]**AnyLens: A Generative Diffusion Model with Any Rendering Lens** \
[[Website](https://arxiv.org/abs/2311.17609)]**Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering** \
[[Website](https://arxiv.org/abs/2401.06345)]**Text2Street: Controllable Text-to-image Generation for Street Views** \
[[Website](https://arxiv.org/abs/2402.04504)]**Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2402.10210)]**Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Model** \
[[Website](https://arxiv.org/abs/2402.13490)]**Debiasing Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.14577)]**Stochastic Conditional Diffusion Models for Semantic Image Synthesis** \
[[Website](https://arxiv.org/abs/2402.16506)]**Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion** \
[[Website](https://arxiv.org/abs/2402.16305)]**Transparent Image Layer Diffusion using Latent Transparency** \
[[Website](https://arxiv.org/abs/2402.17113)]**Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2402.17245)]**HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances** \
[[Website](https://arxiv.org/abs/2403.01693)]**StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.04965)]**Make Me Happier: Evoking Emotions Through Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.08255)]**Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model** \
[[Website](https://arxiv.org/abs/2403.11077)]**LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model** \
[[Website](https://arxiv.org/abs/2403.11929)]**AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2403.13352)]**U-Sketch: An Efficient Approach for Sketch to Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.18425)]**ECNet: Effective Controllable Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.18417)]**TextCraftor: Your Text Encoder Can be Image Quality Controller** \
[[Website](https://arxiv.org/abs/2403.18978)]**Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding** \
[[Website](https://arxiv.org/abs/2404.11589)]**Towards Better Text-to-Image Generation Alignment via Attention Modulation** \
[[Website](https://arxiv.org/abs/2404.13899)]**Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model** \
[[Website](https://arxiv.org/abs/2405.15330)]**SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance** \
[[Website](https://arxiv.org/abs/2405.15321)]**Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance** \
[[Website](https://arxiv.org/abs/2406.04551)]**Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2408.00230)]**FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting** \
[[Website](https://arxiv.org/abs/2408.11706)]**Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models** \
[[Website](https://arxiv.org/abs/2408.14135)]**SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation** \
[[Website](https://arxiv.org/abs/2409.01327)]**Training-Free Sketch-Guided Diffusion with Latent Optimization** \
[[Website](https://arxiv.org/abs/2409.00313)]**Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization** \
[[Website](https://arxiv.org/abs/2410.03190)]**Sparse Repellency for Shielded Generation in Text-to-image Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.06025)]**Training-free Diffusion Model Alignment with Sampling Demons** \
[[Website](https://arxiv.org/abs/2410.05760)]**MinorityPrompt: Text to Minority Image Generation via Prompt Optimization** \
[[Website](https://arxiv.org/abs/2410.07838)]**AUTOMATED FILTERING OF HUMAN FEEDBACK DATA FOR ALIGNING TEXT-TO-IMAGE DIFFUSION MODELS** \
[[Website](https://arxiv.org/abs/2410.10166)]**Saliency Guided Optimization of Diffusion Latents** \
[[Website](https://arxiv.org/pdf/2410.10257)]**Preference Optimization with Multi-Sample Comparisons** \
[[Website](https://arxiv.org/abs/2410.12138)]**CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning** \
[[Website](https://arxiv.org/abs/2410.11963)]**Redefining in Dictionary: Towards a Enhanced Semantic Understanding of Creative Generation** \
[[Website](https://arxiv.org/abs/2410.24160)]**Investigating Conceptual Blending of a Diffusion Model for Improving Nonword-to-Image Generation** \
[[Website](https://arxiv.org/abs/2411.03595)]**Improving image synthesis with diffusion-negative sampling** \
[[Website](https://arxiv.org/abs/2411.05473)]**Golden Noise for Diffusion Models: A Learning Framework** \
[[Website](https://arxiv.org/abs/2411.09502)]**Test-time Conditional Text-to-Image Synthesis Using Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.10800)]**Decoupling Training-Free Guided Diffusion by ADMM** \
[[Website](https://arxiv.org/abs/2411.12773)]**Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps** \
[[Website](https://arxiv.org/abs/2411.15236)]**Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2411.16503)]**TKG-DM: Training-free Chroma Key Content Generation Diffusion Model** \
[[Website](https://arxiv.org/abs/2411.15580)]**Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory** \
[[Website](https://arxiv.org/abs/2411.17472)]**CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2411.16783)]**Reward Incremental Learning in Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2411.17310)]**QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain** \
[[Website](https://arxiv.org/abs/2411.19534)]**Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds** \
[[Website](https://arxiv.org/abs/2411.18810)]**Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models** \
[[Website](https://arxiv.org/abs/2412.02237)]**The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation** \
[[Website](https://arxiv.org/abs/2412.05101)]**ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance** \
[[Website](https://arxiv.org/abs/2412.06163)]**Visual Lexicon: Rich Image Features in Language Space** \
[[Website](https://arxiv.org/abs/2412.06774)]**BudgetFusion: Perceptually-Guided Adaptive Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.05780)]## Spatial Control
**MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation** \
[[ICML 2023](https://icml.cc/virtual/2023/poster/23809)]
[[ICML 2023](https://dl.acm.org/doi/10.5555/3618408.3618482)]
[[Website](https://arxiv.org/abs/2302.08113)]
[[Project](https://multidiffusion.github.io/)]
[[Code](https://github.com/omerbt/MultiDiffusion)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_panorama.py)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/api/pipelines/panorama)]
[[Replicate Demo](https://replicate.com/omerbt/multidiffusion)]**SceneComposer: Any-Level Semantic Image Synthesis** \
[[CVPR 2023 Highlight](https://openaccess.thecvf.com/content/CVPR2023/papers/Zeng_SceneComposer_Any-Level_Semantic_Image_Synthesis_CVPR_2023_paper.pdf)]
[[Website](https://arxiv.org/abs/2211.11742)]
[[Project](https://zengyu.me/scenec/)]
[[Code](https://github.com/zengxianyu/scenec)]**GLIGEN: Open-Set Grounded Text-to-Image Generation** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Li_GLIGEN_Open-Set_Grounded_Text-to-Image_Generation_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2301.07093)]
[[Code](https://github.com/gligen/GLIGEN)]
[[Demo](https://huggingface.co/spaces/gligen/demo)]**Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis** \
[[ICLR 2023](https://openreview.net/forum?id=PUIqjT4rzq7)]
[[Website](https://arxiv.org/abs/2212.05032)]
[[Project](https://weixi-feng.github.io/structure-diffusion-guidance/)]
[[Code](https://github.com/shunk031/training-free-structured-diffusion-guidance)]**Visual Programming for Text-to-Image Generation and Evaluation** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/69940)]
[[Website](https://arxiv.org/abs/2305.15328)]
[[Project](https://vp-t2i.github.io/)]
[[Code](https://github.com/j-min/VPGen)]**GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation** \
[[ICLR 2024](https://openreview.net/forum?id=xBfQZWeDRH)]
[[Website](https://arxiv.org/abs/2306.04607)]
[[Project](https://kaichen1998.github.io/projects/geodiffusion/)]
[[Code](https://github.com/KaiChen1998/GeoDiffusion/tree/main)]**GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.20474)]
[[Project](https://groundit-visualai.github.io/)]
[[Code](https://github.com/KAIST-Visual-AI-Group/GrounDiT/)]**ReCo: Region-Controlled Text-to-Image Generation** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/papers/Yang_ReCo_Region-Controlled_Text-to-Image_Generation_CVPR_2023_paper.pdf)]
[[Website](https://arxiv.org/abs/2211.15518)]
[[Code](https://github.com/microsoft/ReCo)]**Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Wu_Harnessing_the_Spatial-Temporal_Attention_of_Diffusion_Models_for_High-Fidelity_Text-to-Image_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2304.03869)]
[[Code](https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn)]**BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Xie_BoxDiff_Text-to-Image_Synthesis_with_Training-Free_Box-Constrained_Diffusion_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2307.10816)]
[[Code](https://github.com/Sierkinhane/BoxDiff)]**Dense Text-to-Image Generation with Attention Modulation** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Kim_Dense_Text-to-Image_Generation_with_Attention_Modulation_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2308.12964)]
[[Code](https://github.com/naver-ai/densediffusion)]**LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models** \
[[Website](https://arxiv.org/abs/2305.13655)]
[[Project](https://llm-grounded-diffusion.github.io/)]
[[Code](https://github.com/TonyLianLong/LLM-groundedDiffusion)]
[[Demo](https://huggingface.co/spaces/longlian/llm-grounded-diffusion)]
[[Blog](https://bair.berkeley.edu/blog/2023/05/23/lmd/)]**StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control** \
[[CVPR 2024](https://arxiv.org/abs/2403.09055)]
[[Code](https://github.com/ironjr/StreamMultiDiffusion)]
[[Project](https://jaerinlee.com/research/streammultidiffusion)]**MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis** \
[[CVPR 2024](https://arxiv.org/abs/2402.05408)]
[[Project](https://migcproject.github.io/)]
[[Code](https://github.com/limuloo/MIGC)]**Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language** \
[[Website](https://arxiv.org/abs/2406.20085)]
[[Project](https://yichengchen24.github.io/projects/autocherrypicker/)]
[[Code](https://github.com/yichengchen24/ACP)]**Training-Free Layout Control with Cross-Attention Guidance** \
[[Website](https://arxiv.org/abs/2304.03373)]
[[Project](https://hohonu-vicml.github.io/DirectedDiffusion.Page/)]
[[Code](https://github.com/hohonu-vicml/DirectedDiffusion)]**ROICtrl: Boosting Instance Control for Visual Generation** \
[[Website](https://arxiv.org/abs/2411.17949)]
[[Project](https://roictrl.github.io/)]
[[Code](https://github.com/showlab/ROICtrl)]**CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation** \
[[Website](https://arxiv.org/abs/2412.03859)]
[[Project](https://creatilayout.github.io/)]
[[Code](https://github.com/HuiZhang0812/CreatiLayout)]**Directed Diffusion: Direct Control of Object Placement through Attention Guidance** \
[[Website](https://arxiv.org/abs/2302.13153)]
[[Project](https://silent-chen.github.io/layout-guidance/)]
[[Code](https://github.com/silent-chen/layout-guidance)]**Grounded Text-to-Image Synthesis with Attention Refocusing** \
[[Website](https://arxiv.org/abs/2306.05427)]
[[Project](https://attention-refocusing.github.io/)]
[[Code](https://github.com/Attention-Refocusing/attention-refocusing)]**eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers** \
[[Website](https://arxiv.org/abs/2211.01324)]
[[Project](https://research.nvidia.com/labs/dir/eDiff-I/)]
[[Code](https://github.com/cloneofsimo/paint-with-words-sd)]**LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2304.03373)]
[[Project](https://layoutllm-t2i.github.io/)]
[[Code](https://github.com/LayoutLLM-T2I/LayoutLLM-T2I)]**Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.13921)]
[[Project](https://oppo-mente-lab.github.io/compositional_t2i/)]
[[Code](https://github.com/OPPO-Mente-Lab/attention-mask-control)]**R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation** \
[[Website](https://arxiv.org/abs/2310.08872)]
[[Project](https://sagileo.github.io/Region-and-Boundary/)]
[[Code](https://github.com/StevenShaw1999/RnB)]**FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition** \
[[Website](https://arxiv.org/abs/2312.07536)]
[[Project](https://genforce.github.io/freecontrol/)]
[[Code](https://github.com/genforce/freecontrol)]**InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.05849)]
[[Project](https://jiuntian.github.io/interactdiffusion/)]
[[Code](https://github.com/jiuntian/interactdiffusion)]**Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following** \
[[Website](https://arxiv.org/abs/2311.17002)]
[[Project](https://ranni-t2i.github.io/Ranni/)]
[[Code](https://github.com/ali-vilab/Ranni)]**InstanceDiffusion: Instance-level Control for Image Generation** \
[[Website](https://arxiv.org/abs/2402.03290)]
[[Project](https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/)]
[[Code](https://github.com/frank-xwang/InstanceDiffusion)]**Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis** \
[[CVPR 2024](https://arxiv.org/abs/2402.18078)]
[[Code](https://github.com/YanzuoLu/CFLD)]**NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging** \
[[CVPR 2024](https://arxiv.org/abs/2403.03485)]
[[Code](https://github.com/univ-esuty/noisecollage)]**Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2308.06027)]
[[Code](https://github.com/endo-yuki-t/MAG)]**Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation** \
[[Website](https://arxiv.org/abs/2409.04847)]
[[Code](https://github.com/cplusx/rich_context_L2I/tree/main)]**Enhancing Object Coherence in Layout-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2311.10522)]
[[Code](https://github.com/CodeGoat24/EOCNet)]**Training-free Regional Prompting for Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2411.02395)]
[[Code](https://github.com/instantX-research/Regional-Prompting-FLUX)]**DivCon: Divide and Conquer for Progressive Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2403.06400)]
[[Code](https://github.com/DivCon-gen/DivCon)]**RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.12908)]
[[Code](https://github.com/YangLing0818/RealCompo)]**StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control** \
[[Website](https://arxiv.org/abs/2403.09055)]
[[Code](https://github.com/ironjr/StreamMultiDiffusion)]**HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation** \
[[Website](https://arxiv.org/abs/2410.14324)]
[[Code](https://github.com/360CVGroup/HiCo_T2I)]**Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis** \
[[ECCV 2024](https://arxiv.org/abs/2311.18435)]
[[Project](https://qizipeng.github.io/LRDiff_projectPage/)]**ReCorD: Reasoning and Correcting Diffusion for HOI Generation** \
[[ACM MM 2024](https://arxiv.org/abs/2407.17911)]
[[Project](https://alberthkyhky.github.io/ReCorD/)]**Compositional Text-to-Image Generation with Dense Blob Representations** \
[[Website](https://arxiv.org/abs/2405.08246)]
[[Project](https://blobgen-2d.github.io/)]**GroundingBooth: Grounding Text-to-Image Customization** \
[[Website](https://arxiv.org/abs/2409.08520)]
[[Project](https://groundingbooth.github.io/)]**Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2311.15773)]
[[Project](https://simm-t2i.github.io/SimM/)]**ReGround: Improving Textual and Spatial Grounding at No Cost** \
[[Website](https://arxiv.org/abs/2403.13589)]
[[Project](https://re-ground.github.io/)]**DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception** \
[[CVPR 2024](https://arxiv.org/abs/2403.13304)]**Guided Image Synthesis via Initial Image Editing in Diffusion Model** \
[[ACM MM 2023](https://arxiv.org/abs/2305.03382)]**Training-free Composite Scene Generation for Layout-to-Image Synthesis** \
[[ECCV 2024](https://arxiv.org/abs/2407.13609)]**LSReGen: Large-Scale Regional Generator via Backward Guidance Framework** \
[[Website](https://arxiv.org/abs/2407.15066)]**Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion** \
[[Website](https://arxiv.org/abs/2404.14768)]**Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching** \
[[Website](https://arxiv.org/abs/2408.13858)]**Boundary Attention Constrained Zero-Shot Layout-To-Image Generation** \
[[Website](https://arxiv.org/abs/2411.10495)]**Enhancing Image Layout Control with Loss-Guided Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.14101)]**GLoD: Composing Global Contexts and Local Details in Image Generation** \
[[Website](https://arxiv.org/abs/2404.15447)]**A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis** \
[[Website](https://arxiv.org/abs/2306.14544)]**Controllable Text-to-Image Generation with GPT-4** \
[[Website](https://arxiv.org/abs/2305.18583)]**Localized Text-to-Image Generation for Free via Cross Attention Control** \
[[Website](https://arxiv.org/abs/2306.14636)]**Training-Free Location-Aware Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2304.13427)]**Composite Diffusion | whole >= \Sigma parts** \
[[Website](https://arxiv.org/abs/2307.13720)]**Continuous Layout Editing of Single Images with Diffusion Models** \
[[Website](https://arxiv.org/abs/2306.13078)]**Zero-shot spatial layout conditioning for text-to-image diffusion models** \
[[Website](https://arxiv.org/abs/2306.13754)]**Obtaining Favorable Layouts for Multiple Object Generation** \
[[Website](https://arxiv.org/abs/2405.00791)]**LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis**\
[[Website](https://arxiv.org/abs/2311.12342)]**Self-correcting LLM-controlled Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.16090)]**Joint Generative Modeling of Scene Graphs and Images via Diffusion Models** \
[[Website](https://arxiv.org/abs/2401.01130)]**Spatial-Aware Latent Initialization for Controllable Image Generation** \
[[Website](https://arxiv.org/abs/2401.16157)]**Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control** \
[[Website](https://arxiv.org/abs/2402.13404)]**ObjBlur: A Curriculum Learning Approach With Progressive Object-Level Blurring for Improved Layout-to-Image Generation** \
[[Website](https://arxiv.org/abs/2404.07564)]**The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise** \
[[Website](https://arxiv.org/abs/2406.01970)]**Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis** \
[[Website](https://arxiv.org/abs/2406.04032)]**SpotActor: Training-Free Layout-Controlled Consistent Image Generation** \
[[Website](https://arxiv.org/abs/2409.04801)]**IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2409.08240)]**Scribble-Guided Diffusion for Training-free Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2409.08026)]**3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2410.12669)]**Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement** \
[[Website](https://arxiv.org/abs/2411.06558)]## I2I translation
⭐⭐⭐**SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations** \
[[ICLR 2022](https://openreview.net/forum?id=aBsCjcPu_tE)]
[[Website](https://arxiv.org/abs/2108.01073)]
[[Project](https://sde-image-editing.github.io/)]
[[Code](https://github.com/ermongroup/SDEdit)]⭐⭐⭐**DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation** \
[[CVPR 2022](https://openaccess.thecvf.com/content/CVPR2022/html/Kim_DiffusionCLIP_Text-Guided_Diffusion_Models_for_Robust_Image_Manipulation_CVPR_2022_paper.html)]
[[Website](https://arxiv.org/abs/2110.02711)]
[[Code](https://github.com/gwang-kim/DiffusionCLIP)]**CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation** \
[[NeurIPS 2023](https://neurips.cc/virtual/2023/poster/69913)]
[[Website](https://arxiv.org/abs/2310.13165)]
[[Project](https://cyclenetweb.github.io/)]
[[Code](https://github.com/sled-group/cyclenet)]**DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations** \
[[CVPR 2024](https://arxiv.org/abs/2403.06951)]
[[Project](https://tianhao-qi.github.io/DEADiff/)]
[[Code](https://github.com/Tianhao-Qi/DEADiff_code)]**Diffusion-based Image Translation using Disentangled Style and Content Representation** \
[[ICLR 2023](https://openreview.net/forum?id=Nayau9fwXU)]
[[Website](https://arxiv.org/abs/2209.15264)]
[[Code](https://github.com/cyclomon/DiffuseIT)]**FlexIT: Towards Flexible Semantic Image Translation** \
[[CVPR 2022](https://openaccess.thecvf.com/content/CVPR2022/html/Couairon_FlexIT_Towards_Flexible_Semantic_Image_Translation_CVPR_2022_paper.html)]
[[Website](https://arxiv.org/abs/2203.04705)]
[[Code](https://github.com/facebookresearch/semanticimagetranslation)]**Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Yang_Zero-Shot_Contrastive_Loss_for_Text-Guided_Diffusion_Image_Style_Transfer_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.08622)]
[[Code](https://github.com/YSerin/ZeCon)]**E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation** \
[[ICML 2024](https://arxiv.org/abs/2401.06127)]
[[Project](https://yifanfanfanfan.github.io/e2gan/)]
[[Code](https://github.com/Yifanfanfanfan/Yifanfanfanfan.github.io/tree/main/e2gan)]**Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.07008)]
[[Project](https://sooyeon-go.github.io/eye_for_an_eye/)]
[[Code](https://github.com/sooyeon-go/eye_for_an_eye)]**Cross-Image Attention for Zero-Shot Appearance Transfer** \
[[Website](https://arxiv.org/abs/2311.03335)]
[[Project](https://garibida.github.io/cross-image-attention/)]
[[Code](https://github.com/garibida/cross-image-attention)]**FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.14429)]
[[Project](https://rickhh.github.io/FashionR2R/)]
[[Code](https://github.com/Style3D/FashionR2R)]**Diffusion Guided Domain Adaptation of Image Generators** \
[[Website](https://arxiv.org/abs/2212.04473)]
[[Project](https://styleganfusion.github.io/)]
[[Code](https://github.com/KunpengSong/styleganfusion)]**Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.12092)]
[[Project](https://sliders.baulab.info/)]
[[Code](https://github.com/rohitgandikota/sliders)]**FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models** \
[[Website](https://arxiv.org/abs/2401.15636)]
[[Project](https://freestylefreelunch.github.io/)]
[[Code](https://github.com/FreeStyleFreeLunch/FreeStyle)]**FilterPrompt: Guiding Image Transfer in Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.13263)]
[[Project](https://meaoxixi.github.io/FilterPrompt/)]
[[Code](https://github.com/Meaoxixi/FilterPrompt)]**Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization** \
[[ECCV 2024](https://arxiv.org/abs/2407.04245)]
[[Code](https://github.com/Kaminyou/Dense-Normalization)]**One-Shot Structure-Aware Stylized Image Synthesis** \
[[CVPR 2024](https://arxiv.org/abs/2402.17275)]
[[Code](https://github.com/hansam95/osasis)]**BBDM: Image-to-image Translation with Brownian Bridge Diffusion Models** \
[[CVPR 2023](https://arxiv.org/abs/2205.07680)]
[[Code](https://github.com/xuekt98/BBDM)]**Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile** \
[[AAAI 2024](https://arxiv.org/abs/2403.05093)]
[[Code](https://github.com/ykykyk112/STIG)]**Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation** \
[[AAAI 2024](https://arxiv.org/abs/2407.03006)]
[[Code](https://github.com/XiangGao1102/FCDiffusion)]**ZePo: Zero-Shot Portrait Stylization with Faster Sampling** \
[[ACM MM 2024](https://arxiv.org/abs/2408.05492)]
[[Code](https://github.com/liujin112/ZePo)]**DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer** \
[[ACM MM Asia 2024](https://arxiv.org/abs/2410.15007)]
[[Code](https://github.com/I2-Multimedia-Lab/DiffuseST)]**TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control** \
[[Website](https://arxiv.org/abs/2410.10133)]
[[Code](https://github.com/weichaozeng/TextCtrl)]**Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance** \
[[Website](https://arxiv.org/abs/2306.04396)]
[[Code](https://github.com/submissionanon18/agg)]**Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis** \
[[Website](https://arxiv.org/abs/2408.16845)]
[[Code](https://zelaki.github.io/localdiff/)]**PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions** \
[[Website](https://arxiv.org/abs/2409.15278)]
[[Code](https://github.com/AFeng-x/PixWizard)]**GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis** \
[[Website](https://arxiv.org/abs/2401.15282)]
[[Code](https://github.com/isbrycee/GEM-Glass-Segmentor)]**CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion** \
[[Website](https://arxiv.org/abs/2401.14066)]
[[Code](https://github.com/haha-lisa/creativesynth)]**PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering** \
[[Website](https://arxiv.org/abs/2403.05053)]
[[Code](https://github.com/CodeGoat24/PrimeComposer)]**One-Step Image Translation with Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2403.12036)]
[[Code](https://github.com/GaParmar/img2img-turbo)]**D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods** \
[[Website](https://arxiv.org/abs/2408.03558)]
[[Code](https://github.com/Onkarsus13/D2Styler)]**StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Wang_StyleDiffusion_Controllable_Disentangled_Style_Transfer_via_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2308.07863)]**ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors** \
[[ACM MM 2023](https://arxiv.org/abs/2311.05463)]**High-Fidelity Diffusion-based Image Editing** \
[[AAAI 2024](https://arxiv.org/abs/2312.15707)]**EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/02096.pdf)]**Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer** \
[[Website](https://arxiv.org/abs/2410.01366)]**UniHDA: Towards Universal Hybrid Domain Adaptation of Image Generators** \
[[Website](https://arxiv.org/abs/2401.12596)]**Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation** \
[[Website](https://arxiv.org/abs/2406.14762)]**TEXTOC: Text-driven Object-Centric Style Transfer** \
[[Website](https://arxiv.org/abs/2408.08461)]**Seed-to-Seed: Image Translation in Diffusion Seed Space** \
[[Website](https://arxiv.org/abs/2409.00654)]**Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation** \
[[Website](https://arxiv.org/abs/2409.08077)]**Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image Translation** \
[[Website](https://arxiv.org/abs/2411.14863)]## Segmentation Detection Tracking
**odise: open-vocabulary panoptic segmentation with text-to-image diffusion modelss** \
[[CVPR 2023 Highlight](https://arxiv.org/abs/2303.04803)]
[[Project](https://jerryxu.net/ODISE/)]
[[Code](https://github.com/NVlabs/ODISE)]
[[Demo](https://huggingface.co/spaces/xvjiarui/ODISE)]**LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Zbinden_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.12343)]
[[Project](https://koutilya-pnvr.github.io/LD-ZNet/)]
[[Code](https://github.com/koutilya-pnvr/LD-ZNet)]**Text-Image Alignment for Diffusion-Based Perception** \
[[CVPR 2024](https://openaccess.thecvf.com/content/CVPR2024/html/Kondapaneni_Text-Image_Alignment_for_Diffusion-Based_Perception_CVPR_2024_paper.html)]
[[Website](https://arxiv.org/abs/2310.00031)]
[[Project](https://www.vision.caltech.edu/tadp/)]
[[Code](https://github.com/damaggu/TADP)]**Stochastic Segmentation with Conditional Categorical Diffusion Models**\
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Zbinden_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.08888)]
[[Code](https://github.com/LarsDoorenbos/ccdm-stochastic-segmentation)]**DDP: Diffusion Model for Dense Visual Prediction**\
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Ji_DDP_Diffusion_Model_for_Dense_Visual_Prediction_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.17559)]
[[Code](https://github.com/JiYuanFeng/DDP)]**DiffusionDet: Diffusion Model for Object Detection** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Chen_DiffusionDet_Diffusion_Model_for_Object_Detection_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2211.09788)]
[[Code](https://github.com/shoufachen/diffusiondet)]**OVTrack: Open-Vocabulary Multiple Object Tracking** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Li_OVTrack_Open-Vocabulary_Multiple_Object_Tracking_CVPR_2023_paper.html
)]
[[Website](https://arxiv.org/abs/2304.08408)]
[[Project](https://www.vis.xyz/pub/ovtrack/)]**SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/71719)]
[[Website](https://arxiv.org/abs/2312.12425)]
[[Code](https://github.com/MengyuWang826/SegRefiner)]**DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction** \
[[CVPR 2024](https://arxiv.org/abs/2403.02075)]
[[Project](https://diffmot.github.io/)]
[[Code](https://github.com/Kroery/DiffMOT)]**Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features** \
[[Website](https://arxiv.org/abs/2406.02842)]
[[Project](https://diffcut-segmentation.github.io/)]
[[Code](https://github.com/PaulCouairon/DiffCut)]**Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion** \
[[Website](https://arxiv.org/abs/2308.12469)]
[[Project](https://sites.google.com/view/diffseg/home)]
[[Code](https://github.com/PotatoTian/DiffSeg)]**InstaGen: Enhancing Object Detection by Training on Synthetic Dataset** \
[[Website](https://arxiv.org/abs/2402.05937)]
[[Project](https://fcjian.github.io/InstaGen/)]
[[Code](https://github.com/fcjian/InstaGen)]**InvSeg: Test-Time Prompt Inversion for Semantic Segmentation** \
[[Website](https://arxiv.org/abs/2410.11473)]
[[Project](https://jylin8100.github.io/InvSegProject/)]
[[Code](https://github.com/jyLin8100/InvSeg)]**SMITE: Segment Me In TimE** \
[[Website](https://arxiv.org/abs/2410.18538)]
[[Project](https://segment-me-in-time.github.io/)]
[[Code](https://github.com/alimohammadiamirhossein/smite/)]**Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.21708)]
[[Code](https://github.com/XiaRho/MADM)]**Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model** \
[[ECCV 2024](https://arxiv.org/abs/2407.05352)]
[[Code](https://github.com/nini0919/DiffPNG)]**ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model** \
[[Website](https://arxiv.org/abs/2408.15548)]
[[Code](https://github.com/Tankowa/ConsistencyTrack)]**SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow** \
[[Website](https://arxiv.org/abs/2405.20282)]
[[Code](https://github.com/wang-chaoyang/SemFlow)]**Delving into the Trajectory Long-tail Distribution for Muti-object Tracking** \
[[Website](https://arxiv.org/abs/2403.04700)]
[[Code](https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT)]**Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.16947)]
[[Code](https://github.com/QianWangX/VidSeg_diffusion)]**Scribble Hides Class: Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label** \
[[Website](https://arxiv.org/abs/2402.17555)]
[[Code](https://github.com/Zxl19990529/Class-driven-Scribble-Promotion-Network)]**Personalize Segment Anything Model with One Shot** \
[[Website](https://arxiv.org/abs/2305.03048)]
[[Code](https://github.com/ZrrSkywalker/Personalize-SAM)]**DiffusionTrack: Diffusion Model For Multi-Object Tracking** \
[[Website](https://arxiv.org/abs/2308.09905)]
[[Code](https://github.com/rainbowluocs/diffusiontrack)]**MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation** \
[[Website](https://arxiv.org/abs/2309.13042)]
[[Code](https://github.com/Jiahao000/MosaicFusion)]**A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting** \
[[Website](https://arxiv.org/abs/2401.10227)]
[[Code](https://github.com/segments-ai/latent-diffusion-segmentation)]**Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation** \
[[Website](https://arxiv.org/abs/2309.05956)]
[[Code](https://github.com/gyhandy/Text2Image-for-Detection)]**UniGS: Unified Representation for Image Generation and Segmentation** \
[[Website](https://arxiv.org/abs/2312.01985)]
[[Code](https://github.com/qqlu/Entity)]**Placing Objects in Context via Inpainting for Out-of-distribution Segmentation**\
[[Website](https://arxiv.org/abs/2402.16392)]
[[Code](https://github.com/naver/poc)]**MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation** \
[[Website](https://arxiv.org/abs/2403.11194)]
[[Code](https://github.com/Valkyrja3607/MaskDiffusion)]**Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation** \
[[Website](https://arxiv.org/abs/2403.12042)]
[[Code](https://github.com/buxiangzhiren/VD-IT)]**Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.14291)]
[[Code](https://github.com/vpulab/ovam)]**No Annotations for Object Detection in Art through Stable Diffusion** \
[[Website](https://arxiv.org/abs/2412.06286)]
[[Code](https://github.com/patrick-john-ramos/nada)]**EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models** \
[[ICLR 2024](https://openreview.net/forum?id=YqyTXmF8Y2)]
[[Website](https://arxiv.org/abs/2401.11739)]
[[Project](https://kmcode1.github.io/Projects/EmerDiff/)]**Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation** \
[[CVPR 2024](https://arxiv.org/abs/2404.06542)]
[[Project](https://aimagelab.github.io/freeda/)]**FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.20105)]
[[Project](https://bcorrad.github.io/freesegdiff/)]**ReferEverything: Towards Segmenting Everything We Can Speak of in Videos** \
[[Website](https://arxiv.org/abs/2410.23287)]
[[Project](https://miccooper9.github.io/projects/ReferEverything/)]**DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models** \
[[Website](https://arxiv.org/abs/2303.11681)]
[[Project](https://weijiawu.github.io/DiffusionMask/)]**Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Peng_Diffusion-based_Image_Translation_with_Label_Guidance_for_Domain_Adaptive_Semantic_ICCV_2023_paper.html
)]
[[Website](https://arxiv.org/abs/2308.12350)]**SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection** \
[[CVPR 2024](https://arxiv.org/abs/2402.17323)]**Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers** \
[[ECCV 2024](https://arxiv.org/abs/2407.08394)]**Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.02369)]**Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation** \
[[WACV 2024](https://arxiv.org/abs/2312.01850)]**Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis** \
[[ACCV 2024](https://arxiv.org/abs/2410.06841)]**A Simple Background Augmentation Method for Object Detection with Diffusion Model** \
[[Website](https://arxiv.org/abs/2408.00350)]**Unveiling the Power of Diffusion Features For Personalized Segmentation and Retrieval** \
[[Website](https://arxiv.org/abs/2405.18025)]**SLiMe: Segment Like Me** \
[[Website](https://arxiv.org/abs/2309.03179)]**ASAM: Boosting Segment Anything Model with Adversarial Tuning** \
[[Website](https://arxiv.org/abs/2405.00256)]**Diffusion Features to Bridge Domain Gap for Semantic Segmentation** \
[[Website](https://arxiv.org/abs/2406.00777)]**MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation** \
[[Website](https://arxiv.org/abs/2303.05105)]**DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery** \
[[Website](https://arxiv.org/abs/2303.09813)]**Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models** \
[[Website](https://arxiv.org/abs/2308.16777)]**Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter** \
[[Website](https://arxiv.org/abs/2309.02773)]**Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion** \
[[Website](https://arxiv.org/abs/2309.01369v1)]**From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2309.04109)]**Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation** \
[[Website](https://arxiv.org/abs/2309.15726)]**Patch-based Selection and Refinement for Early Object Detection** \
[[Website](https://arxiv.org/abs/2311.02274)]**TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.00651)]**Towards Granularity-adjusted Pixel-level Semantic Annotation** \
[[Website](https://arxiv.org/abs/2312.02420)]**Gen2Det: Generate to Detect** \
[[Website](https://arxiv.org/abs/2312.04566)]**Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors** \
[[Website](https://arxiv.org/abs/2401.16459)]**ConsistencyDet: Robust Object Detector with Denoising Paradigm of Consistency Model** \
[[Website](https://arxiv.org/abs/2404.07773)]**Diverse Generation while Maintaining Semantic Coordination: A Diffusion-Based Data Augmentation Method for Object Detection** \
[[Website](https://arxiv.org/abs/2408.02891)]**Generative Edge Detection with Stable Diffusion** \
[[Website](https://arxiv.org/abs/2410.03080)]**DINTR: Tracking via Diffusion-based Interpolation** \
[[Website](https://arxiv.org/abs/2410.10053)]**Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking** \
[[Website](https://arxiv.org/abs/2410.09243)]**DiffuMask-Editor: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability** \
[[Website](https://arxiv.org/abs/2411.01819)]**Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation** \
[[Website](https://arxiv.org/abs/2411.10411)]**Panoptic Diffusion Models: co-generation of images and segmentation maps** \
[[Website](https://arxiv.org/abs/2412.02929)]## Additional conditions
⭐⭐⭐**Adding Conditional Control to Text-to-Image Diffusion Models** \
[[ICCV 2023 best paper](https://openaccess.thecvf.com/content/ICCV2023/html/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2302.05543)]
[[Official Code](https://github.com/lllyasviel/controlnet)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/using-diffusers/controlnet)]
[[Diffusers Code](https://github.com/huggingface/diffusers/tree/main/examples/controlnet)]⭐⭐**T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2302.08453)]
[[Official Code](https://github.com/TencentARC/T2I-Adapter)]
[[Diffusers Code](https://github.com/huggingface/diffusers/tree/main/examples/t2i_adapter)]**SketchKnitter: Vectorized Sketch Generation with Diffusion Models** \
[[ICLR 2023 Spotlight](https://openreview.net/forum?id=4eJ43EN2g6l¬eId=fxpTz_vCdO)]
[[ICLR 2023 Spotlight](https://iclr.cc/virtual/2023/poster/11832)]
[[Website](https://openreview.net/pdf?id=4eJ43EN2g6l)]
[[Code](https://github.com/XDUWQ/SketchKnitter/tree/75ded224e91f5ecf7e225c031b32cb97508443b9)]**Freestyle Layout-to-Image Synthesis** \
[[CVPR 2023 highlight](https://openaccess.thecvf.com/content/CVPR2023/html/Xue_Freestyle_Layout-to-Image_Synthesis_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.14412)]
[[Project](https://essunny310.github.io/FreestyleNet/)]
[[Code](https://github.com/essunny310/freestylenet)]**Collaborative Diffusion for Multi-Modal Face Generation and Editing** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Huang_Collaborative_Diffusion_for_Multi-Modal_Face_Generation_and_Editing_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2304.10530)]
[[Project](https://ziqihuangg.github.io/projects/collaborative-diffusion.html)]
[[Code](https://github.com/ziqihuangg/Collaborative-Diffusion)]**HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Ju_HumanSD_A_Native_Skeleton-Guided_Diffusion_Model_for_Human_Image_Generation_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2304.04269)]
[[Project](https://idea-research.github.io/HumanSD/)]
[[Code]](https://github.com/IDEA-Research/HumanSD)**FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Yu_FreeDoM_Training-Free_Energy-Guided_Conditional_Diffusion_Model_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.09833)]
[[Code](https://github.com/vvictoryuki/freedom)]**Sketch-Guided Text-to-Image Diffusion Models** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2211.13752)]
[[Project](https://sketch-guided-diffusion.github.io/)]
[[Code]](https://github.com/Mikubill/sketch2img)**Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive** \
[[ICLR 2024](https://arxiv.org/abs/2401.08815)]
[[Project](https://yumengli007.github.io/ALDM/)]
[[Code](https://github.com/boschresearch/ALDM)]**IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts** \
[[Website](https://arxiv.org/abs/2408.03209)]
[[Project](https://unity-research.github.io/IP-Adapter-Instruct.github.io/)]
[[Code](https://github.com/unity-research/IP-Adapter-Instruct)]**ControlNeXt: Powerful and Efficient Control for Image and Video Generation** \
[[Website](https://arxiv.org/abs/2408.06070)]
[[Project](https://pbihao.github.io/projects/controlnext/index.html)]
[[Code](https://github.com/dvlab-research/ControlNeXt)]**Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance** \
[[Website](https://arxiv.org/abs/2406.07540)]
[[Project](https://genforce.github.io/ctrl-x/)]
[[Code](https://github.com/genforce/ctrl-x)]**Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model** \
[[Website](https://arxiv.org/abs/2404.09967)]
[[Project](https://ctrl-adapter.github.io/)]
[[Code](https://github.com/HL-hanlin/Ctrl-Adapter)]**IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2308.06721)]
[[Project](https://ip-adapter.github.io/)]
[[Code](https://github.com/tencent-ailab/ip-adapter)]**Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis** \
[[Website](https://arxiv.org/abs/2412.03150)]
[[Project](https://cvlab-kaist.github.io/AM-Adapter/)]
[[Code](https://github.com/cvlab-kaist/AM-Adapter)]**DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2412.03255)]
[[Project](https://hithqd.github.io/projects/Dynamiccontrol/)]
[[Code](https://github.com/hithqd/DynamicControl)]**A Simple Approach to Unifying Diffusion-based Conditional Generation** \
[[Website](https://arxiv.org/abs/2410.11439)]
[[Project](https://lixirui142.github.io/unicon-diffusion/)]
[[Code](https://github.com/lixirui142/UniCon)]**HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion** \
[[Website](https://arxiv.org/abs/2310.08579)]
[[Project](https://snap-research.github.io/HyperHuman/)]
[[Code](https://github.com/snap-research/HyperHuman)]**Late-Constraint Diffusion Guidance for Controllable Image Synthesis** \
[[Website](https://arxiv.org/abs/2305.11520)]
[[Project](https://alonzoleeeooo.github.io/LCDG/)]
[[Code](https://github.com/AlonzoLeeeooo/LCDG)]**Composer: Creative and controllable image synthesis with composable conditions** \
[[Website](https://arxiv.org/abs/2302.09778)]
[[Project](https://damo-vilab.github.io/composer-page/)]
[[Code](https://github.com/damo-vilab/composer)]**DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.15194)]
[[Project](https://sungnyun.github.io/diffblender/)]
[[Code](https://github.com/sungnyun/diffblender)]**Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation** \
[[Website](https://arxiv.org/abs/2303.09833)]
[[Project](https://mhh0318.github.io/cocktail/)]
[[Code](https://github.com/mhh0318/Cocktail)]**UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild** \
[[Website](https://arxiv.org/abs/2305.11147)]
[[Project](https://canqin001.github.io/UniControl-Page/)]
[[Code](https://github.com/salesforce/UniControl)]**Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.16322)]
[[Project](https://shihaozhaozsh.github.io/unicontrolnet/)]
[[Code](https://github.com/ShihaoZhaoZSH/Uni-ControlNet)]**LooseControl: Lifting ControlNet for Generalized Depth Conditioning** \
[[Website](https://arxiv.org/abs/2312.03079)]
[[Project](https://shariqfarooq123.github.io/loose-control/)]
[[Code](https://github.com/shariqfarooq123/LooseControl)]**X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model** \
[[Website](https://arxiv.org/abs/2312.03079)]
[[Project](https://showlab.github.io/X-Adapter/)]
[[Code](https://github.com/showlab/X-Adapter)]**ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.06573)]
[[Project](https://vislearn.github.io/ControlNet-XS/)]
[[Code](https://github.com/vislearn/ControlNet-XS)]**ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet** \
[[Website](https://arxiv.org/abs/2312.03154)]
[[Project](https://soon-yau.github.io/visconet/)]
[[Code](https://github.com/soon-yau/visconet)]**SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior** \
[[Website](https://arxiv.org/abs/2403.09638)]
[[Project](https://air-discover.github.io/SCP-Diff/)]
[[Code](https://github.com/AIR-DISCOVER/SCP-Diff-Toolkit)]**Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis** \
[[ICLR 2024](https://arxiv.org/abs/2401.09048)]
[[Code](https://github.com/tomtom1103/compose-and-conquer/)]**It's All About Your Sketch: Democratising Sketch Control in Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2403.07234)]
[[Code](https://github.com/subhadeepkoley/DemoSketch2RGB)]**CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation** \
[[Website](https://arxiv.org/abs/2410.09400)]
[[Code](https://github.com/xyfJASON/ctrlora)]**Universal Guidance for Diffusion Models** \
[[Website](https://arxiv.org/abs/2302.07121)]
[[Code](https://github.com/arpitbansal297/Universal-Guided-Diffusion)]**Late-Constraint Diffusion Guidance for Controllable Image Synthesis** \
[[Website](https://arxiv.org/abs/2305.11520)]
[[Code]](https://github.com/AlonzoLeeeooo/LCDG)**Meta ControlNet: Enhancing Task Adaptation via Meta Learning** \
[[Website](https://arxiv.org/abs/2312.01255)]
[[Code](https://github.com/JunjieYang97/Meta-ControlNet)]**Local Conditional Controlling for Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.08768)]
[[Code](https://github.com/YibooZhao/Local-Control)]**KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.01595)]
[[Code](https://github.com/aminK8/KnobGen)]**Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC** \
[[Website](https://arxiv.org/abs/2412.05619)]
[[Code](https://github.com/tobran/ONE-PIC)]**OminiControl: Minimal and Universal Control for Diffusion Transformer** \
[[Website](https://arxiv.org/abs/2411.15098)]
[[Code](https://github.com/Yuanshi9815/OminiControl)]**Modulating Pretrained Diffusion Models for Multimodal Image Synthesis** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2302.12764)]
[[Project](https://mcm-diffusion.github.io/)]**SpaText: Spatio-Textual Representation for Controllable Image Generation**\
[[CVPR 2023](https://arxiv.org/abs/2211.14305)]
[[Project]](https://omriavrahami.com/spatext/)**CCM: Adding Conditional Controls to Text-to-Image Consistency Models** \
[[ICML 2024](https://arxiv.org/abs/2312.06971)]
[[Project](https://swiftforce.github.io/CCM/)]**Dreamguider: Improved Training free Diffusion-based Conditional Generation** \
[[Website](https://arxiv.org/abs/2406.02549)]
[[Project](https://nithin-gk.github.io/dreamguider.github.io/)]**ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback** \
[[Website](https://arxiv.org/abs/2404.07987)]
[[Project](https://liming-ai.github.io/ControlNet_Plus_Plus/)]**AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2406.18958)]
[[Project](https://any-control.github.io/)]**BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion** \
[[Website](https://arxiv.org/abs/2404.04544)]
[[Project](https://janeyeon.github.io/beyond-scene/)]**FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection** \
[[Website](https://arxiv.org/abs/2312.09252)]
[[Project](https://samsunglabs.github.io/FineControlNet-project-page/)]**Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor** \
[[Website](https://arxiv.org/abs/2305.20082)]
[[Project](https://control4darxiv.github.io/)]**SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing** \
[[Website](https://arxiv.org/abs/2312.11392)]
[[Project](https://scedit.github.io/)]**CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models** \
[[Website](https://arxiv.org/abs/2405.07913)]
[[Project](https://compvis.github.io/LoRAdapter/)]**AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2406.18958)]
[[Project](https://any-control.github.io/)]**Sketch-Guided Scene Image Generation** \
[[Website](https://arxiv.org/abs/2407.06469)]**SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation** \
[[Website](https://arxiv.org/abs/2308.10156)]**Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation** \
[[Website](https://arxiv.org/abs/2306.00914)]**Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt** \
[[Website](https://arxiv.org/abs/2306.04607)]**Adding 3D Geometry Control to Diffusion Models** \
[[Website](https://arxiv.org/abs/2306.08103)]**LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation** \
[[Website]](https://arxiv.org/abs/2302.08908)**JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling** \
[[Website]](https://arxiv.org/abs/2310.06347)**ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet** \
[[Website]](https://arxiv.org/abs/2312.03154)**Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons** \
[[Website]](https://arxiv.org/abs/2401.13363)**Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt** \
[[Website]](https://arxiv.org/abs/2404.05331)**FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation** \
[[Website]](https://arxiv.org/abs/2405.04834)**Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation** \
[[Website]](https://arxiv.org/abs/2406.02485)**Label-free Neural Semantic Image Synthesis** \
[[Website]](https://arxiv.org/abs/2407.01790)## Few-Shot
**Discriminative Diffusion Models as Few-shot Vision and Language Learners** \
[[Website](https://arxiv.org/abs/2305.10722)]
[[Code](https://github.com/eric-ai-lab/dsd)]**Few-Shot Diffusion Models** \
[[Website](https://arxiv.org/abs/2205.15463)]
[[Code](https://github.com/georgosgeorgos/few-shot-diffusion-models)]**Few-shot Semantic Image Synthesis with Class Affinity Transfer** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Careil_Few-Shot_Semantic_Image_Synthesis_With_Class_Affinity_Transfer_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2304.02321)]**DiffAlign : Few-shot learning using diffusion based synthesis and alignment** \
[[Website](https://arxiv.org/abs/2212.05404)]**Few-shot Image Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2211.03264)]**Lafite2: Few-shot Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2210.14124)]**Few-Shot Task Learning through Inverse Generative Modeling** \
[[Website](https://arxiv.org/abs/2411.04987)]## SD-inpaint
**Paint by Example: Exemplar-based Image Editing with Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Yang_Paint_by_Example_Exemplar-Based_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.html
)]
[[Website](https://arxiv.org/abs/2211.13227)]
[[Code](https://github.com/Fantasy-Studio/Paint-by-Example)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/api/pipelines/paint_by_example)]
[[Diffusers Code](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/paint_by_example/pipeline_paint_by_example.py)]**GLIDE: Towards photorealistic image generation and editing with text-guided diffusion model** \
[[ICML 2022 Spotlight](https://icml.cc/virtual/2022/spotlight/16340)]
[[Website](https://arxiv.org/abs/2112.10741)]
[[Code](https://github.com/openai/glide-text2im)]**Blended Diffusion for Text-driven Editing of Natural Images** \
[[CVPR 2022](https://openaccess.thecvf.com/content/CVPR2022/html/Avrahami_Blended_Diffusion_for_Text-Driven_Editing_of_Natural_Images_CVPR_2022_paper.html)]
[[Website](https://arxiv.org/abs/2111.14818)]
[[Project](https://omriavrahami.com/blended-diffusion-page/)]
[[Code](https://github.com/omriav/blended-diffusion)]**Blended Latent Diffusion** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2206.02779)]
[[Project](https://omriavrahami.com/blended-latent-diffusion-page/)]
[[Code](https://github.com/omriav/blended-latent-diffusion)]**TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Lu_TF-ICON_Diffusion-Based_Training-Free_Cross-Domain_Image_Composition_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2307.12493)]
[[Project](https://shilin-lu.github.io/tf-icon.github.io/)]
[[Code](https://github.com/Shilin-LU/TF-ICON)]**Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Wang_Imagen_Editor_and_EditBench_Advancing_and_Evaluating_Text-Guided_Image_Inpainting_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2212.06909)]
[[Code](https://github.com/fenglinglwb/PSM)]**Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models** \
[[ICML 2023](https://icml.cc/virtual/2023/poster/24127)]
[[Website](https://arxiv.org/abs/2304.03322)]
[[Code](https://github.com/ucsb-nlp-chang/copaint)]**Coherent and Multi-modality Image Inpainting via Latent Space Optimization** \
[[Website](https://arxiv.org/abs/2407.08019)]
[[Project](https://pilot-page.github.io/)]
[[Code](https://github.com/Lingzhi-Pan/PILOT)]**Inst-Inpaint: Instructing to Remove Objects with Diffusion Models** \
[[Website](https://arxiv.org/abs/2304.03246)]
[[Project](http://instinpaint.abyildirim.com/)]
[[Code](https://github.com/abyildirim/inst-inpaint)]
[[Demo](https://huggingface.co/spaces/abyildirim/inst-inpaint)]**Anywhere: A Multi-Agent Framework for Reliable and Diverse Foreground-Conditioned Image Inpainting** \
[[Website](https://arxiv.org/abs/2404.18598)]
[[Project](https://anywheremultiagent.github.io/)]
[[Code](https://github.com/Sealical/anywhere-multi-agent)]**CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.09368)]
[[Project](https://yigitekin.github.io/CLIPAway/)]
[[Code](https://github.com/YigitEkin/CLIPAway)]**AnyDoor: Zero-shot Object-level Image Customization** \
[[Website](https://arxiv.org/abs/2307.09481)]
[[Project](https://damo-vilab.github.io/AnyDoor-Page/)]
[[Code](https://github.com/damo-vilab/AnyDoor)]**A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting** \
[[Website](https://arxiv.org/abs/2312.03594)]
[[Project](https://powerpaint.github.io/)]
[[Code](https://github.com/open-mmlab/mmagic/tree/main/projects/powerpaint)]**Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation** \
[[Website](https://arxiv.org/abs/2409.01055)]
[[Project](https://follow-your-canvas.github.io/)]
[[Code](https://github.com/mayuelala/FollowYourCanvas)]**Towards Language-Driven Video Inpainting via Multimodal Large Language Models** \
[[Website](https://arxiv.org/abs/2401.10226)]
[[Project](https://jianzongwu.github.io/projects/rovi/)]
[[Code](https://github.com/jianzongwu/Language-Driven-Video-Inpainting)]**Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections** \
[[Website](https://arxiv.org/abs/2409.14677)]
[[Project](https://val.cds.iisc.ac.in/reflecting-reality.github.io/)]
[[Code](https://github.com/val-iisc/Reflecting-Reality)]**Improving Text-guided Object Inpainting with Semantic Pre-inpainting**\
[[ECCV 2024](https://arxiv.org/abs/2409.08260)]
[[Code](https://github.com/Nnn-s/CATdiffusion)]**FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior** \
[[ECCV 2024](https://arxiv.org/abs/2407.04947)]
[[Code](https://github.com/aim-uofa/FreeCompose)]**360-Degree Panorama Generation from Few Unregistered NFoV Images** \
[[ACM MM 2023](https://arxiv.org/abs/2308.14686)]
[[Code](https://github.com/shanemankiw/Panodiff)]**Delving Globally into Texture and Structure for Image Inpainting**\
[[ACM MM 2022](https://arxiv.org/abs/2209.08217)]
[[Code](https://github.com/htyjers/DGTS-Inpainting)]**ControlEdit: A MultiModal Local Clothing Image Editing Method** \
[[Website](https://arxiv.org/abs/2409.14720)]
[[Code](https://github.com/cd123-cd/ControlEdit)]**DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting** \
[[Website](https://arxiv.org/abs/2411.17223)]
[[Code](https://github.com/mycfhs/DreamMix)]**Training-and-prompt-free General Painterly Harmonization Using Image-wise Attention Sharing** \
[[Website](https://arxiv.org/abs/2404.12900)]
[[Code](https://github.com/BlueDyee/TF-GPH)]**What to Preserve and What to Transfer: Faithful, Identity-Preserving Diffusion-based Hairstyle Transfer** \
[[Website](https://arxiv.org/abs/2408.16450)]
[[Code](https://github.com/cychungg/HairFusion)]**Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model** \
[[Website](https://arxiv.org/abs/2407.16982)]
[[Code](https://github.com/OpenGVLab/Diffree)]**Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting** \
[[Website](https://arxiv.org/abs/2403.19898)]
[[Code](https://github.com/htyjers/StrDiffusion)]**Reference-based Image Composition with Sketch via Structure-aware Diffusion Model** \
[[Website](https://arxiv.org/abs/2304.09748)]
[[Code](https://github.com/kangyeolk/Paint-by-Sketch)]**Image Inpainting via Iteratively Decoupled Probabilistic Modeling** \
[[Website](https://arxiv.org/abs/2212.02963)]
[[Code](https://github.com/fenglinglwb/PSM)]**ControlCom: Controllable Image Composition using Diffusion Model** \
[[Website](https://arxiv.org/abs/2308.10040)]
[[Code]](https://github.com/bcmi/ControlCom-Image-Composition)**Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model** \
[[Website](https://arxiv.org/abs/2310.07222)]
[[Code](https://github.com/ysy31415/unipaint)]**MAGICREMOVER: TUNING-FREE TEXT-GUIDED IMAGE INPAINTING WITH DIFFUSION MODELS** \
[[Website](https://arxiv.org/abs/2310.02848)]
[[Code](https://github.com/exisas/Magicremover)]**HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.14091)]
[[Code](https://github.com/Picsart-AI-Research/HD-Painter)]**BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion** \
[[Website](https://arxiv.org/abs/2403.06976)]
[[Code](https://github.com/TencentARC/BrushNet)]**Sketch-guided Image Inpainting with Partial Discrete Diffusion Process** \
[[Website](https://arxiv.org/abs/2404.11949)]
[[Code](https://github.com/vl2g/Sketch-Inpainting)]**ReMOVE: A Reference-free Metric for Object Erasure** \
[[Website](https://arxiv.org/abs/2409.00707)]
[[Code](https://github.com/chandrasekaraditya/ReMOVE)]**Modification Takes Courage: Seamless Image Stitching via Reference-Driven Inpainting** \
[[Website](https://arxiv.org/abs/2411.10309)]
[[Code](https://github.com/yayoyo66/RDIStitcher)]**MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior** \
[[Website](https://arxiv.org/abs/2409.10090)]
[[Code](https://github.com/weijing-tao/MotionCom)]**AddMe: Zero-shot Group-photo Synthesis by Inserting People into Scenes** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03028.pdf)]
[[Project](https://addme-awesome.github.io/page/)]**Text2Place: Affordance-aware Text Guided Human Placement** \
[[ECCV 2024](https://arxiv.org/abs/2407.15446)]
[[Project](https://rishubhpar.github.io/Text2Place/)]**IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation** \
[[CVPR 2024](https://arxiv.org/abs/2403.10701)]
[[Project](https://song630.github.io/IMPRINT-Project-Page/)]**Matting by Generation** \
[[SIGGRAPH 2024](https://arxiv.org/abs/2407.21017)]
[[Project](https://lightchaserx.github.io/matting-by-generation/)]**PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference** \
[[NeurIPS 2024](https://arxiv.org/abs/2407.21017)]
[[Project](https://prefpaint.github.io/)]**Taming Latent Diffusion Model for Neural Radiance Field Inpainting** \
[[Website](https://arxiv.org/abs/2404.09995)]
[[Project](https://hubert0527.github.io/MALD-NeRF/)]**SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control** \
[[Website](https://arxiv.org/abs/2312.05039)]
[[Project](https://smartmask-gen.github.io/)]**Towards Stable and Faithful Inpainting** \
[[Website](https://arxiv.org/abs/2312.04831)]
[[Project](https://yikai-wang.github.io/asuka/)]**Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos** \
[[Website](https://arxiv.org/abs/2403.13044)]
[[Project](https://magic-fixup.github.io/)]**ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion** \
[[Website](https://arxiv.org/abs/2403.18818)]
[[Project](https://objectdrop.github.io/)]**TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization** \
[[ACM MM 2024](https://arxiv.org/abs/2408.03637)]**Semantically Consistent Video Inpainting with Conditional Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.00251)]**Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention**\
[[Website](https://arxiv.org/abs/2312.03556)]**Outline-Guided Object Inpainting with Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.16421)]**SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model** \
[[Website](https://arxiv.org/abs/2212.05034)]**Gradpaint: Gradient-Guided Inpainting with Diffusion Models** \
[[Website](https://arxiv.org/abs/2309.09614)]**Infusion: Internal Diffusion for Video Inpainting** \
[[Website](https://arxiv.org/abs/2311.01090)]**Rethinking Referring Object Removal** \
[[Website](https://arxiv.org/abs/2403.09128)]**Tuning-Free Image Customization with Image and Text Guidance** \
[[Website](https://arxiv.org/abs/2403.12658)]**VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model** \
[[Website](https://arxiv.org/abs/2406.01059)]**FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image** \
[[Website](https://arxiv.org/abs/2406.07865)]**InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture** \
[[Website](https://arxiv.org/abs/2407.10592)]**Thinking Outside the BBox: Unconstrained Generative Object Compositing** \
[[Website](https://arxiv.org/abs/2409.04559)]**Content-aware Tile Generation using Exterior Boundary Inpainting** \
[[Website](https://arxiv.org/abs/2409.14184)]**AnyLogo: Symbiotic Subject-Driven Diffusion System with Gemini Status** \
[[Website](https://arxiv.org/abs/2409.17740)]**TD-Paint: Faster Diffusion Inpainting Through Time Aware Pixel Conditioning** \
[[Website](https://arxiv.org/abs/2410.09306)]**MagicEraser: Erasing Any Objects via Semantics-Aware Control** \
[[Website](https://arxiv.org/abs/2410.10207)]**I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting** \
[[Website](https://arxiv.org/abs/2411.19050)]**VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference** \
[[Website](https://arxiv.org/abs/2411.18929)]**FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting** \
[[Website](https://arxiv.org/abs/2412.00427)]**PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control** \
[[Website](https://arxiv.org/abs/2412.01223)]**Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment** \
[[Website](https://arxiv.org/abs/2412.00306)]**Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion** \
[[Website](https://arxiv.org/abs/2412.00857)]**Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting** \
[[Website](https://arxiv.org/abs/2412.03812)]**AsyncDSB: Schedule-Asynchronous Diffusion Schrödinger Bridge for Image Inpainting** \
[[Website](https://arxiv.org/abs/2412.08149)]**RAD: Region-Aware Diffusion Models for Image Inpainting** \
[[Website](https://arxiv.org/abs/2412.09191)]## Layout Generation
**LayoutDM: Discrete Diffusion Model for Controllable Layout Generation** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Inoue_LayoutDM_Discrete_Diffusion_Model_for_Controllable_Layout_Generation_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.08137)]
[[Project](https://cyberagentailab.github.io/layout-dm/)]
[[Code](https://github.com/CyberAgentAILab/layout-dm)]**Desigen: A Pipeline for Controllable Design Template Generation** \
[[CVPR 2024](https://arxiv.org/abs/2403.09093)]
[[Project](https://whaohan.github.io/desigen/)]
[[Code](https://github.com/whaohan/desigen)]**DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Levi_DLT_Conditioned_layout_generation_with_Joint_Discrete-Continuous_Diffusion_Layout_Transformer_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.03755)]
[[Code](https://github.com/wix-incubator/DLT)]**LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Zhang_LayoutDiffusion_Improving_Graphic_Layout_Generation_by_Discrete_Diffusion_Probabilistic_Models_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.11589)]
[[Code](https://github.com/microsoft/LayoutGeneration/tree/main/LayoutDiffusion)]**Desigen: A Pipeline for Controllable Design Template Generation** \
[[CVPR 2024](https://arxiv.org/abs/2403.09093)]
[[Code](https://github.com/whaohan/desigen)]**DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation** \
[[Website](https://arxiv.org/abs/2412.00381)]
[[Code](https://github.com/deadsmither5/DogLayout)]**LayoutDM: Transformer-based Diffusion Model for Layout Generation** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Chai_LayoutDM_Transformer-Based_Diffusion_Model_for_Layout_Generation_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2305.02567)]**Unifying Layout Generation with a Decoupled Diffusion Model** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Hui_Unifying_Layout_Generation_With_a_Decoupled_Diffusion_Model_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.05049)]**PLay: Parametrically Conditioned Layout Generation using Latent Diffusion** \
[[ICML 2023](https://dl.acm.org/doi/10.5555/3618408.3618624)]
[[Website](https://arxiv.org/abs/2301.11529)]**Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints** \
[[ICLR 2024](https://arxiv.org/abs/2402.04754)]**SLayR: Scene Layout Generation with Rectified Flow** \
[[Website](https://arxiv.org/abs/2412.05003)]**CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model** \
[[Website](https://arxiv.org/abs/2407.15233)]**Diffusion-based Document Layout Generation** \
[[Website](https://arxiv.org/abs/2303.10787)]**Dolfin: Diffusion Layout Transformers without Autoencoder** \
[[Website](https://arxiv.org/abs/2310.16305)]**LayoutFlow: Flow Matching for Layout Generation** \
[[Website](https://arxiv.org/abs/2403.18187)]**Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model** \
[[Website](https://arxiv.org/abs/2409.16689)]## Text Generation
⭐⭐**TextDiffuser: Diffusion Models as Text Painters** \
[[NeurIPS 2023](https://neurips.cc/virtual/2023/poster/70636)]
[[Website](https://arxiv.org/abs/2305.10855)]
[[Project](https://jingyechen.github.io/textdiffuser/)]
[[Code](https://github.com/microsoft/unilm/tree/master/textdiffuser)]⭐⭐**TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering** \
[[ECCV 2024 Oral](https://arxiv.org/abs/2311.16465)]
[[Project](https://jingyechen.github.io/textdiffuser2/)]
[[Code](https://github.com/microsoft/unilm/tree/master/textdiffuser-2)]**GlyphControl: Glyph Conditional Control for Visual Text Generation** \
[[NeurIPS 2023](https://neurips.cc/virtual/2023/poster/70191)]
[[Website](https://arxiv.org/abs/2305.18259)]
[[Code](https://github.com/AIGText/GlyphControl-release)]**DiffUTE: Universal Text Editing Diffusion Model** \
[[NeurIPS 2023](https://neurips.cc/virtual/2023/poster/71364)]
[[Website](https://arxiv.org/abs/2305.10825)]
[[Code](https://github.com/chenhaoxing/DiffUTE)]**Word-As-Image for Semantic Typography** \
[[SIGGRAPH 2023](https://arxiv.org/abs/2303.01818)]
[[Project](https://wordasimage.github.io/Word-As-Image-Page/)]
[[Code](https://github.com/Shiriluz/Word-As-Image)]**Kinetic Typography Diffusion Model** \
[[ECCV 2024](https://arxiv.org/abs/2407.10476)]
[[Project](https://seonmip.github.io/kinety/)]
[[Code](https://github.com/SeonmiP/KineTy)]**Dynamic Typography: Bringing Text to Life via Video Diffusion Prior** \
[[Website](https://arxiv.org/abs/2404.11614)]
[[Project](https://animate-your-word.github.io/demo/)]
[[Code](https://github.com/zliucz/animate-your-word)]**JoyType: A Robust Design for Multilingual Visual Text Creation** \
[[Website](https://arxiv.org/abs/2409.17524)]
[[Project](https://jdh-algo.github.io/JoyType/)]
[[Code](https://github.com/jdh-algo/JoyType)]**UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.04884)]
[[Project](https://udifftext.github.io/)]
[[Code](https://github.com/ZYM-PKU/UDiffText)]**One-Shot Diffusion Mimicker for Handwritten Text Generation** \
[[ECCV 2024](https://arxiv.org/abs/2409.04004)]
[[Code](https://github.com/dailenson/One-DM)]**DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/02357.pdf)]
[[Code](https://github.com/shreygithub/DCDM)]**HFH-Font: Few-shot Chinese Font Synthesis with Higher Quality, Faster Speed, and Higher Resolution** \
[[SIGGRAPH Asia 2024](https://arxiv.org/abs/2410.06488)]
[[Code](https://github.com/grovessss/HFH-Font)]**Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model** \
[[AAAI 2024](https://arxiv.org/abs/2312.12232)]
[[Code](https://github.com/ecnuljzhang/brush-your-text)]**FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning** \
[[AAAI 2024](https://arxiv.org/abs/2312.12142)]
[[Code](https://github.com/yeungchenwa/FontDiffuser)]**Text Image Inpainting via Global Structure-Guided Diffusion Models** \
[[AAAI 2024](https://arxiv.org/abs/2401.14832)]
[[Code](https://github.com/blackprotoss/GSDM)]**Ambigram generation by a diffusion model** \
[[ICDAR 2023](https://arxiv.org/abs/2306.12049)]
[[Code](https://github.com/univ-esuty/ambifusion)]**Scene Text Image Super-resolution based on Text-conditional Diffusion Models** \
[[WACV 2024](https://arxiv.org/abs/2311.09759)]
[[Code](https://github.com/toyotainfotech/stisr-tcdm)]**Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling** \
[[ECCV 2024](https://arxiv.org/abs/2409.13431)]
[[Code](https://github.com/wzx99/TMIM)]**First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending** \
[[ECAI 2024](https://arxiv.org/abs/2410.10168)]
[[Code](https://github.com/Zhenhang-Li/GlyphOnly)]**VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.01738)]
[[Code](https://github.com/Carlofkl/VitaGlyph)]**Visual Text Generation in the Wild** \
[[Website](https://arxiv.org/abs/2407.14138)]
[[Code](https://github.com/alibabaresearch/advancedliteratemachinery)]**Deciphering Oracle Bone Language with Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.00684)]
[[Code](https://github.com/guanhaisu/OBSD)]**High Fidelity Scene Text Synthesis** \
[[Website](https://arxiv.org/abs/2405.14701)]
[[Code](https://github.com/CodeGoat24/DreamText)]**TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition** \
[[Website](https://arxiv.org/abs/2412.01137)]
[[Code](https://github.com/YesianRohn/TextSSR)]**AnyText: Multilingual Visual Text Generation And Editing** \
[[Website](https://arxiv.org/abs/2311.03054)]
[[Code](https://github.com/tyxsspa/AnyText)]**AnyText2: Visual Text Generation and Editing With Customizable Attributes** \
[[Website](https://arxiv.org/abs/2411.15245)]
[[Code](https://github.com/tyxsspa/AnyText2)]**Few-shot Calligraphy Style Learning** \
[[Website](https://arxiv.org/abs/2404.17199)]
[[Code](https://github.com/kono-dada/xysffusion)]**GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models** \
[[Website](https://arxiv.org/abs/2407.02252)]
[[Code](https://github.com/OPPO-Mente-Lab/GlyphDraw2)]**DiffusionPen: Towards Controlling the Style of Handwritten Text Generation** \
[[Website](https://arxiv.org/abs/2409.06065)]
[[Code](https://github.com/koninik/DiffusionPen)]**AmbiGen: Generating Ambigrams from Pre-trained Diffusion Model** \
[[Website](https://arxiv.org/abs/2312.02967)]
[[Project](https://raymond-yeh.com/AmbiGen/)]**UniVG: Towards UNIfied-modal Video Generation** \
[[Website](https://arxiv.org/abs/2401.09084)]
[[Project](https://univg-baidu.github.io/)]**FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation** \
[[Website](https://arxiv.org/abs/2406.08392)]
[[Project](https://font-studio.github.io/)]**DECDM: Document Enhancement using Cycle-Consistent Diffusion Models** \
[[WACV 2024](https://arxiv.org/abs/2311.09625)]**SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Diffusion Models** \
[[Website](https://arxiv.org/abs/2406.01062)]**AnyTrans: Translate AnyText in the Image with Large Scale Models** \
[[Website](https://arxiv.org/abs/2406.11432)]**ARTIST: Improving the Generation of Text-rich Images by Disentanglement** \
[[Website](https://arxiv.org/abs/2406.12044)]**Improving Text Generation on Images with Synthetic Captions** \
[[Website](https://arxiv.org/abs/2406.00505)]**CustomText: Customized Textual Image Generation using Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.12531)]**VecFusion: Vector Font Generation with Diffusion** \
[[Website](https://arxiv.org/abs/2312.10540)]**Typographic Text Generation with Off-the-Shelf Diffusion Model** \
[[Website](https://arxiv.org/abs/2402.14314)]**Font Style Interpolation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.14311)]**Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation** \
[[Website](https://arxiv.org/abs/2403.16422)]**DiffCJK: Conditional Diffusion Model for High-Quality and Wide-coverage CJK Character Generation** \
[[Website](https://arxiv.org/abs/2404.05212)]**CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction** \
[[Website](https://arxiv.org/abs/2407.16204)]**Zero-Shot Paragraph-level Handwriting Imitation with Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2409.00786)]**Text Image Generation for Low-Resource Languages with Dual Translation Learning** \
[[Website](https://arxiv.org/abs/2409.17747)]**Decoupling Layout from Glyph in Online Chinese Handwriting Generation** \
[[Website](https://arxiv.org/abs/2410.02309)]**Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training** \
[[Website](https://arxiv.org/abs/2410.04439)]**TextMaster: Universal Controllable Text Edit** \
[[Website](https://arxiv.org/abs/2410.09879)]**Towards Visual Text Design Transfer Across Languages** \
[[Website](https://arxiv.org/abs/2410.18823)]**DiffSTR: Controlled Diffusion Models for Scene Text Removal** \
[[Website](https://arxiv.org/abs/2410.21721)]**TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images** \
[[Website](https://arxiv.org/abs/2411.00355)]**TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models** \
[[Website](https://arxiv.org/abs/2411.02437)]**Conditional Text-to-Image Generation with Reference Guidance** \
[[Website](https://arxiv.org/abs/2411.16713)]**Type-R: Automatically Retouching Typos for Text-to-Image Generation** \
[[Website](https://arxiv.org/abs/2411.18159)]**AMO Sampler: Enhancing Text Rendering with Overshooting** \
[[Website](https://arxiv.org/abs/2411.19415)]**FonTS: Text Rendering with Typography and Style Controls** \
[[Website](https://arxiv.org/abs/2412.00136)]## Super Resolution
**ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting** \
[[NeurIPS 2023 spotlight](https://nips.cc/virtual/2023/poster/71244)]
[[Website](https://arxiv.org/abs/2307.12348)]
[[Project](https://zsyoaoa.github.io/projects/resshift/)]
[[Code](https://github.com/zsyoaoa/resshift)]**Image Super-Resolution via Iterative Refinement** \
[[TPAMI](https://ieeexplore.ieee.org/document/9887996)]
[[Website](https://arxiv.org/abs/2104.07636)]
[[Project](https://iterative-refinement.github.io/)]
[[Code](https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement)]**DiffIR: Efficient Diffusion Model for Image Restoration**\
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/papers/Xia_DiffIR_Efficient_Diffusion_Model_for_Image_Restoration_ICCV_2023_paper.pdf)]
[[Website](https://arxiv.org/abs/2303.09472)]
[[Code](https://github.com/Zj-BinXia/DiffIR)]**Kalman-Inspired Feature Propagation for Video Face Super-Resolution** \
[[ECCV 2024](https://arxiv.org/abs/2408.05205)]
[[Project](https://jnjaby.github.io/projects/KEEP/)]
[[Code](https://github.com/jnjaby/KEEP)]**HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior** \
[[Website](https://arxiv.org/abs/2411.18662)]
[[Project](https://liyuantsao.github.io/HoliSDiP/)]
[[Code](https://github.com/liyuantsao/HoliSDiP)]**MatchDiffusion: Training-free Generation of Match-cuts** \
[[Website](https://arxiv.org/abs/2411.18677)]
[[Project](https://matchdiffusion.github.io/)]
[[Code](https://github.com/PardoAlejo/MatchDiffusion)]**Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling** \
[[Website](https://arxiv.org/abs/2411.18664)]
[[Project](https://junhahyung.github.io/STGuidance/)]
[[Code](https://github.com/junhahyung/STGuidance)]**AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation** \
[[Website](https://arxiv.org/abs/2404.01717)]
[[Project](https://nju-pcalab.github.io/projects/AddSR/)]
[[Code](https://github.com/NJU-PCALab/AddSR)]**FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution** \
[[Website](https://arxiv.org/abs/2411.18824)]
[[Project](https://jychen9811.github.io/FaithDiff_page/)]
[[Code](https://github.com/JyChen9811/FaithDiff/)]**Exploiting Diffusion Prior for Real-World Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2305.07015)]
[[Project](https://iceclear.github.io/projects/stablesr/)]
[[Code](https://github.com/IceClear/StableSR)]**SinSR: Diffusion-Based Image Super-Resolution in a Single Step** \
[[CVPR 2024](https://arxiv.org/abs/2311.14760)]
[[Code](https://github.com/wyf0912/SinSR)]**CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution** \
[[CVPR 2024](https://arxiv.org/abs/2405.07648)]
[[Code](https://github.com/I2-Multimedia-Lab/CDFormer)]**Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs** \
[[NeurIPS 2024](https://arxiv.org/abs/2409.17778)]
[[Code](https://github.com/QinpengCui/DoSSR)]**SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution** \
[[NeurIPS 2024](https://arxiv.org/abs/2410.05799)]
[[Code](https://github.com/Tang1705/SeeClear-NeurIPS24)]**Iterative Token Evaluation and Refinement for Real-World Super-Resolution** \
[[AAAI 2024](https://arxiv.org/abs/2312.05616)]
[[Code](https://github.com/chaofengc/ITER)]**PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2411.17106)]
[[Code](https://github.com/libozhu03/PassionSR)]**Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2410.04224)]
[[Code](https://github.com/JianzeLi-114/DFOSD)]**Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors** \
[[Website](https://arxiv.org/abs/2409.17058)]
[[Code](https://github.com/ArcticHare105/S3Diff)]**One Step Diffusion-based Super-Resolution with Time-Aware Distillation** \
[[Website](https://arxiv.org/abs/2408.07476)]
[[Code](https://github.com/LearningHx/TAD-SR)]**Hero-SR: One-Step Diffusion for Super-Resolution with Human Perception Priors** \
[[Website](https://arxiv.org/abs/2412.07152)]
[[Code](https://github.com/W-JG/Hero-SR)]**RAP-SR: RestorAtion Prior Enhancement in Diffusion Models for Realistic Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2412.07149)]
[[Code](https://github.com/W-JG/RAP-SR)]**One-Step Effective Diffusion Network for Real-World Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2406.08177)]
[[Code](https://github.com/cswry/OSEDiff)]**Binarized Diffusion Model for Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2406.05723)]
[[Code](https://github.com/zhengchen1999/BI-DiffSR)]**Does Diffusion Beat GAN in Image Super Resolution?** \
[[Website](https://arxiv.org/abs/2405.17261)]
[[Code](https://github.com/yandex-research/gan_vs_diff_sr)]**PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution** \
[[Website](https://arxiv.org/abs/2405.17158)]
[[Code](https://github.com/yongliuy/PatchScaler)]**DeeDSR: Towards Real-World Image Super-Resolution via Degradation-Aware Stable Diffusion** \
[[Website](https://arxiv.org/abs/2404.00661)]
[[Code](https://github.com/bichunyang419/DeeDSR)]**Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach** \
[[Website](https://arxiv.org/abs/2310.12004)]
[[Code](https://github.com/amandaluof/moe_sr)]**OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs** \
[[Website](https://arxiv.org/abs/2412.09465)]
[[Code](https://github.com/yuanzhi-zhu/OFTSR)]**Arbitrary-steps Image Super-resolution via Diffusion Inversion** \
[[Website](https://arxiv.org/abs/2412.09013)]
[[Code](https://github.com/zsyOAOA/InvSR)]**Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization** \
[[Website](https://arxiv.org/abs/2308.14469)]
[[Code](https://github.com/yangxy/PASD)]**DSR-Diff: Depth Map Super-Resolution with Diffusion Model** \
[[Website](https://arxiv.org/abs/2311.09919)]
[[Code](https://github.com/shiyuan7/DSR-Diff)]**Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach** \
[[Website](https://arxiv.org/abs/2412.03017)]
[[Code](https://github.com/csslc/PiSA-SR)]**RFSR: Improving ISR Diffusion Models via Reward Feedback Learning** \
[[Website](https://arxiv.org/abs/2412.03268)]
[[Code](https://github.com/sxpro/RFSR)]**SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2402.17133)]
[[Code](https://github.com/lose4578/SAM-DiffSR)]**XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2403.05049)]
[[Code](https://github.com/qyp2000/XPSR)]**Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution** \
[[Website](https://arxiv.org/abs/2403.16643)]
[[Code](https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion)]**BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2403.10211)]
[[Code](https://github.com/lifengcs/BlindDiff)]**TASR: Timestep-Aware Diffusion Model for Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2412.03355)]
[[Code](https://github.com/SleepyLin/TASR)]**HSR-Diff: Hyperspectral Image Super-Resolution via Conditional Diffusion Models**\
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/papers/Wu_HSR-Diff_Hyperspectral_Image_Super-Resolution_via_Conditional_Diffusion_Models_ICCV_2023_paper.pdf)]
[[Website](https://arxiv.org/abs/2306.12085)]**Text-guided Explorable Image Super-resolution** \
[[CVPR 2024](https://arxiv.org/abs/2403.01124)]**Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder** \
[[CVPR 2024](https://arxiv.org/abs/2403.10255)]**AdaDiffSR: Adaptive Region-aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution** \
[[CVPR 2024](https://arxiv.org/abs/2410.17752)]**Enhancing Hyperspectral Images via Diffusion Model and Group-Autoencoder Super-resolution Network** \
[[AAAI 2024](https://arxiv.org/abs/2402.17285)]**Detail-Enhancing Framework for Reference-Based Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2405.00431)]**You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation** \
[[Website](https://arxiv.org/abs/2401.17258)]**Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2305.15357)]**Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models** \
[[Website](https://arxiv.org/abs/2306.00714)]**Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning** \
[[Website](https://arxiv.org/abs/2412.06978)]**YODA: You Only Diffuse Areas. An Area-Masked Diffusion Approach For Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2308.07977)]**Domain Transfer in Latent Space (DTLS) Wins on Image Super-Resolution -- a Non-Denoising Model** \
[[Website](https://arxiv.org/abs/2311.02358)]**TDDSR: Single-Step Diffusion with Two Discriminators for Super Resolution** \
[[Website](https://arxiv.org/abs/2410.07663)]**ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2410.13807)]**Image Super-Resolution with Text Prompt Diffusio** \
[[Website](https://arxiv.org/abs/2311.14282)]**DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution** \
[[Website](https://arxiv.org/abs/2311.18508)]**DREAM: Diffusion Rectification and Estimation-Adaptive Models** \
[[Website](https://arxiv.org/abs/2312.00210)]**Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution** \
[[Website](https://arxiv.org/abs/2401.10404)]**Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2403.05808)]**CasSR: Activating Image Power for Real-World Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2403.11451)]**Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution** \
[[Website](https://arxiv.org/abs/2403.17000)]**Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution** \
[[Website](https://arxiv.org/abs/2405.10014)]**ClearSR: Latent Low-Resolution Image Embeddings Help Diffusion-Based Real-World Super Resolution Models See Clearer** \
[[Website](https://arxiv.org/abs/2410.14279)]**Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2411.12072)]**Adversarial Diffusion Compression for Real-World Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2411.13383)]**HF-Diff: High-Frequency Perceptual Loss and Distribution Matching for One-Step Diffusion-Based Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2411.13548)]**Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution** \
[[Website](https://arxiv.org/abs/2412.02960)]**RealOSR: Latent Unfolding Boosting Diffusion-based Real-world Omnidirectional Image Super-Resolution** \
[[Website](https://arxiv.org/abs/2412.09646)]## Video Generation
**Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators** \
[[ICCV 2023 Oral](https://openaccess.thecvf.com/content/ICCV2023/html/Khachatryan_Text2Video-Zero_Text-to-Image_Diffusion_Models_are_Zero-Shot_Video_Generators_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.13439)]
[[Project](https://text2video-zero.github.io/)]
[[Code](https://github.com/Picsart-AI-Research/Text2Video-Zero)]**SinFusion: Training Diffusion Models on a Single Image or Video** \
[[ICML 2023](https://icml.cc/virtual/2023/poster/24630)]
[[Website](https://arxiv.org/abs/2211.11743)]
[[Project](http://yaniv.nikankin.com/sinfusion/)]
[[Code](https://github.com/yanivnik/sinfusion-code)]**Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/papers/Blattmann_Align_Your_Latents_High-Resolution_Video_Synthesis_With_Latent_Diffusion_Models_CVPR_2023_paper.pdf)]
[[Website](https://arxiv.org/abs/2304.08818)]
[[Project](https://research.nvidia.com/labs/toronto-ai/VideoLDM/)]
[[Code](https://github.com/srpkdyy/VideoLDM)]**ZIGMA: A DiT-style Zigzag Mamba Diffusion Model** \
[[ECCV 2024](https://arxiv.org/abs/2403.13802)]
[[Project](https://taohu.me/zigma/)]
[[Code](https://github.com/CompVis/zigma)]**MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation** \
[[NeurIPS 2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/944618542d80a63bbec16dfbd2bd689a-Abstract-Conference.html)]
[[Website](https://arxiv.org/abs/2205.09853)]
[[Project](https://mask-cond-video-diffusion.github.io/)]
[[Code](https://github.com/voletiv/mcvd-pytorch)]**GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/71560)]
[[Website](https://arxiv.org/abs/2309.13274)]
[[Code](https://github.com/iva-mzsun/glober)]**Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator** \
[[NeurIPS 2023](https://nips.cc/virtual/2023/poster/70404)]
[[Website](https://arxiv.org/abs/2309.14494)]
[[Code](https://github.com/SooLab/Free-Bloom)]**Conditional Image-to-Video Generation with Latent Flow Diffusion Models** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Ni_Conditional_Image-to-Video_Generation_With_Latent_Flow_Diffusion_Models_CVPR_2023_paper.html)]
[[Website](https://arxiv.org/abs/2303.13744)]
[[Code](https://github.com/nihaomiao/CVPR23_LFDM)]**FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation** \
[[CVPR 2023](https://arxiv.org/abs/2403.12962)]
[[Project](https://www.mmlab-ntu.com/project/fresco/)]
[[Code](https://github.com/williamyang1991/FRESCO)]**TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2404.16306)]
[[Project](https://merl.com/research/highlights/TI2V-Zero)]
[[Code](https://github.com/merlresearch/TI2V-Zero)]**Video Diffusion Models** \
[[ICLR 2022 workshop](https://openreview.net/forum?id=BBelR2NdDZ5)]
[[Website](https://arxiv.org/abs/2204.03458)]
[[Code](https://github.com/lucidrains/video-diffusion-pytorch)]
[[Project](https://video-diffusion.github.io/)]**PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models** \
[[Website](https://arxiv.org/abs/2312.13964)]
[[Diffusers Doc](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pia)]
[[Project](https://pi-animator.github.io/)]
[[Code](https://github.com/open-mmlab/PIA)]**IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation** \
[[ECCV 2024](https://arxiv.org/abs/2407.10937)]
[[Project](https://yhzhai.github.io/idol/)]
[[Code](https://github.com/yhZhai/idol)]**EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions** \
[[ECCV 2024](https://arxiv.org/abs/2402.17485)]
[[Project](https://humanaigc.github.io/emote-portrait-alive/)]
[[Code](https://github.com/HumanAIGC/EMO)]**T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design** \
[[Website](https://arxiv.org/abs/2410.05677)]
[[Project](https://t2v-turbo-v2.github.io/)]
[[Code](https://github.com/Ji4chenLi/t2v-turbo)]**Tora: Trajectory-oriented Diffusion Transformer for Video Generation** \
[[Website](https://arxiv.org/abs/2407.21705)]
[[Project](https://ali-videoai.github.io/tora_video/)]
[[Code](https://github.com/ali-videoai/Tora)]**MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling** \
[[Website](https://arxiv.org/abs/2409.16160)]
[[Project](https://menyifang.github.io/projects/MIMO/index.html)]
[[Code](https://github.com/menyifang/MIMO)]**MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence** \
[[Website](https://arxiv.org/abs/2407.16655)]
[[Project](https://aim-uofa.github.io/MovieDreamer/)]
[[Code](https://github.com/aim-uofa/MovieDreamer)]**SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2411.04989)]
[[Project](https://kmcode1.github.io/Projects/SG-I2V/)]
[[Code](https://github.com/Kmcode1/SG-I2V)]**Video Diffusion Alignment via Reward Gradients** \
[[Website](https://arxiv.org/abs/2407.08737)]
[[Project](https://vader-vid.github.io/)]
[[Code](https://github.com/mihirp1998/VADER)]**Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.08701)]
[[Project](https://live2diff.github.io/)]
[[Code](https://github.com/open-mmlab/Live2Diff)]**Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.15642)]
[[Project](https://maxin-cn.github.io/cinemo_project/)]
[[Code](https://github.com/maxin-cn/Cinemo)]**TVG: A Training-free Transition Video Generation Method with Diffusion Models** \
[[Website](https://arxiv.org/abs/2408.13413)]
[[Project](https://sobeymil.github.io/tvg.com/)]
[[Code](https://github.com/SobeyMIL/TVG)]**VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement** \
[[Website](https://arxiv.org/abs/2411.15115)]
[[Project](https://video-repair.github.io/)]
[[Code](https://github.com/daeunni/VideoRepair)]**CamI2V: Camera-Controlled Image-to-Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2410.15957)]
[[Project](https://zgctroy.github.io/CamI2V/)]
[[Code](https://github.com/ZGCTroy/CamI2V)]**Identity-Preserving Text-to-Video Generation by Frequency Decomposition** \
[[Website](https://arxiv.org/abs/2411.17440)]
[[Project](https://pku-yuangroup.github.io/ConsisID/)]
[[Code](https://github.com/PKU-YuanGroup/ConsisID)]**Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning** \
[[Website](https://arxiv.org/abs/2410.24219)]
[[Project](https://pr-ryan.github.io/DEMO-project/)]
[[Code](https://github.com/PR-Ryan/DEMO)]**MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning** \
[[Website](https://arxiv.org/abs/2409.15179)]
[[Project](https://mimaface2024.github.io/mimaface.github.io/)]
[[Code](https://github.com/MIMAFace2024/MIMAFace)]**MotionClone: Training-Free Motion Cloning for Controllable Video Generation** \
[[Website](https://arxiv.org/abs/2406.05338)]
[[Project](https://bujiazi.github.io/motionclone.github.io/)]
[[Code](https://github.com/Bujiazi/MotionClone)]**StableAnimator: High-Quality Identity-Preserving Human Image Animation** \
[[Website](https://arxiv.org/abs/2411.17697)]
[[Project](https://francis-rings.github.io/StableAnimator/)]
[[Code](https://github.com/Francis-Rings/StableAnimator)]**AnimateAnything: Consistent and Controllable Animation for Video Generation** \
[[Website](https://arxiv.org/abs/2411.10836)]
[[Project](https://yu-shaonian.github.io/Animate_Anything/)]
[[Code](https://github.com/yu-shaonian/AnimateAnything)]**GameGen-X: Interactive Open-world Game Video Generation** \
[[Website](https://arxiv.org/abs/2411.00769)]
[[Project](https://gamegen-x.github.io/)]
[[Code](https://github.com/GameGen-X/GameGen-X)]**VEnhancer: Generative Space-Time Enhancement for Video Generation** \
[[Website](https://arxiv.org/abs/2407.07667)]
[[Project](https://vchitect.github.io/VEnhancer-project/)]
[[Code](https://github.com/Vchitect/VEnhancer)]**SF-V: Single Forward Video Generation Model** \
[[Website](https://arxiv.org/abs/2406.04324)]
[[Project](https://snap-research.github.io/SF-V/)]
[[Code](https://github.com/snap-research/SF-V)]**Video Motion Transfer with Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2412.07776)]
[[Project](https://ditflow.github.io/)]
[[Code](https://github.com/ditflow/ditflow)]**SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints** \
[[Website](https://arxiv.org/abs/2412.07760)]
[[Project](https://jianhongbai.github.io/SynCamMaster/)]
[[Code](https://github.com/KwaiVGI/SynCamMaster)]**Pyramidal Flow Matching for Efficient Video Generative Modeling** \
[[Website](https://arxiv.org/abs/2410.05954)]
[[Project](https://pyramid-flow.github.io/)]
[[Code](https://github.com/jy0205/Pyramid-Flow)]**AnchorCrafter: Animate CyberAnchors Saling Your Products via Human-Object Interacting Video Generation** \
[[Website](https://arxiv.org/abs/2411.17383)]
[[Project](https://cangcz.github.io/Anchor-Crafter/)]
[[Code](https://github.com/cangcz/AnchorCrafter)]**Trajectory Attention for Fine-grained Video Motion Control** \
[[Website](https://arxiv.org/abs/2411.19324)]
[[Project](https://xizaoqu.github.io/trajattn/)]
[[Code](https://github.com/xizaoqu/TrajectoryAttntion)]**GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration** \
[[Website](https://arxiv.org/abs/2412.04440)]
[[Project](https://karine-h.github.io/GenMAC/)]
[[Code](https://github.com/Karine-Huang/GenMAC)]**CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion** \
[[Website](https://arxiv.org/abs/2406.05082)]
[[Project](https://wxrui182.github.io/CoNo.github.io/)]
[[Code](https://github.com/wxrui182/CoNo)]**Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation** \
[[Website](https://arxiv.org/abs/2311.17117)]
[[Project](https://humanaigc.github.io/animate-anyone/)]
[[Code](https://github.com/HumanAIGC/AnimateAnyone)]**MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance** \
[[Website](https://arxiv.org/abs/2412.05355)]
[[Project](https://motionshop-diffusion.github.io/)]
[[Code](https://github.com/gemlab-vt/motionshop)]**VideoTetris: Towards Compositional Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2406.04277)]
[[Project](https://videotetris.github.io/)]
[[Code](https://github.com/YangLing0818/VideoTetris)]**T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback** \
[[Website](https://arxiv.org/abs/2405.18750)]
[[Project](https://t2v-turbo.github.io/)]
[[Code](https://github.com/Ji4chenLi/t2v-turbo)]**ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation** \
[[Website](https://arxiv.org/abs/2406.00908)]
[[Project](https://ssyang2020.github.io/zerosmooth.github.io/)]
[[Code](https://github.com/ssyang2020/ZeroSmooth)]**MotionBooth: Motion-Aware Customized Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2406.17758)]
[[Project](https://jianzongwu.github.io/projects/motionbooth/)]
[[Code](https://github.com/jianzongwu/MotionBooth)]**Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation** \
[[Website](https://arxiv.org/abs/2412.01316)]
[[Project](https://presto-video.github.io/)]
[[Code](https://github.com/rhymes-ai/Allegro)]**MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2405.20222)]
[[Project](https://myniuuu.github.io/MOFA_Video/)]
[[Code](https://github.com/MyNiuuu/MOFA-Video)]**MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.20155)]
[[Project](https://lukas.uzolas.com/MotionDreamer/)]
[[Code](https://github.com/lukasuz/MotionDreamer)]**MotionCraft: Physics-based Zero-Shot Video Generation** \
[[Website](https://arxiv.org/abs/2405.13557)]
[[Project](https://mezzelfo.github.io/MotionCraft/)]
[[Code](https://github.com/mezzelfo/MotionCraft)]**MotionMaster: Training-free Camera Motion Transfer For Video Generation** \
[[Website](https://arxiv.org/abs/2404.15789)]
[[Project](https://sjtuplayer.github.io/projects/MotionMaster/)]
[[Code](https://github.com/sjtuplayer/MotionMaster)]**Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets** \
[[Website](https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets)]
[[Project](https://stability.ai/news/stable-video-diffusion-open-ai-video-model)]
[[Code](https://github.com/Stability-AI/generative-models)]**Motion Inversion for Video Customization** \
[[Website](https://arxiv.org/abs/2403.20193)]
[[Project](https://wileewang.github.io/MotionInversion/)]
[[Code](https://github.com/EnVision-Research/MotionInversion)]**MagicAvatar: Multimodal Avatar Generation and Animation** \
[[Website](https://arxiv.org/abs/2308.14748)]
[[Project](https://magic-avatar.github.io/)]
[[Code](https://github.com/magic-research/magic-avatar)]**Progressive Autoregressive Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.08151)]
[[Project](https://desaixie.github.io/pa-vdm/)]
[[Code](https://github.com/desaixie/pa_vdm)]**TrailBlazer: Trajectory Control for Diffusion-Based Video Generation** \
[[Website](https://arxiv.org/abs/2401.00896)]
[[Project](https://hohonu-vicml.github.io/Trailblazer.Page/)]
[[Code](https://github.com/hohonu-vicml/Trailblazer)]**Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos** \
[[Website](https://arxiv.org/abs/2304.01186)]
[[Project](https://follow-your-pose.github.io/)]
[[Code](https://github.com/mayuelala/FollowYourPose)]**Align3R: Aligned Monocular Depth Estimation for Dynamic Videos** \
[[Website](https://arxiv.org/abs/2412.03079)]
[[Project](https://igl-hkust.github.io/Align3R.github.io/)]
[[Code](https://github.com/jiah-cloud/Align3R)]**Breathing Life Into Sketches Using Text-to-Video Priors** \
[[Website](https://arxiv.org/abs/2311.13608)]
[[Project](https://livesketch.github.io/)]
[[Code](https://github.com/yael-vinker/live_sketch)]**Latent Video Diffusion Models for High-Fidelity Long Video Generation** \
[[Website](https://arxiv.org/abs/2211.13221)]
[[Project](https://yingqinghe.github.io/LVDM/)]
[[Code](https://github.com/YingqingHe/LVDM)]**Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance** \
[[Website](https://arxiv.org/abs/2306.00943)]
[[Project](https://doubiiu.github.io/projects/Make-Your-Video/)]
[[Code](https://github.com/VideoCrafter/Make-Your-Video)]**Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising** \
[[Website](https://arxiv.org/abs/2305.18264)]
[[Project](https://g-u-n.github.io/projects/gen-long-video/index.html)]
[[Code](https://github.com/G-U-N/Gen-L-Video)]**Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.13840)]
[[Project](https://controlavideo.github.io/)]
[[Code](https://github.com/Weifeng-Chen/control-a-video)]**VideoComposer: Compositional Video Synthesis with Motion Controllability** \
[[Website](https://arxiv.org/abs/2306.02018)]
[[Project](https://videocomposer.github.io/)]
[[Code](https://github.com/damo-vilab/videocomposer)]**DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion** \
[[Website](https://arxiv.org/abs/2304.06025)]
[[Project](https://grail.cs.washington.edu/projects/dreampose/)]
[[Code](https://github.com/johannakarras/DreamPose)]**LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2309.15103)]
[[Project](https://vchitect.github.io/LaVie-project/)]
[[Code](https://github.com/Vchitect/LaVie)]**Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2309.15818)]
[[Project](https://showlab.github.io/Show-1/)]
[[Code](https://github.com/showlab/Show-1)]**LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation** \
[[Website](https://arxiv.org/abs/2310.10769)]
[[Project](https://rq-wu.github.io/projects/LAMP/index.html)]
[[Code](https://github.com/RQ-Wu/LAMP)]**MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer** \
[[Website](https://arxiv.org/abs/2311.12052)]
[[Project](https://boese0601.github.io/magicdance/)]
[[Code](https://github.com/Boese0601/MagicDance)]**LLM-GROUNDED VIDEO DIFFUSION MODELS** \
[[Website](https://arxiv.org/abs/2309.17444)]
[[Project](https://llm-grounded-video-diffusion.github.io/)]
[[Code](https://github.com/TonyLianLong/LLM-groundedVideoDiffusion)]**FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling** \
[[Website](https://arxiv.org/abs/2310.15169)]
[[Project](http://haonanqiu.com/projects/FreeNoise.html)]
[[Code](https://github.com/arthur-qiu/LongerCrafter)]**VideoCrafter1: Open Diffusion Models for High-Quality Video Generation** \
[[Website](https://arxiv.org/abs/2310.19512)]
[[Project](https://ailab-cvc.github.io/videocrafter/)]
[[Code](https://github.com/AILab-CVC/VideoCrafter)]**VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2310.19512)]
[[Project](https://ailab-cvc.github.io/videocrafter2/)]
[[Code](https://github.com/AILab-CVC/VideoCrafter)]**VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning** \
[[Website](https://arxiv.org/abs/2311.00990)]
[[Project](https://videodreamer23.github.io/)]
[[Code](https://github.com/videodreamer23/videodreamer23.github.io)]**I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.04145)]
[[Project](https://i2vgen-xl.github.io/)]
[[Code](https://github.com/damo-vilab/i2vgen-xl)]**FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline** \
[[Website](https://arxiv.org/abs/2311.13073)]
[[Project](https://ai-forever.github.io/kandinsky-video/)]
[[Code](https://github.com/ai-forever/KandinskyVideo)]**MotionCtrl: A Unified and Flexible Motion Controller for Video Generation** \
[[Website](https://arxiv.org/abs/2312.03641)]
[[Project](https://wzhouxiff.github.io/projects/MotionCtrl/)]
[[Code](https://github.com/TencentARC/MotionCtrl)]**ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.18834)]
[[Project](https://warranweng.github.io/art.v/)]
[[Code](https://github.com/WarranWeng/ART.V)]**FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax** \
[[Website](https://arxiv.org/abs/2311.15813)]
[[Project](https://flowzero-video.github.io/)]
[[Code](https://github.com/aniki-ly/FlowZero)]**VideoBooth: Diffusion-based Video Generation with Image Prompts** \
[[Website](https://arxiv.org/abs/2312.00777)]
[[Project](https://vchitect.github.io/VideoBooth-project/)]
[[Code](https://github.com/Vchitect/VideoBooth)]**MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model** \
[[Website](https://arxiv.org/abs/2311.16498)]
[[Project](https://showlab.github.io/magicanimate/)]
[[Code](https://github.com/magic-research/magic-animate)]**LivePhoto: Real Image Animation with Text-guided Motion Control** \
[[Website](https://arxiv.org/abs/2312.02928)]
[[Project](https://xavierchen34.github.io/LivePhoto-Page/)]
[[Code](https://github.com/XavierCHEN34/LivePhoto)]**AnimateZero: Video Diffusion Models are Zero-Shot Image Animators** \
[[Website](https://arxiv.org/abs/2312.03793)]
[[Project](https://vvictoryuki.github.io/animatezero.github.io/)]
[[Code](https://github.com/vvictoryuki/AnimateZero)]**DreamVideo: Composing Your Dream Videos with Customized Subject and Motion** \
[[Website](https://arxiv.org/abs/2312.04433)]
[[Project](https://dreamvideo-t2v.github.io/)]
[[Code](https://github.com/damo-vilab/i2vgen-xl)]**Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2312.04483)]
[[Project](https://higen-t2v.github.io/)]
[[Code](https://github.com/damo-vilab/i2vgen-xl)]**DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.05107)]
[[Project](https://dreamoving.github.io/dreamoving/)]
[[Code](https://github.com/dreamoving/dreamoving-project)]**Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution** \
[[Website](https://arxiv.org/abs/2312.06640)]
[[Project](https://shangchenzhou.com/projects/upscale-a-video/)]
[[Code](https://github.com/sczhou/Upscale-A-Video)]**FreeInit: Bridging Initialization Gap in Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.07537)]
[[Project](https://tianxingwu.github.io/pages/FreeInit/)]
[[Code](https://github.com/TianxingWu/FreeInit)]**Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion** \
[[Website](https://arxiv.org/abs/2312.07133)]
[[Project](https://abdo-eldesokey.github.io/text2ac-zero/)]
[[Code](https://github.com/abdo-eldesokey/text2ac-zero)]**StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter** \
[[Website](https://arxiv.org/abs/2312.00330)]
[[Project](https://gongyeliu.github.io/StyleCrafter.github.io/)]
[[Code](https://github.com/GongyeLiu/StyleCrafter)]**A Recipe for Scaling up Text-to-Video Generation with Text-free Videos** \
[[Website](https://arxiv.org/abs/2312.15770)]
[[Project](https://tf-t2v.github.io/)]
[[Code](https://github.com/ali-vilab/i2vgen-xl)]**FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis** \
[[Website](https://arxiv.org/abs/2312.17681)]
[[Project](https://jeff-liangf.github.io/projects/flowvid/)]
[[Code](https://github.com/Jeff-LiangF/FlowVid)]**Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions** \
[[Website](https://arxiv.org/abs/2401.01827)]
[[Project](https://showlab.github.io/Moonshot/)]
[[Code](https://github.com/salesforce/LAVIS)]**Latte: Latent Diffusion Transformer for Video Generation** \
[[Website](https://arxiv.org/abs/2401.03048)]
[[Project](https://maxin-cn.github.io/latte_project/)]
[[Code](https://github.com/maxin-cn/Latte)]**WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens** \
[[Website](https://arxiv.org/abs/2401.09985)]
[[Project](https://world-dreamer.github.io/)]
[[Code](https://github.com/JeffWang987/WorldDreamer)]**SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.16933)]
[[Project](https://guoyww.github.io/projects/SparseCtrl/)]
[[Code](https://github.com/guoyww/AnimateDiff#202312-animatediff-v3-and-sparsectrl)]**Towards A Better Metric for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2401.07781)]
[[Project](https://showlab.github.io/T2VScore/)]
[[Code](https://github.com/showlab/T2VScore)]**HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.22901)]
[[Project](https://songkey.github.io/hellomeme/)]
[[Code](https://github.com/HelloVision/HelloMeme)]**AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning** \
[[Website](https://arxiv.org/abs/2402.00769)]
[[Project](https://animatelcm.github.io/)]
[[Code](https://github.com/G-U-N/AnimateLCM)]**Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation** \
[[Website](https://arxiv.org/abs/2403.13745)]
[[Project](https://be-your-outpainter.github.io/)]
[[Code](https://github.com/G-U-N/Be-Your-Outpainter)]**UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control** \
[[Website](https://arxiv.org/abs/2403.02332)]
[[Project](https://unified-attention-control.github.io/)]
[[Code](https://github.com/XuweiyiChen/UniCtrl)]**VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.05438)]
[[Project](https://videoelevator.github.io/)]
[[Code](https://github.com/YBYBZhang/VideoElevator)]**ID-Animator: Zero-Shot Identity-Preserving Human Video Generation** \
[[Website](https://arxiv.org/abs/2404.15275)]
[[Project](https://id-animator.github.io/)]
[[Code](https://github.com/ID-Animator/ID-Animator)]**Optical-Flow Guided Prompt Optimization for Coherent Video Generation** \
[[Website](https://arxiv.org/abs/2411.16199)]
[[Project](https://motionprompt.github.io/)]
[[Code](https://github.com/HyelinNAM/MotionPrompt)]**FlexiFilm: Long Video Generation with Flexible Conditions** \
[[Website](https://arxiv.org/abs/2404.18620)]
[[Project](https://y-ichen.github.io/FlexiFilm-Page/)]
[[Code](https://github.com/Y-ichen/FlexiFilm)]**FIFO-Diffusion: Generating Infinite Videos from Text without Training** \
[[Website](https://arxiv.org/abs/2405.11473)]
[[Project](https://jjihwan.github.io/projects/FIFO-Diffusion)]
[[Code](https://github.com/jjihwan/FIFO-Diffusion_public)]**TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2405.04682)]
[[Project](https://talc-mst2v.github.io/)]
[[Code](https://github.com/Hritikbansal/talc)]**CV-VAE: A Compatible Video VAE for Latent Generative Video Models** \
[[Website](https://arxiv.org/abs/2405.20279)]
[[Project](https://ailab-cvc.github.io/cvvae/index.html)]
[[Code](https://github.com/AILab-CVC/CV-VAE)]**MVOC: a training-free multiple video object composition method with diffusion models** \
[[Website](https://arxiv.org/abs/2406.15829)]
[[Project](https://sobeymil.github.io/mvoc.com/)]
[[Code](https://github.com/SobeyMIL/MVOC)]**Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2406.15735)]
[[Project](https://cond-image-leak.github.io/)]
[[Code](https://github.com/thu-ml/cond-image-leakage/)]**Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2406.16863)]
[[Project](http://haonanqiu.com/projects/FreeTraj.html)]
[[Code](https://github.com/arthur-qiu/FreeTraj)]**Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation** \
[[Website](https://arxiv.org/abs/2408.15239)]
[[Project](https://svd-keyframe-interpolation.github.io/)]
[[Code](https://github.com/jeanne-wang/svd_keyframe_interpolation)]**Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction** \
[[Website](https://arxiv.org/abs/2411.14762)]
[[Project](https://huiwon-jang.github.io/coordtok/)]
[[Code](https://github.com/huiwon-jang/CoordTok)]**AMG: Avatar Motion Guided Video Generation** \
[[Website](https://arxiv.org/abs/2409.01502)]
[[Project](https://zshyang.github.io/amg-website/)]
[[Code](https://github.com/zshyang/amg)]**DiVE: DiT-based Video Generation with Enhanced Control** \
[[Website](https://arxiv.org/abs/2409.01595)]
[[Project](https://liautoad.github.io/DIVE/)]
[[Code](https://github.com/LiAutoAD/DIVE)]**MegActor-Σ: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer** \
[[Website](https://arxiv.org/abs/2408.14975)]
[[Project](https://megactor-ops.github.io/)]
[[Code](https://github.com/megvii-research/MegActor)]**CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers** \
[[ICLR 2023](https://arxiv.org/abs/2205.15868)]
[[Code](https://github.com/THUDM/CogVideo)]**UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer** \
[[AAAI 2025](https://arxiv.org/abs/2412.09389)]
[[Code](https://github.com/Delong-liu-bupt/UFO)]**InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption** \
[[Website](https://arxiv.org/abs/2412.09283)]
[[Code](https://github.com/NJU-PCALab/InstanceCap)]**MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.01343)]
[[Code](https://github.com/XiaominLi1997/MoTrans)]**CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer** \
[[Website](https://arxiv.org/abs/2408.06072)]
[[Code](https://github.com/THUDM/CogVideo)]**Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing** \
[[ICLR 2024](https://arxiv.org/abs/2402.16627)]
[[Code](https://github.com/YangLing0818/ContextDiff)]**SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces** \
[[ICLR 2024](https://arxiv.org/abs/2403.07711)]
[[Code](https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models)]**Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing** \
[[Website](https://arxiv.org/abs/2411.16375)]
[[Code](https://github.com/Dawn-LX/CausalCache-VDM)]**Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach** \
[[Website](https://arxiv.org/abs/2410.03160)]
[[Code](https://github.com/Yaofang-Liu/FVDM)]**PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2412.00596)]
[[Code](https://github.com/pittisl/PhyT2V)]**Real-Time Video Generation with Pyramid Attention Broadcast** \
[[Website](https://arxiv.org/abs/2408.12588)]
[[Code](https://github.com/NUS-HPC-AI-Lab/VideoSyso)]**Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model** \
[[Website](https://arxiv.org/abs/2404.01862)]
[[Code](https://github.com/thuhcsi/S2G-MDDiffusion)]**Diffusion Probabilistic Modeling for Video Generation** \
[[Website](https://arxiv.org/abs/2203.09481)]
[[Code](https://github.com/buggyyang/RVD)]**DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors** \
[[Website](https://arxiv.org/abs/2310.12190)]
[[Code](https://github.com/AILab-CVC/VideoCrafter)]**VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation** \
[[Website](https://arxiv.org/abs/2303.08320)]
[[Code](https://github.com/modelscope/modelscope)]**STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction** \
[[Website](https://arxiv.org/abs/2312.06486)]
[[Code](https://github.com/xiye20/stdiffproject)]**Vlogger: Make Your Dream A Vlog** \
[[Website](https://arxiv.org/abs/2401.09414)]
[[Code](https://github.com/zhuangshaobin/Vlogger)]**Magic-Me: Identity-Specific Video Customized Diffusion** \
[[Website](https://arxiv.org/abs/2402.09368)]
[[Code](https://github.com/Zhen-Dong/Magic-Me)]**VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.06098)]
[[Code](https://github.com/WangWenhao0716/VidProM)]**EchoReel: Enhancing Action Generation of Existing Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.11535)]
[[Code](https://github.com/liujianzhi/echoreel)]**StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text** \
[[Website](https://arxiv.org/abs/2403.14773)]
[[Code](https://github.com/Picsart-AI-Research/StreamingT2V)]**TAVGBench: Benchmarking Text to Audible-Video Generation** \
[[Website](https://arxiv.org/abs/2404.14381)]
[[Code](https://github.com/OpenNLPLab/TAVGBench)]**OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2409.01199)]
[[Code](https://github.com/PKU-YuanGroup/Open-Sora-Plan)]**Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation** \
[[Website](https://arxiv.org/abs/2412.04432)]
[[Code](https://github.com/TencentARC/Divot)]**FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations** \
[[Website](https://arxiv.org/abs/2411.10818)]
[[Code](https://github.com/hmrishavbandy/FlipSketch)]**IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis** \
[[Website](https://arxiv.org/abs/2410.04171)]
[[Code](https://github.com/xie-lab-ml/IV-mixed-Sampler)]**REDUCIO! Generating 1024×1024 Video within 16 Seconds using Extremely Compressed Motion Latents** \
[[Website](https://arxiv.org/abs/2411.13552)]
[[Code](https://github.com/microsoft/Reducio-VAE)]**MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling** \
[[Website](https://arxiv.org/abs/2405.18003)]
[[Code](https://github.com/18445864529/MAVIN)]**WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2411.17459)]
[[Code](https://github.com/PKU-YuanGroup/WF-VAE)]**HARIVO: Harnessing Text-to-Image Models for Video Generation**
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/06938.pdf)]
[[Project](https://kwonminki.github.io/HARIVO/)]**Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners** \
[[CVPR 2024](https://arxiv.org/abs/2402.17723)]
[[Project](https://yzxing87.github.io/Seeing-and-Hearing/)]**AtomoVideo: High Fidelity Image-to-Video Generation** \
[[CVPR 2024](https://arxiv.org/abs/2403.01800)]
[[Project](https://atomo-video.github.io/)]**Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition** \
[[ICLR 2024](https://arxiv.org/abs/2403.14148)]
[[Project](https://sihyun.me/CMD/)]**TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models** \
[[CVPR 2024](https://arxiv.org/abs/2403.17005)]
[[Project](https://trip-i2v.github.io/TRIP/)]**ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/06174.pdf)]
[[Project](https://gen-l-2.github.io/)]**TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2407.09012)]
[[Project](https://eccv2024tcan.github.io/)]**Motion Prompting: Controlling Video Generation with Motion Trajectories** \
[[Website](https://arxiv.org/abs/2412.02700)]
[[Project](https://motion-prompting.github.io/)]**Mojito: Motion Trajectory and Intensity Control for Video Generation** \
[[Website](https://arxiv.org/abs/2412.08948)]
[[Project](https://sites.google.com/view/mojito-video)]**OmniCreator: Self-Supervised Unified Generation with Universal Editing** \
[[Website](https://arxiv.org/abs/2412.02114)]
[[Project](https://haroldchen19.github.io/OmniCreator-Page/)]**DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models** \
[[Website](https://arxiv.org/abs/2412.04446)]
[[Project](https://liyizhuo.com/DiCoDe/)]**VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation** \
[[Website](https://arxiv.org/abs/2412.02259)]
[[Project](https://cheliosoops.github.io/VGoT/)]**Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop** \
[[Website](https://arxiv.org/abs/2411.18644)]
[[Project](https://abolfazl-sh.github.io/Scene_co-pilot_site/)]**LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity** \
[[Website](https://arxiv.org/abs/2412.09856)]
[[Project](https://lineargen.github.io/)]**Training-free Long Video Generation with Chain of Diffusion Model Experts** \
[[Website](https://arxiv.org/abs/2408.13423)]
[[Project](https://confiner2025.github.io/)]**Free2Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Model** \
[[Website](https://arxiv.org/abs/2411.17041)]
[[Project](https://kjm981995.github.io/free2guide/)]**FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention** \
[[Website](https://arxiv.org/abs/2407.19918)]
[[Project](https://freelongvideo.github.io/)]**CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2406.02509)]
[[Project](https://ir1d.github.io/CamCo/)]**Hierarchical Patch Diffusion Models for High-Resolution Video Generation** \
[[Website](https://arxiv.org/abs/2406.07792)]
[[Project](https://snap-research.github.io/hpdm/)]**Mimir: Improving Video Diffusion Models for Precise Text Understanding** \
[[Website](https://arxiv.org/abs/2412.03085)]
[[Project](https://lucaria-academy.github.io/Mimir/)]**From Slow Bidirectional to Fast Causal Video Generators** \
[[Website](https://arxiv.org/abs/2412.07772)]
[[Project](https://causvid.github.io/)]**I4VGen: Image as Stepping Stone for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2406.02230)]
[[Project](https://xiefan-guo.github.io/i4vgen/)]**Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback** \
[[Website](https://arxiv.org/abs/2412.02617)]
[[Project](https://sites.google.com/view/aif-dynamic-t2v/)]**FrameBridge: Improving Image-to-Video Generation with Bridge Models** \
[[Website](https://arxiv.org/abs/2410.15371)]
[[Project](https://framebridge-demo.github.io/)]**MarDini: Masked Autoregressive Diffusion for Video Generation at Scale** \
[[Website](https://arxiv.org/abs/2410.20280)]
[[Project](https://mardini-vidgen.github.io/)]**Boosting Camera Motion Control for Video Diffusion Transformers** \
[[Website](https://arxiv.org/abs/2410.10802)]
[[Project](https://soon-yau.github.io/CameraMotionGuidance/)]**UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation** \
[[Website](https://arxiv.org/abs/2406.01188)]
[[Project](https://unianimate.github.io/)]**Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control** \
[[Website](https://arxiv.org/abs/2405.17414)]
[[Project](https://collaborativevideodiffusion.github.io/)]**Controllable Longer Image Animation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.17306)]
[[Project](https://wangqiang9.github.io/Controllable.github.io/)]**AniClipart: Clipart Animation with Text-to-Video Priors** \
[[Website](https://arxiv.org/abs/2404.12347)]
[[Project](https://aniclipart.github.io/)]**Spectral Motion Alignment for Video Motion Transfer using Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.15249)]
[[Project](https://geonyeong-park.github.io/spectral-motion-alignment/)]**TimeRewind: Rewinding Time with Image-and-Events Video Diffusion** \
[[Website](https://arxiv.org/abs/2403.13800)]
[[Project](https://timerewind.github.io/)]**VideoPoet: A Large Language Model for Zero-Shot Video Generation** \
[[Website](https://storage.googleapis.com/videopoet/paper.pdf)]
[[Project](https://sites.research.google/videopoet/)]**PEEKABOO: Interactive Video Generation via Masked-Diffusion**\
[[Website](https://arxiv.org/abs/2312.07509)]
[[Project](https://jinga-lala.github.io/projects/Peekaboo/)]**Searching Priors Makes Text-to-Video Synthesis Better** \
[[Website](https://arxiv.org/abs/2406.03215)]
[[Project](https://hrcheng98.github.io/Search_T2V/)]**Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2309.03549)]
[[Project](https://anonymous0x233.github.io/ReuseAndDiffuse/)]**Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning** \
[[Website](https://arxiv.org/abs/2311.10709)]
[[Project](https://emu-video.metademolab.com/)]**BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.02813)]
[[Project](https://bivdiff.github.io/)]**SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation** \
[[Website](https://arxiv.org/abs/2410.23277)]
[[Project](https://slowfast-vgen.github.io/)]**Imagen Video: High Definition Video Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2210.02303)]
[[Project](https://imagen.research.google/video/)]**MoVideo: Motion-Aware Video Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.11325)]
[[Project](https://jingyunliang.github.io/MoVideo/)]**Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training** \
[[Website](https://arxiv.org/abs/2412.06029)]
[[Project](https://latent-reframe.github.io/)]**Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer** \
[[Website](https://arxiv.org/abs/2311.17009)]
[[Project](https://diffusion-motion-transfer.github.io/)]**Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning** \
[[Website](https://arxiv.org/abs/2311.17536)]
[[Project](https://github.com/SPengLiang/SmoothVideo)]**VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model** \
[[Website](https://arxiv.org/abs/2311.17338)]
[[Project](https://videoassembler.github.io/videoassembler/)]**MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2311.18829)]
[[Project](https://wangyanhui666.github.io/MicroCinema.github.io/)]**Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.01409)]
[[Project](https://primecai.github.io/generative_rendering/)]**GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation** \
[[Website](https://arxiv.org/abs/2312.04557)]
[[Project](https://www.shoufachen.com/gentron_website/)]**Customizing Motion in Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.04966)]
[[Project](https://joaanna.github.io/customizing_motion/)]**Photorealistic Video Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.06662)]
[[Project](https://walt-video-diffusion.github.io/)]**DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control** \
[[Website](https://arxiv.org/abs/2410.13830)]
[[Project](https://dreamvideo2.github.io/)]**VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM** \
[[Website](https://arxiv.org/abs/2401.01256)]
[[Project](https://videodrafter.github.io/)]**Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.10474)]
[[Project](https://research.nvidia.com/labs/dir/pyoco/)]**ActAnywhere: Subject-Aware Video Background Generation** \
[[Website](https://arxiv.org/abs/2401.10822)]
[[Project](https://actanywhere.github.io/)]**Lumiere: A Space-Time Diffusion Model for Video Generation** \
[[Website](https://arxiv.org/abs/2401.12945)]
[[Project](https://lumiere-video.github.io/)]**InstructVideo: Instructing Video Diffusion Models with Human Feedback** \
[[Website](https://arxiv.org/abs/2312.12490)]
[[Project](https://instructvideo.github.io/)]**Boximator: Generating Rich and Controllable Motions for Video Synthesis** \
[[Website](https://arxiv.org/abs/2402.01566)]
[[Project](https://boximator.github.io/)]**Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion** \
[[Website](https://arxiv.org/abs/2402.03162)]
[[Project](https://direct-a-video.github.io/)]**ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2402.04324)]
[[Project](https://tiger-ai-lab.github.io/ConsistI2V/)]**Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2403.02827)]
[[Project](https://noise-rectification.github.io/)]**Audio-Synchronized Visual Animation** \
[[Website](https://arxiv.org/abs/2403.05659)]
[[Project](https://lzhangbj.github.io/projects/asva/asva.html)]**I2VControl: Disentangled and Unified Video Motion Synthesis Control** \
[[Website](https://arxiv.org/abs/2411.17765)]
[[Project](https://wanquanf.github.io/I2VControl)]**Mind the Time: Temporally-Controlled Multi-Event Video Generation** \
[[Website](https://arxiv.org/abs/2412.05263)]
[[Project](https://mint-video.github.io/)]**VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis** \
[[Website](https://arxiv.org/abs/2403.13501)]
[[Project](https://yumengli007.github.io/VSTAR/)]**S2DM: Sector-Shaped Diffusion Models for Video Generation** \
[[Website](https://arxiv.org/abs/2403.13408)]
[[Project](https://s2dm.github.io/S2DM/)]**MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2412.05275)]
[[Project](https://motionflow-diffusion.github.io/)]**AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment** \
[[Website](https://arxiv.org/abs/2404.04946)]
[[Project](https://justinxu0.github.io/AnimateZoo/)]**Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation** \
[[Website](https://arxiv.org/abs/2405.16393)]
[[Project](https://liujl09.github.io/humanvideo_movingbackground/)]**Dance Any Beat: Blending Beats with Visuals in Dance Video Generation** \
[[Website](https://arxiv.org/abs/2405.09266)]
[[Project](https://dabfusion.github.io/)]**PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control** \
[[Website](https://arxiv.org/abs/2405.14582)]
[[Project](https://ml-gsai.github.io/PoseCrafter-demo/)]**Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer** \
[[Website](https://arxiv.org/abs/2405.17405)]
[[Project](https://human4dit.github.io/)]**Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation** \
[[Website](https://arxiv.org/abs/2406.01900)]
[[Project](https://follow-your-emoji.github.io/)]**FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance** \
[[Website](https://arxiv.org/abs/2408.08189)]
[[Project](https://fancyvideo.github.io/)]**CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities** \
[[Website](https://arxiv.org/abs/2408.13239)]
[[Project](https://customcrafter.github.io/)]**Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention** \
[[Website](https://arxiv.org/abs/2410.10774)]
[[Project](https://ir1d.github.io/Cavia/)]**VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide** \
[[Website](https://arxiv.org/abs/2410.04364)]
[[Project](https://videoguide2025.github.io/)]**MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis** \
[[Website](https://arxiv.org/abs/2410.20974)]
[[Project](https://moviecharacter.github.io/)]**ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation** \
[[Website](https://arxiv.org/abs/2410.20502)]
[[Project](https://arlont2v.github.io/)]**Improved Video VAE for Latent Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2411.06449)]
[[Project](https://wpy1999.github.io/IV-VAE/)]**Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation** \
[[Website](https://arxiv.org/abs/2412.06016)]
[[Project](https://hyeonho99.github.io/track4gen/)]**DragEntity: Trajectory Guided Video Generation using Entity and Positional Relationships** \
[[ACM MM 2024 Oral](https://arxiv.org/abs/2410.10751)]**Four-Plane Factorized Video Autoencoders** \
[[Website](https://arxiv.org/abs/2412.04452)]**Grid Diffusion Models for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2404.00234)]**SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction** \
[[Website](https://arxiv.org/abs/2310.20700)]**GenRec: Unifying Video Generation and Recognition with Diffusion Models** \
[[Website](https://arxiv.org/abs/2408.15241)]**Efficient Continuous Video Flow Model for Video Prediction** \
[[Website](https://arxiv.org/abs/2412.05633)]**Dual-Stream Diffusion Net for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2308.08316)]**DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control** \
[[Website](https://arxiv.org/abs/2405.12796)]**SimDA: Simple Diffusion Adapter for Efficient Video Generation** \
[[Website](https://arxiv.org/abs/2308.09710)]**VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2305.10874)]**Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models** \
[[Website](https://arxiv.org/abs/2308.13812)]**ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2310.07697)]**LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation** \
[[Website](https://arxiv.org/abs/2311.00353)]**Optimal Noise pursuit for Augmenting Text-to-Video Generation** \
[[Website](https://arxiv.org/abs/2311.00949)]**Make Pixels Dance: High-Dynamic Video Generation** \
[[Website](https://arxiv.org/abs/2311.10982)]**Video-Infinity: Distributed Long Video Generation** \
[[Website](https://arxiv.org/abs/2406.16260)]**GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning** \
[[Website](https://arxiv.org/abs/2311.12631)]**Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion** \
[[Website](https://arxiv.org/abs/2311.14343)]**Decouple Content and Motion for Conditional Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2311.14294)]**X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention** \
[[Website](https://arxiv.org/abs/2403.15931)]**F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis** \
[[Website](https://arxiv.org/abs/2312.03459)]**MTVG : Multi-text Video Generation with Text-to-Video Models** \
[[Website](https://arxiv.org/abs/2312.04086)]**VideoLCM: Video Latent Consistency Model** \
[[Website](https://arxiv.org/abs/2312.09109)]**MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion** \
[[Website](https://arxiv.org/abs/2410.07659)]**MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation** \
[[Website](https://arxiv.org/abs/2401.04468)]**I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.16693)]**360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2401.06578)]**CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects** \
[[Website](https://arxiv.org/abs/2401.09962)]**Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation** \
[[Website](https://arxiv.org/abs/2401.10150)]**Training-Free Semantic Video Composition via Pre-trained Diffusion Model** \
[[Website](https://arxiv.org/abs/2401.09195)]**STIV: Scalable Text and Image Conditioned Video Generation** \
[[Website](https://arxiv.org/abs/2412.07730)]**Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling** \
[[Website](https://arxiv.org/abs/2401.15977)]**Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models** \
[[Website](https://arxiv.org/abs/2401.16224)]**Human Video Translation via Query Warping** \
[[Website](https://arxiv.org/abs/2402.12099)]**Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation** \
[[Website](https://arxiv.org/abs/2402.13729)]**Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis** \
[[Website](https://arxiv.org/abs/2402.14797v1)]**Context-aware Talking Face Video Generation** \
[[Website](https://arxiv.org/abs/2402.18092)]**Pix2Gif: Motion-Guided Diffusion for GIF Generation** \
[[Website](https://arxiv.org/abs/2403.04634)]**Intention-driven Ego-to-Exo Video Generation** \
[[Website](https://arxiv.org/abs/2403.09194)]**AnimateDiff-Lightning: Cross-Model Diffusion Distillation** \
[[Website](https://arxiv.org/abs/2403.12706)]**Frame by Familiar Frame: Understanding Replication in Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.19593)]**Matten: Video Generation with Mamba-Attention** \
[[Website](https://arxiv.org/abs/2405.03025)]**Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.04233)]**ReVideo: Remake a Video with Motion and Content Control** \
[[Website](https://arxiv.org/abs/2405.13865)]**VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation** \
[[Website](https://arxiv.org/abs/2405.18156)]**SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model** \
[[Website](https://arxiv.org/abs/2406.00195)]**GVDIFF: Grounded Text-to-Video Generation with Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.01921)]**Mobius: An High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task** \
[[Website](https://arxiv.org/abs/2407.06617)]**Contrastive Sequential-Diffusion Learning: An approach to Multi-Scene Instructional Video Synthesis** \
[[Website](https://arxiv.org/abs/2407.11814)]**Multi-sentence Video Grounding for Long Video Generation** \
[[Website](https://arxiv.org/abs/2407.13219)]**Fine-gained Zero-shot Video Sampling** \
[[Website](https://arxiv.org/abs/2407.21475)]**Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data** \
[[Website](https://arxiv.org/abs/2408.10119)]**xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations** \
[[Website](https://arxiv.org/abs/2408.12590)]**EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation** \
[[Website](https://arxiv.org/abs/2408.13005)]**Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation** \
[[Website](https://arxiv.org/abs/2408.16506)]**One-Shot Learning Meets Depth Diffusion in Multi-Object Videos** \
[[Website](https://arxiv.org/abs/2408.16704)]**Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation** \
[[Website](https://arxiv.org/abs/2409.12532)]**S2AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance** \
[[Website](https://arxiv.org/abs/2409.15259)]**JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation** \
[[Website](https://arxiv.org/abs/2409.14149)]**ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning** \
[[Website](https://arxiv.org/abs/2410.00262)]**COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation** \
[[Website](https://arxiv.org/abs/2410.01718)]**Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2410.05322)]**BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way** \
[[Website](https://arxiv.org/abs/2410.06241)]**LumiSculpt: A Consistency Lighting Control Network for Video Generation** \
[[Website](https://arxiv.org/abs/2410.22979)]**TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation** \
[[Website](https://arxiv.org/abs/2410.24037)]**OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2411.10501)]**Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge** \
[[Website](https://arxiv.org/abs/2411.11343)]**SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input** \
[[Website](https://arxiv.org/abs/2411.11934)]**StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart** \
[[Website](https://arxiv.org/abs/2411.14295)]**VIRES: Video Instance Repainting with Sketch and Text Guidance** \
[[Website](https://arxiv.org/abs/2411.16199)]**MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation** \
[[Website](https://arxiv.org/abs/2411.18281)]**Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints** \
[[Website](https://arxiv.org/abs/2411.19381)]**Fleximo: Towards Flexible Text-to-Human Motion Video Generation** \
[[Website](https://arxiv.org/abs/2411.19459)]**SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing** \
[[Website](https://arxiv.org/abs/2411.18983)]**Towards Chunk-Wise Generation for Long Videos** \
[[Website](https://arxiv.org/abs/2411.18668)]**Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning** \
[[Website](https://arxiv.org/abs/2412.00547)]**CPA: Camera-pose-awareness Diffusion Transformer for Video Generation** \
[[Website](https://arxiv.org/abs/2412.01429)]**Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis** \
[[Website](https://arxiv.org/abs/2412.00638)]**MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation** \
[[Website](https://arxiv.org/abs/2412.05848)]**Mobile Video Diffusion** \
[[Website](https://arxiv.org/abs/2412.07583)]**TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation** \
[[Website](https://arxiv.org/abs/2412.10275)]## Video Editing
**FateZero: Fusing Attentions for Zero-shot Text-based Video Editing** \
[[ICCV 2023 Oral](https://openaccess.thecvf.com/content/ICCV2023/html/QI_FateZero_Fusing_Attentions_for_Zero-shot_Text-based_Video_Editing_ICCV_2023_paper.html
)]
[[Website](https://arxiv.org/abs/2303.09535)]
[[Project](https://fate-zero-edit.github.io/)]
[[Code](https://github.com/ChenyangQiQi/FateZero)]**Text2LIVE: Text-Driven Layered Image and Video Editing** \
[[ECCV 2022 Oral](https://arxiv.org/abs/2204.02491)]
[[Project](https://text2live.github.io/)]
[[code](https://github.com/omerbt/Text2LIVE)]**Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding** \
[[CVPR 2023](https://arxiv.org/abs/2212.02802)]
[[Project](https://diff-video-ae.github.io/)]
[[Code](https://github.com/man805/Diffusion-Video-Autoencoders)]**Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation** \
[[ICCV 2023](https://arxiv.org/abs/2212.11565)]
[[Project](https://tuneavideo.github.io/)]
[[Code](https://github.com/showlab/Tune-A-Video)]**StableVideo: Text-driven Consistency-aware Diffusion Video Editing** \
[[ICCV 2023](https://openaccess.thecvf.com/content/ICCV2023/html/Chai_StableVideo_Text-driven_Consistency-aware_Diffusion_Video_Editing_ICCV_2023_paper.html)]
[[Website](https://arxiv.org/abs/2308.09592)]
[[Code](https://github.com/rese1f/stablevideo)]**Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models** \
[[ECCV 2024](https://arxiv.org/abs/2407.10285)]
[[Project](https://yangqy1110.github.io/NC-SDEdit/)]
[[Code](https://github.com/yangqy1110/NC-SDEdit/)]**StableV2V: Stablizing Shape Consistency in Video-to-Video Editing** \
[[Website](https://arxiv.org/abs/2411.11045)]
[[Project](https://alonzoleeeooo.github.io/StableV2V/)]
[[Code](https://github.com/AlonzoLeeeooo/StableV2V)]**Video-P2P: Video Editing with Cross-attention Control** \
[[Website](https://arxiv.org/abs/2303.04761)]
[[Project](https://video-p2p.github.io/)]
[[Code](https://github.com/ShaoTengLiu/Video-P2P)]**CoDeF: Content Deformation Fields for Temporally Consistent Video Processing** \
[[Website](https://arxiv.org/abs/2308.07926)]
[[Project](https://qiuyu96.github.io/CoDeF/)]
[[Code](https://github.com/qiuyu96/CoDeF)]**MagicEdit: High-Fidelity and Temporally Coherent Video Editing**\
[[Website](https://arxiv.org/abs/2308.14749)]
[[Project](https://magic-edit.github.io/)]
[[Code](https://github.com/magic-research/magic-edit)]**TokenFlow: Consistent Diffusion Features for Consistent Video Editing** \
[[Website](https://arxiv.org/abs/2307.10373)]
[[Project](https://diffusion-tokenflow.github.io/)]
[[Code](https://github.com/omerbt/TokenFlow)]**ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing** \
[[Website](https://arxiv.org/abs/2305.17098)]
[[Project](https://ml.cs.tsinghua.edu.cn/controlvideo/)]
[[Code](https://github.com/thu-ml/controlvideo)]**Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts** \
[[Website](https://arxiv.org/abs/2305.08850)]
[[Project](https://make-a-protagonist.github.io/)]
[[Code](https://github.com/Make-A-Protagonist/Make-A-Protagonist)]**MotionDirector: Motion Customization of Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2310.08465)]
[[Project](https://showlab.github.io/MotionDirector/)]
[[Code](https://github.com/showlab/MotionDirector)]**EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing** \
[[Website](https://arxiv.org/abs/2403.16111)]
[[Project](https://knightyxp.github.io/EVA/)]
[[Code](https://github.com/knightyxp/EVA_Video_Edit)]**RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models**\
[[Website](https://arxiv.org/abs/2312.04524)]
[[Project](https://rave-video.github.io/)]
[[Code](https://github.com/rehg-lab/RAVE)]**Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models**\
[[Website](https://arxiv.org/abs/2310.01107)]
[[Project](https://ground-a-video.github.io/)]
[[Code](https://github.com/Ground-A-Video/Ground-A-Video)]**MotionEditor: Editing Video Motion via Content-Aware Diffusion** \
[[Website](https://arxiv.org/abs/2311.18830)]
[[Project](https://francis-rings.github.io/MotionEditor/)]
[[Code](https://github.com/Francis-Rings/MotionEditor)]**VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2312.00845)]
[[Project](https://video-motion-customization.github.io/)]
[[Code](https://github.com/HyeonHo99/Video-Motion-Customization)]**MagicStick: Controllable Video Editing via Control Handle Transformations** \
[[Website](https://arxiv.org/abs/2312.03047)]
[[Project](https://magic-stick-edit.github.io/)]
[[Code](https://github.com/mayuelala/MagicStick)]**VidToMe: Video Token Merging for Zero-Shot Video Editing** \
[[Website](https://arxiv.org/abs/2312.10656)]
[[Project](https://vidtome-diffusion.github.io/)]
[[Code](https://github.com/lixirui142/VidToMe)]**VASE: Object-Centric Appearance and Shape Manipulation of Real Videos** \
[[Website](https://arxiv.org/abs/2401.02473)]
[[Project](https://helia95.github.io/vase-website/)]
[[Code](https://github.com/helia95/VASE)]**Neural Video Fields Editing** \
[[Website](https://arxiv.org/abs/2312.08882)]
[[Project](https://nvedit.github.io/)]
[[Code](https://github.com/Ysz2022/NVEdit)]**UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing** \
[[Website](https://arxiv.org/abs/2402.13185v1)]
[[Project](https://jianhongbai.github.io/UniEdit/)]
[[Code](https://github.com/JianhongBai/UniEdit)]**MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion** \
[[Website](https://arxiv.org/abs/2405.20325)]
[[Project](https://francis-rings.github.io/MotionFollower/)]
[[Code](https://github.com/Francis-Rings/MotionFollower)]**Vid2Vid-zero: Zero-Shot Video Editing Using Off-the-Shelf Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2303.17599)]
[[Code](https://github.com/baaivision/vid2vid-zero)]**DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization** \
[[Website](https://arxiv.org/abs/2311.16060)]
[[Code](https://github.com/Jeffery9707/DiffSLVA)]**LOVECon: Text-driven Training-Free Long Video Editing with ControlNet** \
[[Website](https://arxiv.org/abs/2310.09711)]
[[Code](https://github.com/zhijie-group/LOVECon)]**Pix2video: Video Editing Using Image Diffusion** \
[[Website](https://arxiv.org/abs/2303.12688)]
[[Code](https://github.com/G-U-N/Pix2Video.pytorch)]**E-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment** \
[[Website](https://arxiv.org/abs/2408.11481)]
[[Code](https://github.com/littlespray/E-Bench)]**Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer**\
[[Website](https://arxiv.org/abs/2305.05464)]
[[Code](https://github.com/haha-lisa/style-a-video)]**Flow-Guided Diffusion for Video Inpainting** \
[[Website](https://arxiv.org/abs/2311.15368)]
[[Code](https://github.com/nevsnev/fgdvi)]**Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.05519)]
[[Code](https://github.com/sam-motamed/Video-Editing-X-Attention/)]**Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing** \
[[Website](https://arxiv.org/abs/2405.04496)]
[[Code](https://github.com/yiiizuo/Edit-Your-Motion)]**COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing** \
[[Website](https://arxiv.org/abs/2406.08850)]
[[Code](https://github.com/wangjiangshan0725/COVE)]**Shape-Aware Text-Driven Layered Video Editing** \
[[CVPR 2023](https://openaccess.thecvf.com/content/CVPR2023/html/Lee_Shape-Aware_Text-Driven_Layered_Video_Editing_CVPR_2023_paper.html
)]
[[Website](https://arxiv.org/abs/2301.13173)]
[[Project](https://text-video-edit.github.io/#)]**VideoDirector: Precise Video Editing via Text-to-Video Models** \
[[Website](https://arxiv.org/abs/2411.17592)]
[[Project](https://anonymous.4open.science/w/c4KzqAbCaz89o0FeWkdya/)]**NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing** \
[[Website](https://arxiv.org/abs/2406.06523)]
[[Project](https://koi953215.github.io/NaRCan_page/)]**Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices** \
[[Website](https://arxiv.org/abs/2405.12211)]
[[Project](https://matankleiner.github.io/slicedit/)]**DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing** \
[[Website](https://arxiv.org/abs/2310.10624)]
[[Project](https://showlab.github.io/DynVideo-E/)]**I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.16537)]
[[Project](https://i2vedit.github.io/)]**FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing** \
[[Website](https://arxiv.org/abs/2310.05922)]
[[Project](https://flatten-video-editing.github.io/)]**VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing** \
[[Website](https://arxiv.org/abs//2306.08707)]
[[Project](https://videdit.github.io/)]**DIVE: Taming DINO for Subject-Driven Video Editing** \
[[Website](https://arxiv.org/abs/2412.03347)]
[[Project](https://dino-video-editing.github.io/)]**VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence** \
[[Website](https://arxiv.org/abs/2312.02087)]
[[Project](https://videoswap.github.io/)]**Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation** \
[[Website](https://arxiv.org/abs/2306.07954)]
[[Project](https://anonymous-31415926.github.io/)]**ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning** \
[[Website](https://arxiv.org/abs/2411.05003)]
[[Project](https://generative-video-camera-controls.github.io/)]**WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing** \
[[ECCV 2024](https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/09682.pdf)]
[[Project](https://ree1s.github.io/wave/)]**MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance** \
[[Website](https://arxiv.org/abs/2308.10079)]
[[Project](https://medm2023.github.io)]**Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models** \
[[Website](https://arxiv.org/abs/2402.14780)]
[[Project](https://anonymous-314.github.io/)]**DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing** \
[[Website](https://arxiv.org/abs/2403.12002)]
[[Project](https://hyeonho99.github.io/dreammotion/)]**VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing** \
[[Website](https://arxiv.org/abs/2411.15260)]
[[Project](https://inkosizhong.github.io/VIVID/)]**DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency** \
[[ECCV 2024](https://arxiv.org/abs/2408.07481)]**Edit Temporal-Consistent Videos with Image Diffusion Model** \
[[Website](https://arxiv.org/abs/2308.09091)]**Streaming Video Diffusion: Online Video Editing with Diffusion Models** \
[[Website](https://arxiv.org/abs/2405.19726)]**Cut-and-Paste: Subject-Driven Video Editing with Attention Control** \
[[Website](https://arxiv.org/abs/2311.11697)]**MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation** \
[[Website](https://arxiv.org/abs/2309.00908)]**Dreamix: Video Diffusion Models Are General Video Editors** \
[[Website](https://arxiv.org/abs/2302.01329)]**Towards Consistent Video Editing with Text-to-Image Diffusion Models** \
[[Website](https://arxiv.org/abs/2305.17431)]**EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints** \
[[Website](https://arxiv.org/abs/2308.10648)]**CCEdit: Creative and Controllable Video Editing via Diffusion Models** \
[[Website](https://arxiv.org/abs/2309.16496)]**Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models** \
[[Website](https://arxiv.org/abs/2310.16400)]**FastBlend: a Powerful Model-Free Toolkit Making Video Stylization Easier** \
[[Website](https://arxiv.org/abs/2311.09265)]**VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models** \
[[Website](https://arxiv.org/abs/2311.18837)]**RealCraft: Attention Control as A Solution for Zero-shot Long Video Editing** \
[[Website](https://arxiv.org/abs/2312.12635)]**Object-Centric Diffusion for Efficient Video Editing** \
[[Website](https://arxiv.org/abs/2401.05735)]**FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing** \
[[Website](https://arxiv.org/abs/2403.06269)]**Video Editing via Factorized Diffusion Distillation** \
[[Website](https://arxiv.org/abs/2403.06269)]**EffiVED:Efficient Video Editing via Text-instruction Diffusion Models** \
[[Website](https://arxiv.org/abs/2403.11568)]**Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion** \
[[Website](https://arxiv.org/abs/2403.14617)]**GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models** \
[[Website](https://arxiv.org/abs/2404.12541)]**Temporally Consistent Object Editing in Videos using Extended Attention** \
[[Website](https://arxiv.org/abs/2406.00272)]**Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting** \
[[Website](https://arxiv.org/abs/2406.02541)]**FRAG: Frequency Adapting Group for Diffusion Video Editing** \
[[Website](https://arxiv.org/abs/2406.06044)]**InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models** \
[[Website](https://arxiv.org/abs/2407.10958)]**Text-based Talking Video Editing with Cascaded Conditional Diffusion** \
[[Website](https://arxiv.org/abs/2407.14841)]**Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion** \
[[Website](https://arxiv.org/abs/2408.00458)]**Blended Latent Diffusion under Attention Control for Real-World Video Editing** \
[[Website](https://arxiv.org/abs/2409.03514)]**EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models** \
[[Website](https://arxiv.org/abs/2409.09668)]**DNI: Dilutional Noise Initialization for Diffusion Video Editing** \
[[Website](https://arxiv.org/abs/2409.13037)]**FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing** \
[[Website](https://arxiv.org/abs/2409.20500)]**Replace Anyone in Videos** \
[[Website](https://arxiv.org/abs/2409.19911)]**Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing** \
[[Website](https://arxiv.org/abs/2410.12526)]**DreamColour: Controllable Video Colour Editing without Training** \
[[Website](https://arxiv.org/abs/2412.05180)]**MoViE: Mobile Diffusion for Video Editing** \
[[Website](https://arxiv.org/abs/2412.06578)]