https://github.com/YangLing0818/RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
https://github.com/YangLing0818/RPG-DiffusionMaster

image-editting large-language-models multimodal-large-language-models text-to-image

Last synced: 3 months ago
JSON representation

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

Host: GitHub
URL: https://github.com/YangLing0818/RPG-DiffusionMaster
Owner: YangLing0818
License: mit
Created: 2024-01-22T01:07:23.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-10-10T02:44:57.000Z (8 months ago)
Last Synced: 2024-10-29T15:34:38.092Z (8 months ago)
Topics: image-editting, large-language-models, multimodal-large-language-models, text-to-image
Language: Jupyter Notebook
Homepage: https://proceedings.mlr.press/v235/yang24ai.html
Size: 46 MB
Stars: 1,684
Watchers: 25
Forks: 97
Open Issues: 42
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Code
ai-game-devtools - RPG-DiffusionMaster - to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG). | | | Image | (<span id="image">Image</span> / <span id="tool">Tool (AI LLM)</span>)
StarryDivineSky - YangLing0818/RPG-DiffusionMaster - 4、Gemini-Pro）或开源的本地MLLM（如miniGPT-4）作为提示的字幕重配和区域规划器，通过我们的互补区域扩散来实现SOTA文本到图像的生成和编辑。我们的框架非常灵活，可以推广到任意MLLM架构和扩散主干网。RPG还能够生成超高分辨率的图像。高度准确的图像生成： RPG框架能够根据复杂的描述生成高度准确和详细的图像，尤其在处理包含多个对象、属性和关系的场景时表现出色，生成的图像与文本描述高度一致。超越现有技术：与现有的文本到图像模型相比，RPG框架展现了更好的性能，尤其在处理多元素组合和文本-图像语义对齐方面。灵活性和广泛适用性：实验表明，RPG框架能够与不同的多模态大型语言模型和扩散模型兼容，适用于多种图像生成场景。提升质量和细节：生成的图像不仅在视觉上吸引人，而且细节丰富，对于艺术创作、设计和娱乐等领域至关重要。RPG框架还能够处理复杂的交互和环境，生成的图像在构图和细节方面表现出色。 (多模态大模型 / 网络服务_其他)
AiTreasureBox - YangLing0818/RPG-DiffusionMaster - 06-19_1812_1](https://img.shields.io/github/stars/YangLing0818/RPG-DiffusionMaster.svg)|Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)| (Repos)

README

## Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs - ICML 2024

This repository contains the official implementation of our [RPG](https://openreview.net/forum?id=DgLFkAPwuZ), accepted by ICML 2024.

> [**Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs**](https://openreview.net/forum?id=DgLFkAPwuZ)
> [Ling Yang](https://yangling0818.github.io/),
> [Zhaochen Yu](https://github.com/BitCodingWalkin),
> [Chenlin Meng](https://cs.stanford.edu/~chenlin/),
> [Minkai Xu](https://minkaixu.com/),
> [Stefano Ermon](https://cs.stanford.edu/~ermon/),
> [Bin Cui](https://cuibinpku.github.io/)
>
**Peking University, Stanford University, Pika Labs**

## Introduction

Overview of our RPG

**Abstract**: RPG is a powerful training-free paradigm that can utilize proprietary MLLMs (e.g., GPT-4, Gemini-Pro) or open-source local MLLMs (e.g., miniGPT-4) as the **prompt recaptioner and region planner** with our **complementary regional diffusion** to achieve SOTA text-to-image generation and editing. Our framework is very flexible and can generalize to arbitrary MLLM architectures and diffusion backbones. RPG is also capable of generating image with super high resolutions, here is an example:

Text prompt: A beautiful landscape with a river in the middle the left of the river is in the evening and in the winter with a big iceberg and a small village while some people are skating on the river and some people are skiing, the right of the river is in the summer with a volcano in the morning and a small village while some people are playing.

## 🚩 New Updates

**[2025.2]** We enhance RPG with LLMs that possess the strongest reasoning capabilities, including [**DeepSeek-R1**](https://github.com/deepseek-ai/DeepSeek-R1), [**OpenAI o3-mini**](https://openai.com/index/openai-o3-mini/), and [**OpenAI o1**](https://openai.com/index/learning-to-reason-with-llms/), and leverage the powerful diffusion backbone [**IterComp**](https://github.com/YangLing0818/IterComp), to achieve outstanding compositional image generation under complex prompts.

**[2024.10]** We enhance RPG by incorporating a more powerful **composition-aware backbone**, [IterComp](https://arxiv.org/abs/2410.07171), significantly improving performance on compositional generation without additional computational costs. Simply update the model path using the command below to obtain the results:

```
pipe = RegionalDiffusionXLPipeline.from_pretrained("comin/IterComp",torch_dtype=torch.float16, use_safetensors=True)
```

**[2024.4]** Our codebase has been updated based on [diffusers](https://github.com/huggingface/diffusers), it now supports both ckpts and diffusers of diffusion models. As for diffusion backbones, one can use **RegionalDiffusionPipeline** for base models like **SD v2.0/2.1** **SD v1.4/1.5**, and use **RegionalDiffusionXLPipeline** for SDXL.

**[2024.1]** Our main code along with the demo release, supporting different diffusion backbones (**SDXL**, **SD v2.0/2.1** **SD v1.4/1.5**), and one can reproduce our good results utilizing GPT-4 and Gemini-Pro. Our RPG is also compatible with local MLLMs, and we will continue to improve the results in the future.

## New Features of RPG

### 🔥🔥🔥News: Enhance RPG's regional planning with DeepSeek-R1, o3-mini and o1

2048*1024 Examples

DeepSeek-R1

OpenAI o3-mini
OpenAI o1

A surreal dreamscape where the sky is split into day and night. On the left side, a bright sun shines over golden fields with people flying kites, while on the right side, a deep blue night sky is filled with stars and glowing jellyfish floating in the air. In the center, a giant clock tower stands, with its hands pointing to different times for each side. A person wearing a half-day, half-night cloak is walking down the path that separates the two worlds.

1024*1024 Examples

DeepSeek-R1
OpenAI o3-mini
OpenAI o1

A floating city above the clouds, with golden towers and waterfalls cascading into the mist below. A dragon with shimmering wings soars through the sky, while airships dock at crystal platforms.

A cozy winter cabin in a snowy forest at night. Warm yellow lights glow from the windows, and smoke gently rises from the chimney. A deer stands near the trees, watching as a child builds a snowman. In the sky, the northern lights shimmer above the treetops.

We recommend using [**DeepSeek-R1**](https://github.com/deepseek-ai/DeepSeek-R1) as the regional planner and [**IterComp**](https://github.com/YangLing0818/IterComp) as the base diffusion model to achieve the state-of-the-art compositional text-to-image generation results.

### Enhance RPG with IterComp

1024*1024 Examples

A colossal, ancient tree with leaves made of ice towers over a mystical castle. Green trees line both sides, while cascading waterfalls and an ethereal glow adorn the scene. The backdrop features towering mountains and a vibrant, colorful sky.
On the rooftop of a skyscraper in a bustling cyberpunk city, a figure in a trench coat and neon-lit visor stands amidst a garden of bio-luminescent plants, overlooking the maze of flying cars and towering holograms. Robotic birds flit among the foliage, digital billboards flash advertisements in the distance.

Compared with RPG

RPG
RPG with IterComp

Futuristic and prehistoric worlds collide: Dinosaurs roam near a medieval castle, flying cars and advanced skyscrapers dominate the skyline. A river winds through lush greenery, blending ancient and modern civilizations in a surreal landscape.

## Gallery

### 1. Multi-people with complex attribute binding

1024*1024 Examples

A girl with white ponytail and black dress are chatting with a blonde curly hair girl in a white dress in a cafe.
A twin-tail girl wearing a brwon cowboy hat and white shirt printed with apples, and blue denim jeans with knee boots,full body shot.
A couple, the beautiful girl on the left, silver hair, braided ponytail, happy, dynamic, energetic, peaceful, the handsome young man on the right detailed gorgeous face, grin, blonde hair, enchanting
Two beautiful Chinese girls wearing cheongsams are drinking tea in the tea room, and a Chinese Landscape Painting is hanging on the wall, the girl on the left is black ponytail in red cheongsam, the girl on the right is white ponytail in orange cheongsam

2048*1024 Example

From left to right, a blonde ponytail Europe girl in white shirt, a brown curly hair African girl in blue shirt printed with a bird, an Asian young man with black short hair in suit are walking in the campus happily.

### 2. Multi-object with complex relationship

1024*1024 Examples

From left to right, two red apples and an apple printed shirt and an ipad on the wooden floor

Seven white ceramic mugs with different geometric patterns on the marble table while a bunch of rose on the left

Five watermelons arranged in X shape on a wooden table, with the one in the middle being cut, realistic style, top down view.

From left to right ,bathed in soft morning light,a cozy nook features a steaming Starbucks latte on a rustic table beside an elegant vase of blooming roses,while a plush ragdoll cat purrs contentedly nearby,its eyes half-closed in blissful serenity.

2048*1024 Example

A green twintail girl in orange dress is sitting on the sofa while a messy desk under a big window on the left, a lively aquarium is on the top right of the sofa, realistic style

### 3. RPG With ControlNet

Open Pose Example

Open Pose

Text prompt: A beautiful black hair girl with her eyes closed in champagne long sleeved formal dress standing in her bright room with delicate blue vases with pink roses on the left and some white roses, filled with upgraded growth all around on the right.

Depth Map Example

Depth Map

Text prompt: Under the clear starry sky, clear river water flows in the mountains, and the lavender flower sea dances with the wind, a peaceful, beautiful, and harmonious atmosphere.

Canny Edge Example

Canny Edge

Text prompt: From left to right, an acient Chinese city in spring, summer, autumn and winter in four different regions

## Preparations

**1. Set Environment**

```bash
git clone https://github.com/YangLing0818/RPG-DiffusionMaster
cd RPG-DiffusionMaster
conda create -n RPG python==3.9
conda activate RPG
pip install -r requirements.txt
git clone https://github.com/huggingface/diffusers
```

**2. Download Diffusion Models and MLLMs**

To attain SOTA generative capabilities, we mainly employ [SDXL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo), and [Playground v2](https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic) as our base diffusion. To generate images of high fidelity across various styles, such as photorealism, cartoons, and anime, we incorporate the models from [CIVITA](https://civitai.com/). For images aspiring to photorealism, we advocate the use of [AlbedoBase XL](https://civitai.com/models/140737/albedobase-xl?modelVersionId=281176) , and [DreamShaper XL](https://civitai.com/models/112902/dreamshaper-xl?modelVersionId=251662). Moreover, we generalize our paradigm to SD v1.5 and SD v2.1. All checkpoints are accessible within our [Hugging Face spaces](https://huggingface.co/BitStarWalkin/RPG_models), with detailed descriptions.

We recommend the utilization of GPT-4 or Gemini-Pro for users of Multilingual Large Language Models (MLLMs), as they not only exhibit superior performance but also reduce local memory. According to our experiments, the minimum requirements of VRAM is 10GB with GPT-4, if you want to use local LLM, it would need more VRAM. For those interested in using MLLMs locally, we suggest deploying [miniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4) or directly engaging with substantial Local LLMs such as [Llama2-13b-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) and [Llama2-70b-chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf).

## Text-to-Image Generation

#### 1. Quick Start

For individuals equipped with constrained computational resources, we here provide a simple notebook demonstration that partitions the image into two equal-sized subregions. By making minor alterations to select functions within the diffusers library, one may achieve commendable outcomes utilizing base diffusion models such as SD v1.4, v1.5, v2.0, and v2.1, as mentioned in our paper. Additionally, you can apply your customized configurations to experiment with a graphics card possessing 8GB of VRAM. For an in-depth exposition, kindly refer to our [Example_Notebook](RegionalDiffusion_playground.ipynb).

#### **2. Regional Diffusion with GPT-4**
Our method can automatically generates output without pre-storing MLLM responses, leveraging Chain-of-Thought reasoning and high-quality in-context examples to obtain satisfactory results. Users only need to specify some parameters. For example, to use GPT-4 as the region planner, we can refer to the code below, contained in the [RPG.py](RPG.py) ( **Please note that we have two pipelines which support different model architectures, for SD v1.4/1.5/2.0/2.1 models, you should use RegionalDiffusionPipeline, for SDXL models, you should use RegionalDiffusionXLPipeline.** ):

```python
from RegionalDiffusion_base import RegionalDiffusionPipeline
from RegionalDiffusion_xl import RegionalDiffusionXLPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers,DPMSolverMultistepScheduler
from mllm import local_llm,GPT4
import torch
# If you want to load ckpt, initialize with ".from_single_file".
pipe = RegionalDiffusionXLPipeline.from_single_file("path to your ckpt",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
# If you want to use diffusers, initialize with ".from_pretrained".
# pipe = RegionalDiffusionXLPipeline.from_pretrained("path to your diffusers",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config,use_karras_sigmas=True)
pipe.enable_xformers_memory_efficient_attention()
## User input
prompt= ' A handsome young man with blonde curly hair and black suit with a black twintail girl in red cheongsam in the bar.'
para_dict = GPT4(prompt,key='...Put your api-key here...')
## MLLM based split generation results
split_ratio = para_dict['Final split ratio']
regional_prompt = para_dict['Regional Prompt']
negative_prompt = "" # negative_prompt,
images = pipe(
prompt=regional_prompt,
split_ratio=split_ratio, # The ratio of the regional prompt, the number of prompts is the same as the number of regions
batch_size = 1, #batch size
base_ratio = 0.5, # The ratio of the base prompt
base_prompt= prompt,
num_inference_steps=20, # sampling step
height = 1024,
negative_prompt=negative_prompt, # negative prompt
width = 1024,
seed = None,# random seed
guidance_scale = 7.0
).images[0]
images.save("test.png")
```

**prompt** is the original prompt that roughly summarize the content of the image

**base_prompt** sets base prompt for generation, which is the summary of the image, here we set the base_prompt as the original input prompt by default

**base_ratio** is the weight of the base prompt

There are also other common optional parameters:

**guidance_scale** is the classifier-free guidance scale

**num_inference_steps** is the steps to generate an image

**seed** controls the seed to make the generation reproducible

It should be noted that we introduce some important parameters: **base_prompt & base_ratio**

After adding your **prompt and api-key**, and setting your **path to downloaded diffusion model**, just run the following command and get the results:

```bash
python RPG.py
```

**FAQ: How to set --base_prompt & --base_ratio properly ?**

If you want to generate an image with **multiple entities with the same class** (e.g., two girls, three cats, a man and a girl), you should use **base prompt** and set base prompt that includes the number of each class of entities in the image using **base_prompt**. Another relevant parameter is **base_ratio** which is the weight of the base prompt. According to our experiments, when base_ratio is in [0.35,0.55], the final results are better. Here is the generated image for command above:

And you will get an image similar to ours results as long as we have the same random seed:

Text prompt: A handsome young man with blonde curly hair and black suit with a black twintail girl in red cheongsam in the bar.

On the other hand, when it comes to an image including **multiple entities with different classes**, there is no need to use base prompt, here is an example:

```python
from RegionalDiffusion_base import RegionalDiffusionPipeline
from RegionalDiffusion_xl import RegionalDiffusionXLPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers,DPMSolverMultistepScheduler
from mllm import local_llm,GPT4
import torch
# If you want to load ckpt, initialize with ".from_single_file".
pipe = RegionalDiffusionXLPipeline.from_single_file("path to your ckpt",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
# #If you want to use diffusers, initialize with ".from_pretrained".
# pipe = RegionalDiffusionXLPipeline.from_pretrained("path to your diffusers",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config,use_karras_sigmas=True)
pipe.enable_xformers_memory_efficient_attention()
prompt= 'From left to right, bathed in soft morning light,a cozy nook features a steaming Starbucks latte on a rustic table beside an elegant vase of blooming roses,while a plush ragdoll cat purrs contentedly nearby,its eyes half-closed in blissful serenity.'
para_dict = GPT4(prompt,key='your key')
split_ratio = para_dict['Final split ratio']
regional_prompt = para_dict['Regional Prompt']
negative_prompt = ""
images = pipe(
prompt=regional_prompt,
split_ratio=split_ratio, # The ratio of the regional prompt, the number of prompts is the same as the number of regions, and the number of prompts is the same as the number of regions
batch_size = 1, #batch size
base_ratio = 0.5, # The ratio of the base prompt
base_prompt= None, # If the base_prompt is None, the base_ratio will not work
num_inference_steps=20, # sampling step
height = 1024,
negative_prompt=negative_prompt, # negative prompt
width = 1024,
seed = None,# random seed
guidance_scale = 7.0
).images[0]
images.save("test.png")
```

And you will get an image similar to our results:

Text prompt: From left to right, bathed in soft morning light,a cozy nook features a steaming Starbucks latte on a rustic table beside an elegant vase of blooming roses,while a plush ragdoll cat purrs contentedly nearby,its eyes half-closed in blissful serenity.

It's important to know when should we use **base_prompt**, if these parameters are not set properly, we can not get satisfactory results. We have conducted ablation study about base prompt in our paper, you can check our paper for more information.

#### **3. Regional Diffusion with local LLMs**

We recommend to use base models with over 13 billion parameters for high-quality results, but it will increase load times and graphical memory use at the same time. We have conducted experiments with three different sized models. Here we take llama2-13b-chat as an example:

```python
from RegionalDiffusion_base import RegionalDiffusionPipeline
from RegionalDiffusion_xl import RegionalDiffusionXLPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers,DPMSolverMultistepScheduler
from mllm import local_llm,GPT4
import torch
# If you want to use single ckpt, use this pipeline
pipe = RegionalDiffusionXLPipeline.from_single_file("path to your ckpt",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
# If you want to use diffusers, use this pipeline
# pipe = RegionalDiffusionXLPipeline.from_pretrained("path to your diffusers",torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config,use_karras_sigmas=True)
pipe.enable_xformers_memory_efficient_attention()
prompt= 'Two girls are chatting in the cafe.'
para_dict = local_llm(prompt,model_path='path to your model')
split_ratio = para_dict['Final split ratio']
regional_prompt = para_dict['Regional Prompt']
negative_prompt = ""
images = pipe(
prompt=regional_prompt,
split_ratio=split_ratio, # The ratio of the regional prompt, the number of prompts is the same as the number of regions, and the number of prompts is the same as the number of regions
batch_size = 1, #batch size
base_ratio = 0.5, # The ratio of the base prompt
base_prompt= prompt,
num_inference_steps=20, # sampling step
height = 1024,
negative_prompt=negative_prompt, # negative prompt
width = 1024,
seed = 1234,# random seed
guidance_scale = 7.0
).images[0]
images.save("test.png")
```

In local version, after adding your prompt and setting your path to diffusion model and your path to the local MLLM/LLM, just the command below to get the results:

```
python RPG.py
```

# 📖BibTeX
```
@inproceedings{yang2024mastering,
title={Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs},
author={Yang, Ling and Yu, Zhaochen and Meng, Chenlin and Xu, Minkai and Ermon, Stefano and Cui, Bin},
booktitle={International Conference on Machine Learning},
year={2024}
}
```

# Acknowledgements
Our RPG is a general MLLM-controlled text-to-image generation/editing framework, which is builded upon several solid works. Thanks to [AUTOMATIC1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui), [regional-prompter](https://github.com/hako-mikan/sd-webui-regional-prompter), [SAM](https://github.com/facebookresearch/segment-anything), [diffusers](https://github.com/huggingface/diffusers)
and [IA](https://github.com/geekyutao/Inpaint-Anything) for their wonderful work and codebase! We also thank Hugging Face for sharing our [paper](https://huggingface.co/papers/2401.11708).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/YangLing0818/RPG-DiffusionMaster

Awesome Lists containing this project

README