https://github.com/naver-ai/densediffusion

Official Pytorch Implementation of DenseDiffusion (ICCV 2023)
https://github.com/naver-ai/densediffusion

Last synced: 7 months ago
JSON representation

Official Pytorch Implementation of DenseDiffusion (ICCV 2023)

Host: GitHub
URL: https://github.com/naver-ai/densediffusion
Owner: naver-ai
License: apache-2.0
Created: 2023-08-25T01:20:59.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-11-14T19:16:18.000Z (almost 2 years ago)
Last Synced: 2024-08-01T18:37:40.610Z (over 1 year ago)
Language: Jupyter Notebook
Homepage:
Size: 17 MB
Stars: 464
Watchers: 11
Forks: 31
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Code

README

          ## Dense Text-to-Image Generation with Attention Modulation

### ICCV 2023 [[Paper](https://arxiv.org/abs/2308.12964)] [[Demo on HF 🤗](https://huggingface.co/spaces/naver-ai/DenseDiffusion)] [[Colab Demo](https://github.com/XandrChris/DenseDiffusionColab)] 
 


> #### Authors    [Yunji Kim](https://github.com/YunjiKim)¹, [Jiyoung Lee](https://lee-jiyoung.github.io)¹, [Jin-Hwa Kim](http://wityworks.com/)¹, [Jung-Woo Ha](https://github.com/jungwoo-ha)¹, [Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/)² 
 _{¹NAVER AI Lab, ²Carnegie Mellon University}

> #### Abstract

Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region.

To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout.

We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps.

Next,  we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance.

Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores.

In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

> #### Method



  





  



Our goal is to improve the text-to-image model's ability to reflect textual and spatial conditions without fine-tuning.

We formally define our condition as a set of $N$ segments ${\lbrace(c_{n},m_{n})\rbrace}^{N}_{n=1}$, where each segment $(c_n,m_n)$ describes a single region.

Here $c_n$ is a non-overlapping part of the full-text caption $c$, and $m_n$ denotes a binary map representing each region. Given the input conditions, we modulate attention maps of all attention layers on the fly so that the object described by $c_n$ can be generated in the corresponding region $m_n$.

To maintain the pre-trained model's generation capacity, we design the modulation to consider original value range and each segment's area.

> #### Examples



  





  





  



----

### How to launch a web interface

- Put your access token to Hugging Face Hub [here](./gradio_app.py#L77).

- Run the Gradio app.

```

python gradio_app.py

```

----

### Getting Started

- Create the image layout.



  



- Label each segment with a text prompt.



  



- Adjust the full text. The default full text is automatically concatenated from each segment's text.  The default one works well, but refineing the full text will further improve the result.



    



- Check the generated images, and tune the hyperparameters if needed.


  w^c : The degree of attention modulation at cross-attention layers. 


  w^s : The degree of attention modulation at self-attention layers. 




    



----

### Benchmark

We share the benchmark used in our model development and evaluation [here](./dataset).

The code for preprocessing segment conditions is in [here](./inference.ipynb).

---

#### BibTeX

```

@inproceedings{densediffusion,

  title={Dense Text-to-Image Generation with Attention Modulation},

  author={Kim, Yunji and Lee, Jiyoung and Kim, Jin-Hwa and Ha, Jung-Woo and Zhu, Jun-Yan},

  year={2023},

  booktitle = {ICCV}

}

```

---

#### Acknowledgment

The demo was developed referencing this [source code](https://huggingface.co/spaces/weizmannscience/multidiffusion-region-based). Thanks for the inspiring work! 🙏

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/naver-ai/densediffusion

Awesome Lists containing this project

README