https://github.com/endo-yuki-t/MAG

PyTorch implementation of ``Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation'' [The Visual Computer]
https://github.com/endo-yuki-t/MAG

Last synced: 8 months ago
JSON representation

PyTorch implementation of ``Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation'' [The Visual Computer]

Host: GitHub
URL: https://github.com/endo-yuki-t/MAG
Owner: endo-yuki-t
License: mit
Created: 2023-10-27T06:56:37.000Z (about 2 years ago)
Default Branch: master
Last Pushed: 2023-10-30T03:28:33.000Z (about 2 years ago)
Last Synced: 2024-08-01T18:37:48.478Z (over 1 year ago)
Language: Python
Size: 787 KB
Stars: 16
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Code

README

          # Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

  

  







This repository contains our implementation of the following paper:

Yuki Endo: "Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation," accepted to The Visual Computer Journal.  [[Project](https://www.cgg.cs.tsukuba.ac.jp/~endo/projects/MAG/)] [[PDF (preprint)](https://arxiv.org/abs/2308.06027)]

## Prerequisites  

1. Python3

2. PyTorch

3. Others (see env.yml)

## Preparation

Download the Stable Diffusion model weight (512-base-ema.ckpt) from [https://huggingface.co/stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) and put it in the checkpoint directory. 

## Inference

You can generate an image from an input mask and prompt by running the following command:

```

python scripts/txt2img_mag.py --ckpt ./checkpoint/512-base-ema.ckpt --prompt "A furry bear riding on a bike in the city" --mask ./inputs/mask1.png --word_ids_for_mask "[[1,2,3,4],[6,7]]" --outdir ./outputs

```

Here, ***--word_ids_for_mask*** means word indices corresponding to each region in a mask image. For example, if you specify word_ids_for_mask as "[[1,2,3,4],[6,7]]", the first region corresponds to "A" (1), "furry" (2), "bear" (3), and "riding" (4), and the second region corresponds to "a" (6) and "bike" (7). (An index 0 means the token of the beginning of the sentence.) The order of regions is determined as a reverse order based on BGR color.

You can also specify two additional parameters, ***--alpha*** and ***--lmda*** to determine the masked attention guidance scale and loss balancing weight. 

## Citation

Please cite our paper if you find the code useful:

```

@Article{endoTVC2023,

Title = {Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation},

Author = {Yuki Endo},

Journal = {The Visual Computer},

volume = {40},

pages = {6033-6045},

doi = {https://doi.org/10.1007/s00371-023-03151-y},

Year = {2023}

}

```

## Acknowledgements

This code heavily borrows from the [Stable Diffusion](https://github.com/Stability-AI/stablediffusion) repository.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/endo-yuki-t/MAG

Awesome Lists containing this project

README