https://github.com/endo-yuki-t/MAG
PyTorch implementation of ``Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation'' [The Visual Computer]
https://github.com/endo-yuki-t/MAG
Last synced: 3 months ago
JSON representation
PyTorch implementation of ``Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation'' [The Visual Computer]
- Host: GitHub
- URL: https://github.com/endo-yuki-t/MAG
- Owner: endo-yuki-t
- License: mit
- Created: 2023-10-27T06:56:37.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-10-30T03:28:33.000Z (over 1 year ago)
- Last Synced: 2024-08-01T18:37:48.478Z (11 months ago)
- Language: Python
- Size: 787 KB
- Stars: 16
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-diffusion-categorized - [Code
README
# Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation
![]()
![]()
![]()
This repository contains our implementation of the following paper:
Yuki Endo: "Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation," accepted to The Visual Computer Journal. [[Project](https://www.cgg.cs.tsukuba.ac.jp/~endo/projects/MAG/)] [[PDF (preprint)](https://arxiv.org/abs/2308.06027)]
## Prerequisites
1. Python3
2. PyTorch
3. Others (see env.yml)## Preparation
Download the Stable Diffusion model weight (512-base-ema.ckpt) from [https://huggingface.co/stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) and put it in the checkpoint directory.## Inference
You can generate an image from an input mask and prompt by running the following command:
```
python scripts/txt2img_mag.py --ckpt ./checkpoint/512-base-ema.ckpt --prompt "A furry bear riding on a bike in the city" --mask ./inputs/mask1.png --word_ids_for_mask "[[1,2,3,4],[6,7]]" --outdir ./outputs
```
Here, ***--word_ids_for_mask*** means word indices corresponding to each region in a mask image. For example, if you specify word_ids_for_mask as "[[1,2,3,4],[6,7]]", the first region corresponds to "A" (1), "furry" (2), "bear" (3), and "riding" (4), and the second region corresponds to "a" (6) and "bike" (7). (An index 0 means the token of the beginning of the sentence.) The order of regions is determined as a reverse order based on BGR color.You can also specify two additional parameters, ***--alpha*** and ***--lmda*** to determine the masked attention guidance scale and loss balancing weight.
## Citation
Please cite our paper if you find the code useful:
```
@Article{endoTVC2023,
Title = {Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation},
Author = {Yuki Endo},
Journal = {The Visual Computer},
volume = {40},
pages = {6033-6045},
doi = {https://doi.org/10.1007/s00371-023-03151-y},
Year = {2023}
}
```## Acknowledgements
This code heavily borrows from the [Stable Diffusion](https://github.com/Stability-AI/stablediffusion) repository.