https://mlpc-ucsd.github.io/TokenCompose/

(CVPR 2024) 🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision
https://mlpc-ucsd.github.io/TokenCompose/

artificial-intelligence computer-vision diffusion-models generative-ai image-generation latent-diffusion machine-learning multimodal stable-diffusion text-to-image

Last synced: 3 months ago
JSON representation

(CVPR 2024) 🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Host: GitHub
URL: https://mlpc-ucsd.github.io/TokenCompose/
Owner: mlpc-ucsd
License: apache-2.0
Created: 2023-12-03T21:43:01.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-21T08:13:31.000Z (6 months ago)
Last Synced: 2024-12-21T09:19:47.753Z (6 months ago)
Topics: artificial-intelligence, computer-vision, diffusion-models, generative-ai, image-generation, latent-diffusion, machine-learning, multimodal, stable-diffusion, text-to-image
Language: Jupyter Notebook
Homepage: https://mlpc-ucsd.github.io/TokenCompose/
Size: 209 MB
Stars: 115
Watchers: 3
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Project

README

🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Zirui Wang^{1, 3}
·
Zhizhou Sha^{2, 3}
·
Zheng Ding³
·
Yilin Wang^{2, 3}
·
Zhuowen Tu³

¹Princeton University
·
²Tsinghua University
·
³University of California, San Diego

CVPR 2024

Project done while Zirui Wang, Zhizhou Sha and Yilin Wang interned at UC San Diego.

Project Page
|
arXiv
|
X (Twitter)

### Updates
*If you use our method and/or model for your research project, we are happy to provide cross-reference here in the updates.* :)

[04/04/2024] 🔥 Our training methodology is incorporated into [CoMat](https://arxiv.org/abs/2404.03653) which shows enhanced text-to-image attribute assignments.
[02/26/2024] 🔥 TokenCompose is accepted to CVPR 2024!
[02/20/2024] 🔥 TokenCompose is used as a base model from the [RealCompo](https://arxiv.org/abs/2402.12908) paper for enhanced compositionality.

https://github.com/mlpc-ucsd/TokenCompose/assets/59942464/93feea16-4eac-49c3-b286-ee390a325b17

A Stable Diffusion model finetuned with token-level consistency terms for enhanced multi-category instance composition and photorealism.

Method
Multi-category Instance Composition
Photorealism
Efficiency

Object Accuracy
COCO
ADE20K
FID (COCO)
FID (Flickr30K)
Latency

MG2
MG3
MG4
MG5
MG2
MG3
MG4
MG5

SD 1.4
29.86
90.72_1.33
50.74_0.89
11.68_0.45
0.88_0.21
89.81_0.40
53.96_1.14
16.52_1.13
1.89_0.34
20.88
71.46
7.54_0.17

Composable
27.83
63.33_0.59
21.87_1.01
3.25_0.45
0.23_0.18
69.61_0.99
29.96_0.84
6.89_0.38
0.73_0.22
-
75.57
13.81_0.15

Layout
43.59
93.22_0.69
60.15_1.58
19.49_0.88
2.27_0.44
96.05_0.34
67.83_0.90
21.93_1.34
2.35_0.41
-
74.00
18.89_0.20

Structured
29.64
90.40_1.06
48.64_1.32
10.71_0.92
0.68_0.25
89.25_0.72
53.05_1.20
15.76_0.86
1.74_0.49
21.13
71.68
7.74_0.17

Attn-Exct
45.13
93.64_0.76
65.10_1.24
28.01_0.90
6.01_0.61
91.74_0.49
62.51_0.94
26.12_0.78
5.89_0.40
-
71.68
25.43_4.89

TokenCompose (Ours)
52.15
98.08_0.40
76.16_1.04
28.81_0.95
3.28_0.48
97.75_0.34
76.93_1.09
33.92_1.47
6.21_0.62
20.19
71.13
7.56_0.14

## 🆕 Models

| Stable Diffusion Version | Checkpoint 1 | Checkpoint 2 |
|:------------------------:|:------------:|:------------:|
| v1.4 | [TokenCompose_SD14_A](https://huggingface.co/mlpc-lab/TokenCompose_SD14_A) | [TokenCompose_SD14_B](https://huggingface.co/mlpc-lab/TokenCompose_SD14_B) |
| v2.1 | [TokenCompose_SD21_A](https://huggingface.co/mlpc-lab/TokenCompose_SD21_A) | [TokenCompose_SD21_B](https://huggingface.co/mlpc-lab/TokenCompose_SD21_B) |

Our finetuned models do not contain any extra modules and can be directly used in a standard diffusion model library (e.g., HuggingFace's Diffusers) by replacing the pretrained U-Net with our finetuned U-Net in a plug-and-play manner. We provide a [demo jupyter notebook](notebooks/example_usage.ipynb) which uses our model checkpoint to generate images.

You can also use the following code to download our checkpoints and generate images:

```python
import torch
from diffusers import StableDiffusionPipeline

model_id = "mlpc-lab/TokenCompose_SD14_A"
device = "cuda"

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)
pipe = pipe.to(device)

prompt = "A cat and a wine glass"
image = pipe(prompt).images[0]

image.save("cat_and_wine_glass.png")
```

## 📊 MultiGen

See [MultiGen](multigen/readme.md) for details.

Method
COCO
ADE20K

MG2
MG3
MG4
MG5
MG2
MG3
MG4
MG5

SD 1.4
90.72_1.33
50.74_0.89
11.68_0.45
0.88_0.21
89.81_0.40
53.96_1.14
16.52_1.13
1.89_0.34

Composable
63.33_0.59
21.87_1.01
3.25_0.45
0.23_0.18
69.61_0.99
29.96_0.84
6.89_0.38
0.73_0.22

Layout
93.22_0.69
60.15_1.58
19.49_0.88
2.27_0.44
96.05_0.34
67.83_0.90
21.93_1.34
2.35_0.41

Structured
90.40_1.06
48.64_1.32
10.71_0.92
0.68_0.25
89.25_0.72
53.05_1.20
15.76_0.86
1.74_0.49

Attn-Exct
93.64_0.76
65.10_1.24
28.01_0.90
6.01_0.61
91.74_0.49
62.51_0.94
26.12_0.78
5.89_0.40

Ours
98.08_0.40
76.16_1.04
28.81_0.95
3.28_0.48
97.75_0.34
76.93_1.09
33.92_1.47
6.21_0.62

## 💻 Environment Setup

For those who want to use our codebase to **train your own diffusion models with token-level objectives**, follow the below instructions:

```bash
conda create -n TokenCompose python=3.8.5
conda activate TokenCompose
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
```

We have verified the environment setup using this specific package versions, but we expect that it will also work for newer versions too!

## 🛠️ Dataset Setup

If you want to use your own data, please refer to [preprocess_data](preprocess_data/readme.md) for details.

If you want to use our training data as examples or for research purposes, please follow the below instructions:

### 1. Setup the COCO Image Data

```bash
cd train/data
# download COCO train2017
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
rm train2017.zip
bash coco_data_setup.sh
```

After this step, you should have the following structure under the `train/data` directory:

```
train/data/
coco_gsam_img/
train/
000000000142.jpg
000000000370.jpg
...
```

### 2. Setup Token-wise Grounded Segmentation Maps

Download COCO segmentation data from [Google Drive](https://drive.google.com/file/d/16uoQpfZ0O-NW92HuaCaFU8K4cGHHbv4R/view?usp=drive_link) and put it under `train/data` directory.

After this step, you should have the following structure under the `train/data` directory:

```
train/data/
coco_gsam_img/
train/
000000000142.jpg
000000000370.jpg
...
coco_gsam_seg.tar
```

Then, run the following command to unzip the segmentation data:

```bash
cd train/data
tar -xvf coco_gsam_seg.tar
rm coco_gsam_seg.tar
```

After the setup, you should have the following structure under the `train/data` directory:

```
train/data/
coco_gsam_img/
train/
000000000142.jpg
000000000370.jpg
...
coco_gsam_seg/
000000000142/
mask_000000000142_bananas.png
mask_000000000142_bread.png
...
000000000370/
mask_000000000370_bananas.png
mask_000000000370_bread.png
...
...
```

## 📈 Training
We use wandb to log some curves and visualizations. Login to wandb before running the scripts.
```bash
wandb login
```
Then, to run TokenCompose, use the following command:

```bash
cd train
bash scripts/train.sh
```

The results will be saved under `train/results` directory.

## 🏷️ License

This repository is released under the [Apache 2.0](LICENSE) license.

## 🙏 Acknowledgement

Our code is built upon [diffusers](https://github.com/huggingface/diffusers), [prompt-to-prompt](https://github.com/google/prompt-to-prompt), [VISOR](https://github.com/microsoft/VISOR), [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything), and [CLIP](https://github.com/openai/CLIP). We thank all these authors for their nicely open sourced code and their great contributions to the community.

## 📝 Citation

If you find our work useful, please consider citing:
```bibtex
@InProceedings{Wang2024TokenCompose,
author = {Wang, Zirui and Sha, Zhizhou and Ding, Zheng and Wang, Yilin and Tu, Zhuowen},
title = {TokenCompose: Text-to-Image Diffusion with Token-level Supervision},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {8553-8564}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://mlpc-ucsd.github.io/TokenCompose/

Awesome Lists containing this project

README

🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Project Page
|
arXiv
|
X (Twitter)

https://mlpc-ucsd.github.io/TokenCompose/

Awesome Lists containing this project

README

🧩 TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Project Page | arXiv | X (Twitter)

Project Page
|
arXiv
|
X (Twitter)