Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/CaraJ7/CoMat
Official code for ๐ซCoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
https://github.com/CaraJ7/CoMat
Last synced: 2 months ago
JSON representation
Official code for ๐ซCoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
- Host: GitHub
- URL: https://github.com/CaraJ7/CoMat
- Owner: CaraJ7
- Created: 2024-04-04T01:49:00.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-04-30T17:00:36.000Z (9 months ago)
- Last Synced: 2024-08-01T18:35:15.393Z (5 months ago)
- Language: Python
- Homepage: https://caraj7.github.io/comat/
- Size: 7.67 MB
- Stars: 115
- Watchers: 17
- Forks: 3
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-diffusion-categorized - [Code
README
# ๐ซCoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Official repository for the paper "[CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching](https://arxiv.org/pdf/2404.03653.pdf)".
๐ For more details, please refer to the project page: [https://caraj7.github.io/comat/](https://caraj7.github.io/comat/).
[[๐ Webpage](https://caraj7.github.io/comat/)] [[๐ Paper](https://arxiv.org/pdf/2404.03653.pdf)]
## ๐ฅ News
- **[2024.09.26]** ๐ CoMat is accepted by Neurips 2024!
- **[2024.04.30]** ๐ฅ We release the training code of CoMat.
- **[2024.04.05]** ๐ We release our paper on [arXiv](https://arxiv.org/pdf/2404.03653.pdf).
## ๐ About CoMat
We propose ๐ซCoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens.
![demo](fig/demo.png)
## ๐จUsage
### Install
Install the requirements first. We verify the environment setup in the current file but we expect the newer versions should also work.
```bash
pip install -r requirements.txt
```The Attribute Concentration module requires [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) to find the mask of the entities. Please run the following command to install Grounded-SAM.
```bash
mkdir seg_model
cd seg_model
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
mv Grounded-Segment-Anything gsam
cd gsam/GroundingDINO
pip install -e .
```### Training
We currently support [SD1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) and [SDXL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). Other Version 1 of Stable Diffusion should also be supported, e.g, [SD1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4), as they share the same architecture with SD1.5.
#### SD1.5
SD1.5 can be directly used to train.
First, we need to generate the latents used in the Fidelity Preservation module.
```bash
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python tools/gan_gt_generate.py \
--prompt-path merged_data/abc5k_hrs10k_t2icompall_20k.txt \
--save-prompt-path train_data_sd15/gan_gt_data.jsonl \
--model-path runwayml/stable-diffusion-v1-5 \
--model-type sd_1_5
```Then we start training.
```bash
bash scripts/sd15.sh
```#### SDXL
We recommend to first fine-tune the Unet of SDXL on the resolution of 512\*512 to enable fast convergence since the original SDXL generate images of poor quality on 512\*512.
For the fine-tuning data, we directly use SDXL to generate 1024\*1024 images given the training prompt. Then we resize the generated images to 512\*512 and use these images to fine-tune SDXL for only 100 steps. We use the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py) from diffusers. We will later release the fine-tuned unet.
Then we first generate the latents with the fine-tuned UNet.
```bash
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python tools/gan_gt_generate.py \
--prompt-path merged_data/abc5k_hrs10k_t2icompall_20k.txt \
--save-prompt-path train_data_sdxl/gan_gt_data.jsonl \
--unet-path FINETUNED_UNET_PATH \
--model-path stabilityai/stable-diffusion-xl-base-1.0 \
--model-type sdxl_unet
```Finally we start training:
```bash
bash scripts/sdxl.sh
```## ๐ TODO
- [ ] Release the checkpoints.
- [x] Release training code in April.
## :white_check_mark: Citation
If you find **CoMat** useful for your research and applications, please kindly cite using this BibTeX:
```latex
@article{jiang2024comat,
title={CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching},
author={Jiang, Dongzhi and Song, Guanglu and Wu, Xiaoshi and Zhang, Renrui and Shen, Dazhong and Zong, Zhuofan and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2404.03653},
year={2024}
}
```## ๐Thanks
We would like to thank [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything), [TokenCompose](https://github.com/mlpc-ucsd/TokenCompose), and [Diffusers](https://github.com/huggingface/diffusers).