{"id":13488636,"url":"https://github.com/ziqihuangg/Collaborative-Diffusion","last_synced_at":"2025-03-28T01:36:55.421Z","repository":{"id":149081738,"uuid":"617443880","full_name":"ziqihuangg/Collaborative-Diffusion","owner":"ziqihuangg","description":"Collaborative Diffusion (CVPR 2023)","archived":false,"fork":false,"pushed_at":"2023-11-28T19:47:58.000Z","size":4459,"stargazers_count":392,"open_issues_count":6,"forks_count":31,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-08-01T18:38:40.318Z","etag":null,"topics":["aigc","diffusion-models","face-editing","face-generation","gen-ai","image-editing","image-generation","latent-diffusion-models","multi-modality","stable-diffusion"],"latest_commit_sha":null,"homepage":"https://ziqihuangg.github.io/projects/collaborative-diffusion.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ziqihuangg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-22T12:07:45.000Z","updated_at":"2024-08-01T12:35:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"c386cba2-2f6d-4bfa-a8f6-99128495dff1","html_url":"https://github.com/ziqihuangg/Collaborative-Diffusion","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ziqihuangg%2FCollaborative-Diffusion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ziqihuangg%2FCollaborative-Diffusion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ziqihuangg%2FCollaborative-Diffusion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ziqihuangg%2FCollaborative-Diffusion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ziqihuangg","download_url":"https://codeload.github.com/ziqihuangg/Collaborative-Diffusion/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222333976,"owners_count":16968058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aigc","diffusion-models","face-editing","face-generation","gen-ai","image-editing","image-generation","latent-diffusion-models","multi-modality","stable-diffusion"],"created_at":"2024-07-31T18:01:19.326Z","updated_at":"2025-03-28T01:36:55.408Z","avatar_url":"https://github.com/ziqihuangg.png","language":"Python","funding_links":[],"categories":["Additional conditions"],"sub_categories":[],"readme":"# Collaborative Diffusion (CVPR 2023)\n\n\u003c!-- [![arXiv](https://img.shields.io/badge/arXiv-2311.99999-b31b1b.svg)](https://arxiv.org/abs/2311.99999) --\u003e\n[![Paper](https://img.shields.io/badge/cs.CV-Paper-b31b1b?logo=arxiv\u0026logoColor=red)](https://arxiv.org/abs/2304.10530)\n[![Project Page](https://img.shields.io/badge/Project-Website-green?logo=googlechrome\u0026logoColor=green)](https://ziqihuangg.github.io/projects/collaborative-diffusion.html)\n[![Video](https://img.shields.io/badge/YouTube-Video-c4302b?logo=youtube\u0026logoColor=red)](https://www.youtube.com/watch?v=inLK4c8sNhc)\n[![Visitor](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fziqihuangg%2FCollaborative-Diffusion\u0026count_bg=%23FFA500\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=visitors\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n\nThis repository contains the implementation of the following paper:\n\u003e **Collaborative Diffusion for Multi-Modal Face Generation and Editing**\u003cbr\u003e\n\u003e [Ziqi Huang](https://ziqihuangg.github.io/), [Kelvin C.K. Chan](https://ckkelvinchan.github.io/), [Yuming Jiang](https://yumingj.github.io/), [Ziwei Liu](https://liuziwei7.github.io/)\u003cbr\u003e\nIEEE/CVF International Conference on Computer Vision (**CVPR**), 2023\n\nFrom [MMLab@NTU](https://www.mmlab-ntu.com/) affiliated with S-Lab, Nanyang Technological University\n\n\n## :open_book: Overview\n\u003c!-- ![overall_structure](./assets/fig_teaser.jpg) --\u003e\n\u003cimg src=\"./assets/fig_teaser.jpg\" width=\"100%\"\u003e\n\nWe propose **Collaborative Diffusion**, where users can use multiple modalities to control face generation and editing.\n    *(a) Face Generation*. Given multi-modal controls, our framework synthesizes high-quality images consistent with the input conditions.\n    *(b) Face Editing*. Collaborative Diffusion also supports multi-modal editing of real images with promising identity preservation capability.\n\n\u003cbr\u003e\n\u003cimg src=\"./assets/fig_framework.jpg\" width=\"100%\"\u003e\n\nWe use pre-trained uni-modal diffusion models to perform multi-modal guided face generation and editing. At each step of the reverse process (i.e., from timestep t to t − 1), the **dynamic diffuser** predicts the spatial-varying and temporal-varying **influence function** to *selectively enhance or suppress the contributions of the given modality*.\n\n## :heavy_check_mark: Updates\n- [10/2023] Collaborative Diffusion can support [FreeU](https://chenyangsi.top/FreeU/) now. See [here](https://github.com/ziqihuangg/Collaborative-Diffusion/tree/master/freeu) for how to run Collaborative Diffusion + FreeU.\n- [09/2023] We provide inference script of face generation driven by single modality, and the scripts and checkpoints of 256x256 resolution.\n- [09/2023] [Editing code](https://github.com/ziqihuangg/Collaborative-Diffusion#editing) is released.\n- [06/2023] We provide the preprocessed multi-modal annotations [here](https://drive.google.com/drive/folders/1rLcdN-VctJpW4k9AfSXWk0kqxh329xc4?usp=sharing).\n- [05/2023] [Training code](https://github.com/ziqihuangg/Collaborative-Diffusion#training) for Collaborative Diffusion (512x512) released.\n- [04/2023] [Project page](https://ziqihuangg.github.io/projects/collaborative-diffusion.html) and [video](https://www.youtube.com/watch?v=inLK4c8sNhc) available.\n- [04/2023] [Arxiv paper](https://arxiv.org/abs/2304.10530) available.\n- [04/2023] Checkpoints for multi-modal face generation (512x512) released.\n- [04/2023] Inference code for multi-modal face generation (512x512) released.\n\n\n## :hammer: Installation\n\n1. Clone repo\n\n   ```bash\n   git clone https://github.com/ziqihuangg/Collaborative-Diffusion\n   cd Collaborative-Diffusion\n   ```\n\n2. Create conda environment.\u003cbr\u003e\nIf you already have an `ldm` environment installed according to [LDM](https://github.com/CompVis/latent-diffusion#requirements), you do not need to go throught this step (i.e., step 2). You can simply `conda activate ldm` and jump to step 3.\n\n   ```bash\n    conda env create -f environment.yaml\n    conda activate codiff\n   ```\n\n3. Install dependencies\n\n   ```bash\n    pip install transformers==4.19.2 scann kornia==0.6.4 torchmetrics==0.6.0\n    conda install -c anaconda git\n    pip install git+https://github.com/arogozhnikov/einops.git\n   ```\n\n## :arrow_down: Download\n\n### Download Checkpoints\n\n1. Download the pre-trained models from [Google Drive](https://drive.google.com/drive/folders/13MdDea8eI8P4ygeIyfy8krlTb8Ty0mAP?usp=sharing) or [OneDrive](https://entuedu-my.sharepoint.com/:f:/g/personal/ziqi002_e_ntu_edu_sg/ErjBxdNGbyhJtnPLFWxLJkABb1dScdz9T0kCjzYC65y17g?e=cn5F9h).\n\n2. Put the models under `pretrained` as follows:\n    ```\n    Collaborative-Diffusion\n    └── pretrained\n        ├── 256_codiff_mask_text.ckpt\n        ├── 256_mask.ckpt\n        ├── 256_text.ckpt\n        ├── 256_vae.ckpt\n        ├── 512_codiff_mask_text.ckpt\n        ├── 512_mask.ckpt\n        ├── 512_text.ckpt\n        └── 512_vae.ckpt\n    ```\n### Download Datasets\nWe provide preprocessed data used in this project (see Acknowledgement for data source). You need download them only if you want to reproduce the training of Collaborative Diffusion. You can skip this step if you simply want to use our pre-trained models for inference.\n\n1. Download the preprocessed training data from [here](https://drive.google.com/drive/folders/1rLcdN-VctJpW4k9AfSXWk0kqxh329xc4?usp=sharing).\n\n2. Put the datasets under `dataset` as follows:\n    ```\n    Collaborative-Diffusion\n    └── dataset\n        ├── image\n        |   └──image_512_downsampled_from_hq_1024\n        ├── text\n        |   └──captions_hq_beard_and_age_2022-08-19.json\n        ├── mask\n        |   └──CelebAMask-HQ-mask-color-palette_32_nearest_downsampled_from_hq_512_one_hot_2d_tensor\n        └── sketch\n            └──sketch_1x1024_tensor\n    ```\n\nFor more details about the annotations, please refer to [CelebA-Dialog](https://github.com/ziqihuangg/CelebA-Dialog).\n\n## :framed_picture: Generation\n\n### Multi-Modal-Driven Generation\n\nYou can control face generation using text and segmentation mask.\n1. `mask_path` is the path to the segmentation mask, and `input_text` is the text condition.\n\n    ```bash\n    python generate.py \\\n    --mask_path test_data/512_masks/27007.png \\\n    --input_text \"This man has beard of medium length. He is in his thirties.\"\n    ```\n    ```bash\n    python generate.py \\\n    --mask_path test_data/512_masks/29980.png \\\n    --input_text \"This woman is in her forties.\"\n    ```\n\n2. You can view different types of intermediate outputs by setting the flags as `1`. For example,  to view the *influence functions*, you can set `return_influence_function` to `1`.\n\n    ```bash\n    python generate.py \\\n    --mask_path test_data/512_masks/27007.png \\\n    --input_text \"This man has beard of medium length. He is in his thirties.\" \\\n    --ddim_steps 10 \\\n    --batch_size 1 \\\n    --save_z 1 \\\n    --return_influence_function 1 \\\n    --display_x_inter 1 \\\n    --save_mixed 1\n    ```\n    Note that producing intermediate results might consume a lot of GPU memory, so we suggest setting `batch_size` to `1`, and setting `ddim_steps` to a smaller value (e.g., `10`) to save memory and computation time.\n\n3. Our script synthesizes 512x512 resolution by default. You can generate 256x256 images by changing the config and ckpt:\n    ```bash\n    python generate.py \\\n    --mask_path test_data/256_masks/29980.png \\\n    --input_text \"This woman is in her forties.\" \\\n    --config_path \"configs/256_codiff_mask_text.yaml\" \\\n    --ckpt_path \"pretrained/256_codiff_mask_text.ckpt\" \\\n    --save_folder \"outputs/inference_256_codiff_mask_text\"\n    ```\n\n\n### Text-to-Face Generation\n1. Give the text prompt and generate the face image:\n\n    ```bash\n    python text2image.py \\\n    --input_text \"This man has beard of medium length. He is in his thirties.\"\n    ```\n    ```bash\n    python text2image.py \\\n    --input_text \"This woman is in her forties.\"\n    ```\n\n2. Our script synthesizes 512x512 resolution by default. You can generate 256x256 images by changing the config and ckpt:\n    ```bash\n    python text2image.py \\\n    --input_text \"This man has beard of medium length. He is in his thirties.\" \\\n    --config_path \"configs/256_text.yaml\" \\\n    --ckpt_path \"pretrained/256_text.ckpt\" \\\n    --save_folder \"outputs/256_text2image\"\n    ```\n\n\n### Mask-to-Face Generation\n1. Give the face segmentation mask and generate the face image:\n\n    ```bash\n    python mask2image.py \\\n    --mask_path \"test_data/512_masks/29980.png\"\n    ```\n    ```bash\n    python mask2image.py \\\n    --mask_path \"test_data/512_masks/27007.png\"\n    ```\n\n2. Our script synthesizes 512x512 resolution by default. You can generate 256x256 images by changing the config and ckpt:\n    ```bash\n    python mask2image.py \\\n    --mask_path \"test_data/256_masks/29980.png\" \\\n    --config_path \"configs/256_mask.yaml\" \\\n    --ckpt_path \"pretrained/256_mask.ckpt\" \\\n    --save_folder \"outputs/256_mask2image\"\n    ```\n\n\n\n## :art: Editing\nYou can edit a face image according to target mask and target text. We achieve this by collaborating multiple uni-modal edits. We use [Imagic](https://imagic-editing.github.io/) to perform the uni-modal edits.\n\n1. Perform text-based editing.\n    ```bash\n    python editing/imagic_edit_text.py\n    ```\n\n1. Perform mask-based editing. Note that we adapted Imagic (the text-based method) to mask-based editing.\n    ```bash\n    python editing/imagic_edit_mask.py\n    ```\n\n3. Collaborate the text-based edit and the mask-based edit using Collaborative Diffusion.\n\n    ```bash\n    python editing/collaborative_edit.py\n    ```\n\n\n\n\n\n## :runner: Training\n\n\nWe provide the entire training pipeline, including training the VAE, uni-modal diffusion models, and our proposed dynamic diffusers.\n\nIf you are only interested in training dynamic diffusers, you can use our provided checkpoints for VAE and uni-modal diffusion models. Simply skip step 1 and 2 and directly look at step 3.\n\n1. **Train VAE.**\n\n    LDM compresses images to the VAE latents to save computational cost, and later train UNet diffusion models on the VAE latents. This step is to reproduce the `pretrained/512_vae.ckpt`.\n\n    ```bash\n    python main.py \\\n    --logdir 'outputs/512_vae' \\\n    --base 'configs/512_vae.yaml' \\\n    -t  --gpus 0,1,2,3,\n    ```\n\n2. **Train the uni-modal diffusion models.**\n\n    (1) train text-to-image model. This step is to reproduce the `pretrained/512_text.ckpt`.\n    ```bash\n    python main.py \\\n    --logdir 'outputs/512_text' \\\n    --base 'configs/512_text.yaml' \\\n    -t  --gpus 0,1,2,3,\n    ```\n    (2) train mask-to-image model. This step is to reproduce the `pretrained/512_mask.ckpt`.\n    ```bash\n    python main.py \\\n    --logdir 'outputs/512_mask' \\\n    --base 'configs/512_mask.yaml' \\\n    -t  --gpus 0,1,2,3,\n    ```\n\n\n3. **Train the dynamic diffusers.**\n\n    The dynamic diffusers are the meta-networks that determine how the uni-modal diffusion models collaborate together. This step is to reproduce the `pretrained/512_codiff_mask_text.ckpt`.\n    ```bash\n    python main.py \\\n    --logdir 'outputs/512_codiff_mask_text' \\\n    --base 'configs/512_codiff_mask_text.yaml' \\\n    -t  --gpus 0,1,2,3,\n    ```\n\n\n## :fountain_pen: Citation\n\n   If you find our repo useful for your research, please consider citing our paper:\n\n   ```bibtex\n    @InProceedings{huang2023collaborative,\n        author = {Huang, Ziqi and Chan, Kelvin C.K. and Jiang, Yuming and Liu, Ziwei},\n        title = {Collaborative Diffusion for Multi-Modal Face Generation and Editing},\n        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n        year = {2023},\n    }\n   ```\n\n\n## :purple_heart: Acknowledgement\n\nThe codebase is maintained by [Ziqi Huang](https://ziqihuangg.github.io/).\n\nThis project is built on top of [LDM](https://github.com/CompVis/latent-diffusion). We trained on data provided by [CelebA-HQ](https://github.com/tkarras/progressive_growing_of_gans), [CelebA-Dialog](https://github.com/ziqihuangg/CelebA-Dialog), [CelebAMask-HQ](https://mmlab.ie.cuhk.edu.hk/projects/CelebA/CelebAMask_HQ.html), and [MM-CelebA-HQ-Dataset](https://github.com/IIGROUP/MM-CelebA-HQ-Dataset). We also make use of the [Imagic implementation](https://github.com/justinpinkney/stable-diffusion/blob/main/notebooks/imagic.ipynb).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fziqihuangg%2FCollaborative-Diffusion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fziqihuangg%2FCollaborative-Diffusion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fziqihuangg%2FCollaborative-Diffusion/lists"}