https://github.com/pix2pixzero/pix2pix-zero

Zero-shot Image-to-Image Translation [SIGGRAPH 2023]
https://github.com/pix2pixzero/pix2pix-zero

Last synced: 6 months ago
JSON representation

Zero-shot Image-to-Image Translation [SIGGRAPH 2023]

Host: GitHub
URL: https://github.com/pix2pixzero/pix2pix-zero
Owner: pix2pixzero
License: mit
Created: 2023-02-07T00:52:27.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-10-16T20:29:57.000Z (about 1 year ago)
Last Synced: 2025-04-13T19:41:12.407Z (7 months ago)
Language: Python
Homepage: https://pix2pixzero.github.io/
Size: 33.3 MB
Stars: 1,112
Watchers: 31
Forks: 80
Open Issues: 32
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-diffusion-categorized - [Code
StarryDivineSky - pix2pixzero/pix2pix-zero - zero是一个零样本图像到图像转换项目，基于SIGGRAPH 2023论文。它无需训练即可实现图像风格迁移，核心思想是通过寻找源图像和目标图像之间的共同信息来实现转换。项目特色在于其零样本能力，即不需要任何预训练或微调即可应用于新的图像对。它通过自监督的方式学习图像的内在表示，并利用注意力机制来对齐源图像和目标图像的特征。该项目提供代码和预训练模型，方便用户进行实验和应用。pix2pix-zero适用于各种图像编辑任务，例如风格迁移、图像修复和图像着色等。其工作原理涉及特征对齐、注意力机制和图像重建等关键技术。项目目标是提供一种更灵活、更通用的图像转换方法。 (图像风格 / 资源传输下载)

README

          # pix2pix-zero

### [**paper**](https://arxiv.org/abs/2302.03027) | [**website**](https://pix2pixzero.github.io/) | [**demo**](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo) 

#### **Quick start:** [**Edit images**](#getting-started) | [**Gradio (locally hosted)**](#gradio-demo)

This is author's reimplementation of "Zero-shot Image-to-Image Translation" using the diffusers library. 


The results in the paper are based on the [CompVis](https://github.com/CompVis/stable-diffusion) library, which will be released later. 

**[New!]** Demo with ability to generate custom directions released on Hugging Face! 


**[New!]** Code for editing real and synthetic images released!














We propose pix2pix-zero, a diffusion-based image-to-image approach that allows users to specify the edit direction on-the-fly (e.g., cat to dog). Our method can directly use pre-trained [Stable Diffusion](https://github.com/CompVis/stable-diffusion), for editing real and synthetic images while preserving the input image's structure. Our method is training-free and prompt-free, as it requires neither manual text prompting for each input image nor costly fine-tuning for each task.

**TL;DR**: no finetuning required, no text input needed, input structure preserved.

--- 

### Corresponding Manuscript

[Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) 


[Gaurav Parmar](https://gauravparmar.com/),

[Krishna Kumar Singh](http://krsingh.cs.ucdavis.edu/),

[Richard Zhang](https://richzhang.github.io/),

[Yijun Li](https://yijunmaverick.github.io/),

[Jingwan Lu](https://research.adobe.com/person/jingwan-lu/),

[Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/)


CMU and  Adobe 


SIGGRAPH, 2023

---

## Results

All our results are based on [stable-diffusion-v1-4](https://github.com/CompVis/stable-diffusion) model. Please the website for more results.













The top row for each of the results below show editing of real images, and the bottom row shows synthetic image editing.

































































## Real Image Editing











## Synthetic Image Editing











## Method Details

Given an input image, we first generate text captions using [BLIP](https://github.com/salesforce/LAVIS) and apply regularized DDIM inversion to obtain our inverted noise map.

Then, we obtain reference cross-attention maps that correspoind to the structure of the input image by denoising, guided with the CLIP embeddings 

of our generated text (c). Next, we denoise with edited text embeddings, while enforcing a loss to match current cross-attention maps with the 

reference cross-attention maps.











## Getting Started

**Environment Setup**

- We provide a [conda env file](environment.yml) that contains all the required dependencies

  ```

  conda env create -f environment.yml

  ```

- Following this, you can activate the conda environment with the command below. 

  ```

  conda activate pix2pix-zero

  ```

**Real Image Translation**

- First, run the inversion command below to obtain the input noise that reconstructs the image. 

  The command below will save the inversion in the results folder as `output/test_cat/inversion/cat_1.pt` 

  and the BLIP-generated prompt as `output/test_cat/prompt/cat_1.txt`

    ```

    python src/inversion.py  \

            --input_image "assets/test_images/cats/cat_1.png" \

            --results_folder "output/test_cat"

    ```

- Next, we can perform image editing with the editing direction as shown below.

  The command below will save the edited image as `output/test_cat/edit/cat_1.png`

    ```

    python src/edit_real.py \

        --inversion "output/test_cat/inversion/cat_1.pt" \

        --prompt "output/test_cat/prompt/cat_1.txt" \

        --task_name "cat2dog" \

        --results_folder "output/test_cat/" 

    ```

**Editing Synthetic Images**

- Similarly, we can edit the synthetic images generated by Stable Diffusion with the following command.

    ```

    python src/edit_synthetic.py \

        --results_folder "output/synth_editing" \

        --prompt_str "a high resolution painting of a cat in the style of van gogh" \

        --task "cat2dog"

    ```

### **Gradio demo**

- We also provide a UI for testing our method that is built with gradio. This demo also supports generating new directions on the fly! Running the following command in a terminal will launch the demo: 

    ```

    python app_gradio.py

    ```

- This demo is also hosted on HuggingFace [here](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo).

### **Tips and Debugging**

  - **Controlling the Image Structure:**


    The `--xa_guidance` flag controls the amount of cross-attention guidance to be applied when performing the edit. If the output edited image does not retain the structure from the input, increasing the value will typically address the issue. We recommend changing the value in increments of 0.05. 

  - **Improving Image Quality:**


    If the output image quality is low or has some artifacts, using more steps for both the inversion and editing would be helpful. 

    This can be controlled with the `--num_ddim_steps` flag. 

  - **Reducing the VRAM Requirements:**


    We can reduce the VRAM requirements using lower precision and setting the flag `--use_float_16`. 




**Finding Custom Edit Directions**


 - We provide some pre-computed directions in the assets [folder](assets/embeddings_sd_1.4).

   To generate new edit directions, users can first generate two files containing a large number of sentences (~1000) and then run the command as shown below. 

    ```

      python src/make_edit_direction.py \

        --file_source_sentences sentences/apple.txt \

        --file_target_sentences sentences/orange.txt \

        --output_folder assets/embeddings_sd_1.4

    ```

- After running the above command, you can set the flag `--task apple2orange` for the new edit.

## Comparison

Comparisons with different baselines, including, SDEdit + word swap, DDIM + word swap, and prompt-to-propmt. Our method successfully applies the edit, while preserving the structure of the input image. 











### Note:

The original implementation for the regularized DDIM Inversion had an implementation issue where the random roll would sometimes not get applied. Please see the updated code [here](https://github.com/pix2pixzero/pix2pix-zero/blob/main/src/utils/ddim_inv.py#L32) for the updated version.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pix2pixzero/pix2pix-zero

Awesome Lists containing this project

README