https://github.com/thudm/cogview4

CogView4, CogView3-Plus and CogView3(ECCV 2024)
https://github.com/thudm/cogview4
eccv2024 high-resolution image-generation text-to-image
Last synced: about 1 year ago
JSON representation
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Host: GitHub
URL: https://github.com/thudm/cogview4
Owner: THUDM
License: apache-2.0
Created: 2024-09-23T06:38:31.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-03-29T07:09:58.000Z (over 1 year ago)
Last Synced: 2025-04-03T03:59:04.575Z (over 1 year ago)
Topics: eccv2024, high-resolution, image-generation, text-to-image
Language: Python
Homepage:
Size: 24.1 MB
Stars: 964
Watchers: 18
Forks: 67
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # CogView4 & CogView3 & CogView-3Plus

[阅读中文版](./README_zh.md)

[日本語で読む](./README_ja.md)









 🤗 HuggingFace Space

  🤖ModelScope Space

 🛠️ZhipuAI MaaS(Faster)




 👋 WeChat Community  📚 CogView3 Paper



![showcase.png](resources/showcase.png)

## Project Updates

- 🔥🔥 ```2025/03/24```: We are launching [CogKit](https://github.com/THUDM/CogKit), a powerful toolkit for fine-tuning and inference of the **CogView4** and **CogVideoX** series, allowing you to fully explore our multimodal generation models.

- ```2025/03/04```: We've adapted and open-sourced the [diffusers](https://github.com/huggingface/diffusers) version

  of **CogView-4** model, which has 6B parameters, supports native Chinese input, and Chinese text-to-image generation.

  You can try it [online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView4).

- ```2024/10/13```: We've adapted and open-sourced the [diffusers](https://github.com/huggingface/diffusers) version of

  **CogView-3Plus-3B** model. You can try

  it [online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space).

- ```2024/9/29```: We've open-sourced **CogView3** and **CogView-3Plus-3B**. **CogView3** is a text-to-image system

  based on cascading diffusion, using a relay diffusion framework. **CogView-3Plus** is a series of newly developed

  text-to-image models based on Diffusion Transformer.

## Project Plan

- [X] Diffusers workflow adaptation

- [X] Cog series fine-tuning kits (coming soon)

- [ ] ControlNet models and training code

## Community Contributions

We have collected some community projects related to this repository here. These projects are maintained by community members, and we appreciate their contributions.

+ [ComfyUI_CogView4_Wrapper](https://github.com/chflame163/ComfyUI_CogView4_Wrapper) - An implementation of the CogView4 project in ComfyUI.

## Model Introduction

### Model Comparison

  

    Model Name

    CogView4

    CogView3-Plus-3B

  

    Resolution

    

            512 <= H, W <= 2048 


            H * W <= 2^{21} 


            H, W \mod 32 = 0

    

  

    Inference Precision

    Only supports BF16, FP32

  

  Encoder

  GLM-4-9B

  T5-XXL

  

    Prompt Language

    Chinese, English

    English

  

  

    Prompt Length Limit

    1024 Tokens

    224 Tokens

  

  

    Download Links

    🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel

    🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel

  

### Memory Usage

DIT models are tested with `BF16` precision and `batchsize=4`, with results shown in the table below:

| Resolution  | enable_model_cpu_offload OFF | enable_model_cpu_offload ON | enable_model_cpu_offload ON  Text Encoder 4bit |

|-------------|------------------------------|-----------------------------|-----------------------------------------------------|

| 512 * 512   | 33GB                         | 20GB                        | 13G                                                 |

| 1280 * 720  | 35GB                         | 20GB                        | 13G                                                 |

| 1024 * 1024 | 35GB                         | 20GB                        | 13G                                                 |

| 1920 * 1280 | 39GB                         | 20GB                        | 14G                                                 |

Additionally, we recommend that your device has at least `32GB` of RAM to prevent the process from being killed.

### Model Metrics

We've tested on multiple benchmarks and achieved the following scores:

#### DPG-Bench

| Model        | Overall   | Global    | Entity    | Attribute | Relation  | Other     |

|--------------|-----------|-----------|-----------|-----------|-----------|-----------|

| SDXL         | 74.65     | 83.27     | 82.43     | 80.91     | 86.76     | 80.41     |

| PixArt-alpha | 71.11     | 74.97     | 79.32     | 78.60     | 82.57     | 76.96     |

| SD3-Medium   | 84.08     | 87.90     | **91.01** | 88.83     | 80.70     | 88.68     |

| DALL-E 3      | 83.50     | **90.97** | 89.61     | 88.39     | 90.58     | 89.83     |

| Flux.1-dev   | 83.79     | 85.80     | 86.79     | 89.98     | 90.04     | **89.90** |

| Janus-Pro-7B | 84.19     | 86.90     | 88.90     | 89.40     | 89.32     | 89.48     |

| **CogView4-6B** | **85.13** | 83.85     | 90.35     | **91.17** | **91.14** | 87.29     |

#### GenEval

| Model           | Overall  | Single Obj. | Two Obj. | Counting | Colors   | Position | Color attribution |

|-----------------|----------|-------------|----------|----------|----------|----------|-------------------|

| SDXL            | 0.55     | 0.98        | 0.74     | 0.39     | 0.85     | 0.15     | 0.23              |

| PixArt-alpha    | 0.48     | 0.98        | 0.50     | 0.44     | 0.80     | 0.08     | 0.07              |

| SD3-Medium      | 0.74     | **0.99**    | **0.94** | 0.72     | 0.89     | 0.33     | 0.60              |

| DALL-E 3        | 0.67     | 0.96        | 0.87     | 0.47     | 0.83     | 0.43     | 0.45              |

| Flux.1-dev      | 0.66     | 0.98        | 0.79     | **0.73** | 0.77     | 0.22     | 0.45              |

| Janus-Pro-7B    | **0.80** | **0.99**    | 0.89     | 0.59     | **0.90** | **0.79** | **0.66**          |

| **CogView4-6B** | 0.73     | **0.99**    | 0.86     | 0.66     | 0.79     | 0.48     | 0.58              |

#### T2I-CompBench

| Model           | Color      | Shape      | Texture    | 2D-Spatial | 3D-Spatial | Numeracy   | Non-spatial Clip | Complex 3-in-1 |

|-----------------|------------|------------|------------|------------|------------|------------|------------------|----------------|

| SDXL            | 0.5879     | 0.4687     | 0.5299     | 0.2133     | 0.3566     | 0.4988     | 0.3119           | 0.3237         |

| PixArt-alpha    | 0.6690     | 0.4927     | 0.6477     | 0.2064     | 0.3901     | 0.5058     | **0.3197**       | 0.3433         |

| SD3-Medium      | **0.8132** | 0.5885     | **0.7334** | **0.3200** | **0.4084** | 0.6174     | 0.3140           | 0.3771         |

| DALL-E 3        | 0.7785     | **0.6205** | 0.7036     | 0.2865     | 0.3744     | 0.5880     | 0.3003           | 0.3773         |

| Flux.1-dev      | 0.7572     | 0.5066     | 0.6300     | 0.2700     | 0.3992     | 0.6165     | 0.3065           | 0.3628         |

| Janus-Pro-7B    | 0.5145     | 0.3323     | 0.4069     | 0.1566     | 0.2753     | 0.4406     | 0.3137           | 0.3806         |

| **CogView4-6B** | 0.7786     | 0.5880     | 0.6983     | 0.3075     | 0.3708     | **0.6626** | 0.3056           | **0.3869**     |

## Chinese Text Accuracy Evaluation

| Model           | Precision  | Recall     | F1 Score   | Pick@4     |

|-----------------|------------|------------|------------|------------|

| Kolors          | 0.6094     | 0.1886     | 0.2880     | 0.1633     |

| **CogView4-6B** | **0.6969** | **0.5532** | **0.6168** | **0.3265** |

## Inference Model

### Prompt Optimization

Although CogView4 series models are trained with lengthy synthetic image descriptions, we strongly recommend using a

large language model to rewrite prompts before text-to-image generation, which will greatly improve generation quality.

We provide an [example script](inference/prompt_optimize.py). We recommend running this script to refine your prompts.

Note that `CogView4` and `CogView3` models use different few-shot examples for prompt optimization. They need to be

distinguished.

```shell

cd inference

python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus" --cogview_version "cogview4"

```

### Inference Model

Run the model `CogView4-6B` with `BF16` precision:

```python

from diffusers import CogView4Pipeline

import torch

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")

# Open it for reduce GPU memory usage

pipe.enable_model_cpu_offload()

pipe.vae.enable_slicing()

pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."

image = pipe(

    prompt=prompt,

    guidance_scale=3.5,

    num_images_per_prompt=1,

    num_inference_steps=50,

    width=1024,

    height=1024,

).images[0]

image.save("cogview4.png")

```

For more inference code, please check:

1. For using `BNB int4` to load `text encoder` and complete inference code annotations,

   check [here](inference/cli_demo_cogview4.py).

2. For using `TorchAO int8 or int4` to load `text encoder & transformer` and complete inference code annotations,

   check [here](inference/cli_demo_cogview4_int8.py).

3. For setting up a `gradio` GUI DEMO, check [here](inference/gradio_web_demo.py).

## Fine-tuning

This repository does not contain fine-tuning code, but you can fine-tune using the following two approaches, including both LoRA and SFT:

1. [CogKit](https://github.com/THUDM/CogKit), our officially maintained system-level fine-tuning framework that supports CogView4 and CogVideoX.

2. [finetrainers](https://github.com/a-r-r-o-w/finetrainers), a low-memory solution that enables fine-tuning on a single RTX 4090.

3. If you want to train ControlNet models directly, you can refer to the [training code](https://github.com/huggingface/diffusers/tree/main/examples/cogview4-control) and train your own models.

## License

The code in this repository and the CogView3 models are licensed under [Apache 2.0](./LICENSE).

We welcome and appreciate your code contributions. You can view the contribution

guidelines [here](resources/contribute.md).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thudm/cogview4

Awesome Lists containing this project

README