{"id":26214314,"url":"https://github.com/thudm/cogview4","last_synced_at":"2025-05-15T02:03:18.037Z","repository":{"id":257808431,"uuid":"861571397","full_name":"THUDM/CogView4","owner":"THUDM","description":"CogView4, CogView3-Plus and CogView3(ECCV 2024)","archived":false,"fork":false,"pushed_at":"2025-03-29T07:09:58.000Z","size":25307,"stargazers_count":964,"open_issues_count":6,"forks_count":67,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-04-03T03:59:04.575Z","etag":null,"topics":["eccv2024","high-resolution","image-generation","text-to-image"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-23T06:38:31.000Z","updated_at":"2025-04-02T12:06:21.000Z","dependencies_parsed_at":"2024-11-24T09:03:06.691Z","dependency_job_id":"a4eb5a2b-1a91-4385-ad39-d15e6c5402fc","html_url":"https://github.com/THUDM/CogView4","commit_stats":null,"previous_names":["thudm/cogview3","thudm/cogview4"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView4","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView4/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView4/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView4/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/CogView4/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248166857,"owners_count":21058481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["eccv2024","high-resolution","image-generation","text-to-image"],"created_at":"2025-03-12T10:08:20.057Z","updated_at":"2025-04-10T06:16:59.243Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CogView4 \u0026 CogView3 \u0026 CogView-3Plus\n\n[阅读中文版](./README_zh.md)\n[日本語で読む](./README_ja.md)\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=resources/logo.svg width=\"50%\"/\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://huggingface.co/spaces/THUDM-HF-SPACE/CogView4\"  target=\"_blank\"\u003e 🤗 HuggingFace Space\u003c/a\u003e\n\u003ca href=\"https://modelscope.cn/studios/ZhipuAI/CogView4\" target=\"_blank\"\u003e  🤖ModelScope Space\u003c/a\u003e\n\u003ca href=\"https://zhipuaishengchan.datasink.sensorsdata.cn/t/4z\" target=\"_blank\"\u003e 🛠️ZhipuAI MaaS(Faster)\u003c/a\u003e\n\u003cbr\u003e\n\u003ca href=\"resources/WECHAT.md\" target=\"_blank\"\u003e 👋 WeChat Community\u003c/a\u003e  \u003ca href=\"https://arxiv.org/abs/2403.05121\" target=\"_blank\"\u003e📚 CogView3 Paper\u003c/a\u003e\n\u003c/p\u003e\n\n![showcase.png](resources/showcase.png)\n\n## Project Updates\n\n- 🔥🔥 ```2025/03/24```: We are launching [CogKit](https://github.com/THUDM/CogKit), a powerful toolkit for fine-tuning and inference of the **CogView4** and **CogVideoX** series, allowing you to fully explore our multimodal generation models.\n- ```2025/03/04```: We've adapted and open-sourced the [diffusers](https://github.com/huggingface/diffusers) version\n  of **CogView-4** model, which has 6B parameters, supports native Chinese input, and Chinese text-to-image generation.\n  You can try it [online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView4).\n- ```2024/10/13```: We've adapted and open-sourced the [diffusers](https://github.com/huggingface/diffusers) version of\n  **CogView-3Plus-3B** model. You can try\n  it [online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space).\n- ```2024/9/29```: We've open-sourced **CogView3** and **CogView-3Plus-3B**. **CogView3** is a text-to-image system\n  based on cascading diffusion, using a relay diffusion framework. **CogView-3Plus** is a series of newly developed\n  text-to-image models based on Diffusion Transformer.\n\n## Project Plan\n\n- [X] Diffusers workflow adaptation\n- [X] Cog series fine-tuning kits (coming soon)\n- [ ] ControlNet models and training code\n\n## Community Contributions\n\nWe have collected some community projects related to this repository here. These projects are maintained by community members, and we appreciate their contributions.\n\n+ [ComfyUI_CogView4_Wrapper](https://github.com/chflame163/ComfyUI_CogView4_Wrapper) - An implementation of the CogView4 project in ComfyUI.\n\n## Model Introduction\n\n### Model Comparison\n\n\u003ctable style=\"border-collapse: collapse; width: 100%;\"\u003e\n  \u003ctr\u003e\n    \u003cth style=\"text-align: center;\"\u003eModel Name\u003c/th\u003e\n    \u003cth style=\"text-align: center;\"\u003eCogView4\u003c/th\u003e\n    \u003cth style=\"text-align: center;\"\u003eCogView3-Plus-3B\u003c/th\u003e\n  \u003c/tr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eResolution\u003c/td\u003e\n    \u003ctd colspan=\"2\" style=\"text-align: center;\"\u003e\n            512 \u003c= H, W \u003c= 2048 \u003cbr\u003e\n            H * W \u003c= 2^{21} \u003cbr\u003e\n            H, W \\mod 32 = 0\n    \u003c/td\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eInference Precision\u003c/td\u003e\n    \u003ctd colspan=\"2\" style=\"text-align: center;\"\u003eOnly supports BF16, FP32\u003c/td\u003e\n  \u003ctr\u003e\n  \u003ctd style=\"text-align: center;\"\u003eEncoder\u003c/td\u003e\n  \u003ctd style=\"text-align: center;\"\u003e\u003ca href=\"https://huggingface.co/THUDM/glm-4-9b-hf\" target=\"_blank\"\u003eGLM-4-9B\u003c/a\u003e\u003c/td\u003e\n  \u003ctd style=\"text-align: center;\"\u003e\u003ca href=\"https://huggingface.co/google/t5-v1_1-xxl\" target=\"_blank\"\u003eT5-XXL\u003c/a\u003e\u003c/td\u003e\n\u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003ePrompt Language\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003eChinese, English\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003eEnglish\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003ePrompt Length Limit\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e1024 Tokens\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e224 Tokens\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eDownload Links\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\u003ca href=\"https://huggingface.co/THUDM/CogView4-6B\"\u003e🤗 HuggingFace\u003c/a\u003e\u003cbr\u003e\u003ca href=\"https://modelscope.cn/models/ZhipuAI/CogView4-6B\"\u003e🤖 ModelScope\u003c/a\u003e\u003cbr\u003e\u003ca href=\"https://wisemodel.cn/models/ZhipuAI/CogView4-6B\"\u003e🟣 WiseModel\u003c/a\u003e\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\u003ca href=\"https://huggingface.co/THUDM/CogView3-Plus-3B\"\u003e🤗 HuggingFace\u003c/a\u003e\u003cbr\u003e\u003ca href=\"https://modelscope.cn/models/ZhipuAI/CogView3-Plus-3B\"\u003e🤖 ModelScope\u003c/a\u003e\u003cbr\u003e\u003ca href=\"https://wisemodel.cn/models/ZhipuAI/CogView3-Plus-3B\"\u003e🟣 WiseModel\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### Memory Usage\n\nDIT models are tested with `BF16` precision and `batchsize=4`, with results shown in the table below:\n\n| Resolution  | enable_model_cpu_offload OFF | enable_model_cpu_offload ON | enable_model_cpu_offload ON \u003c/br\u003e Text Encoder 4bit |\n|-------------|------------------------------|-----------------------------|-----------------------------------------------------|\n| 512 * 512   | 33GB                         | 20GB                        | 13G                                                 |\n| 1280 * 720  | 35GB                         | 20GB                        | 13G                                                 |\n| 1024 * 1024 | 35GB                         | 20GB                        | 13G                                                 |\n| 1920 * 1280 | 39GB                         | 20GB                        | 14G                                                 |\n\nAdditionally, we recommend that your device has at least `32GB` of RAM to prevent the process from being killed.\n\n### Model Metrics\n\nWe've tested on multiple benchmarks and achieved the following scores:\n\n#### DPG-Bench\n\n| Model        | Overall   | Global    | Entity    | Attribute | Relation  | Other     |\n|--------------|-----------|-----------|-----------|-----------|-----------|-----------|\n| SDXL         | 74.65     | 83.27     | 82.43     | 80.91     | 86.76     | 80.41     |\n| PixArt-alpha | 71.11     | 74.97     | 79.32     | 78.60     | 82.57     | 76.96     |\n| SD3-Medium   | 84.08     | 87.90     | **91.01** | 88.83     | 80.70     | 88.68     |\n| DALL-E 3      | 83.50     | **90.97** | 89.61     | 88.39     | 90.58     | 89.83     |\n| Flux.1-dev   | 83.79     | 85.80     | 86.79     | 89.98     | 90.04     | **89.90** |\n| Janus-Pro-7B | 84.19     | 86.90     | 88.90     | 89.40     | 89.32     | 89.48     |\n| **CogView4-6B** | **85.13** | 83.85     | 90.35     | **91.17** | **91.14** | 87.29     |\n\n#### GenEval\n\n| Model           | Overall  | Single Obj. | Two Obj. | Counting | Colors   | Position | Color attribution |\n|-----------------|----------|-------------|----------|----------|----------|----------|-------------------|\n| SDXL            | 0.55     | 0.98        | 0.74     | 0.39     | 0.85     | 0.15     | 0.23              |\n| PixArt-alpha    | 0.48     | 0.98        | 0.50     | 0.44     | 0.80     | 0.08     | 0.07              |\n| SD3-Medium      | 0.74     | **0.99**    | **0.94** | 0.72     | 0.89     | 0.33     | 0.60              |\n| DALL-E 3        | 0.67     | 0.96        | 0.87     | 0.47     | 0.83     | 0.43     | 0.45              |\n| Flux.1-dev      | 0.66     | 0.98        | 0.79     | **0.73** | 0.77     | 0.22     | 0.45              |\n| Janus-Pro-7B    | **0.80** | **0.99**    | 0.89     | 0.59     | **0.90** | **0.79** | **0.66**          |\n| **CogView4-6B** | 0.73     | **0.99**    | 0.86     | 0.66     | 0.79     | 0.48     | 0.58              |\n\n#### T2I-CompBench\n\n| Model           | Color      | Shape      | Texture    | 2D-Spatial | 3D-Spatial | Numeracy   | Non-spatial Clip | Complex 3-in-1 |\n|-----------------|------------|------------|------------|------------|------------|------------|------------------|----------------|\n| SDXL            | 0.5879     | 0.4687     | 0.5299     | 0.2133     | 0.3566     | 0.4988     | 0.3119           | 0.3237         |\n| PixArt-alpha    | 0.6690     | 0.4927     | 0.6477     | 0.2064     | 0.3901     | 0.5058     | **0.3197**       | 0.3433         |\n| SD3-Medium      | **0.8132** | 0.5885     | **0.7334** | **0.3200** | **0.4084** | 0.6174     | 0.3140           | 0.3771         |\n| DALL-E 3        | 0.7785     | **0.6205** | 0.7036     | 0.2865     | 0.3744     | 0.5880     | 0.3003           | 0.3773         |\n| Flux.1-dev      | 0.7572     | 0.5066     | 0.6300     | 0.2700     | 0.3992     | 0.6165     | 0.3065           | 0.3628         |\n| Janus-Pro-7B    | 0.5145     | 0.3323     | 0.4069     | 0.1566     | 0.2753     | 0.4406     | 0.3137           | 0.3806         |\n| **CogView4-6B** | 0.7786     | 0.5880     | 0.6983     | 0.3075     | 0.3708     | **0.6626** | 0.3056           | **0.3869**     |\n\n## Chinese Text Accuracy Evaluation\n\n| Model           | Precision  | Recall     | F1 Score   | Pick@4     |\n|-----------------|------------|------------|------------|------------|\n| Kolors          | 0.6094     | 0.1886     | 0.2880     | 0.1633     |\n| **CogView4-6B** | **0.6969** | **0.5532** | **0.6168** | **0.3265** |\n\n## Inference Model\n\n### Prompt Optimization\n\nAlthough CogView4 series models are trained with lengthy synthetic image descriptions, we strongly recommend using a\nlarge language model to rewrite prompts before text-to-image generation, which will greatly improve generation quality.\n\nWe provide an [example script](inference/prompt_optimize.py). We recommend running this script to refine your prompts.\nNote that `CogView4` and `CogView3` models use different few-shot examples for prompt optimization. They need to be\ndistinguished.\n\n```shell\ncd inference\npython prompt_optimize.py --api_key \"Zhipu AI API Key\" --prompt {your prompt} --base_url \"https://open.bigmodel.cn/api/paas/v4\" --model \"glm-4-plus\" --cogview_version \"cogview4\"\n```\n\n### Inference Model\n\nRun the model `CogView4-6B` with `BF16` precision:\n\n```python\nfrom diffusers import CogView4Pipeline\nimport torch\n\npipe = CogView4Pipeline.from_pretrained(\"THUDM/CogView4-6B\", torch_dtype=torch.bfloat16).to(\"cuda\")\n\n# Open it for reduce GPU memory usage\npipe.enable_model_cpu_offload()\npipe.vae.enable_slicing()\npipe.vae.enable_tiling()\n\nprompt = \"A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background.\"\nimage = pipe(\n    prompt=prompt,\n    guidance_scale=3.5,\n    num_images_per_prompt=1,\n    num_inference_steps=50,\n    width=1024,\n    height=1024,\n).images[0]\n\nimage.save(\"cogview4.png\")\n```\n\nFor more inference code, please check:\n\n1. For using `BNB int4` to load `text encoder` and complete inference code annotations,\n   check [here](inference/cli_demo_cogview4.py).\n2. For using `TorchAO int8 or int4` to load `text encoder \u0026 transformer` and complete inference code annotations,\n   check [here](inference/cli_demo_cogview4_int8.py).\n3. For setting up a `gradio` GUI DEMO, check [here](inference/gradio_web_demo.py).\n\n\n## Fine-tuning\n\nThis repository does not contain fine-tuning code, but you can fine-tune using the following two approaches, including both LoRA and SFT:\n\n1. [CogKit](https://github.com/THUDM/CogKit), our officially maintained system-level fine-tuning framework that supports CogView4 and CogVideoX.\n2. [finetrainers](https://github.com/a-r-r-o-w/finetrainers), a low-memory solution that enables fine-tuning on a single RTX 4090.\n3. If you want to train ControlNet models directly, you can refer to the [training code](https://github.com/huggingface/diffusers/tree/main/examples/cogview4-control) and train your own models.\n\n## License\n\nThe code in this repository and the CogView3 models are licensed under [Apache 2.0](./LICENSE).\n\nWe welcome and appreciate your code contributions. You can view the contribution\nguidelines [here](resources/contribute.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogview4","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fcogview4","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogview4/lists"}