{"id":20215989,"url":"https://github.com/thudm/cogview3","last_synced_at":"2025-03-01T00:04:06.790Z","repository":{"id":257808431,"uuid":"861571397","full_name":"THUDM/CogView3","owner":"THUDM","description":"text to image to  generation: CogView3-Plus and CogView3(ECCV 2024)","archived":false,"fork":false,"pushed_at":"2025-01-13T08:41:15.000Z","size":14262,"stargazers_count":284,"open_issues_count":12,"forks_count":20,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-02-21T23:01:47.947Z","etag":null,"topics":["eccv2024","high-resolution","image-generation","text-to-image"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-23T06:38:31.000Z","updated_at":"2025-02-21T14:31:59.000Z","dependencies_parsed_at":"2024-11-24T09:03:06.691Z","dependency_job_id":"a4eb5a2b-1a91-4385-ad39-d15e6c5402fc","html_url":"https://github.com/THUDM/CogView3","commit_stats":null,"previous_names":["thudm/cogview3"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FCogView3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/CogView3/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241271853,"owners_count":19937089,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["eccv2024","high-resolution","image-generation","text-to-image"],"created_at":"2024-11-14T06:25:47.167Z","updated_at":"2025-03-01T00:04:06.756Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CogView3 \u0026 CogView-3Plus\n\n[Read this in Chinese](./README_zh.md)\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=resources/logo.svg width=\"50%\"/\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\nExperience the CogView3-Plus-3B model online on \u003ca href=\"https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space\" target=\"_blank\"\u003e 🤗 Huggingface Space\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n📚 Check out the \u003ca href=\"https://arxiv.org/abs/2403.05121\" target=\"_blank\"\u003epaper\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    👋 Join our \u003ca href=\"resources/WECHAT.md\" target=\"_blank\"\u003eWeChat\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n📍 Visit \u003ca href=\"https://chatglm.cn/main/gdetail/65a232c082ff90a2ad2f15e2?fr=osm_cogvideox\u0026lang=zh\"\u003eQingyan\u003c/a\u003e and \u003ca href=\"https://open.bigmodel.cn/?utm_campaign=open\u0026_channel_track_key=OWTVNma9\"\u003eAPI Platform\u003c/a\u003e for larger-scale commercial video generation models.\n\u003c/p\u003e\n\n## Project Updates\n\n- 🔥🔥 ```2024/10/13```: We have adapted and open-sourced the **CogView-3Plus-3B** model in the [diffusers](https://github.com/huggingface/diffusers) version. You can [experience it online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space).\n- 🔥 ```2024/9/29```: We have open-sourced **CogView3** and **CogView-3Plus-3B**. **CogView3** is a text-to-image system based on cascaded diffusion, utilizing a relay diffusion framework. **CogView-3Plus** is a series of newly developed text-to-image models based on Diffusion Transformers.\n\n## Model Introduction\n\nCogView-3-Plus builds upon CogView3 (ECCV'24) by introducing the latest DiT framework for further overall performance\nimprovements. CogView-3-Plus uses the Zero-SNR diffusion noise scheduling and incorporates a joint text-image attention\nmechanism. Compared to the commonly used MMDiT structure, it effectively reduces training and inference costs while\nmaintaining the model's basic capabilities. CogView-3Plus utilizes a VAE with a latent dimension of 16.\n\nThe table below shows the list of text-to-image models we currently offer along with their basic information.\n\n\u003ctable style=\"border-collapse: collapse; width: 100%;\"\u003e\n  \u003ctr\u003e\n    \u003cth style=\"text-align: center;\"\u003eModel Name\u003c/th\u003e\n    \u003cth style=\"text-align: center;\"\u003eCogView3-Base-3B\u003c/th\u003e\n    \u003cth style=\"text-align: center;\"\u003eCogView3-Base-3B-distill\u003c/th\u003e\n    \u003cth style=\"text-align: center;\"\u003eCogView3-Plus-3B\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eModel Description\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003eThe base and relay stage models of CogView3, supporting 512x512 text-to-image generation and 2x super-resolution generation.\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003eThe distilled version of CogView3, with 4 and 1 step sampling in two stages (or 8 and 2 steps).\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003eThe DiT version image generation model, supporting image generation ranging from 512 to 2048.\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eResolution\u003c/td\u003e\n    \u003ctd colspan=\"2\" style=\"text-align: center;\"\u003e512 * 512\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\n            512 \u003c= H, W \u003c= 2048 \u003cbr\u003e\n            H * W \u003c= 2^{21} \u003cbr\u003e\n            H, W \\mod 32 = 0\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eInference Precision\u003c/td\u003e\n    \u003ctd colspan=\"2\" style=\"text-align: center;\"\u003e\u003cb\u003eFP16 (recommended)\u003c/b\u003e, BF16, FP32\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\u003cb\u003eBF16* (recommended)\u003c/b\u003e, FP16, FP32\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eMemory Usage (bs = 4)\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e 17G \u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e 64G \u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e 30G (2048 * 2048) \u003cbr\u003e 20G (1024 * 1024) \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003ePrompt Language\u003c/td\u003e\n    \u003ctd colspan=\"3\" style=\"text-align: center;\"\u003eEnglish*\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eMaximum Prompt Length\u003c/td\u003e\n    \u003ctd colspan=\"2\" style=\"text-align: center;\"\u003e225 Tokens\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e224 Tokens\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eDownload Link (SAT)\u003c/td\u003e\n    \u003ctd colspan=\"3\" style=\"text-align: center;\"\u003e\u003ca href=\"./sat/README.md\"\u003eSAT\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003eDownload Link (Diffusers)\u003c/td\u003e\n    \u003ctd colspan=\"2\" style=\"text-align: center;\"\u003eNot Adapted\u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\n        \u003ca href=\"https://huggingface.co/THUDM/CogView3-Plus-3B\"\u003e🤗 HuggingFace\u003c/a\u003e\u003cbr\u003e\n        \u003ca href=\"https://modelscope.cn/models/ZhipuAI/CogView3-Plus-3B\"\u003e🤖 ModelScope\u003c/a\u003e\u003cbr\u003e\n        \u003ca href=\"https://wisemodel.cn/models/ZhipuAI/CogView3-Plus-3B\"\u003e🟣 WiseModel\u003c/a\u003e\n    \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n**Data Explanation**\n\n+ All inference tests were conducted on a single A100 GPU with a batch size of 4,\n  using `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to save memory.\n+ The models only support English input. Other languages can be translated into English when refining with large models.\n+ This test environment uses the `SAT` framework. Many optimization points are not yet complete, and we will work with\n  the community to create a version of the model for the `diffusers` library. Once the `diffusers` repository is\n  supported, we will test using `diffusers`. The release is expected in November 2024.\n\n## Quick Start\n\n### Prompt Optimization\n\nAlthough CogView3 series models are trained with long image descriptions, we highly recommend rewriting prompts using\nlarge language models (LLMs) before generating text-to-image, as this will significantly improve generation quality.\n\nWe provide an [example script](prompt_optimize.py). We suggest running this script to refine the prompt:\n\n```shell\npython prompt_optimize.py --api_key \"Zhipu AI API Key\" --prompt {your prompt} --base_url \"https://open.bigmodel.cn/api/paas/v4\" --model \"glm-4-plus\"\n```\n\n### Inference Model (Diffusers)\n\nFirst, ensure the `diffusers` library is installed **from source**. \n```\npip install git+https://github.com/huggingface/diffusers.git\n```\n\nThen, run the following code:\n\n```python\nfrom diffusers import CogView3PlusPipeline\nimport torch\n\npipe = CogView3PlusPipeline.from_pretrained(\"THUDM/CogView3-Plus-3B\", torch_dtype=torch.float16).to(\"cuda\")\n\n# Enable it to reduce GPU memory usage\npipe.enable_model_cpu_offload()\npipe.vae.enable_slicing()\npipe.vae.enable_tiling()\n\nprompt = \"A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background.\"\nimage = pipe(\n    prompt=prompt,\n    guidance_scale=7.0,\n    num_images_per_prompt=1,\n    num_inference_steps=50,\n    width=1024,\n    height=1024,\n).images[0]\n\nimage.save(\"cogview3.png\")\n```\n\nFor more inference code, please refer to [inference](inference/cli_demo.py). This folder also contains a simple WEBUI code wrapped with Gradio.\n\n### Inference Model (SAT)\n\nPlease check the [sat](sat/README.md) tutorial for step-by-step instructions on model inference.\n\n### Open Source Plan\n\nSince the project is in its early stages, we are working on the following:\n\n+ [ ] Fine-tuning the SAT version of CogView3-Plus-3B, including SFT and LoRA fine-tuning\n+ [X] Inference with the Diffusers library version of the CogView3-Plus-3B model\n+ [ ] Fine-tuning the Diffusers library version of the CogView3-Plus-3B model\n+ [ ] Related work for the CogView3-Plus-3B model, including ControlNet and other tasks.\n\n## CogView3 (ECCV'24)\n\nOfficial paper\nrepository: [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://arxiv.org/abs/2403.05121)\n\nCogView3 is a novel text-to-image generation system using relay diffusion. It breaks down the process of generating\nhigh-resolution images into multiple stages. Through the relay super-resolution process, Gaussian noise is added to\nlow-resolution generation results, and the diffusion process begins from these noisy images. Our results show that\nCogView3 outperforms SDXL with a winning rate of 77.0%. Additionally, through progressive distillation of the diffusion\nmodel, CogView3 can generate comparable results while reducing inference time to only 1/10th of SDXL's.\n\n![CogView3 Showcase](resources/CogView3_showcase.png)\n![CogView3 Pipeline](resources/CogView3_pipeline.jpg)\n\nComparison results from human evaluations:\n\n![CogView3 Evaluation](resources/CogView3_evaluation.png)\n\n## Citation\n\n🌟 If you find our work helpful, feel free to cite our paper and leave a star.\n\n```\n@article{zheng2024cogview3,\n  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},\n  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},\n  journal={arXiv preprint arXiv:2403.05121},\n  year={2024}\n}\n```\n\nWe welcome your contributions! Click [here](resources/contribute.md) for more information.\n\n## Model License\n\nThis codebase is released under the [Apache 2.0 License](LICENSE).\n\nThe CogView3-Base, CogView3-Relay, and CogView3-Plus models (including the UNet module, Transformers module, and VAE\nmodule) are released under the [Apache 2.0 License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogview3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Fcogview3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Fcogview3/lists"}