{"id":13488369,"url":"https://github.com/TencentQQGYLab/ELLA","last_synced_at":"2025-03-28T00:33:46.554Z","repository":{"id":226429273,"uuid":"768660898","full_name":"TencentQQGYLab/ELLA","owner":"TencentQQGYLab","description":"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment","archived":false,"fork":false,"pushed_at":"2024-05-13T08:53:34.000Z","size":13127,"stargazers_count":843,"open_issues_count":16,"forks_count":45,"subscribers_count":41,"default_branch":"main","last_synced_at":"2024-05-15T23:58:48.540Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://ella-diffusion.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TencentQQGYLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-07T13:57:12.000Z","updated_at":"2024-06-14T04:29:31.547Z","dependencies_parsed_at":"2024-06-14T04:29:30.326Z","dependency_job_id":"61b62a58-d61a-4ced-858d-7838ef965ff5","html_url":"https://github.com/TencentQQGYLab/ELLA","commit_stats":null,"previous_names":["ella-diffusion/ella","tencentqqgylab/ella"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentQQGYLab%2FELLA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentQQGYLab%2FELLA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentQQGYLab%2FELLA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TencentQQGYLab%2FELLA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TencentQQGYLab","download_url":"https://codeload.github.com/TencentQQGYLab/ELLA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245949278,"owners_count":20698913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T18:01:14.536Z","updated_at":"2025-03-28T00:33:41.530Z","avatar_url":"https://github.com/TencentQQGYLab.png","language":"Python","funding_links":[],"categories":["T2I Diffusion Model augmentation","Building","Benchmark","图像生成"],"sub_categories":["LLM Models","Multi-modal","资源传输下载"],"readme":"# ELLA \u0026 EMMA\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\n      \u003ch2\u003e ELLA \u003c/h2\u003e\n      \u003cp\u003e Paper: \u003ca href=\"https://arxiv.org/abs/2403.05135\"\u003eELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment \u003c/a\u003e\u003c/p\u003e\n      \u003cp\u003e Project Website: \u003ca href=\"https://ella-diffusion.github.io/\"\u003eELLA\u003c/a\u003e \u003c/p\u003e\n    \u003c/td\u003e\n    \u003ctd\u003e\n      \u003ch2\u003e EMMA \u003c/h2\u003e\n      \u003cp\u003e Paper: \u003ca href=\"https://arxiv.org/abs/2406.09162\"\u003eEMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts\u003c/a\u003e\u003c/p\u003e\n      \u003cp\u003e Project Website: \u003ca href=\"https://tencentqqgylab.github.io/EMMA/\"\u003eEMMA\u003c/a\u003e \u003c/p\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n## ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment\n\n\u003cdiv align=\"center\"\u003e\n\u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://openreview.net/profile?id=~Xiwei_Hu1\"\u003eXiwei Hu*\u003c/a\u003e,\n\u003c/span\u003e\n\u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://wrong.wang/\"\u003eRui Wang*\u003c/a\u003e,\n\u003c/span\u003e\n\u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://openreview.net/profile?id=~Yixiao_Fang1\"\u003eYixiao Fang*\u003c/a\u003e,\n\u003c/span\u003e\n\u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://openreview.net/profile?id=~BIN_FU2\"\u003eBin Fu*\u003c/a\u003e,\n\u003c/span\u003e\n\u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://openreview.net/profile?id=~Pei_Cheng1\"\u003ePei Cheng\u003c/a\u003e,\n\u003c/span\u003e\n\u003cspan class=\"author-block\"\u003e\n    \u003ca href=\"https://www.skicyyu.org/\"\u003eGang Yu\u0026#10022\u003c/a\u003e\n\u003c/span\u003e\n\u003cp\u003e\n* Equal contributions, \u0026#10022 Corresponding Author\n\u003c/p\u003e\n\n\u003cimg src=\"./assets/ELLA-Diffusion.jpg\" width=\"30%\" \u003e \u003cbr/\u003e\n\u003ca href='https://ella-diffusion.github.io/'\u003e\u003cimg src='https://img.shields.io/badge/Project-Page-green'\u003e\u003c/a\u003e\n\u003ca href='https://arxiv.org/abs/2403.05135'\u003e\u003cimg src='https://img.shields.io/badge/arXiv-2403.05135-b31b1b.svg'\u003e\u003c/a\u003e\n\u003c/div\u003e\n\nOfficial code of \"ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment\".\n\u003cp\u003e\n\u003c/p\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/teaser_3img.png\" width=\"100%\"\u003e\n    \u003cimg src=\"./assets/teaser1_raccoon.png\" width=\"100%\"\u003e\n\u003c/div\u003e\n\n## 🌟 Changelog\n\n- **[2024.6.14]** 🔥🔥 EMMA: [Technical Report](https://arxiv.org/abs/2406.09162), [Project Website](https://tencentqqgylab.github.io/EMMA/)\n- **[2024.5.13]** EMMA is coming soon. Let's first preview the results of EMMA: [中文版](https://wrong.wang/blog/20240512-emma/), [English Version](https://wrong.wang/blog/20240512-what-is-emma/)\n- **[2024.4.19]** We provide ELLA’s ComfyUI plugin: [TencentQQGYLab/ComfyUI-ELLA](https://github.com/TencentQQGYLab/ComfyUI-ELLA)\n- **[2024.4.11]** Add some results of [EMMA(Efficient Multi-Modal Adapter)](#emma)\n- **[2024.4.9]** 🔥🔥🔥 Release [ELLA-SD1.5](https://huggingface.co/QQGYLab/ELLA/blob/main/ella-sd1.5-tsc-t5xl.safetensors) Checkpoint! Welcome to try! \n- **[2024.3.11]** 🔥 Release DPG-Bench! Welcome to try! \n- **[2024.3.7]** Initial update\n\n\n## 🚀 Usage\n\n### Download\n\nYou can download ELLA models from [QQGYLab/ELLA](https://huggingface.co/QQGYLab/ELLA).\n\n### Quick View\n\n```bash\n# get ELLA-SD1.5 at https://huggingface.co/QQGYLab/ELLA/blob/main/ella-sd1.5-tsc-t5xl.safetensors\n\n# comparing ella-sd1.5 and sd1.5\n# will generate images at `./assets/ella-inference-examples`\npython3 inference.py test --save_folder ./assets/ella-inference-examples --ella_path /path/to/ella-sd1.5-tsc-t5xl.safetensors\n```\n\n### Build a demo for comparing SD1.5 and ELLA-SD1.5\n\n```python\nGRADIO_SERVER_NAME=0.0.0.0 GRADIO_SERVER_PORT=8082 python3 ./inference.py demo /path/to/ella-sd1.5-tsc-t5xl.safetensors\n```\n\n### Using ELLA in ComfyUI\n\nWe provide ELLA’s ComfyUI plugin: [TencentQQGYLab/ComfyUI-ELLA](https://github.com/TencentQQGYLab/ComfyUI-ELLA), which supports ControlNet, img2img and more. You are welcome to try it out.\n\n\nThanks to [@ExponentialML](https://github.com/ExponentialML/) and [@kijai](https://github.com/kijai), they offer third-party ComfyUI plugins for ELLA:\n\n1. [ExponentialML/ComfyUI_ELLA](https://github.com/ExponentialML/ComfyUI_ELLA/)\n2. [kijai/ComfyUI-ELLA-wrapper](https://github.com/kijai/ComfyUI-ELLA-wrapper)\n\n## 📙 Notes\n\nELLA is still in its early stages of research, and we have not yet conducted comprehensive testing on all potential applications of ELLA. We welcome constructive and friendly suggestions from the community.\n\nHere, we share some tips that we have discovered thus far on how to better utilize ELLA:\n\n### 1. Caption Upscale\n\nELLA was trained using MLLM-annotated synthetic captions. As mentioned in the [Improving Image Generation with Better Captions](https://cdn.openai.com/papers/dall-e-3.pdf), performing an \"upsampling\" on the input caption before using ELLA can extract its maximum potential.\n\nWe have discovered that leveraging the In-Context Learning (ICL) capability of LLMs can facilitate a straightforward caption upsampler:\n\nexample instruction:\n\n```\nPlease generate the long prompt version of the short one according to the given examples. Long prompt version should consist of 3 to 5 sentences. Long prompt version must sepcify the color, shape, texture or spatial relation of the included objects. DO NOT generate sentences that describe any atmosphere!!!\n\nShort: A calico cat with eyes closed is perched upon a Mercedes.\nLong: a multicolored cat perched atop a shiny black car. the car is parked in front of a building with wooden walls and a green fence. the reflection of the car and the surrounding environment can be seen on the car's glossy surface.\n\nShort: A boys sitting on a chair holding a video game remote.\nLong: a young boy sitting on a chair, wearing a blue shirt and a baseball cap with the letter 'm'. he has a red medal around his neck and is holding a white game controller. behind him, there are two other individuals, one of whom is wearing a backpack. to the right of the boy, there's a blue trash bin with a sign that reads 'automatic party'.\n\nShort: A man is on the bank of the water fishing.\nLong: a serene waterscape where a person, dressed in a blue jacket and a red beanie, stands in shallow waters, fishing with a long rod. the calm waters are dotted with several sailboats anchored at a distance, and a mountain range can be seen in the background under a cloudy sky.\n\nShort: A kitchen with a cluttered counter and wooden cabinets.\nLong: a well-lit kitchen with wooden cabinets, a black and white checkered floor, and a refrigerator adorned with a floral decal on its side. the kitchen countertop holds various items, including a coffee maker, jars, and fruits.\n\nShort: a racoon holding a shiny red apple over its head\n```\n\nusing: https://huggingface.co/spaces/Qwen/Qwen-72B-Chat-Demo\n\nwe got: \n\na mischievous raccoon standing on its hind legs, holding a bright red apple aloft in its furry paws. the apple shines brightly against the backdrop of a dense forest, with leaves rustling in the gentle breeze. a few scattered rocks can be seen on the ground beneath the raccoon's feet, while a gnarled tree trunk stands nearby.\n\n\n#### Before and After caption upsampling \n\n\noriginal prompt: *a racoon holding a shiny red apple over its head*\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](./assets/ella-sd1.5-notes/racoon_apple.jpg)\n\nQwen-72B refined caption: *a mischievous raccoon standing on its hind legs, holding a bright red apple aloft in its furry paws. the apple shines brightly against the backdrop of a dense forest, with leaves rustling in the gentle breeze. a few scattered rocks can be seen on the ground beneath the raccoon's feet, while a gnarled tree trunk stands nearby.*\n\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](./assets/ella-sd1.5-notes/racoon_apple_Qwen-72B-Chat-refined.jpg)\n\n\n\noriginal prompt: *Crocodile in a sweater*\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](./assets/ella-sd1.5-notes/crocodile_sweater.jpg)\n\nGPT4 refined caption: *a large, textured green crocodile lying comfortably on a patch of grass with a cute, knitted orange sweater enveloping its scaly body. Around its neck, the sweater features a whimsical pattern of blue and yellow stripes. In the background, a smooth, grey rock partially obscures the view of a small pond with lily pads floating on the surface.*\n\n\n| SD1.5 | ELLA-SD1.5_fixed_token_length | ELLA-SD1.5_flexible_token_length |\n| ----- | ----------------------------- | -------------------------------- |\n\n![](./assets/ella-sd1.5-notes/crocodile_sweater-gpt4_refined_caption.jpg)\n\n\n### 2. flexible token length\n\nDuring the training of ELLA, long synthetic captions were utilized, with the maximum number of tokens set to 128. When testing ELLA with short captions, in addition to the previously mentioned caption upsampling technique, the \"flexible_token_length\" trick can also be employed. This involves setting the tokenizer's `max_length` as `None`, thereby eliminating any text token padding or truncation. We have observed that this trick can help improve the quality of generated images corresponding to short captions.\n\n### 3. ELLA+CLIP for community models\n\nOur testing has revealed that some community models heavily reliant on trigger words may experience significant style loss when utilizing ELLA, primarily because CLIP is not used at all during ELLA inference.\n\n Although CLIP was not used during training, we have discovered that it is still possible to concatenate ELLA's input with CLIP's output during inference (Bx77x768 + Bx64x768 -\u003e Bx141x768) as a condition for the UNet. We anticipate that using ELLA in conjunction with CLIP will better integrate with the existing community ecosystem, particularly with CLIP-specific techniques such as Textual Inversion and Trigger Word.\n \n  Our goal is to ensure better compatibility with a wider range of community models; however, we currently do not have a comprehensive set of experiences to share. If you have any suggestions, we would be grateful if you could share them in issue.\n\n### 4. FlanT5 must run in fp16 mode.\n\nAs described in [issues#23](https://github.com/TencentQQGYLab/ELLA/issues/23), we conducted the vast majority of experiments on V100, which does not support bf16, so we had to use the fp16 T5 for training. we tested and found that the output difference between the fp16 T5 and the bf16 T5 cannot be ignored, resulting in obvious differences in the generated images. \nTherefore, it is recommended to use fp16 T5 for inference.\n\n## 📊 DPG-Bench\n\nThe guideline of DPG-Bench:\n\n1. Generate your images according to our [prompts](./dpg_bench/prompts/).\n    \n    It is recommended to generate 4 images per prompt and grid them to 2x2 format. **Please Make sure your generated image's filename is the same with the prompt's filename.**\n\n2. Run the following command to conduct evaluation.\n\n    ```bash\n    bash dpg_bench/dist_eval.sh $YOUR_IMAGE_PATH $RESOLUTION\n    ```\n\nThanks to the excellent work of [DSG](https://github.com/j-min/DSG) sincerely, we follow their instructions to generate questions and answers of DPG-Bench.\n\n\u003ca id=\"emma\"\u003e\u003c/a\u003e\n## 🚧 EMMA - Efficient Multi-Modal Adapter (Work in progress)\n\nAs described in the conclusion section of ELLA's paper  and [issue#15](https://github.com/TencentQQGYLab/ELLA/issues/15),\nwe plan to investigate the integration of\nMLLM with diffusion models, enabling the utilization of interleaved image-text input as a conditional component in the image generation process. Here are some very early results with EMMA-SD1.5, stay tuned.\n\n\u003ctable\u003e\n\u003cthead\u003e\n  \u003ctr\u003e\n    \u003cth\u003eprompt\u003c/th\u003e\n    \u003cth\u003eobject image\u003c/th\u003e\n    \u003cth\u003eresults\u003c/th\u003e\n  \u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eA woman is skiing down a snowy mountain, wearing a bright orange ski suit and goggles.\u003c/td\u003e\n    \u003ctd rowspan=\"3\"\u003e\u003cimg src=\"./assets/emma/emma_c.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_3.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eA woman is playing basketball on an outdoor court, wearing a sleeveless jersey.\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_1.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eA woman is hiking through a dense forest, wearing a green camouflage jacket and carrying a backpack.\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_2.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ea  dog jumping over a vehicle on a snowy day\u003c/td\u003e\n    \u003ctd rowspan=\"2\"\u003e\u003cimg src=\"./assets/emma/emma_a.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_6.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ea  dog reading a book with a pink glasses on\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_4.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eA dog standing on a mountaintop, surveying the stunning view. Snow-capped peaks stretch out in the distance, and a river winds its way through the valley below.\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_b.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./assets/emma/emma_5.jpg\" width=\"100%\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n\n## 📝 TODO\n\n- [x] release checkpoint\n- [x] release inference code\n- [x] release DPG-Bench\n\n\n## 💡 Others\n\nWe have also found [LaVi-Bridge](https://arxiv.org/abs/2403.07860), another independent but similar work completed almost concurrently, which offers additional insights not covered by ELLA. The difference between ELLA and LaVi-Bridge can be found in [issue 13](https://github.com/ELLA-Diffusion/ELLA/issues/13). We are delighted to welcome other researchers and community users to promote the development of this field.\n\n## 😉 Citation\n\nIf you find **ELLA** useful for your research and applications, please cite us using this BibTeX:\n\n```\n@misc{hu2024ella,\n      title={ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment}, \n      author={Xiwei Hu and Rui Wang and Yixiao Fang and Bin Fu and Pei Cheng and Gang Yu},\n      year={2024},\n      eprint={2403.05135},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencentQQGYLab%2FELLA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTencentQQGYLab%2FELLA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTencentQQGYLab%2FELLA/lists"}