{"id":13488279,"url":"https://github.com/SongweiGe/rich-text-to-image","last_synced_at":"2025-03-28T00:33:28.609Z","repository":{"id":152972807,"uuid":"627644217","full_name":"songweige/rich-text-to-image","owner":"songweige","description":"Rich-Text-to-Image Generation","archived":false,"fork":false,"pushed_at":"2023-10-09T22:04:56.000Z","size":43360,"stargazers_count":759,"open_issues_count":5,"forks_count":63,"subscribers_count":20,"default_branch":"main","last_synced_at":"2024-10-31T00:36:35.207Z","etag":null,"topics":["computer-vision","diffusion-models","pytorch","rich-text","text-to-image-generation"],"latest_commit_sha":null,"homepage":"https://rich-text-to-image.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/songweige.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-13T22:39:42.000Z","updated_at":"2024-10-29T15:58:50.000Z","dependencies_parsed_at":"2024-10-31T00:40:57.832Z","dependency_job_id":null,"html_url":"https://github.com/songweige/rich-text-to-image","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songweige%2Frich-text-to-image","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songweige%2Frich-text-to-image/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songweige%2Frich-text-to-image/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songweige%2Frich-text-to-image/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/songweige","download_url":"https://codeload.github.com/songweige/rich-text-to-image/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245949254,"owners_count":20698911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","diffusion-models","pytorch","rich-text","text-to-image-generation"],"created_at":"2024-07-31T18:01:12.973Z","updated_at":"2025-03-28T00:33:23.574Z","avatar_url":"https://github.com/songweige.png","language":"Python","funding_links":[],"categories":["T2I Diffusion Model augmentation","\u003cspan id=\"image\"\u003eImage\u003c/span\u003e"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e"],"readme":"# Rich-Text-to-Image\n\n### [Project Page](https://rich-text-to-image.github.io/) | [Paper](https://arxiv.org/abs/2304.06720) | [Video](https://youtu.be/ihDbAUh0LXk) | [HuggingFace Demo](https://huggingface.co/spaces/songweig/rich-text-to-image) | [A1111 Extension](https://github.com/songweige/sd-webui-rich-text)\n\n\n**tl;dr:** We use various formatting information from rich text, including font size, color, style, and footnote, to increase control of text-to-image generation. Our method enables explicit token reweighting, precise color rendering, local style control, and detailed region synthesis.\n\n\nhttps://github.com/songweige/rich-text-to-image/assets/22885450/ccd186d1-f0fc-4e55-80c0-06afd6cb84c0\n\n\n***Expressive Text-to-Image Generation with Rich Text*** \u003cbr\u003e\n[Songwei Ge](https://songweige.github.io/), [Taesung Park](https://taesung.me/), [Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/), [Jia-Bin Huang](https://jbhuang0604.github.io/)\u003cbr\u003e\nUMD, Adobe, CMU\u003cbr\u003e\nICCV 2023\n\n## Updates\n* [09/26] We initiate an implementation of an [A1111 WebUI extension](https://github.com/songweige/sd-webui-rich-text) for integrating the rich-text editor for text-to-image generation.\n* [09/24] We now support LoRA checkpoints. Please find the demo and the latest code in [this branch](https://github.com/songweige/rich-text-to-image/tree/lora).\n* [08/09] Our method now support [SD-XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with `--model SDXL`, and various fintuned model like [ANIMAGINE-XL](https://huggingface.co/Linaqruf/animagine-xl) with `--model AnimeXL`.\n* [07/14] Our paper is accepted by ICCV 2023.\n* [05/03] We update our approach to abtain more robust and accurate token maps and improve the structure preservation from plain-text results. The following images are generated by the new method with the prompt taken from [this issue](https://github.com/SongweiGe/rich-text-to-image/issues/9).\n* [04/17] We release the [rich-text-to-image demo](https://huggingface.co/spaces/songweig/rich-text-to-image) on HuggingFace Space. Thanks to [HuggingFace](https://huggingface.co/) team for the help with the demo!\n* [04/13] We release the [rich-text-to-image generation](https://arxiv.org/abs/2304.06720), which leverages the formatting options of a rich-text editor to facilitate controlling the text-to-image generation.\n\n\n## Setup\n\nThis code was tested with Python 3.8, [Pytorch](https://pytorch.org/) 1.11 and supports a [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) or [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) or [ANIMAGINE-XL](https://huggingface.co/Linaqruf/animagine-xl) through hugginface.\n```\ngit clone https://github.com/SongweiGe/rich-text-to-image.git\ncd rich-text-to-image/\nconda env create -f environment.yaml\npip install git+https://github.com/openai/CLIP.git\nconda activate rich-text\n```\n## Usage\nIn general, our pipeline contains two steps. We first input the plain text prompt to the diffusion model and compute the cross-attention maps to associate each token with the spatial region. The rich-text prompts obtained from the editor are stored in JSON format, providing attributes for each token span. We use a new region-based diffusion to render each region’s attributes into a globally coherent image. Below we provide the basic usage of various font formats.\n\n### Rich text to JSON\nWe encode the rich text into JSON format and use it as the input to the rich-text conditioned sampling script `sample.py`. To automatically generate a JSON string based on rich text, you can use our [rich-text-to-json](https://rich-text-to-image.github.io/rich-text-to-json.html) interface, which is a purely static webpage that can be readily incorporated into any rich-text-based application.\n\n### Rich-text JSON to Image\n![teaser](assets/teaser.jpg)\n\nYou may start generating images with rich-text JSON via our local gradio demo:\n\n```\npython gradio_app.py\n```\nOr through the command line:\n```\npython sample.py --rich_text_json 'your rich-text json here'\n```\n\n#### Font Color\n\n![color](assets/color.png)\n\nWe use font color to control the precise color of the generated objects. For example, the script below generates \"a Gothic church (with color #b26b00) in the sunset with a beautiful landscape in the background.\"\n\n```\npython sample.py --rich_text_json '{\"ops\":[{\"insert\":\"a Gothic \"},{\"attributes\":{\"color\":\"#fd6c9e\"},\"insert\":\"church\"},{\"insert\":\" in a sunset with a beautiful landscape in the background.\\n\"}]}' --num_segments 10 --segment_threshold 0.4 --inject_selfattn 0.5 --inject_background 0.5 --color_guidance_weight 1 --seed 7 --run_dir results/color_example_xl --model SDXL\n```\n\n#### Footnote\n\n![footnote](assets/footnote.png)\n\nWe use footnotes to provide supplementary descriptions for selected text elements. The following script generates a cat wearing sunglasses and bandana, which is a difficult case as mentioned in [eDiffi](https://research.nvidia.com/labs/dir/eDiff-I/#comparison_stable_cat_scooter).\n\n```\npython sample.py --rich_text_json '{\"ops\":[{\"insert\":\"A close-up 4k dslr photo of a \"},{\"attributes\":{\"link\":\"A cat wearing sunglasses and a bandana around its neck.\"},\"insert\":\"cat\"},{\"insert\":\" riding a scooter. Palm trees in the background.\\n\"}]}' --seed 3 --inject_background 0.5  --inject_selfattn 0.3 --num_segments 5 --run_dir results/footnote_example_xl --model SDXL\n```\n\n#### Font Style\n\n![style](assets/font.png)\n\nJust as the font style distinguishes the styles of individual text elements, we propose using it to define the artistic style of specific areas in the generation. Here is an example script to generate \"a beautiful garden (in the style of Claude Monet) with a snow mountain (in the style of Ukiyo-e) in the background\".\n\n```\npython sample.py --rich_text_json '{\"ops\":[{\"insert\":\"a beautiful\"},{\"attributes\":{\"font\":\"mirza\"},\"insert\":\" garden\"},{\"insert\":\" with a \"},{\"attributes\":{\"font\":\"roboto\"},\"insert\":\"snow mountain\"},{\"insert\":\" in the background\"}]}' --num_segments 10 --segment_threshold 0.5 --inject_background 0.4 --seed 5 --run_dir results/style_example_xl --model SDXL\n```\n\n#### Font Size\n\n![size](assets/size.png)\n\nFont size indicates the weight of each token in the final generation. This is implemented by reweighting the exponential attention score before the softmax at each cross-attention layer. The following example adds more pineapple to a generated pizza:\n\n```\npython sample.py --rich_text_json '{\"ops\": [{\"insert\": \"A pizza with pineapple, pepperoni, and \"}, {\"attributes\": {\"size\": \"60px\"}, \"insert\": \"mushroom\"}, {\"insert\": \" on the top\"}]}' --seed 3 --run_dir results/size_example_xl --model SDXL\n```\n\n## Evaluation\n\n### Local style generation\n\nTo evaluate the capacity of generating certain styles in a local region, we compute the CLIP similarity between each stylized region and its region prompt with the name of that style. We provide an evaluation script and compare ours with the AttentionRefine method proposed in [Prompt-to-Prompt](https://github.com/google/prompt-to-prompt):\n```\npython evaluation/benchmark_style.py --save_img --folder eval_style\n```\n\n### Precise color generation\nWe come up with color names in three difficulty levels to measure the capacity of a method to understand and generate a specific color. We evaluate the color accuracy by computing the average L2 distance between the region and target RGB values. The change of distance towards the target color is reported.\n```\npython evaluation/benchmark_color.py --category html --folder eval_color_html\npython evaluation/benchmark_color.py --category rgb --folder eval_color_rgb\npython evaluation/benchmark_color.py --category common --folder eval_color_common\n```\n\n\n## Visualize token maps\n\n![teaser](assets/visualization.png)\n\n\nEvery time the function `get_token_maps()` is called, the resulted segmentation and token maps are also visualized and saved locally for debugging purpose. Otherwise, you can manually visualize the map for the tokens in the text prompts with the following script.\n\n```\npython visualize_token_maps.py --text_prompt \"a camera on a tripod taking a picture of a cat.\" --token_ids 1 4 10 --num_segments 15 --segment_threshold 0.45 --model SDXL\n```\n\n## Citation\n\n``` bibtex\n@inproceedings{ge2023expressive,\n      title={Expressive text-to-image generation with rich text},\n      author={Ge, Songwei and Park, Taesung and Zhu, Jun-Yan and Huang, Jia-Bin},\n      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},\n      year={2023}\n}\n```\n\n## Acknowledgement\n\nWe thank Mia Tang, Aaron Hertzmann, Nupur Kumari, Gaurav Parmar, Ruihan Gao, and Aniruddha Mahapatra for their helpful discussion and paper reading. We thank AK, Radamés Ajna, and other HuggingFace team members for their help and support with the [online demo](https://huggingface.co/spaces/songweig/rich-text-to-image). Our rich-text editor is built on [Quill](https://quilljs.com/). Our model code is built on [huggingface / diffusers](https://github.com/huggingface/diffusers#readme).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSongweiGe%2Frich-text-to-image","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSongweiGe%2Frich-text-to-image","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSongweiGe%2Frich-text-to-image/lists"}