{"id":13456959,"url":"https://github.com/cloneofsimo/paint-with-words-sd","last_synced_at":"2025-04-05T02:03:37.928Z","repository":{"id":62805544,"uuid":"562595985","full_name":"cloneofsimo/paint-with-words-sd","owner":"cloneofsimo","description":"Implementation of Paint-with-words with Stable Diffusion : method from eDiff-I that let you generate image from text-labeled segmentation map.","archived":false,"fork":false,"pushed_at":"2023-03-24T03:38:19.000Z","size":43302,"stargazers_count":642,"open_issues_count":15,"forks_count":50,"subscribers_count":22,"default_branch":"master","last_synced_at":"2025-03-29T01:02:50.773Z","etag":null,"topics":["diffusion","generative-model","stable-diffusion"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cloneofsimo.png","metadata":{"files":{"readme":"README.md","changelog":"change_model_path.py","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-06T20:26:30.000Z","updated_at":"2025-03-24T10:55:37.000Z","dependencies_parsed_at":"2024-07-31T08:25:02.304Z","dependency_job_id":null,"html_url":"https://github.com/cloneofsimo/paint-with-words-sd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloneofsimo%2Fpaint-with-words-sd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloneofsimo%2Fpaint-with-words-sd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloneofsimo%2Fpaint-with-words-sd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloneofsimo%2Fpaint-with-words-sd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cloneofsimo","download_url":"https://codeload.github.com/cloneofsimo/paint-with-words-sd/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247276159,"owners_count":20912288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion","generative-model","stable-diffusion"],"created_at":"2024-07-31T08:01:30.886Z","updated_at":"2025-04-05T02:03:37.874Z","avatar_url":"https://github.com/cloneofsimo.png","language":"Jupyter Notebook","funding_links":[],"categories":["Spatial Control","Jupyter Notebook"],"sub_categories":[],"readme":"# Paint-with-Words, Implemented with Stable diffusion\n\n## Subtle Control of the Image Generation\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/rabbit_mage.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e Notice how without PwW the cloud is missing.\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/road.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e Notice how without PwW, abandoned city is missing, and road becomes purple as well.\n\n## Shift the object : Same seed, just the segmentation map's positional difference\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/aurora_1_merged.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/aurora_2_merged.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e \"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.\"\n\n\u003e Notice how nearly all of the composition remains the same, other than the position of the moon.\n\n---\n\nRecently, researchers from NVIDIA proposed [eDiffi](https://arxiv.org/abs/2211.01324). In the paper, they suggested method that allows \"painting with word\". Basically, this is like make-a-scene, but with just using adjusted cross-attention score. You can see the results and detailed method in the paper.\n\nTheir paper and their method was not open-sourced. Yet, paint-with-words can be implemented with Stable Diffusion since they share common Cross Attention module. So, I implemented it with Stable Diffusion.\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/paint_with_words_figure.png\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n# Installation\n\n```bash\npip install git+https://github.com/cloneofsimo/paint-with-words-sd.git\n```\n\n# Basic Usage\n\nBefore running, fill in the variable `HF_TOKEN` in `.env` file with Huggingface token for Stable Diffusion, and load_dotenv().\n\nPrepare segmentation map, and map-color : tag label such as below. keys are (R, G, B) format, and values are tag label.\n\n```python\n{\n    (0, 0, 0): \"cat,1.0\",\n    (255, 255, 255): \"dog,1.0\",\n    (13, 255, 0): \"tree,1.5\",\n    (90, 206, 255): \"sky,0.2\",\n    (74, 18, 1): \"ground,0.2\",\n}\n```\n\nYou neeed to have them so that they are in format \"{label},{strength}\", where strength is additional weight of the attention score you will give during generation, i.e., it will have more effect.\n\n```python\n\nimport dotenv\nfrom PIL import Image\n\nfrom paint_with_words import paint_with_words\n\nsettings = {\n    \"color_context\": {\n        (0, 0, 0): \"cat,1.0\",\n        (255, 255, 255): \"dog,1.0\",\n        (13, 255, 0): \"tree,1.5\",\n        (90, 206, 255): \"sky,0.2\",\n        (74, 18, 1): \"ground,0.2\",\n    },\n    \"color_map_img_path\": \"contents/example_input.png\",\n    \"input_prompt\": \"realistic photo of a dog, cat, tree, with beautiful sky, on sandy ground\",\n    \"output_img_path\": \"contents/output_cat_dog.png\",\n}\n\n\ndotenv.load_dotenv()\n\ncolor_map_image = Image.open(settings[\"color_map_img_path\"]).convert(\"RGB\")\ncolor_context = settings[\"color_context\"]\ninput_prompt = settings[\"input_prompt\"]\n\nimg = paint_with_words(\n    color_context=color_context,\n    color_map_image=color_map_image,\n    input_prompt=input_prompt,\n    num_inference_steps=30,\n    guidance_scale=7.5,\n    device=\"cuda:0\",\n)\n\nimg.save(settings[\"output_img_path\"])\n\n```\n\nThere is minimal working example in `runner.py` that is self contained. Please have a look!\n\n---\n\n# Weight Scaling\n\nIn the paper, they used $w \\log (1 + \\sigma)  \\max (Q^T K)$ to scale appropriate attention weight. However, this wasn't optimal after few tests, found by [CookiePPP](https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4406). You can check out the effect of the functions below:\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/compare_std.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e $w' = w \\log (1 + \\sigma)  std (Q^T K)$\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/compare_max.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e $w' = w \\log (1 + \\sigma)  \\max (Q^T K)$\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/compare_log2_std.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e $w' = w \\log (1 + \\sigma^2)  std (Q^T K)$\n\nYou can define your own weight function and further tweak the configurations by defining `weight_function` argument in `paint_with_words`.\n\nExample:\n\n```python\nw_f = lambda w, sigma, qk: 0.4 * w * math.log(sigma**2 + 1) * qk.std()\n\nimg = paint_with_words(\n    color_context=color_context,\n    color_map_image=color_map_image,\n    input_prompt=input_prompt,\n    num_inference_steps=20,\n    guidance_scale=7.5,\n    device=\"cuda:0\",\n    preloaded_utils=loaded,\n    weight_function=w_f\n)\n```\n\n## More on the weight function, (but higher)\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/compare_4_std.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e $w' = w \\log (1 + \\sigma)  std (Q^T K)$\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/compare_4_max.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e $w' = w \\log (1 + \\sigma)  \\max (Q^T K)$\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/compare_4_log2_std.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e $w' = w \\log (1 + \\sigma^2)  std (Q^T K)$\n\n# Regional-based seeding\n\nFollowing this example, where the random seed for whole image is 0,\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/aurora_1_merged.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003e \"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.\"\n\nthe random seed for 'boat', 'moon', and 'mountain' are set to various values show in the top row.\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/cmp_regional_based_seeing.png\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nExample:\n\n```python\n\nEXAMPLE_SETTING_4_seed = {\n    \"color_context\": {\n        (7, 9, 182): \"aurora,0.5,-1\",\n        (136, 178, 92): \"full moon,1.5,-1\",\n        (51, 193, 217): \"mountains,0.4,-1\",\n        (61, 163, 35): \"a half-frozen lake,0.3,-1\",\n        (89, 102, 255): \"boat,2.0,2077\",\n    },\n    \"color_map_img_path\": \"contents/aurora_1.png\",\n    \"input_prompt\": \"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.\",\n    \"output_img_path\": \"contents/aurora_1_seed_output.png\",\n}\n```\nwhere the 3rd item of context are random seed for the object. Use -1 to follow the seed set in paint_with_words function. In this example the random seed of boat is set to 2077.\n\n# Image inpainting\nFollowing the previous example, the figure below shows the results of image inpainting with paint-with-word\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/pww_inpainting.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nwhere the top row shows the example of editing moon size by inpainting.\nThe bottom row shows the example of re-synthesize the moon by inpainting with the original \"input color map\" for text-image paint-with-word.\n\n\nExample\n\n```python\nfrom paint_with_words import paint_with_words_inpaint\n\n\nimg = paint_with_words_inpaint(\n    color_context=color_context,\n    color_map_image=color_map_image,\n    init_image=init_image,\n    mask_image=mask_image,\n    input_prompt=input_prompt,\n    num_inference_steps=150,\n    guidance_scale=7.5,\n    device=\"cuda:0\",\n    seed=81,\n    weight_function=lambda w, sigma, qk: 0.15 * w * math.log(1 + sigma) * qk.max(),\n    strength = 1.0,\n)\n```\n\nTo run inpainting\n\n```bash\npython runner_inpaint.py\n```\n\n# Using other Fine-tuned models\n\nIf you are from Automatic1111 community, you maybe used to using native LDM checkpoint formats, not diffuser-checkpoint format. Luckily, there is a quick script that allows conversion.\n[this](https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py).\n\n```bash\npython change_model_path.py --checkpoint_path custom_model.ckpt --scheduler_type ddim --dump_path custom_model_diffusion_format\n```\n\nNow, use the converted model in `paint_with_words` function.\n\n```python\nfrom paint_with_words import paint_with_words, pww_load_tools\n\nloaded = pww_load_tools(\n    \"cuda:0\",\n    scheduler_type=LMSDiscreteScheduler,\n    local_model_path=\"./custom_model_diffusion_format\"\n)\n#...\nimg = paint_with_words(\n    color_context=color_context,\n    color_map_image=color_map_image,\n    input_prompt=input_prompt,\n    num_inference_steps=30,\n    guidance_scale=7.5,\n    device=\"cuda:0\",\n    weight_function=lambda w, sigma, qk: 0.4 * w * math.log(1 + sigma) * qk.max(),\n    preloaded_utils=loaded\n)\n```\n\n# Example Notebooks\n\nYou can view the minimal working notebook [here](./contents/notebooks/paint_with_words.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1MZfGaY3aQQn5_T-6bkXFE1rI59A2nJlU?usp=sharing)\n\n- [Painting with words](./contents/notebooks/paint_with_words.ipynb)\n\n- [Painting with words + Textual Inversion](./contents/notebooks/paint_with_words_textual_inversion.ipynb)\n\n---\n\n# Gradio interface\n## Paint-with-word\nTo launch gradio api\n\n```bash\npython gradio_pww.py\n```\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/gradio_demo.png\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nNoting that the \"Color context\" should follows the format defined as the example in runner.py. \nFor example, \n\u003e {(7, 9, 182): \"aurora,0.5,-1\",(136, 178, 92): \"full moon,1.5,-1\",(51, 193, 217): \"mountains,0.4,-1\",(61, 163, 35): \"a half-frozen lake,0.3,-1\",(89, 102, 255): \"boat,2.0,2077\",}\n\n### Color contenet extraction\nOne can extract the color content from \"Segmentation map\" by expanding the \"Color content option\". \nPress the button \"Extract color content\" to extract the unique color of images.\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/gradio_color_content_demo_0.png\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nIn \"Color content option\", the extracted colors are shown respectively for each item. One can then replace \"obj\" with the object appear in the prompt. Importantly, don't use \",\" in the object, as this is the separator of the color content.\n\nClick the button \"Generate color content\" to collect all the contents into \"Color content\" the textbox as the formal input of Paint-with-word.\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/gradio_color_content_demo.png\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nThe same function is supported for Paint-with-word for image inpainting as shown below\n\n## Paint-with-word for image inpainting\nTo launch gradio api\n\n```bash\npython gradio_pww_inpaint.py\n```\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/gradio_inpaint_demo.png\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\n# Paint with Word (PwW) + ControlNet Extension for [AUTOMATIC1111(A1111) stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)\n\nThis extension provide additional PwW control to ControlNet. See [sd-webui-controlnet-pww\n](https://github.com/lwchen6309/sd-webui-controlnet-pww) for the repo of this module.\n\nThe demo is shown below.\n\n![screencapture-127-0-0-1-7860-2023-03-13-10_56_34](https://user-images.githubusercontent.com/42672685/225545442-bdb481ec-e234-475e-900d-e9340c0c7deb.png)\n\nThe implementation is based on the great [controlnet extension for A1111](https://github.com/Mikubill/sd-webui-controlnet)\n\n## Benchmark of ControlNet + PwW\n\nThe following figure shows the comparison between the ControlNet results and the ControlNet+PwW results for the boat examples. \n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/cn_pww/cn_pww_boat.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nNoting that the PwW make the background, e.g. aurora and mountains, more realistic as weight function scales increases. \n\nThe setups are detailed as follows\n\nScribble and Segmentation map:\n\n\u003cp float=\"middle\"\u003e\n  \u003cimg src=\"contents/cn_pww/user1.png\" width=\"200\" /\u003e\n  \u003cimg src=\"contents/cn_pww/seg_map1.png\" width=\"200\" /\u003e \n\u003c/p\u003e\n\nPrompts:\n\n\u003e \"A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.\"\n\nColor contents: \n\n\u003e \"{(7, 9, 182): \"aurora@0.5@-1\",(136, 178, 92): \"full moon@1.5@-1\",(51, 193, 217): \"mountains@0.4@-1\",(61, 163, 35): \"a half-frozen lake@0.3@-1\",(89, 102, 255): \"boat@2.0@-1\",}\"\n\nNote that A1111 extension now use \"@\" as separator instead of \",\".\n\n## Assign the material for the specific region in scribble\n\nOne can use PwW to assign the material upon scribble, see the results comparing ControlNet and ControlNet+PwW below.\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/cn_pww/cn_pww_turtle.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\n\u003c!-- #region --\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg  src=\"contents/cn_pww/cn_pww_ballon.jpg\"\u003e\n\u003c/p\u003e\n\u003c!-- #endregion --\u003e\n\nNoting that the material of turtle shell specified by PwW is significantly improved showns in the right blocks.\nPlease see [sd-webui-controlnet-pww\n](https://github.com/lwchen6309/sd-webui-controlnet-pww#assign-the-material-for-the-specific-region-in-scribble) for the experimental setups.\n\n## Installation\n\n### (1) Clone the source code to A1111 webui extensions\none can install by cloning the 'pww_controlnet\" directory into the extensions directory of A1111 webui\n\n```bash\ncp -rf pww_controlnet path/stable-diffusion-webui/extensions/\n```\n\nor simply\n\n```bash\ncd path/stable-diffusion-webui/extensions/\ngit clone git@github.com:lwchen6309/sd-webui-controlnet-pww.git\n```\n\nwhere path is the location of A1111 webui.\n\n### (2) Setup pretrained model of ControlNet\nPlease follow the instruction of [controlnet extension](https://github.com/Mikubill/sd-webui-controlnet) to get the pretrained models. \n\n#### IMPORTANT: This extension is currently NOT compatible with [ControlNet extension](https://github.com/Mikubill/sd-webui-controlnet) as reported at [this issue](https://github.com/cloneofsimo/paint-with-words-sd/issues/38). Hence, please disable the ControlNet extension before you install ControlNet+PwW.\n\nHowever, one can still make them compatible by following [the instruction of installation](https://github.com/lwchen6309/sd-webui-controlnet-pww/tree/fc7b0e4471f1da491d12a2f12f3f0487bb671696#important-this-extension-is-currently-not-compatible-with-controlnet-extension-as-reported-at-this-issue-hence-please-disable-the-controlnet-extension-before-you-install-controlnetpww-this-repo-will-sync-the-latest-controlnet-extension-and-should-therefore-includes-its-original-function).\n\n\n# TODO\n\n- [ ] Make extensive comparisons for different weight scaling functions.\n- [ ] Create word latent-based cross-attention generations.\n- [ ] Check if statement \"making background weight smaller is better\" is justifiable, by using some standard metrics\n- [x] Create AUTOMATIC1111's interface\n- [x] Create Gradio interface\n- [x] Create tutorial\n- [ ] See if starting with some \"known image latent\" is helpful. If it is, we might as well hard-code some initial latent.\n- [x] Region based seeding, where we set seed for each regions. Can be simply implemented with extra argument in `COLOR_CONTEXT`\n- [ ] sentence wise text seperation. Currently token is the smallest unit that influences cross-attention. This needs to be fixed. (Can be done pretty trivially)\n- [x] Allow different models to be used. use [this](https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py).\n- [ ] \"negative region\", where we can set some region to \"not\" have some semantics. can be done with classifier-free guidance.\n- [x] Img2ImgPaintWithWords -\u003e Img2Img, but with extra text segmentation map for better control\n- [x] InpaintPaintwithWords -\u003e inpaint, but with extra text segmentation map for better control\n- [x] Support for other schedulers\n\n# Acknowledgement\nThanks for the inspiring gradio interface from [ControlNet](https://github.com/lllyasviel/ControlNet)\n\nThanks for the wonderful [A1111 extension of controlnet](https://github.com/Mikubill/sd-webui-controlnet) as the baseline of our implementation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloneofsimo%2Fpaint-with-words-sd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcloneofsimo%2Fpaint-with-words-sd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloneofsimo%2Fpaint-with-words-sd/lists"}