{"id":13562838,"url":"https://github.com/huggingface/diffusion-fast","last_synced_at":"2026-03-18T01:35:50.368Z","repository":{"id":209772735,"uuid":"724908474","full_name":"huggingface/diffusion-fast","owner":"huggingface","description":"Faster generation with text-to-image diffusion models.","archived":false,"fork":false,"pushed_at":"2025-06-28T05:30:12.000Z","size":116,"stargazers_count":229,"open_issues_count":0,"forks_count":15,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-11-13T02:42:01.889Z","etag":null,"topics":["diffusers","diffusion-models","pytorch","sdxl","text-to-image-generation"],"latest_commit_sha":null,"homepage":"https://pytorch.org/blog/accelerating-generative-ai-3/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggingface.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-11-29T03:09:59.000Z","updated_at":"2025-11-04T11:45:52.000Z","dependencies_parsed_at":"2024-06-26T12:24:05.346Z","dependency_job_id":"9a508c5e-cf1b-4deb-ab5e-be249591a428","html_url":"https://github.com/huggingface/diffusion-fast","commit_stats":null,"previous_names":["sayakpaul/sdxl-fast","huggingface/sdxl-fast"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/huggingface/diffusion-fast","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fdiffusion-fast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fdiffusion-fast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fdiffusion-fast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fdiffusion-fast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggingface","download_url":"https://codeload.github.com/huggingface/diffusion-fast/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggingface%2Fdiffusion-fast/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30640295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-18T00:09:27.587Z","status":"ssl_error","status_checked_at":"2026-03-18T00:09:26.123Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusers","diffusion-models","pytorch","sdxl","text-to-image-generation"],"created_at":"2024-08-01T13:01:12.710Z","updated_at":"2026-03-18T01:35:50.318Z","avatar_url":"https://github.com/huggingface.png","language":"Python","readme":"# Diffusion, fast\n\nRepository for the blog post: [**Accelerating Generative AI Part III: Diffusion, Fast**](https://pytorch.org/blog/accelerating-generative-ai-3/). You can find a run down of the techniques on the [🤗 Diffusers website](https://huggingface.co/docs/diffusers/main/en/optimization/fp16) too. \n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/final-results-diffusion-fast/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30.png\" width=500\u003e\n\n\u003c/div\u003e\u003cbr\u003e\n\nCheck out the Flux edition here: [huggingface/flux-fast](https://github.com/huggingface/flux-fast/).\n\n\u003e [!WARNING]  \n\u003e This repository relies on the `torchao` package for all things quantization. Since the first version of this repo, the `torchao` package has changed its APIs significantly. More specifically, [this](https://github.com/huggingface/diffusion-fast/blob/f4fa861422d9819226eb2ceac247c85c3547130d/Dockerfile#L30) version was used to obtain the numbers in this repository. For more updated usage of `torchao`, please refer to the [`diffusers-torchao`](https://github.com/sayakpaul/diffusers-torchao) repository.\n\nSummary of the optimizations:\n\n* Running with the bfloat16 precision\n* `scaled_dot_product_attention` (SDPA)\n* `torch.compile`\n* Combining q,k,v projections for attention computation\n* Dynamic int8 quantization \n\nThese techniques are fairly generalizable to other pipelines too, as we show below.\n\nTable of contents:\n\n* [Setup](#setup-🛠️)\n* [Running benchmarking experiments](#running-a-benchmarking-experiment-🏎️)\n* [Code](#improvements-progressively-📈-📊)\n* [Results from other pipelines](#results-from-other-pipelines-🌋)\n\n## Setup 🛠️\n\nWe rely on pure PyTorch for the optimizations. You can refer to the [Dockerfile](./Dockerfile) to get the complete development environment setup. \n\nFor hardware, we used an 80GB 400W A100 GPU with its memory clock set to the maximum rate (1593 in our case).\n\nMeanwhile, these optimizations (BFloat16, SDPA, torch.compile, Combining q,k,v projections) can run on CPU platforms as well, and bring 4x latency improvement to Stable Diffusion XL (SDXL) on 4th Gen Intel® Xeon® Scalable processors.\n\n## Running a benchmarking experiment 🏎️\n\n[`run_benchmark.py`](./run_benchmark.py) is the main script for benchmarking the different optimization techniques. After an experiment has been done, you should expect to see two files:\n\n* A `.csv` file with all the benchmarking numbers.\n* A `.jpeg` image file corresponding to the experiment. \n\nRefer to the [`experiment-scripts/run_sd.sh`](./experiment-scripts/run_sd.sh) for some reference experiment commands. \n\n**Notes on running PixArt-Alpha experiments**:\n\n* Use the [`run_experiment_pixart.py`](./run_benchmark_pixart.py) for this.\n* Uninstall the current installation of `diffusers` and re-install it again like so: `pip install git+https://github.com/huggingface/diffusers@fuse-projections-pixart`.\n* Refer to the [`experiment-scripts/run_pixart.sh`](./experiment-scripts/run_pixart.sh) script for some reference experiment commands.\n\n_(Support for PixArt-Alpha is experimental.)_\n\nYou can use the [`prepare_results.py`](./prepare_results.py) script to generate a consolidated CSV file and a plot to visualize the results from it. This is best used after you have run a couple of benchmarking experiments already and have their corresponding CSV files.\n\nThe script also supports CPU platforms, you can refer to the [`experiment-scripts/run_sd_cpu.sh`](./experiment-scripts/run_sd_cpu.sh) for some reference experiment commands. \n\nTo run the script, you need the following dependencies:\n\n* pandas\n* matplotlib\n* seaborn\n\n## Improvements, progressively 📈 📊\n\n\u003cdetails\u003e\n  \u003csummary\u003eBaseline\u003c/summary\u003e\n\n```python\nfrom diffusers import StableDiffusionXLPipeline\n\n# Load the pipeline in full-precision and place its model components on CUDA.\npipe = StableDiffusionXLPipeline.from_pretrained(\n    \"stabilityai/stable-diffusion-xl-base-1.0\"\n).to(\"cuda\")\n\n# Run the attention ops without efficiency.\npipe.unet.set_default_attn_processor()\npipe.vae.set_default_attn_processor()\n\nprompt = \"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k\"\nimage = pipe(prompt, num_inference_steps=30).images[0]\n```\n\nWith this, we're at:\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_0.png\" width=500\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eBfloat16\u003c/summary\u003e\n\n```python\nfrom diffusers import StableDiffusionXLPipeline\nimport torch\n\npipe = StableDiffusionXLPipeline.from_pretrained(\n\t\"stabilityai/stable-diffusion-xl-base-1.0\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n\n# Run the attention ops without efficiency.\npipe.unet.set_default_attn_processor()\npipe.vae.set_default_attn_processor()\n\nprompt = \"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k\"\nimage = pipe(prompt, num_inference_steps=30).images[0]\n```\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_1.png\" width=500\u003e\n\n\u003c/div\u003e\n\n\u003e 💡 We later ran the experiments in float16 and found out that the recent versions of `torchao` do not incur numerical problems from float16.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003escaled_dot_product_attention\u003c/summary\u003e\n\n```python\nfrom diffusers import StableDiffusionXLPipeline\nimport torch\n\npipe = StableDiffusionXLPipeline.from_pretrained(\n\t\"stabilityai/stable-diffusion-xl-base-1.0\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n\nprompt = \"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k\"\nimage = pipe(prompt, num_inference_steps=30).images[0]\n```\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_2.png\" width=500\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003etorch.compile\u003c/summary\u003e\u003cbr\u003e\n\nFirst, configure some compiler flags:\n\n```python\nfrom diffusers import StableDiffusionXLPipeline\nimport torch\n\n# Set the following compiler flags to make things go brrr.\ntorch._inductor.config.conv_1x1_as_mm = True\ntorch._inductor.config.coordinate_descent_tuning = True\ntorch._inductor.config.epilogue_fusion = False\ntorch._inductor.config.coordinate_descent_check_all_directions = True\n```\n\nThen load the pipeline:\n\n```python\npipe = StableDiffusionXLPipeline.from_pretrained(\n    \"stabilityai/stable-diffusion-xl-base-1.0\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n```\n\nCompile and perform inference:\n\n```python\n# Compile the UNet and VAE.\npipe.unet.to(memory_format=torch.channels_last)\npipe.vae.to(memory_format=torch.channels_last)\npipe.unet = torch.compile(pipe.unet, mode=\"max-autotune\", fullgraph=True)\npipe.vae.decode = torch.compile(pipe.vae.decode, mode=\"max-autotune\", fullgraph=True)\n\nprompt = \"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k\"\n\n# First call to `pipe` will be slow, subsequent ones will be faster.\nimage = pipe(prompt, num_inference_steps=30).images[0]\n```\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_3.png\" width=500\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eCombining attention projection matrices\u003c/summary\u003e\u003cbr\u003e\n\n```python\nfrom diffusers import StableDiffusionXLPipeline\nimport torch\n\n# Configure the compiler flags.\ntorch._inductor.config.conv_1x1_as_mm = True\ntorch._inductor.config.coordinate_descent_tuning = True\ntorch._inductor.config.epilogue_fusion = False\ntorch._inductor.config.coordinate_descent_check_all_directions = True\n\npipe = StableDiffusionXLPipeline.from_pretrained(\n    \"stabilityai/stable-diffusion-xl-base-1.0\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n\n# Combine attention projection matrices.\npipe.fuse_qkv_projections()\n\n# Compile the UNet and VAE.\npipe.unet.to(memory_format=torch.channels_last)\npipe.vae.to(memory_format=torch.channels_last)\npipe.unet = torch.compile(pipe.unet, mode=\"max-autotune\", fullgraph=True)\npipe.vae.decode = torch.compile(pipe.vae.decode, mode=\"max-autotune\", fullgraph=True)\n\nprompt = \"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k\"\n\n# First call to `pipe` will be slow, subsequent ones will be faster.\nimage = pipe(prompt, num_inference_steps=30).images[0]\n```\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_4.png\" width=500\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eDynamic quantization\u003c/summary\u003e\u003cbr\u003e\n\nStart by setting the compiler flags (this time, we have two new):\n\n```python\nfrom diffusers import StableDiffusionXLPipeline\nimport torch\n\nfrom torchao.quantization import apply_dynamic_quant, swap_conv2d_1x1_to_linear\n\n# Compiler flags. There are two new.\ntorch._inductor.config.conv_1x1_as_mm = True\ntorch._inductor.config.coordinate_descent_tuning = True\ntorch._inductor.config.epilogue_fusion = False\ntorch._inductor.config.coordinate_descent_check_all_directions = True\ntorch._inductor.config.force_fuse_int_mm_with_mul = True\ntorch._inductor.config.use_mixed_mm = True\n```\n\nThen write the filtering functions to apply dynamic quantization:\n\n```python\ndef dynamic_quant_filter_fn(mod, *args):\n    return (\n        isinstance(mod, torch.nn.Linear)\n        and mod.in_features \u003e 16\n        and (mod.in_features, mod.out_features)\n        not in [\n            (1280, 640),\n            (1920, 1280),\n            (1920, 640),\n            (2048, 1280),\n            (2048, 2560),\n            (2560, 1280),\n            (256, 128),\n            (2816, 1280),\n            (320, 640),\n            (512, 1536),\n            (512, 256),\n            (512, 512),\n            (640, 1280),\n            (640, 1920),\n            (640, 320),\n            (640, 5120),\n            (640, 640),\n            (960, 320),\n            (960, 640),\n        ]\n    )\n\n\ndef conv_filter_fn(mod, *args):\n    return (\n        isinstance(mod, torch.nn.Conv2d) and mod.kernel_size == (1, 1) and 128 in [mod.in_channels, mod.out_channels]\n    )\n```\n\nThen we're rwady for inference:\n\n```python\npipe = StableDiffusionXLPipeline.from_pretrained(\n\t\"stabilityai/stable-diffusion-xl-base-1.0\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n\n# Combine attention projection matrices.\npipe.fuse_qkv_projections()\n\n# Change the memory layout.\npipe.unet.to(memory_format=torch.channels_last)\npipe.vae.to(memory_format=torch.channels_last)\n\n# Swap the pointwise convs with linears.\nswap_conv2d_1x1_to_linear(pipe.unet, conv_filter_fn)\nswap_conv2d_1x1_to_linear(pipe.vae, conv_filter_fn)\n\n# Apply dynamic quantization.\napply_dynamic_quant(pipe.unet, dynamic_quant_filter_fn)\napply_dynamic_quant(pipe.vae, dynamic_quant_filter_fn)\n\n# Compile.\npipe.unet = torch.compile(pipe.unet, mode=\"max-autotune\", fullgraph=True)\npipe.vae.decode = torch.compile(pipe.vae.decode, mode=\"max-autotune\", fullgraph=True)\n\nprompt = \"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k\"\nimage = pipe(prompt, num_inference_steps=30).images[0]\n```\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_5.png\" width=500\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n## Results from other pipelines 🌋\n\n\u003cdetails\u003e\n  \u003csummary\u003eSSD-1B\u003c/summary\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/final-results-diffusion-fast/SSD-1B%2C_Batch_Size%3A_1%2C_Steps%3A_30.png\" width=500\u003e\n\u003cbr\u003e\u003csup\u003e\u003ca href=\"https://huggingface.co/segmind/SSD-1B\"\u003esegmind/SSD-1B\u003c/a\u003e\u003c/sup\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eSD v1-5\u003c/summary\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/final-results-diffusion-fast/SD_v1-5%2C_Batch_Size%3A_1%2C_Steps%3A_30.png\" width=500\u003e\n\u003cbr\u003e\u003csup\u003e\u003ca href=\"https://huggingface.co/runwayml/stable-diffusion-v1-5\"\u003erunwayml/stable-diffusion-v1-5\u003c/a\u003e\u003c/sup\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003ePixart-Alpha\u003c/summary\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/final-results-diffusion-fast/PixArt-%24%5Calpha%24%2C_Batch_Size%3A_1%2C_Steps%3A_30.png\" width=500\u003e\n\u003cbr\u003e\u003csup\u003e\u003ca href=\"https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS\"\u003ePixArt-alpha/PixArt-XL-2-1024-MS\u003c/a\u003e\u003c/sup\u003e\n\n\u003c/div\u003e\n\n\u003c/details\u003e\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fdiffusion-fast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggingface%2Fdiffusion-fast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggingface%2Fdiffusion-fast/lists"}