{"id":15032728,"url":"https://github.com/chengzeyi/stable-fast","last_synced_at":"2025-04-11T00:52:23.816Z","repository":{"id":200659288,"uuid":"706023626","full_name":"chengzeyi/stable-fast","owner":"chengzeyi","description":"https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.","archived":false,"fork":false,"pushed_at":"2025-03-27T08:07:20.000Z","size":419,"stargazers_count":1246,"open_issues_count":63,"forks_count":78,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-04-11T00:52:12.596Z","etag":null,"topics":["cuda","deeplearnng","diffusers","inference-engines","openai-triton","performance-optimizations","pytorch","stable-diffusion","stable-video-diffusion","torch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chengzeyi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-17T06:49:59.000Z","updated_at":"2025-04-10T04:24:33.000Z","dependencies_parsed_at":"2023-12-11T07:43:03.903Z","dependency_job_id":"d37a38cd-739c-4d10-9c59-eac5ddfd81fc","html_url":"https://github.com/chengzeyi/stable-fast","commit_stats":{"total_commits":220,"total_committers":2,"mean_commits":110.0,"dds":0.004545454545454519,"last_synced_commit":"61269318a52092b12134adbfbc000d004ef7c286"},"previous_names":["chengzeyi/stable-fast"],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chengzeyi%2Fstable-fast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chengzeyi%2Fstable-fast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chengzeyi%2Fstable-fast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chengzeyi%2Fstable-fast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chengzeyi","download_url":"https://codeload.github.com/chengzeyi/stable-fast/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248322609,"owners_count":21084336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","deeplearnng","diffusers","inference-engines","openai-triton","performance-optimizations","pytorch","stable-diffusion","stable-video-diffusion","torch"],"created_at":"2024-09-24T20:19:15.640Z","updated_at":"2025-04-11T00:52:23.789Z","avatar_url":"https://github.com/chengzeyi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀Stable Fast\n\n[Blazing Fast FLUX-dev with LoRAs](https://wavespeed.ai/models/wavespeed-ai/flux-dev-lora)\n\n[Blazing Fast Wan 2.1 T2V with LoRAs](https://wavespeed.ai/models/wavespeed-ai/wan-2.1/t2v-480p)\n\n[Blazing Fast Wan 2.1 I2V with LoRAs](https://wavespeed.ai/models/wavespeed-ai/wan-2.1/i2v-480p)\n\n## 🎉Important Announcement🎉\n\nAfter one year of delay, I am happy to announce I plan to build a new project [Comfy-WaveSpeed](https://github.com/chengzeyi/Comfy-WaveSpeed) to provide the fastest inference speed for all models running with `ComfyUI`.\nIt's just started and I hope it will be a great project👏.. Please keep focusing on it and give me feedbacks👍!\n\n[![wheels](https://github.com/chengzeyi/stable-fast/actions/workflows/wheels.yml/badge.svg?branch=main)](https://github.com/chengzeyi/stable-fast/actions/workflows/wheels.yml)\n[![Upload Python Package](https://github.com/chengzeyi/stable-fast/actions/workflows/python-publish.yml/badge.svg)](https://github.com/chengzeyi/stable-fast/actions/workflows/python-publish.yml)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chengzeyi/stable-fast-colab/blob/main/stable_fast_colab.ipynb)\n\n__NOTE__\n\nActive development on `stable-fast` has been paused. I am currently working on a new `torch._dynamo` based project targeting new models such as `stable-cascade`, `SD3` and `Sora` like mmodels.\nIt would be faster and more flexible, as well as supporting more hardware backends rather than `CUDA`.\n\nContact is welcomed.\n\n[Discord Channel](https://discord.gg/kQFvfzM4SJ)\n\n`stable-fast` achieves SOTA inference performance on __ALL__ kinds of diffuser models, even with the latest `StableVideoDiffusionPipeline`.\nAnd unlike `TensorRT` or `AITemplate`, which takes dozens of minutes to compile a model, `stable-fast` only takes a few seconds to compile a model.\n`stable-fast` also supports `dynamic shape`, `LoRA` and `ControlNet` out of the box.\n\n[![](https://mermaid.ink/img/pako:eNpVUsFu2zAM_RVCQJCLnchOncQ-DBjWHXbYpc1hWNUDbdO2AFsyLKVNFvjfR8VFhwkwQT5LfOQjb6KyNYlCxHGsTGVNo9tCGeBzuX7rcPIfUTjvuvZdAUcp_2Ed6bbzBaQZg_ckq9VNG83Qbe07GmhdwLqxEzm_nmFerZS5XKuQOS7JI3R20n-s8dgr47XvCZQ46YGA30BLhib02rRgDYEesCUuw3fw_AjJJosgS9ILf-xIcJ5GF_FNeDr9ggd5lEowW4wX7eBFiTc0uu8RvJ2qTomIme7uprLDqHtaoK8_TjSMPfqPmImb3r4vwYmMs9PTCfKN3CQL5jyWPcUNOq_EqzLXhVCJgdm0I1a1dkqAhDj-AqkM8ilT4gQvyTE_RJBkiWSbZEe2Uu4iyPMkmOxVRGKgaUBd84xuQXauOaiqRMFuiY5CjzPfw7O3z1dTicJPZ4rEeay5h0eN7YTD_-D3WnPjomiwdwz2Fmvi8Cb8dQzL0GrnOeOyDgE_Tz3DnfejK7bb8HvT8hTOZZBu63Qd5tm95fvtPt0fMd3R_rDDbLerqzLJj036kDT1QSYpinmOxIjmt7XDZwF0r-fnson3hZz_AskE0h8?type=png)](https://mermaid.live/edit#pako:eNpVUsFu2zAM_RVCQJCLnchOncQ-DBjWHXbYpc1hWNUDbdO2AFsyLKVNFvjfR8VFhwkwQT5LfOQjb6KyNYlCxHGsTGVNo9tCGeBzuX7rcPIfUTjvuvZdAUcp_2Ed6bbzBaQZg_ckq9VNG83Qbe07GmhdwLqxEzm_nmFerZS5XKuQOS7JI3R20n-s8dgr47XvCZQ46YGA30BLhib02rRgDYEesCUuw3fw_AjJJosgS9ILf-xIcJ5GF_FNeDr9ggd5lEowW4wX7eBFiTc0uu8RvJ2qTomIme7uprLDqHtaoK8_TjSMPfqPmImb3r4vwYmMs9PTCfKN3CQL5jyWPcUNOq_EqzLXhVCJgdm0I1a1dkqAhDj-AqkM8ilT4gQvyTE_RJBkiWSbZEe2Uu4iyPMkmOxVRGKgaUBd84xuQXauOaiqRMFuiY5CjzPfw7O3z1dTicJPZ4rEeay5h0eN7YTD_-D3WnPjomiwdwz2Fmvi8Cb8dQzL0GrnOeOyDgE_Tz3DnfejK7bb8HvT8hTOZZBu63Qd5tm95fvtPt0fMd3R_rDDbLerqzLJj036kDT1QSYpinmOxIjmt7XDZwF0r-fnson3hZz_AskE0h8)\n\n[![](https://mermaid.ink/img/pako:eNpFUk1v2zAM_SuEgCAXu3HsfLg-7LL22MsaDMWqHmiLtgXYUmAxWTLD_310UrQ8SY8S3-MjR1V5Q6pQcRxrV3lX26bQDiQu158tDvx5m-OvNdwWkCfJN9aSbVouIE0FvBVZLEbrrEDjklvqaVnAsvYDBV5OMC0W2l2u1Vw5LokRWj_Yf94xdtqx5Y5Aq4PtCeQPNORoQLauAe8IztaQFxncwuvvp_jtEME6STeX7X4XQbqFesCeQiRv4dfhDTZJnmglfDFebIB3rc7obNchsB-qVqtIuG7Hh8r3R9vRHQqMZUdxjYG1-tDuev8vCRKHTNAKEojjH0IuTWtX4gDveRbBPolgk3-oSPU09GiNGDvOXgnNbIVWhRxLDDTLmuQdnti_Xl2lCh5OFKnT0SDTk8VGWlFFjV34Qp-NFbFfYOfRkFxHxdfjPMLGBpaS9yHO-GnoBG6Zj6FYreb0QyPencq53VWwZp5Ce37crXbpLsc0o90-w22WmapcP-Z1ulnXZp-sU1TTFKkjuj_ef6uim56X-_7c1mj6D1vdvkY?type=png)](https://mermaid.live/edit#pako:eNpFUk1v2zAM_SuEgCAXu3HsfLg-7LL22MsaDMWqHmiLtgXYUmAxWTLD_310UrQ8SY8S3-MjR1V5Q6pQcRxrV3lX26bQDiQu158tDvx5m-OvNdwWkCfJN9aSbVouIE0FvBVZLEbrrEDjklvqaVnAsvYDBV5OMC0W2l2u1Vw5LokRWj_Yf94xdtqx5Y5Aq4PtCeQPNORoQLauAe8IztaQFxncwuvvp_jtEME6STeX7X4XQbqFesCeQiRv4dfhDTZJnmglfDFebIB3rc7obNchsB-qVqtIuG7Hh8r3R9vRHQqMZUdxjYG1-tDuev8vCRKHTNAKEojjH0IuTWtX4gDveRbBPolgk3-oSPU09GiNGDvOXgnNbIVWhRxLDDTLmuQdnti_Xl2lCh5OFKnT0SDTk8VGWlFFjV34Qp-NFbFfYOfRkFxHxdfjPMLGBpaS9yHO-GnoBG6Zj6FYreb0QyPencq53VWwZp5Ce37crXbpLsc0o90-w22WmapcP-Z1ulnXZp-sU1TTFKkjuj_ef6uim56X-_7c1mj6D1vdvkY)\n\n| Model       | torch | torch.compile | AIT  | oneflow | TensorRT | __stable-fast__ |\n| ----------- | ----- | ------------- | ---- | ------- | -------- | --------------- |\n| SD 1.5 (ms) | 1897  | 1510          | 1158 | 1003    | 991      | __995__         |\n| SVD-XT (s)  | 83    | 70            |      |         |          | __47__          |\n\n__NOTE__: During benchmarking, `TensorRT` is tested with `static batch size` and `CUDA Graph enabled` while `stable-fast` is running with dynamic shape.\n\n- [🚀Stable Fast](#stable-fast)\n  - [Introduction](#introduction)\n    - [What is this?](#what-is-this)\n    - [Differences With Other Acceleration Libraries](#differences-with-other-acceleration-libraries)\n  - [Installation](#installation)\n    - [Install Prebuilt Wheels](#install-prebuilt-wheels)\n    - [Install From Source](#install-from-source)\n  - [Usage](#usage)\n    - [Optimize StableDiffusionPipeline](#optimize-stablediffusionpipeline)\n    - [Optimize LCM Pipeline](#optimize-lcm-pipeline)\n    - [Optimize StableVideoDiffusionPipeline](#optimize-stablevideodiffusionpipeline)\n    - [Dynamically Switch LoRA](#dynamically-switch-lora)\n    - [Model Quantization](#model-quantization)\n    - [Some Common Methods To Speed Up PyTorch](#some-common-methods-to-speed-up-pytorch)\n  - [Performance Comparison](#performance-comparison)\n    - [RTX 4080 (512x512, batch size 1, fp16, in WSL2)](#rtx-4080-512x512-batch-size-1-fp16-in-wsl2)\n    - [H100](#h100)\n    - [A100](#a100)\n  - [Compatibility](#compatibility)\n  - [Troubleshooting](#troubleshooting)\n\n## Introduction\n\n### What is this?\n\n`stable-fast` is an ultra lightweight inference optimization framework for __HuggingFace Diffusers__ on __NVIDIA GPUs__.\n`stable-fast` provides super fast inference optimization by utilizing some key techniques and features:\n\n- __CUDNN Convolution Fusion__: `stable-fast` implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations of `Conv + Bias + Add + Act` computation patterns.\n- __Low Precision \u0026 Fused GEMM__: `stable-fast` implements a series of fused GEMM operators that compute with `fp16` precision, which is fast than PyTorch's defaults (read \u0026 write with `fp16` while compute with `fp32`).\n- __Fused Linear GEGLU__: `stable-fast` is able to fuse `GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)` into one CUDA kernel.\n- __NHWC \u0026 Fused GroupNorm__: `stable-fast` implements a highly optimized fused NHWC `GroupNorm + Silu` operator with OpenAI's `Triton`, which eliminates the need of memory format permutation operators.\n- __Fully Traced Model__: `stable-fast` improves the `torch.jit.trace` interface to make it more proper for tracing complex models. Nearly every part of `StableDiffusionPipeline/StableVideoDiffusionPipeline` can be traced and converted to __TorchScript__. It is more stable than `torch.compile` and has a significantly lower CPU overhead than `torch.compile` and supports __ControlNet__ and __LoRA__.\n- __CUDA Graph__: `stable-fast` can capture the `UNet`, `VAE` and `TextEncoder` into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape.\n- __Fused Multihead Attention__: `stable-fast` just uses xformers and makes it compatible with __TorchScript__.\n\nMy next goal is to keep `stable-fast` as one of the fastest inference optimization frameworks for `diffusers` and also\nprovide both speedup and VRAM reduction for `transformers`.\nIn fact, I already use `stable-fast` to optimize LLMs and achieve a significant speedup.\nBut I still need to do some work to make it more stable and easy to use and provide a stable user interface.\n\n### Differences With Other Acceleration Libraries\n\n- __Fast__: `stable-fast` is specialy optimized for __HuggingFace Diffusers__. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster than `torch.compile`, `TensorRT` and `AITemplate` in compilation time.\n- __Minimal__: `stable-fast` works as a plugin framework for `PyTorch`. It utilizes existing `PyTorch` functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.\n- __Maximum Compatibility__: `stable-fast` is compatible with all kinds of `HuggingFace Diffusers` and `PyTorch` versions. It is also compatible with `ControlNet` and `LoRA`. And it even supports the latest `StableVideoDiffusionPipeline` out of the box!\n\n## Installation\n\n__NOTE__: `stable-fast` is currently only tested on `Linux` and `WSL2 in Windows`.\nYou need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).\n\nI only test `stable-fast` with `torch\u003e=2.1.0`, `xformers\u003e=0.0.22` and `triton\u003e=2.1.0` on `CUDA 12.1` and `Python 3.10`.\nOther versions might build and run successfully but that's not guaranteed.\n\n### Install Prebuilt Wheels\n\nDownload the wheel corresponding to your system from the [Releases Page](https://github.com/chengzeyi/stable-fast/releases) and install it with `pip3 install \u003cwheel file\u003e`.\n\nCurrently both __Linux__ and __Windows__ wheels are available.\n\n```bash\n# Change cu121 to your CUDA version and \u003cwheel file\u003e to the path of the wheel file.\n# And make sure the wheel file is compatible with your PyTorch version.\npip3 install --index-url https://download.pytorch.org/whl/cu121 \\\n    'torch\u003e=2.1.0' 'xformers\u003e=0.0.22' 'triton\u003e=2.1.0' 'diffusers\u003e=0.19.3' \\\n    '\u003cwheel file\u003e'\n```\n\n### Install From Source\n\n```bash\n# Make sure you have CUDNN/CUBLAS installed.\n# https://developer.nvidia.com/cudnn\n# https://developer.nvidia.com/cublas\n\n# Install PyTorch with CUDA and other packages at first.\n# Windows user: Triton might be not available, you could skip it.\n# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.\npip3 install wheel 'torch\u003e=2.1.0' 'xformers\u003e=0.0.22' 'triton\u003e=2.1.0' 'diffusers\u003e=0.19.3'\n\n# (Optional) Makes the build much faster.\npip3 install ninja\n\n# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.\n# You can also install the latest stable release from PyPI.\n# pip3 install -v -U stable-fast\npip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast\n# (this can take dozens of minutes)\n```\n\n__NOTE__: Any usage outside `sfast.compilers` is not guaranteed to be backward compatible.\n\n__NOTE__: To get the best performance, `xformers` and OpenAI's `triton\u003e=2.1.0` need to be installed and enabled.\nYou might need to build `xformers` from source to make it compatible with your `PyTorch`.\n\n## Usage\n\n### Optimize StableDiffusionPipeline\n\n`stable-fast` is able to optimize `StableDiffusionPipeline` and `StableDiffusionPipelineXL` directly.\n\n```python\nimport time\nimport torch\nfrom diffusers import (StableDiffusionPipeline,\n                       EulerAncestralDiscreteScheduler)\nfrom sfast.compilers.diffusion_pipeline_compiler import (compile,\n                                                         CompilationConfig)\n\ndef load_model():\n    model = StableDiffusionPipeline.from_pretrained(\n        'runwayml/stable-diffusion-v1-5',\n        torch_dtype=torch.float16)\n\n    model.scheduler = EulerAncestralDiscreteScheduler.from_config(\n        model.scheduler.config)\n    model.safety_checker = None\n    model.to(torch.device('cuda'))\n    return model\n\nmodel = load_model()\n\nconfig = CompilationConfig.Default()\n# xformers and Triton are suggested for achieving best performance.\ntry:\n    import xformers\n    config.enable_xformers = True\nexcept ImportError:\n    print('xformers not installed, skip')\ntry:\n    import triton\n    config.enable_triton = True\nexcept ImportError:\n    print('Triton not installed, skip')\n# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.\n# But it can increase the amount of GPU memory used.\n# For StableVideoDiffusionPipeline it is not needed.\nconfig.enable_cuda_graph = True\n\nmodel = compile(model, config)\n\nkwarg_inputs = dict(\n    prompt=\n    '(masterpiece:1,2), best quality, masterpiece, best detailed face, a beautiful girl',\n    height=512,\n    width=512,\n    num_inference_steps=30,\n    num_images_per_prompt=1,\n)\n\n# NOTE: Warm it up.\n# The initial calls will trigger compilation and might be very slow.\n# After that, it should be very fast.\nfor _ in range(3):\n    output_image = model(**kwarg_inputs).images[0]\n\n# Let's see it!\n# Note: Progress bar might work incorrectly due to the async nature of CUDA.\nbegin = time.time()\noutput_image = model(**kwarg_inputs).images[0]\nprint(f'Inference time: {time.time() - begin:.3f}s')\n\n# Let's view it in terminal!\nfrom sfast.utils.term_image import print_image\n\nprint_image(output_image, max_width=80)\n```\n\nRefer to [examples/optimize_stable_diffusion_pipeline.py](examples/optimize_stable_diffusion_pipeline.py) for more details.\n\nYou can check this Colab to see how it works on T4 GPU: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chengzeyi/stable-fast-colab/blob/main/stable_fast_colab.ipynb)\n\n### Optimize LCM Pipeline\n\n`stable-fast` is able to optimize the newest `latent consistency model` pipeline and achieve a significant speedup.\n\nRefer to [examples/optimize_lcm_pipeline.py](examples/optimize_lcm_lora.py) for more details about how to optimize normal SD model with LCM LoRA.\nRefer to [examples/optimize_lcm_pipeline.py](examples/optimize_lcm_pipeline.py) for more details about how to optimize the standalone LCM model.\n\n### Optimize StableVideoDiffusionPipeline\n\n`stable-fast` is able to optimize the newest `StableVideoDiffusionPipeline` and achieve a `2x` speedup\n\nRefer to [examples/optimize_stable_video_diffusion_pipeline.py](examples/optimize_stable_video_diffusion_pipeline.py) for more details\n\n### Dynamically Switch LoRA\n\nSwitching LoRA dynamically is supported but you need to do some extra work.\nIt is possible because the compiled graph and `CUDA Graph` share the same\nunderlaying data (pointers) with the original UNet model. So all you need to do\nis to update the original UNet model's parameters inplace.\n\nThe following code assumes you have already load a LoRA and compiled the model,\nand you want to switch to another LoRA.\n\nIf you don't enable CUDA graph and keep `preserve_parameters = True`, things could be much easier.\nThe following code might not even be needed.\n\n```python\n# load_state_dict with assign=True requires torch \u003e= 2.1.0\n\ndef update_state_dict(dst, src):\n    for key, value in src.items():\n        # Do inplace copy.\n        # As the traced forward function shares the same underlaying data (pointers),\n        # this modification will be reflected in the traced forward function.\n        dst[key].copy_(value)\n\n# Switch \"another\" LoRA into UNet\ndef switch_lora(unet, lora):\n    # Store the original UNet parameters\n    state_dict = unet.state_dict()\n    # Load another LoRA into unet\n    unet.load_attn_procs(lora)\n    # Inplace copy current UNet parameters to the original unet parameters\n    update_state_dict(state_dict, unet.state_dict())\n    # Load the original UNet parameters back.\n    # We use assign=True because we still want to hold the references\n    # of the original UNet parameters\n    unet.load_state_dict(state_dict, assign=True)\n\nswitch_lora(compiled_model.unet, lora_b_path)\n```\n\n### Model Quantization\n\n`stable-fast` extends PyTorch's `quantize_dynamic` functionality and provides a dynamically quantized linear operator on CUDA backend.\nBy enabling it, you could get a slight VRAM reduction for `diffusers` and significant VRAM reduction for `transformers`,\nand cound get a potential speedup (not always).\n\nFor `SD XL`, it is expected to see VRAM reduction of `2GB` with an image size of `1024x1024`.\n\n```python\ndef quantize_unet(m):\n    from diffusers.utils import USE_PEFT_BACKEND\n    assert USE_PEFT_BACKEND\n    m = torch.quantization.quantize_dynamic(m, {torch.nn.Linear},\n                                            dtype=torch.qint8,\n                                            inplace=True)\n    return m\n\nmodel.unet = quantize_unet(model.unet)\nif hasattr(model, 'controlnet'):\n    model.controlnet = quantize_unet(model.controlnet)\n```\n\nRefer to [examples/optimize_stable_diffusion_pipeline.py](examples/optimize_stable_diffusion_pipeline.py) for more details.\n\n### Some Common Methods To Speed Up PyTorch\n\n```bash\n# TCMalloc is highly suggested to reduce CPU overhead\n# https://github.com/google/tcmalloc\nLD_PRELOAD=/path/to/libtcmalloc.so python3 ...\n```\n\n```python\nimport packaging.version\nimport torch\n\nif packaging.version.parse(torch.__version__) \u003e= packaging.version.parse('1.12.0'):\n    torch.backends.cuda.matmul.allow_tf32 = True\n```\n\n## Performance Comparison\n\nPerformance varies very greatly across different hardware/software/platform/driver configurations.\nIt is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job.\nI have tested on some platforms before but the results may still be inaccurate.\nNote that when benchmarking, the progress bar showed by `tqdm` may be inaccurate because of the asynchronous nature of CUDA.\nTo solve this problem, I use `CUDA Event` to measure the speed of iterations per second accurately.\n\n`stable-fast` is expected to work better on newer GPUs and newer CUDA versions.\n__On older GPUs, the performance increase might be limited.__\n__During benchmarking, the progress bar might work incorrectly because of the asynchronous nature of CUDA.__\n\n### RTX 4080 (512x512, batch size 1, fp16, in WSL2)\n\nThis is my personal gaming PC😄. It has a more powerful CPU than those from cloud server providers.\n\n| Framework                                | SD 1.5        | SD XL (1024x1024) | SD 1.5 ControlNet |\n| ---------------------------------------- | ------------- | ----------------- | ----------------- |\n| Vanilla PyTorch (2.1.0)                  | 29.5 it/s     | 4.6 it/s          | 19.7 it/s         |\n| torch.compile (2.1.0, max-autotune)      | 40.0 it/s     | 6.1 it/s          | 21.8 it/s         |\n| AITemplate                               | 44.2 it/s     |                   |                   |\n| OneFlow                                  | 53.6 it/s     |                   |                   |\n| AUTO1111 WebUI                           | 17.2 it/s     | 3.6 it/s          |                   |\n| AUTO1111 WebUI (with SDPA)               | 24.5 it/s     | 4.3 it/s          |                   |\n| TensorRT (AUTO1111 WebUI)                | 40.8 it/s     |                   |                   |\n| TensorRT Official Demo                   | 52.6 it/s     |                   |                   |\n| __stable-fast (with xformers \u0026 Triton)__ | __51.6 it/s__ | __9.1 it/s__      | __36.7 it/s__     |\n\n### H100\n\nThanks for __@Consceleratus__ and __@harishp__'s help, I have tested speed on H100.\n\n| Framework                                | SD 1.5         | SD XL (1024x1024) | SD 1.5 ControlNet |\n| ---------------------------------------- | -------------- | ----------------- | ----------------- |\n| Vanilla PyTorch (2.1.0)                  | 54.5 it/s      | 14.9 it/s         | 35.8 it/s         |\n| torch.compile (2.1.0, max-autotune)      | 66.0 it/s      | 18.5 it/s         |                   |\n| __stable-fast (with xformers \u0026 Triton)__ | __104.6 it/s__ | __21.6 it/s__     | __72.6 it/s__     |\n\n### A100\n\nThanks for __@SuperSecureHuman__ and __@jon-chuang__'s help, benchmarking on A100 is available now.\n\n| Framework                                | SD 1.5        | SD XL (1024x1024) | SD 1.5 ControlNet |\n| ---------------------------------------- | ------------- | ----------------- | ----------------- |\n| Vanilla PyTorch (2.1.0)                  | 35.6 it/s     | 8.7 it/s          | 25.1 it/s         |\n| torch.compile (2.1.0, max-autotune)      | 41.9 it/s     | 10.0 it/s         |                   |\n| __stable-fast (with xformers \u0026 Triton)__ | __61.8 it/s__ | __11.9 it/s__     | __41.1 it/s__     |\n\n## Compatibility\n\n| Model                               | Supported |\n| ----------------------------------- | --------- |\n| Hugging Face Diffusers (1.5/2.1/XL) | Yes       |\n| With ControlNet                     | Yes       |\n| With LoRA                           | Yes       |\n| Latent Consistency Model            | Yes       |\n| SDXL Turbo                          | Yes       |\n| Stable Video Diffusion              | Yes       |\n\n| Functionality                       | Supported |\n| ----------------------------------- | --------- |\n| Dynamic Shape                       | Yes       |\n| Text to Image                       | Yes       |\n| Image to Image                      | Yes       |\n| Image Inpainting                    | Yes       |\n\n| UI Framework                        | Supported | Link                                                                    |\n| ----------------------------------- | --------- | ----------------------------------------------------------------------- |\n| AUTOMATIC1111                       | WIP       |                                                                         |\n| SD Next                             | Yes       | [`SD Next`](https://github.com/vladmandic/automatic)                    |\n| ComfyUI                             | Yes       | [`ComfyUI_stable_fast`](https://github.com/gameltb/ComfyUI_stable_fast) |\n\n| Operating System                    | Supported |\n| ----------------------------------- | --------- |\n| Linux                               | Yes       |\n| Windows                             | Yes       |\n| Windows WSL                         | Yes       |\n\n## Troubleshooting\n\nRefer to [doc/troubleshooting.md](doc/troubleshooting.md) for more details.\n\nAnd you can join the [Discord Channel](https://discord.gg/kQFvfzM4SJ) to ask for help.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchengzeyi%2Fstable-fast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchengzeyi%2Fstable-fast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchengzeyi%2Fstable-fast/lists"}