https://github.com/D-Ogi/ComfyUI-Attention-Optimizer
Automatically benchmark and optimize attention in diffusion models. 1.5-2x speedup on RTX 4090.
https://github.com/D-Ogi/ComfyUI-Attention-Optimizer
attention comfyui comfyui-custom-node diffusion flash-attention flux optimization performance sageattention stable-diffusion
Last synced: 4 months ago
JSON representation
Automatically benchmark and optimize attention in diffusion models. 1.5-2x speedup on RTX 4090.
- Host: GitHub
- URL: https://github.com/D-Ogi/ComfyUI-Attention-Optimizer
- Owner: D-Ogi
- License: mit
- Created: 2026-01-24T23:02:20.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-02-09T12:56:32.000Z (5 months ago)
- Last Synced: 2026-02-09T17:49:01.206Z (5 months ago)
- Topics: attention, comfyui, comfyui-custom-node, diffusion, flash-attention, flux, optimization, performance, sageattention, stable-diffusion
- Language: Python
- Size: 16.6 KB
- Stars: 25
- Watchers: 0
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-comfyui - **ComfyUI-Attention-Optimizer** - 2x speedup on RTX 4090, up to 4x on video models. (Workflows (4120) sorted by GitHub Stars)
README
# ComfyUI Attention Optimizer
**Automatically benchmark and optimize the attention mechanism in diffusion models for maximum generation speed.**
## Why This Matters
### The Problem
Modern diffusion models (SDXL, Flux, WAN, LTX-V, Hunyuan Video) are based on **transformer architecture**. The core operation - **attention** - computes relationships between all elements in the image/video latent space. This is:
- **The most expensive operation** - attention takes 40-70% of total generation time
- **O(n²) complexity** - cost grows quadratically with resolution/frames
- **GPU-dependent** - different GPUs perform best with different implementations
### The Solution
Multiple optimized attention backends exist:
- **PyTorch SDPA** - built-in, always available
- **Flash Attention** - CUDA kernels, memory efficient
- **SageAttention** - INT8 quantization, up to 2-4x faster
- **xFormers** - memory efficient attention
**But which one is fastest for YOUR specific GPU and model?**
This plugin **benchmarks all available backends** and **automatically applies the fastest one**.
## Real-World Speedups
Tested on RTX 4090 with head_dim=128 (SDXL, Flux):
| Backend | Time | Speedup |
|---------|------|---------|
| PyTorch SDPA | 5.0ms | 1.0x (baseline) |
| Flash Attention | 5.4ms | 0.93x |
| **SageAttention** | **2.7ms** | **1.9x** |
**Result: 1.9x faster generation** just by switching attention backend.
For video models (WAN, Hunyuan) with longer sequences, speedups can reach **2-4x**.
## Installation
### Option 1: ComfyUI Manager (Recommended)
1. Open ComfyUI Manager
2. Click **"Install via Git URL"**
3. Paste: `https://github.com/D-Ogi/ComfyUI-Attention-Optimizer.git`
4. Restart ComfyUI
### Option 2: Manual Installation
```bash
cd ComfyUI/custom_nodes
git clone https://github.com/D-Ogi/ComfyUI-Attention-Optimizer.git
```
Restart ComfyUI.
### Optional: Install Optimized Backends
The plugin works out-of-the-box with PyTorch SDPA. For better performance, install additional backends:
```bash
# SageAttention - recommended for RTX 30xx/40xx (1.5-2x speedup)
pip install sageattention
# Flash Attention - alternative for Ampere+ GPUs
pip install flash-attn
# xFormers - memory efficient option
pip install xformers
```
> **Note:** On Windows, Flash Attention requires building from source or using prebuilt wheels.
> SageAttention is easier to install and often faster on consumer GPUs.
## Usage
### Basic Usage
1. Add **"Attention Optimizer"** node to your workflow (category: `model_patches`)
2. Connect your model to the `model` input
3. Run - it benchmarks once, caches results, and auto-applies the fastest backend
### How It Works
```
┌─────────────────┐ ┌──────────────────────────┐ ┌─────────────┐
│ Load Checkpoint │────▶│ Attention Optimizer │────▶│ KSampler │
└─────────────────┘ │ │ └─────────────┘
│ 1. Detect model params │
│ 2. Check cache │
│ 3. Benchmark (if needed) │
│ 4. Clone model & apply │
│ attention override │
└──────────────────────────┘
```
**First run:** Benchmarks all backends (~5-10 seconds), saves to cache.
**Subsequent runs:** Loads from cache (instant), applies optimal backend.
### Node Inputs
| Input | Type | Default | Description |
|-------|------|---------|-------------|
| `model` | MODEL | required | The diffusion model to optimize |
| `attention_backend` | dropdown | `auto` | `auto` = benchmark & select best, or force specific backend |
| `force_refresh` | bool | False | Re-run benchmark even if cached |
| `auto_apply` | bool | True | Apply the selected backend to this model |
| `seq_len` | int | 8192 | Sequence length for benchmark |
| `num_heads` | int | 24 | Number of attention heads |
### Node Outputs
| Output | Type | Description |
|--------|------|-------------|
| `model` | MODEL | Cloned model with optimized attention applied |
| `best_attention` | STRING | Name of applied backend |
| `kjnodes_mode` | STRING | Compatible mode for KJNodes PatchSageAttention |
| `impl_type` | STRING | Implementation type (cuda/triton/pytorch) |
| `speedup` | FLOAT | Speedup vs PyTorch SDPA baseline |
| `time_ms` | FLOAT | Time per attention call in milliseconds |
| `head_dim` | INT | Detected head dimension from model |
| `report` | STRING | Full benchmark report text |
## Supported Backends
| Backend | Implementation | Best For |
|---------|---------------|----------|
| `pytorch` | PyTorch SDPA | Always available, baseline |
| `xformers` | xFormers CUDA | Memory efficiency |
| `sage_auto` | SageAttention auto | General use (auto-selects best variant) |
| `sage_cuda` | SageAttention CUDA | RTX 30xx/40xx |
| `sage_triton` | SageAttention Triton | When CUDA kernel unavailable |
| `sage_fp8_cuda` | SageAttention FP8 | Maximum speed, slight quality trade-off |
| `sage_fp8_cuda_fast` | SageAttention FP8++ | Even faster FP8 |
| `sage3` | SageAttention 3 | RTX 50xx (Blackwell) only |
| `flash` | Flash Attention 2 | H100, A100, RTX 30xx/40xx |
## Model Compatibility
| Model | Status | Notes |
|-------|--------|-------|
| SDXL | ✅ Full | head_dim=128, SageAttention optimal |
| SD 1.5 | ✅ Full | head_dim=64 |
| SD 3 | ✅ Full | |
| Flux | ✅ Full | Per-model attention override |
| LTX-V | ✅ Full | head_dim=160 |
| WAN 2.1/2.2 | ✅ Full | Per-model attention override |
| Hunyuan Video | ✅ Full | Per-model attention override |
| Cosmos | ✅ Full | Per-model attention override |
| SeedVR2 | ❌ N/A | Uses own attention system, not affected |
## GPU Recommendations
| GPU | Recommended Backend | Expected Speedup |
|-----|---------------------|------------------|
| RTX 4090/4080 | `sage_auto` or `sage_fp8_cuda_fast` | 1.5-2.0x |
| RTX 3090/3080 | `sage_auto` or `flash` | 1.3-1.8x |
| RTX 50xx (Blackwell) | `sage3` | 2-4x |
| H100/A100 | `flash` | 1.5-2.0x |
| AMD (ROCm) | `pytorch` | 1.0x (baseline) |
## Example Benchmark Report
```
=================================================================
BENCHMARK REPORT
=================================================================
dtype: float16 | head_dim: 128 | seq_len: 8192 | CUDA: 12.4 | Triton: 3.0.0
SageAttention: v2.1.1
>>> BEST: sage_fp8_cuda_fast (1.89x speedup) <<<
impl: cuda | kjnodes mode: sageattn_qk_int8_pv_fp8_cuda++
Results (fastest first):
-----------------------------------------------------------------
[v] sage_fp8_cuda_fast 2.671ms 1.89x (cuda) <<<
[v] sage_auto 2.679ms 1.88x (auto)
[v] sage_fp8_cuda 3.100ms 1.63x (cuda)
[v] sage_triton 3.446ms 1.47x (triton)
[v] sage_cuda 3.947ms 1.28x (cuda)
[v] pytorch 5.049ms 1.00x (pytorch)
[v] xformers 5.194ms 0.97x (cuda/triton)
[v] flash 5.430ms 0.93x (cuda)
[ ] sage3 --- (N/A) Not installed
-----------------------------------------------------------------
[v] = validated (tested underlying library directly)
=================================================================
```
## Technical Details
### Why Different Backends?
**PyTorch SDPA** uses cuDNN/cuBLAS - general purpose, always works.
**Flash Attention** fuses operations into single CUDA kernel, reducing memory bandwidth. Great for long sequences.
**SageAttention** quantizes Q/K to INT8, reducing memory and compute. Works best for head_dim ≤ 128.
**xFormers** similar to Flash Attention, good memory efficiency.
### head_dim Matters
Models have different attention head dimensions:
- **SD 1.5:** head_dim=64
- **SDXL, Flux:** head_dim=128
- **LTX-V:** head_dim=160
SageAttention works best with head_dim ≤ 128. For larger dimensions, SDPA or Flash Attention may be faster.
### Cache System
Benchmark results are cached in `benchmark_db.json` based on:
- Model hash (architecture + weights)
- head_dim
- seq_len / num_heads parameters
Cache is per-machine - different GPUs will have different optimal backends.
## Troubleshooting
### "Backend X not available"
Install the missing package:
```bash
pip install sageattention # for sage_*
pip install flash-attn # for flash
pip install xformers # for xformers
```
### No speedup observed
1. Check if `auto_apply` is enabled
2. Try `force_refresh=True` to re-benchmark
3. Check console for `[Benchmark] Applied: X` message
### Model not affected
Some models (like SeedVR2) use their own attention implementation and won't be affected by this plugin. Check the compatibility table above.
## License
MIT License - see [LICENSE](LICENSE)
## Credits
- [SageAttention](https://github.com/thu-ml/SageAttention) - THU-ML
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) - Dao-AILab
- [xFormers](https://github.com/facebookresearch/xformers) - Meta