https://github.com/thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
https://github.com/thu-ml/SageAttention
attention cuda inference-acceleration llm quantization triton video-generation
Last synced: about 2 months ago
JSON representation
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
- Host: GitHub
- URL: https://github.com/thu-ml/SageAttention
- Owner: thu-ml
- License: apache-2.0
- Created: 2024-10-03T09:33:18.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-07T18:19:04.000Z (10 months ago)
- Last Synced: 2024-12-13T17:20:30.111Z (10 months ago)
- Topics: attention, cuda, inference-acceleration, llm, quantization, triton, video-generation
- Language: Cuda
- Homepage:
- Size: 11.1 MB
- Stars: 683
- Watchers: 18
- Forks: 34
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-cuda-and-hpc - SageAttention - ml/SageAttention?style=social"/> : Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models. "SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration". (**[arXiv 2024](https://arxiv.org/abs/2410.02367)**). "SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization". (**[arXiv 2024](https://arxiv.org/abs/2411.10958)**). (Frameworks)
README
# SageAttention
This repository provides the official implementation of SageAttention, SageAttention2, and SageAttention2++, which achieve surprising speedup on most GPUs without lossing accuracy across all models in a plug-and-play way.
**SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration**
Paper: https://arxiv.org/abs/2410.02367
Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen**SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization**
Paper: https://arxiv.org/abs/2411.10958
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen**SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training**
Paper: https://arxiv.org/abs/2505.11594
Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
*Note: [SageAttention2++](https://arxiv.org/pdf/2505.21136) achieves higher speed while maintaining the same accuracy performance.*## Current Features
+ Optmized kernels for **Ampere, Ada and Hopper GPUs.**
+ INT8 quantization and smoothing for $QK^\top$ with support for varying granularities.
+ FP8 quantization for $PV$, and FP16 accumulator for FP8/FP16 $PV$.
+ Two-level accumulation strategy for $PV$ to improve accuracy in FP8 MMA and WGMMA.
+ Support `torch.compile` with non-cudagraphs mode and distributed inference.## Project Updates
- [2025-07-21]: The early access to SageAttention3 code is available at [HuggingFace](https://huggingface.co/jt-zhang/SageAttention3), where you'll need to fill out a form in detail and await approval.
- [2025-07-01]: The code of [SageAttention2++](https://arxiv.org/pdf/2505.21136) is released in this repository. We would still greatly appreciate it if you could take a moment to fill out the Form in [Huggingface](https://huggingface.co/jt-zhang/SageAttention2_plus). Thank you very much!

- [2025-06-19]: [Here](https://github.com/jt-zhang/Sparse_SageAttention_API) provides a Sparse Attention API based on SageAttention V1, which can compute attention with any block sparse pattern very fast.
- [2025-05-02]: 🎉SageAttention2 and [SpargeAttn](https://github.com/thu-ml/SpargeAttn) are accepted by ICML 2025!
- [2025-02-25]: 🔥 We release [SpargeAttn](https://github.com/thu-ml/SpargeAttn), a sparse attention based on SageAttention2, which could acclerate any model without training.
- [2025-02-15]: 🔥 The compilation code is updated to support RTX5090! On RTX5090, SageAttention reaches 560T, 2.7x faster than FlashAttention2!
- [2025-01-28]: 🔥⚡SageAttention is now available on Hopper GPUs (H100, H800, H20)! It matches the speed of FlashAttention3-FP8 but offers **much better accuracy!**| **FlashAttention2** | **FlashAttention3** | **FlashAttention3-FP8** | **SageAttention** |
|----------------------|----------------------|----------------------|----------------------|
|  |  |  |  |
| **25'34''** | **17'32''** | **12'14''** | **12'07''** |*Results for [CogVideoX1.5-5B](https://huggingface.co/THUDM/CogVideoX1.5-5B) on NVIDIA H20 GPU*


- [2025-01-24]: 🎉SageAttention is accepted by ICLR 2025!
- [2024-12-20]: 🔥Update the [SageAttention2 Paper](https://arxiv.org/abs/2411.10958).- [2024-12-20]: 🔥Release SageAttention 2.0.1 Beta! In this version, we introduce a new feature: per-thread quantization, which offers finer granularity while maintaining hardware efficiency.
- [2024-11-21]: 🔥SageAttention 2.0.0 beta is released! Now SageAttention has measured speedup on L20, L40, A100, A800, and A6000, RTX3090 and RTX4090.
- [2024-11-12]: Support for `sageattn_varlen` is available now.
- [2024-11-11]: Support for different sequence lengths between `q` and `k,v`, `(batch_size, head_num, seq_len, head_dim)` or `(batch_size, seq_len, head_num, head_dim)` input shapes, and `group-query attention` is available now.## Installation
### Base environment
+ `python>=3.9` , `torch>=2.3.0` , `triton>=3.0.0`
- `CUDA`:
+ `>=12.8` for Blackwell or SageAttention2++
+ `>=12.4` for fp8 support on Ada
+ `>=12.3` for fp8 support on Hopper
+ `>=12.0` for Ampere
+ `flash-attn` for benchmarking### Install Package
For SageAttention V1 in Triton (slower than SageAttention V2/V2++/V3), refer to [SageAttention-1](https://github.com/thu-ml/SageAttention/tree/sageattention-1) and install using pip: `pip install sageattention==1.0.6`
To use SageAttention 2.2.0 (containing SageAttention2++), please **compile from source**:
```
git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention
export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # parallel compiling (Optional)
python setup.py install # or pip install -e .
```To benchmark the speed against FlashAttention3, please compile FlashAttention3 from source:
```
git clone https://github.com/Dao-AILab/flash-attention.git --recursive
git checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release
cd hopper
python setup.py install
```## How to Use
```python
from sageattention import sageattn
attn_output = sageattn(q, k, v, tensor_layout="HND", is_causal=False)
```
+ `q, k, v` are **FP16/BF16** dtype with the shape `(batch_size, head_num, seq_len, head_dim)` using default `tensor_layout="HND"`. For shape `(batch_size, seq_len, head_num, head_dim)`, set `tensor_layout="NHD"`.
+ `is_causal` determines the use of a causal mask.### Available APIs:
+ `sageattn`: Automatically selects the optimal kernel based on the GPU to achieve a good performance-accuracy trade-off.
+ `sageattn_qk_int8_pv_fp16_triton`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend.
+ `sageattn_qk_int8_pv_fp16_cuda`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using CUDA backend.
+ `sageattn_qk_int8_pv_fp8_cuda`: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend. (Note that setting `pv_accum_dtype=fp32+fp16` corresponds to SageAttention2++.)
+ `sageattn_qk_int8_pv_fp8_cuda_sm90`: INT8 quantization for $QK^\top$ and FP8 for $PV$ using CUDA backend, specifically optimized for Hopper GPUs.
+ `sageattn_varlen`: INT8 quantization for $QK^\top$ and FP16 for $PV$ using Triton backend. Support for varying sequence lengths within the same batch.For optimal speed and accuracy performance on custom devices and models, we strongly recommend referring to the [this file](./sageattention/core.py) for detailed guidance.
> **Note:**
Support for different sequence lengths between `q` and `k,v` and `group-query attention` is available.### Plug-and-play Example
We can replace `scaled_dot_product_attention` easily.
We will take [CogvideoX](https://huggingface.co/THUDM/CogVideoX-2b) as an example:Add the following codes and run
```diff
import torch.nn.functional as F+ from sageattention import sageattn
+ F.scaled_dot_product_attention = sageattn```
Specifically,
```bash
cd example
python cogvideox-2b.py --compile --attention_type sage
```**You can get a lossless video in** `./example` **faster than by using** `python cogvideox-2b.py --compile`. More examples and guidance can be found under the `example/` directory.
> **Note:** Not all models works with `F.scaled_dot_product_attention = sageattn`. Technically, you should replace the original Attention by modifying the `Attention Class` of the target model. For image and video models, we suggest only replacing the attention in DiT (see `example/mochi.py` for detail).
### Kernel Benchmarking
We provide a benchmarking script to compare the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please refer to the `benchmark/` directory for more details.
## Performance
### Speed of Kernels`8+8` means the kernel with INT8 quantization for $QK^\top$ and FP8 quantization for $PV$. `8+16` uses FP16 with FP16 accumulator for $PV$.








> **Note:** The TOPS results refer only to the Attention Kernel, excluding the quantization and smoothing.
### End-to-end Performance
#### **End-to-End Accuracy:**



#### **End-to-End Speedup:**

*Note: SageAttention2++ achieves higher speed.*## Citation
**If you use this code or find our work valuable, please cite:**
```
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@article{zhang2025sageattention2++,
title={Sageattention2++: A more efficient implementation of sageattention2},
author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei},
journal={arXiv preprint arXiv:2505.21136},
year={2025}
}
@article{zhang2025sageattention3,
title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
journal={arXiv preprint arXiv:2505.11594},
year={2025}
}
```