{"id":23047836,"url":"https://github.com/thu-ml/SageAttention","last_synced_at":"2025-08-15T01:33:00.459Z","repository":{"id":259402597,"uuid":"867007699","full_name":"thu-ml/SageAttention","owner":"thu-ml","description":"Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.","archived":false,"fork":false,"pushed_at":"2024-12-07T18:19:04.000Z","size":11624,"stargazers_count":683,"open_issues_count":25,"forks_count":34,"subscribers_count":18,"default_branch":"main","last_synced_at":"2024-12-13T17:20:30.111Z","etag":null,"topics":["attention","cuda","inference-acceleration","llm","quantization","triton","video-generation"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-03T09:33:18.000Z","updated_at":"2024-12-13T16:21:47.000Z","dependencies_parsed_at":"2024-10-25T05:00:08.728Z","dependency_job_id":"4c136bf8-9afa-4812-ae5c-42b085a366ed","html_url":"https://github.com/thu-ml/SageAttention","commit_stats":null,"previous_names":["thu-ml/sageattention"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FSageAttention","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FSageAttention/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FSageAttention/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2FSageAttention/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-ml","download_url":"https://codeload.github.com/thu-ml/SageAttention/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229885900,"owners_count":18139382,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","cuda","inference-acceleration","llm","quantization","triton","video-generation"],"created_at":"2024-12-15T22:37:23.964Z","updated_at":"2025-08-15T01:33:00.450Z","avatar_url":"https://github.com/thu-ml.png","language":"Cuda","funding_links":[],"categories":["Frameworks","Cuda"],"sub_categories":[],"readme":"# SageAttention\n\u003c!-- We are continuously updating more features. You could **Star** and **Watch** our repository to stay updated.\n\n--- --\u003e\nThis repository provides the official implementation of SageAttention, SageAttention2, and SageAttention2++, which achieve surprising speedup on most GPUs without lossing accuracy across all models in a plug-and-play way.\n\n**SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration**  \nPaper: https://arxiv.org/abs/2410.02367  \nJintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen\n\n**SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization**  \nPaper: https://arxiv.org/abs/2411.10958  \nJintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen\n\n**SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training**  \nPaper: https://arxiv.org/abs/2505.11594  \nJintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen\n\n![Local Image](./assets/2.png)\n*Note: [SageAttention2++](https://arxiv.org/pdf/2505.21136) achieves higher speed while maintaining the same accuracy performance.*\n\n## Current Features\n\u003c!-- This is a beta release of SageAttention2. We welcome any feedback on accuracy, performance issues, bugs, feature requests, or suggestions. Please feel free to open an issue or launch a pull request! --\u003e\n\n+ Optmized kernels for **Ampere, Ada and Hopper GPUs.**\n+ INT8 quantization and smoothing for $QK^\\top$ with support for varying granularities.\n+ FP8 quantization for $PV$, and FP16 accumulator for FP8/FP16 $PV$.\n+ Two-level accumulation strategy for $PV$ to improve accuracy in FP8 MMA and WGMMA.\n+ Support `torch.compile` with non-cudagraphs mode and distributed inference.\n\n\n## Project Updates\n- [2025-07-21]: The early access to SageAttention3 code is available at [HuggingFace](https://huggingface.co/jt-zhang/SageAttention3), where you'll need to fill out a form in detail and await approval.\n- [2025-07-01]: The code of [SageAttention2++](https://arxiv.org/pdf/2505.21136) is released in this repository. We would still greatly appreciate it if you could take a moment to fill out the Form in [Huggingface](https://huggingface.co/jt-zhang/SageAttention2_plus). Thank you very much!\n\n![Local Image](./assets/5090_sageattn2++.png)\n\n![Local Image](./assets/4090_sageattn2++.png)\n\n- [2025-06-19]: [Here](https://github.com/jt-zhang/Sparse_SageAttention_API) provides a Sparse Attention API based on SageAttention V1, which can compute attention with any block sparse pattern very fast.\n- [2025-05-02]: 🎉SageAttention2 and [SpargeAttn](https://github.com/thu-ml/SpargeAttn) are accepted by ICML 2025! \n- [2025-02-25]: 🔥 We release [SpargeAttn](https://github.com/thu-ml/SpargeAttn), a sparse attention based on SageAttention2, which could acclerate any model without training.\n- [2025-02-15]: 🔥 The compilation code is updated to support RTX5090! On RTX5090, SageAttention reaches 560T, 2.7x faster than FlashAttention2!\n- [2025-01-28]: 🔥⚡SageAttention is now available on Hopper GPUs (H100, H800, H20)! It matches the speed of FlashAttention3-FP8 but offers **much better accuracy!**\n\n| **FlashAttention2** | **FlashAttention3** | **FlashAttention3-FP8** | **SageAttention** |\n|----------------------|----------------------|----------------------|----------------------|\n| ![FlashAttention2](assets/cogvideox1.5_fa2_example.gif) | ![FlashAttention3](assets/cogvideox1.5_fa3_example.gif)  | ![FlashAttention3-FP8](assets/cogvideox1.5_fa3fp8_example.gif) | ![SageAttention](assets/cogvideox1.5_sage_example.gif) |\n| **25'34''** | **17'32''** | **12'14''** | **12'07''** |\n\n*Results for [CogVideoX1.5-5B](https://huggingface.co/THUDM/CogVideoX1.5-5B) on NVIDIA H20 GPU*\n\n![Local Image](./assets/H100_hd128.png)\n\n![Local Image](./assets/H20_hd128.png)\n\n- [2025-01-24]: 🎉SageAttention is accepted by ICLR 2025! \n- [2024-12-20]: 🔥Update the [SageAttention2 Paper](https://arxiv.org/abs/2411.10958).   \n\n- [2024-12-20]: 🔥Release SageAttention 2.0.1 Beta! In this version, we introduce a new feature: per-thread quantization, which offers finer granularity while maintaining hardware efficiency.\n- [2024-11-21]: 🔥SageAttention 2.0.0 beta is released! Now SageAttention has measured speedup on L20, L40, A100, A800, and A6000, RTX3090 and RTX4090.\n- [2024-11-12]: Support for `sageattn_varlen` is available now.\n- [2024-11-11]: Support for different sequence lengths between `q` and `k,v`,  `(batch_size, head_num, seq_len, head_dim)` or `(batch_size, seq_len, head_num, head_dim)` input shapes, and `group-query attention` is available now.\n\n\n## Installation\n### Base environment\n+ `python\u003e=3.9`   , `torch\u003e=2.3.0`  , `triton\u003e=3.0.0` \n- `CUDA`:\n  + `\u003e=12.8` for Blackwell or SageAttention2++\n  + `\u003e=12.4` for fp8 support on Ada\n  + `\u003e=12.3` for fp8 support on Hopper\n  + `\u003e=12.0` for Ampere\n+ `flash-attn` for benchmarking\n\n### Install Package\n\nFor SageAttention V1 in Triton (slower than SageAttention V2/V2++/V3), refer to [SageAttention-1](https://github.com/thu-ml/SageAttention/tree/sageattention-1) and install using pip: `pip install sageattention==1.0.6`\n\nTo use SageAttention 2.2.0 (containing SageAttention2++), please **compile from source**:\n```\ngit clone https://github.com/thu-ml/SageAttention.git\ncd SageAttention \nexport EXT_PARALLEL=4 NVCC_APPEND_FLAGS=\"--threads 8\" MAX_JOBS=32 # parallel compiling (Optional)\npython setup.py install  # or pip install -e .\n```\n\nTo benchmark the speed against FlashAttention3, please compile FlashAttention3 from source:\n```\ngit clone https://github.com/Dao-AILab/flash-attention.git --recursive\ngit checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release\ncd hopper\npython setup.py install\n```\n\n## How to Use\n```python\nfrom sageattention import sageattn\nattn_output = sageattn(q, k, v, tensor_layout=\"HND\", is_causal=False)\n```\n+ `q, k, v` are **FP16/BF16** dtype with the shape `(batch_size, head_num, seq_len, head_dim)` using default `tensor_layout=\"HND\"`. For shape `(batch_size, seq_len, head_num, head_dim)`, set `tensor_layout=\"NHD\"`. \n+ `is_causal` determines the use of a causal mask.\n\n### Available APIs:\n+ `sageattn`: Automatically selects the optimal kernel based on the GPU to achieve a good performance-accuracy trade-off.\n+ `sageattn_qk_int8_pv_fp16_triton`: INT8 quantization for $QK^\\top$ and FP16 for $PV$ using Triton backend.\n+ `sageattn_qk_int8_pv_fp16_cuda`: INT8 quantization for $QK^\\top$ and FP16 for $PV$ using CUDA backend.\n+ `sageattn_qk_int8_pv_fp8_cuda`: INT8 quantization for $QK^\\top$ and FP8 for $PV$ using CUDA backend. (Note that setting `pv_accum_dtype=fp32+fp16` corresponds to SageAttention2++.)\n+ `sageattn_qk_int8_pv_fp8_cuda_sm90`: INT8 quantization for $QK^\\top$ and FP8 for $PV$ using CUDA backend, specifically optimized for Hopper GPUs.\n+ `sageattn_varlen`: INT8 quantization for $QK^\\top$ and FP16 for $PV$ using Triton backend. Support for varying sequence lengths within the same batch.\n\nFor optimal speed and accuracy performance on custom devices and models, we strongly recommend referring to the [this file](./sageattention/core.py) for detailed guidance.\n\n\u003e **Note:**\nSupport for different sequence lengths between `q` and `k,v` and `group-query attention` is available.\n\n\n### Plug-and-play Example\n\nWe can replace `scaled_dot_product_attention` easily. \nWe will take [CogvideoX](https://huggingface.co/THUDM/CogVideoX-2b) as an example:\n\nAdd the following codes and run\n```diff\nimport torch.nn.functional as F\n\n+ from sageattention import sageattn\n+ F.scaled_dot_product_attention = sageattn\n\n```\n\nSpecifically,\n\n```bash\ncd example\npython cogvideox-2b.py --compile --attention_type sage\n```\n\n**You can get a lossless video in** `./example` **faster than by using** `python cogvideox-2b.py --compile`. More examples and guidance can be found under the `example/` directory.\n\n\u003e **Note:** Not all models works with `F.scaled_dot_product_attention = sageattn`. Technically, you should replace the original Attention by modifying the `Attention Class` of the target model. For image and video models, we suggest only replacing the attention in DiT (see `example/mochi.py` for detail).\n\n### Kernel Benchmarking\nWe provide a benchmarking script to compare the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please refer to the `benchmark/` directory for more details.\n \n## Performance\n### Speed of Kernels\n\n`8+8` means the kernel with INT8 quantization for $QK^\\top$ and FP8 quantization for $PV$. `8+16` uses FP16 with FP16 accumulator for $PV$.\n\n![Local Image](./assets/5090_sageattn2++.png)\n\n![Local Image](./assets/4090_sageattn2++.png)\n\n![Local Image](./assets/4090_hd128.png)\n\n![Local Image](./assets/L20_hd128.png)\n\n![Local Image](./assets/H100_hd128.png)\n\n![Local Image](./assets/H20_hd128.png)\n\n![Local Image](./assets/A100_hd128.png)\n\n![Local Image](./assets/3090_hd128.png)\n\n\u003e **Note:** The TOPS results refer only to the Attention Kernel, excluding the quantization and smoothing.\n\n### End-to-end Performance\n#### **End-to-End Accuracy:**\n\n![Local Image](./assets/22.png)\n\n![Local Image](./assets/23.png)\n\n![Local Image](./assets/24.png)\n\n![Local Image](./assets/25.png)\n\n#### **End-to-End Speedup:**\n\n![Local Image](./assets/26.png)\n*Note: SageAttention2++ achieves higher speed.*\n\n## Citation\n**If you use this code or find our work valuable, please cite:**\n```\n@inproceedings{zhang2025sageattention,\n  title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, \n  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},\n  booktitle={International Conference on Learning Representations (ICLR)},\n  year={2025}\n}\n@inproceedings{zhang2024sageattention2,\n  title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},\n  author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},\n  booktitle={International Conference on Machine Learning (ICML)},\n  year={2025}\n}\n@article{zhang2025sageattention2++,\n  title={Sageattention2++: A more efficient implementation of sageattention2},\n  author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei},\n  journal={arXiv preprint arXiv:2505.21136},\n  year={2025}\n}\n@article{zhang2025sageattention3,\n  title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training},\n  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},\n  journal={arXiv preprint arXiv:2505.11594},\n  year={2025}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2FSageAttention","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-ml%2FSageAttention","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2FSageAttention/lists"}