{"id":17632687,"url":"https://github.com/bruce-lee-ly/decoding_attention","last_synced_at":"2025-08-19T14:15:01.383Z","repository":{"id":254302176,"uuid":"842468267","full_name":"Bruce-Lee-LY/decoding_attention","owner":"Bruce-Lee-LY","description":"Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.","archived":false,"fork":false,"pushed_at":"2025-06-11T14:25:06.000Z","size":888,"stargazers_count":40,"open_issues_count":0,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-08-07T05:31:48.650Z","etag":null,"topics":["cuda","cuda-core","decoding-attention","flash-attention","flashinfer","flashmla","gpu","gqa","inference","large-language-model","llm","mha","mla","mqa","multi-head-attention","nvidia"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bruce-Lee-LY.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-08-14T12:14:54.000Z","updated_at":"2025-07-25T10:28:06.000Z","dependencies_parsed_at":"2024-08-22T17:15:40.640Z","dependency_job_id":"94c4fe60-ca95-4278-aaf8-436d955ad8a3","html_url":"https://github.com/Bruce-Lee-LY/decoding_attention","commit_stats":null,"previous_names":["bruce-lee-ly/decoding_attention"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Bruce-Lee-LY/decoding_attention","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fdecoding_attention","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fdecoding_attention/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fdecoding_attention/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fdecoding_attention/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bruce-Lee-LY","download_url":"https://codeload.github.com/Bruce-Lee-LY/decoding_attention/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fdecoding_attention/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271166354,"owners_count":24710465,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-19T02:00:09.176Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cuda-core","decoding-attention","flash-attention","flashinfer","flashmla","gpu","gqa","inference","large-language-model","llm","mha","mla","mqa","multi-head-attention","nvidia"],"created_at":"2024-10-23T01:45:07.067Z","updated_at":"2025-08-19T14:15:01.339Z","avatar_url":"https://github.com/Bruce-Lee-LY.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Decoding Attention\nDecoding Attention is specially optimized for Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA) and Multi-Head Latent Attention  (MLA) using CUDA core for the decoding stage of LLM inference. It mainly refers to OpenPPL and Flash Attention, which can solve the problem of low tensor core utilization of Flash Attention in the decoding stage of LLM inference and support more types of attention and kv cache quantization optimization. The calculation expression is as follows, where the precision of tensor Q, K, V and O is FP16 or BF16. In some LLM inference decoding scenarios, the performance of Decoding Attention is better than Flash Decoding (Flash Attention) and FlashInfer. In addition, Decoding Attention also supports variable length and ALiBi inference scenarios.\n```\nO = Softmax(Q * K^T) * V\n```\n\n![dmha](./media/images/dmha.png)\n\n# Support\n- Variable Length: Variable kv length inference\n- ALiBi: Attention with linear biases inference\n\n# Environment\n- OS: Linux\n- Cmake Version: \u003e= 3.16\n- GCC Version: \u003e= 5.0\n- CUDA Version: \u003e= 11.4\n- Others: gflags, ccache, pytest\n```\nsudo apt-get install libgflags-dev ccache\npip install pytest\n```\n\n# Clone\n```\ngit clone https://github.com/Bruce-Lee-LY/decoding_attention.git\n```\n\n# CPP API\n## Build\n### NVIDIA A100\n```\ncd decoding_attention\n./build_cpp.sh -a 80 -t Release -b OFF\n./build_cpp.sh -a 80 -t Debug -b OFF\n```\n\n### RTX3080Ti / RTX3090 / RTX A6000\n```\ncd decoding_attention\n./build_cpp.sh -a 86 -t Release -b OFF\n./build_cpp.sh -a 86 -t Debug -b OFF\n```\n\n### L20 / L40S\n```\ncd decoding_attention\n./build_cpp.sh -a 89 -t Release -b OFF\n./build_cpp.sh -a 89 -t Debug -b OFF\n```\n\n### H20 / H800\n```\ncd decoding_attention\n./build_cpp.sh -a 90 -t Release -b OFF\n./build_cpp.sh -a 90 -t Debug -b OFF\n```\n\n## Test\n```\n./run_cpp.sh\n```\n\n## Benchmark\n```\n./run_cpp.sh\n```\n\n## Performance\nProcess the cpp result in the log and plot it as a line chart.\n\n```\ncd tools/performance/cpp\n./performance.sh\n```\n\n# Python API\n## Install\n```\ncd decoding_attention\n./install_python.sh\n```\n\n## Test\n```\n./run_python.sh\n```\n\n## Benchmark\n```\n./run_python.sh\n```\n\n## Performance\nProcess the python result in the log and plot it as a line chart.\n\n```\ncd tools/performance/python\n./performance.sh\n```\n\n### MHA Running on RTX3090\n- CUDA Version: 12.1\n- Head Num: 32\n- Head Dim: 128\n- Data Type: FP16\n\n#### Seq Len\nThe performance of Decoding Attention is better when the sequence length is below 1536, while the performance of Flash Decoding (Flash Attention) and FlashInfer is better when the sequence length is above 1536.\n- Batch Size: 1\n- Seq Q: 1\n- Seq K: Seq Len\n\n![seq_throughput](./performance/RTX3090/seq_throughput.png)\n\n#### Batch Size\nRegardless of bacth size, Decoding Attention has better performance than Flash Decoding (Flash Attention) and FlashInfer.\n- Batch Size: Batch Size\n- Seq Q: 1\n- Seq K: 128\n\n![batch_throughput](./performance/RTX3090/batch_throughput.png)\n\n### MLA Running on H20\n- CUDA Version: 12.4\n- Head Num: 128\n- Head Num K: 1\n- Head Dim: 576\n- Head Dim V: 512\n- Data Type: FP16\n\n#### Seq Len\n- Batch Size: 1\n- Seq Q: 1\n- Seq K: Seq Len\n\n![seq_throughput](./performance/H20/seq_throughput.png)\n![seq_bandwidth](./performance/H20/seq_bandwidth.png)\n\n#### Batch Size\n- Batch Size: Batch Size\n- Seq Q: 1\n- Seq K: 4096\n\n![batch_throughput](./performance/H20/batch_throughput.png)\n![batch_bandwidth](./performance/H20/batch_bandwidth.png)\n\n# Reference\n- [ppl.llm.kernel.cuda](https://github.com/OpenPPL/ppl.llm.kernel.cuda)\n- [flash-attention](https://github.com/Dao-AILab/flash-attention): v2.6.3\n- [flashinfer](https://github.com/Bruce-Lee-LY/flashinfer): v0.1.6\n- [FlashMLA](https://github.com/deepseek-ai/FlashMLA)\n\n# TODO\n- Kernel Optimization\n- KV Cache Quantization: FP8、Int8、Int4\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbruce-lee-ly%2Fdecoding_attention","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbruce-lee-ly%2Fdecoding_attention","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbruce-lee-ly%2Fdecoding_attention/lists"}