{"id":20513400,"url":"https://github.com/bruce-lee-ly/flash_attention_inference","last_synced_at":"2025-08-25T18:47:04.801Z","repository":{"id":253116337,"uuid":"679281575","full_name":"Bruce-Lee-LY/flash_attention_inference","owner":"Bruce-Lee-LY","description":"Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.","archived":false,"fork":false,"pushed_at":"2025-02-27T00:36:03.000Z","size":2082,"stargazers_count":35,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-27T14:01:37.143Z","etag":null,"topics":["cuda","cutlass","flash-attention","flash-attention-2","gpu","inference","large-language-model","llm","mha","multi-head-attention","nvidia","tensor-core"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bruce-Lee-LY.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-16T13:46:07.000Z","updated_at":"2025-02-27T00:36:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"ca018303-245c-45da-9a55-1cb97fea58aa","html_url":"https://github.com/Bruce-Lee-LY/flash_attention_inference","commit_stats":{"total_commits":8,"total_committers":1,"mean_commits":8.0,"dds":0.0,"last_synced_commit":"eded2509adca83b52379217829dbad6decf6e7a0"},"previous_names":["bruce-lee-ly/flash_attention_inference"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fflash_attention_inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fflash_attention_inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fflash_attention_inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fflash_attention_inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bruce-Lee-LY","download_url":"https://codeload.github.com/Bruce-Lee-LY/flash_attention_inference/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248799905,"owners_count":21163400,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cutlass","flash-attention","flash-attention-2","gpu","inference","large-language-model","llm","mha","multi-head-attention","nvidia","tensor-core"],"created_at":"2024-11-15T21:10:51.087Z","updated_at":"2025-04-13T23:50:36.350Z","avatar_url":"https://github.com/Bruce-Lee-LY.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Flash Attention Inference\nPerformance of the C++ interface of Flash Attention and Flash Attention v2 in large language model (LLM) inference scenarios. The calculation expression is as follows, where the precision of tensor Q, K, V and O is FP16. Remove redundant code from Flash Attention that has nothing to do with inference, such as backward, dropout, bf16 and torch dependencies, so you can easily integrate this Flash Attention into LLM inference programs. In addition, Flash Attention and Flash Attention v2 have been modified to support group query attention (GQA) / multi query attention (MQA), hybrid by prefill and decoding and attention with linear biases (ALiBi) inference scenarios.\n```\nO = Softmax(Q * K^T) * V\n```\n\n![fmha](./media/images/fmha.png)\n\n# Support\n- GQA/MQA Inference: Group query attention / multi query attention inference\n- Hybrid Inference: Hybrid inference by prefill and decoding\n- ALiBi Inference: Attention with linear biases inference\n\n# Compile\n## Environment\n- OS: Linux\n- Cmake Version: \u003e= 3.16\n- GCC Version: \u003e= 5.0\n- CUDA Version: \u003e= 11.4\n- Others: gflags, ccache\n```\nsudo apt-get install libgflags-dev ccache\n```\n\n## Clone\n```\ngit clone https://github.com/Bruce-Lee-LY/flash_attention_inference.git\n```\n\n## Build\n### NVIDIA A100\n```\ncd flash_attention_inference\n./build.sh -a 80 -t Release -b OFF\n./build.sh -a 80 -t Debug -b OFF\n```\n\n### RTX3080Ti / RTX3090 / RTX A6000\n```\ncd flash_attention_inference\n./build.sh -a 86 -t Release -b OFF\n./build.sh -a 86 -t Debug -b OFF\n```\n\n# Run Sample\n```\n./run_sample.sh\n```\n\n# Performance\nProcess the data in the log and plot it as a line chart.\n\n```\ncd tools/performance\n./performance.sh\n```\n\n## RTX3090\n- CUDA Version: 11.8\n- Head Num: 32\n- Head Dim: 128\n\n### Prefill\n#### Seq Len\nThe performance of both is similar for short sequences and Flash Attention v2 performs well in long sequences. It can increase by about 60%.\n- Batch Size: 1\n- Seq Q: Seq Len\n- Seq K: Seq Len\n\n![prefill_seq_throughput](./performance/RTX3090/prefill_seq_throughput.png)\n\n#### Batch Size\nWhen the batch size is small, the Flash Attention v2 performance is better. When the batch size is large, the performance of the two kernels is comparable.\n- Batch Size: Batch Size\n- Seq Q: 128\n- Seq K: 128\n\n![prefill_batch_throughput](./performance/RTX3090/prefill_batch_throughput.png)\n\n### Decoding\n#### Seq Len\nThe performance of both is similar for short sequences and Flash Attention performs well in long sequences. It can increase by about 100%.\n- Batch Size: 1\n- Seq Q: 1\n- Seq K: Seq Len\n\n![decoding_seq_throughput](./performance/RTX3090/decoding_seq_throughput.png)\n\n#### Batch Size\nThe Flash Attention performance is better than Flash Attention v2 regardless of batch size.\n- Batch Size: Batch Size\n- Seq Q: 1\n- Seq K: 128\n\n![decoding_batch_throughput](./performance/RTX3090/decoding_batch_throughput.png)\n\n### Hybrid\nRegardless of the ratio of Prefill to Decoding, Flash Attention and Flash Attention v2 are similar in performance.\n- Batch Size: 100\n- Seq Q: 128\n- Seq K: 128\n\n![hybrid_throughput](./performance/RTX3090/hybrid_throughput.png)\n\n# Reference\n## [flash-attention](https://github.com/Dao-AILab/flash-attention)\n- Flash Attention: v1.0.9\n- Flash Attention v2: v2.1.0\n\n## [cutlass](https://github.com/NVIDIA/cutlass)\n- cutlass: v3.1.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbruce-lee-ly%2Fflash_attention_inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbruce-lee-ly%2Fflash_attention_inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbruce-lee-ly%2Fflash_attention_inference/lists"}