{"id":21389449,"url":"https://github.com/feifeibear/chituattention","last_synced_at":"2025-08-28T04:39:02.427Z","repository":{"id":260590599,"uuid":"881764541","full_name":"feifeibear/ChituAttention","owner":"feifeibear","description":"Quantized Attention on GPU","archived":false,"fork":false,"pushed_at":"2024-11-22T05:49:25.000Z","size":6467,"stargazers_count":44,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-13T15:44:01.908Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/feifeibear.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-01T07:10:48.000Z","updated_at":"2025-05-22T14:01:50.000Z","dependencies_parsed_at":"2025-07-13T15:43:44.053Z","dependency_job_id":null,"html_url":"https://github.com/feifeibear/ChituAttention","commit_stats":null,"previous_names":["feifeibear/chituattention"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/feifeibear/ChituAttention","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FChituAttention","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FChituAttention/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FChituAttention/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FChituAttention/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/feifeibear","download_url":"https://codeload.github.com/feifeibear/ChituAttention/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FChituAttention/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272439623,"owners_count":24935422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-28T02:00:10.768Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-22T12:26:35.881Z","updated_at":"2025-08-28T04:39:02.382Z","avatar_url":"https://github.com/feifeibear.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Chitu (赤兔) Attention: A Quantized Attention Library Suite\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./resource/chitu.png\" alt=\"Chitu Logo\" width=\"33%\"\u003e\n\u003c/p\u003e\n\nChituAttention is a comprehensive library of quantized Attention implementations. The name \"Chitu\" (赤兔, meaning \"Red Hare\") comes from a legendary swift horse in Chinese history, symbolizing speed and efficiency.\n\nThis library is designed for Attention computations for long sequences, where Attention calculations typically dominate the computational time compared to subsequent FFN operations. By quantizing the softmax(QK^T)V computation, ChituAttention achieves significant speedups.\n\nWe've collected implementations from two leading Quantized Attention repositories, i.e. [thu-ml/SageAttention](https://github.com/thu-ml/SageAttention) and [INT-FlashAttention](https://github.com/INT-FlashAttention2024/INT-FlashAttention), unified their interfaces with FlashAttention. Chitu is designed to integrate seamlessly with the sequence parallel processing in the [feifeibear/Long-Context-Attention](https://github.com/feifeibear/long-context-attention) repository.\n\n## Why a Separate Repository?\n\nWe maintain this dedicated repository because both SageAttention and Int8FlashAttention are currently in their demonstration phases, making it challenging to merge our proposed improvements. Additionally, our extensive testing allows us to:\n- Present objective evaluation results, rather than cherry-picked results from papers\n- Rapidly introduce new features, such as distributed computing support\n- Address bugs promptly\n- Ensure robust performance across a wider range of use cases\n- Applied in Sequence Parallel USP\n\n## Known Issues\n\n- LSE returned by SageAttention has a huge diff with the FlashAttention V2.\n\n## Installation\n\n```bash\npip install .\n```\n\n## Performance\n\nFA is FlashAttention, Sage is SageAttention, Int8 is Int8FlashAttention.\n\n\u003cdiv align=\"center\"\u003e\n\n### 1xL40 Performance (L40, float16)\n\n| Sequence Length | Method | Max Diff | Mean Diff | Latency (sec) |\n|----------------|--------|-----------|-----------|---------------|\n| 10K            | FA     | 0.00E+00  | 0.00E+00  | 1.57         |\n|                | Sage   | 2.20E-03  | 1.64E-04  | 1.08         |\n|                | Int8   | 1.95E-02  | 5.02E-04  | 2.74         |\n| 100K           | FA     | 0.00E+00  | 0.00E+00  | 154.68       |\n|                | Sage   | 6.10E-04  | 5.21E-05  | 111.55       |\n|                | Int8   | 7.39E-03  | 2.63E-04  | 262.83       |\n| 1M             | FA     | 0.00E+00  | 0.00E+00  | 21055.36     |\n|                | Sage   | 1.74E-04  | 1.66E-05  | 12723.77     |\n|                | Int8   | 3.97E-03  | 1.34E-04  | 38920.12     |\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n### Apply in DeepSpeed-Ulysses 8xL40 Performance\n\nRun on 8xL40 PCIe Gen4 GPUs in float16 format. The sequence is local sequence length. The global sequence length is 8 times of the local sequence length.\n\n| Sequence Length | Method | Max Diff | Mean Diff | Latency (fp16) |\n|----------------|--------|-----------|-----------|----------------|\n| 10K            | FA     | 6.10E-05  | 2.56E-06  | 14.79         |\n|                | Sage   | 1.45E-03  | 1.64E-04  | 19.72         |\n|                | Int8   | 1.04E-02  | 4.99E-04  | 37.01         |\n| 100K           | FA     | 0.00E+00  | 0.00E+00  | 126.33        |\n|                | Sage   | 5.34E-04  | 5.20E-05  | 117.05        |\n|                | Int8   | 8.48E-03  | 2.64E-04  | 158.41        |\n| 1M             | FA     | 0.00E+00  | 0.00E+00  | 4726.77       |\n|                | Sage   | 1.10E-02  | 1.66E-05  | 3165.62       |\n|                | Int8   | 4.03E-03  | 1.34E-04  | 6414.29       |\n\nBased on these results, we can conclude that SageAttention has lower errors than Int8FlashAttention. SageAttention also achieves lower latency than FlashAttention. Int8FlashAttention not only shows noticeable errors but also fails to provide acceleration benefits.\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n### 1xA100 Performance (A100 NVLink, float16)\n\n| Sequence Length | Method | Max Diff | Mean Diff | Latency (sec) |\n|----------------|--------|-----------|-----------|---------------|\n| 10K            | FA     | 0.00E+00  | 0.00E+00  | 1.59         |\n|                | Sage   | 2.14E-03  | 1.65E-04  | 2.24         |\n|                | Int8   | 1.58E-02  | 5.02E-04  | 5.90         |\n| 100K           | FA     | 0.00E+00  | 0.00E+00  | 115.09       |\n|                | Sage   | 6.26E-04  | 5.22E-05  | 110.37       |\n|                | Int8   | 8.69E-03  | 2.63E-04  | 342.92       |\n| 1M             | FA     | 0.00E+00  | 0.00E+00  | 12539.97     |\n|                | Sage   | 1.72E-04  | 1.66E-05  | 11034.73     |\n|                | Int8   | 3.82E-03  | 1.34E-04  | 33801.61     |\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n### 8xA100\n\nRun on 8xA00 NLInk GPUs in float16 format. The sequence is local sequence length. The global sequence length is 8 times of the local sequence length.\n\n| Sequence Length | Method | Max Diff | Mean Diff | Latency (fp16) |\n|----------------|--------|-----------|-----------|----------------|\n| 10K            | FA     | 6.10E-05  | 2.80E-06  | 0.93          |\n|                | Sage   | 1.43E-03  | 1.64E-04  | 1.69          |\n|                | Int8   | 1.33E-02  | 5.01E-04  | 2.47          |\n| 100K           | FA     | 0.00E+00  | 0.00E+00  | 16.71         |\n|                | Sage   | 8.24E-04  | 5.20E-05  | 16.74         |\n|                | Int8   | 7.64E-03  | 2.63E-04  | 44.01         |\n| 1M             | FA     | 0.00E+00  | 0.00E+00  | 1505.21       |\n|                | Sage   | 1.66E-04  | 1.66E-05  | 1391.92       |\n|                | Int8   | 3.29E-03  | 1.34E-04  | 4018.76       |\n\nOn A100, SageAttention has no significant advantages over FA and even worse on \"short\" sequences (10K).\n\n\u003c/div\u003e\n\n## Citations\n\n**SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration**  \nPaper: https://arxiv.org/abs/2410.02367  \nJintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, Jianfei Chen\n\n**Int-FlashAttention: Enabling Flash Attention for Int8 Quantization**  \nPaper: https://arxiv.org/pdf/2409.16997v2  \nShimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fchituattention","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffeifeibear%2Fchituattention","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fchituattention/lists"}