{"id":13562925,"url":"https://github.com/proger/accelerated-scan","last_synced_at":"2025-12-25T07:06:19.998Z","repository":{"id":216466338,"uuid":"741400326","full_name":"proger/accelerated-scan","owner":"proger","description":"Accelerated First Order Parallel Associative Scan","archived":false,"fork":false,"pushed_at":"2024-08-20T05:17:42.000Z","size":1024,"stargazers_count":187,"open_issues_count":9,"forks_count":8,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-09-04T20:02:03.146Z","etag":null,"topics":["cuda","cumulative-sum","recurrent-neural-networks","state-space-model","torch"],"latest_commit_sha":null,"homepage":"https://twitter.com/darkproger/status/1745041586394648975","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/proger.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-10T10:13:27.000Z","updated_at":"2025-08-12T20:50:53.000Z","dependencies_parsed_at":"2024-01-14T00:20:32.964Z","dependency_job_id":"99ab8ed7-bcf8-46f9-b3b6-95bce840718d","html_url":"https://github.com/proger/accelerated-scan","commit_stats":{"total_commits":48,"total_committers":3,"mean_commits":16.0,"dds":0.5833333333333333,"last_synced_commit":"db7145f89a442ff0cef076d07f2ccc93c403b1c2"},"previous_names":["proger/accelerated-scan"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/proger/accelerated-scan","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proger%2Faccelerated-scan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proger%2Faccelerated-scan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proger%2Faccelerated-scan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proger%2Faccelerated-scan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/proger","download_url":"https://codeload.github.com/proger/accelerated-scan/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proger%2Faccelerated-scan/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28022940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-25T02:00:05.988Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cumulative-sum","recurrent-neural-networks","state-space-model","torch"],"created_at":"2024-08-01T13:01:13.468Z","updated_at":"2025-12-25T07:06:19.992Z","avatar_url":"https://github.com/proger.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Accelerated Scan\n\n[![PyPI Version](https://img.shields.io/pypi/v/accelerated-scan.svg)](https://pypi.python.org/pypi/accelerated-scan) [![DOI](https://zenodo.org/badge/741400326.svg)](https://zenodo.org/doi/10.5281/zenodo.10600962)\n\n\nThis package implements the fastest [first-order parallel associative scan](https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf) on the GPU for forward and [backward](https://arxiv.org/abs/1709.04057).\n\nThe scan efficiently solves first-order recurrences of the form `x[t] = gate[t] * x[t-1] + token[t]`, common in state space models and linear RNNs.\n\nThe `accelerated_scan.warp` C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available\non each level of hierarchy: [warp shuffles](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/) within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.\n\nThe derivation of [Chunked Scan](https://proger.github.io/posts/scan/chunk.html) has been used to extend tree-level Blelloch algorithm to block.\n\nA similar implementation is available in `accelerated_scan.scalar` using a Triton's `tl.associative_scan` primitive. It [requires at least Triton 2.2 for its `enable_fp_fusion` flag](https://twitter.com/darkproger/status/1742663555835363635).\n\nQuick Start:\n\n```bash\npip install accelerated-scan\n```\n\n```python\nimport torch\nfrom accelerated_scan.scalar import scan # uses tl.associative_scan in chunks\n#from accelerated_scan.warp import scan # a pure c++ kernel, faster than cub\n#from accelerated_scan.ref import scan # reference torch implementation\n\nbatch_size, dim, seqlen = 1, 512, 131072\nforget = 0.999 + 0.001 * torch.rand(batch_size, dim, seqlen, device=\"cuda\")\ninputs = torch.rand(batch_size, dim, seqlen, device=\"cuda\")\n\nout = scan(forget, inputs)\n```\n\nTo ensure numerical equivalence, a reference implementation for trees is provided in Torch. It can be sped up using `torch.compile`.\n\n## Benchmarks:\n\n![bench.png](bench.png)\n\nSee more benchmarks in nanokitchen: https://github.com/proger/nanokitchen\n\n\nforward speed of (8,1536,seqlen), forward-only, accelerated-scan version 0.2.0:\n```\n   SEQUENCE_LENGTH  accelerated_scan.triton (triton 2.2.0)  accelerated_scan.ref  accelerated_scan.warp\n0            128.0                                0.027382              0.380874               0.026844\n1            256.0                                0.049104              0.567916               0.048593\n2            512.0                                0.093008              1.067906               0.092923\n3           1024.0                                0.181856              2.048471               0.183581\n4           2048.0                                0.358250              3.995369               0.355414\n5           4096.0                                0.713511              7.897022               0.714536\n6           8192.0                                1.433052             15.698944               1.411390\n7          16384.0                                3.260965             31.305046               2.817152\n8          32768.0                               31.459671             62.557182               5.645697\n9          65536.0                               66.787331            125.208572              11.297921\n```\n\n## Notes on Precision\n\nWhen gates and tokens are sampled uniformly from 0..1 the lack of bfloat16 precision dominates the error (compared to the reference implementation):\n\n![max-abs-error.png](max-abs-error.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproger%2Faccelerated-scan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fproger%2Faccelerated-scan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproger%2Faccelerated-scan/lists"}