{"id":28100709,"url":"https://github.com/mit-han-lab/spatten","last_synced_at":"2025-05-13T18:38:19.171Z","repository":{"id":204619881,"uuid":"711291697","full_name":"mit-han-lab/spatten","owner":"mit-han-lab","description":"[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning","archived":false,"fork":false,"pushed_at":"2024-08-27T19:21:28.000Z","size":2077,"stargazers_count":60,"open_issues_count":1,"forks_count":5,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-08-27T21:25:43.105Z","etag":null,"topics":["attention","hardware-acceleration","llm-inference","rtl","spinalhdl"],"latest_commit_sha":null,"homepage":"https://hanlab.mit.edu/projects/spatten","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mit-han-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-28T19:37:41.000Z","updated_at":"2024-08-23T08:08:38.000Z","dependencies_parsed_at":null,"dependency_job_id":"ab38a79b-cfe3-434b-8380-4bd47a5211c3","html_url":"https://github.com/mit-han-lab/spatten","commit_stats":null,"previous_names":["mit-han-lab/spatten-llm","mit-han-lab/spatten"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fspatten","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fspatten/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fspatten/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fspatten/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mit-han-lab","download_url":"https://codeload.github.com/mit-han-lab/spatten/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254004775,"owners_count":21998121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","hardware-acceleration","llm-inference","rtl","spinalhdl"],"created_at":"2025-05-13T18:38:18.539Z","updated_at":"2025-05-13T18:38:19.145Z","avatar_url":"https://github.com/mit-han-lab.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"\n# SpAtten: Sparse Attention with Token Pruning and Head Pruning in Large Language Models\n\n\n[[paper](https://arxiv.org/abs/2012.09852)] [[slides](https://www.dropbox.com/s/z189gu92h7uy7yt/SpAtten-for-long-video-no-animation.pdf?dl=0)] [[video](https://www.youtube.com/watch?v=Cln8hFxM9Do)] [[website](https://hanlab.mit.edu/projects/spatten)] \n\n\u003c!-- ![schemes](figures/schemes.png) --\u003e\n\n\n## TL;DR\nWe propose sparse attention (SpAtten) with **KV token pruning, local V pruning, head pruning, and KV progressive quantization** to improve LLM efficiency.\n\n## News\n- SpAtten and SpAtten-Chip won the 1st Place Award at 2023 DAC University Demo.\n- SpAtten is spotlighted on [MIT Homepage](http://mit.edu/spotlight/streamlining-sentence-analysis).\n- SpAtten is covered by [MIT News](https://news.mit.edu/2021/language-learning-efficiency-0210).\n- [2023/10] SpAtten-LLM and SpAtten hardware released.\n\n\n## Abstract\nWe present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel KV token pruning to prune away unimportant tokens in the sentence. We also propose head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose KV progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.\n\n### Token pruning for classification task:\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"460\" src=\"assets/corrected-teaser.png\"\u003e\n\u003c/p\u003e\n\n\n\n### Token pruning for generation task:\n\u003cp align=\"center\"\u003e\n  \u003cimg width=\"560\" src=\"assets/fig_gpt.jpeg\"\u003e\n\u003c/p\u003e\n\n\n\n## SpAtten Usage\n\n### Environment Setup\n\n```bash\nconda create -yn spatten python=3.8\nconda activate spatten\n\npip install torch torchvision torchaudio\npip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece\n\npython setup.py develop\n```\n\n### Run SpAtten Llama Chatbot\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python run_spatten_llama.py  --enable_spatten\n```\n\n## SpAtten Hardware Usage\nThis repo also contains the RTL-level simulation model of SpAtten in `spatten_hardware/hardware/` for accurate performance evaluation on generative models like GPT-2 and a fast behavior model in `spatten_hardware/simulator` for quick evaluation on BERT.\n\n### Running RTL simulation for SpAtten\n#### Prerequisites\n- [Verilator](https://www.veripool.org/verilator/) version [v4.218](https://github.com/verilator/verilator/releases/tag/v4.218)\n\n  Note that there is a known [issue](https://github.com/verilator/verilator/issues/4424) with the latest Verilator that may cause random assertion failure on startup of simulation. Use v4.218 as a workaround.\n- [SBT](https://www.scala-sbt.org/)\n- C/C++ build tools for verilator and ramulator. `gcc,g++\u003e=12`, `cmake`\n- Workload information in CSV format. There are some examples in hardware/workloads\n\n#### Quick Start\nBuild the ramulator2\n```\n$ cd spatten_hardware/hardware/third_party/ramulator2\n$ mkdir build\n$ cd build\n$ cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo\n$ make\n$ cd ../../../..\n```\nBuild the Verilog (DPI) interface for ramulator\n```\n$ cd hardware/dpi\n$ make\n$ cd ../../..\n```\nUse the python script to run SpAtten simulation with a workload file\n```\npython3 run_spatten_hardware.py hardware/workloads/summary-gpt2-small-wikitext2-per8.csv\n```\nThe evaluation results is located in the working directory `spatten.workdir/summary.txt`\n\n### SpAtten Hardware Architecture\n![spatten arch](https://assets-global.website-files.com/64f4e81394e25710d22d042e/6515ab835deaead9f35609ac_spatten_arch.jpeg)\n\nSpAtten uses a specialized pipeline to support efficient attention and focus on memory traffic optimizations for decoding models like GPT2 and LLMs. \n\nThis repo contains the following major modules in SpAtten, and the main pipeline implementation is in [SpAttenController.scala](./spatten_hardware/hardware/src/main/scala/spatten/SpAttenController.scala).\n\n- A parallelized top-k unit (10) that dynamically decides the values to fetch: [TopK.scala](./spatten_hardware/hardware/src/main/scala/spatten/TopK.scala), which uses [QuickSelect.scala](./spatten_hardware/hardware/src/main/scala/spatten/utils/QuickSelect.scala) to choose the k-th largest element from attention prob\n- A matrix fetcher ((3) and (6) in the figure) that loads the key/value matrix from DRAM and convert the bitwidth when necessary: [MatrixFetcher.scala](./spatten_hardware/hardware/src/main/scala/spatten/MatrixFetcher.scala)\n- The Q\\*K (7) and Prob\\*V (11) unit and the corresponding key / value buffers: [DotProduct.scala](./spatten_hardware/hardware/src/main/scala/spatten/DotProduct.scala), [MultiplyValue.scala](./spatten_hardware/hardware/src/main/scala/spatten/MultiplyValue.scala), [Buffer.scala](./spatten_hardware/hardware/src/main/scala/spatten/Buffer.scala), [BufferManager.scala](./spatten_hardware/hardware/src/main/scala/spatten/BufferManager.scala)\n- A progressive quantization module (9) to decide whether or not to load the LSBs of keys: [RequantDecision.scala](./spatten_hardware/hardware/src/main/scala/spatten/RequantDecision.scala)\n\n\n## TODOs\nWe will release the code and data soon, please stay tuned.\n\n- [ ] Release core code of SpAtten, including Llama-2, MPT, Falcon, and Pythia.\n- [ ] Release SpAtten perplexity evaluation code\n- [ ] Release SpAtten Llama Chatbot demo.\n- [ ] Release a docker image for hardware simulation.\n\n\n## Citation\n\nIf you find SpAtten useful or relevant to your project and research, please kindly cite our paper:\n\n```bibtex\n@article{wang2021spatten,\n        title={SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning},\n        author={Wang, Hanrui and Zhang, Zhekai and Han, Song},\n        journal={HPCA},\n        year={2021}\n        }\n```\n\u003c!-- \n```bibtex\n@article{wang2021spattenllm,\n        title={SpAtten-LLM: Sparse Attention with Token Pruning and Head Pruning in Large Language Models},\n        author={Wang, Hanrui and Xiao, Guangxuan and Yang, Shang and Tang, Haotian, and Zhang, Zhekai and Han, Song},\n        journal={Technical Report},\n        year={2023}\n        }\n``` --\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Fspatten","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmit-han-lab%2Fspatten","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Fspatten/lists"}