{"id":19631065,"url":"https://github.com/fasterdecoding/teal","last_synced_at":"2025-04-06T09:07:19.449Z","repository":{"id":255263679,"uuid":"848716186","full_name":"FasterDecoding/TEAL","owner":"FasterDecoding","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-15T04:48:08.000Z","size":65710,"stargazers_count":122,"open_issues_count":2,"forks_count":5,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-30T08:12:10.371Z","etag":null,"topics":["llm","llm-inference","sparsity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FasterDecoding.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-28T09:19:51.000Z","updated_at":"2025-03-29T04:33:28.000Z","dependencies_parsed_at":"2025-02-27T15:35:42.925Z","dependency_job_id":null,"html_url":"https://github.com/FasterDecoding/TEAL","commit_stats":null,"previous_names":["fasterdecoding/teal"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FTEAL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FTEAL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FTEAL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FTEAL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FasterDecoding","download_url":"https://codeload.github.com/FasterDecoding/TEAL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247457800,"owners_count":20941906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","llm-inference","sparsity"],"created_at":"2024-11-11T12:07:44.752Z","updated_at":"2025-04-06T09:07:19.425Z","avatar_url":"https://github.com/FasterDecoding.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Training-Free Acivation Sparsity in Large Language Models\n\n[[Paper](https://www.arxiv.org/abs/2408.14690)][[Blog](https://www.together.ai/blog/teal-training-free-activation-sparsity-in-large-language-models)]\n\n\nTEAL induces up to 40-50% model-wide activation sparsity in modern LLMs with minimal degradation, resulting in an up to 1.53-1.8x speedup in single-batch decoding.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"figures/clickbait.png\" width=\"500\" height=\"auto\"/\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\nThe current release supports:\n- FP16 inference for Llama-2/3 models using uniform sparsities\n- Accuracy evaluation for Llama-2/3 and Mistral models using uniform and block-wise greedy sparsities\n\n\n## News\n\n- [01/2025] 🔥 TEAL is accepted to ICLR 2025 as a Spotlight!\n- [08/2024] 🔥 Arxiv release!\n\n## Abstract\n\nActivation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix\nmultiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards\nolder models with ReLU-based sparsity, while others require extensive continued\npre-training on up to hundreds of billions of tokens. This paper describes TEAL\n(**T**raining-Fre**e** **A**ctivation Sparsity in **L**LMs), a simple training-free method that\napplies magnitude-based activation sparsity to hidden states throughout the entire\nmodel. TEAL achieves 40-50% model-wide sparsity with minimal performance\ndegradation across Llama-2, Llama-3, and Mistral families, with sizes varying\nfrom 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock\ndecoding speed-ups of up to 1.53× and 1.8× at 40% and 50% model-wide sparsity.\nTEAL is compatible with weight quantization, enabling further efficiency gains.\n\n\n\n## Contents\n\n- [Install](#Install)\n- [Demo](#Demo)\n- [Inference Usage](#Inference-Usage)\n- [Accuracy Usage](#Accuracy-Usage)\n- [Citation](#citation)\n\n## Install\n\n1. Clone the repo and navigate to TEAL:\n\n```\ngit clone https://github.com/FasterDecoding/TEAL\ncd TEAL\n```\n\n2. Set up environment:\n\n\n```bash\nconda create -yn teal python=3.11\nconda activate teal\n\npip install -e .\n```\n\n3. (Optional) If you want to calibrate thresholds for your own models, or run accuracy evals for models, install the following dependency:\n\n  ```bash\n  pip install -e \".[eval]\"\n  ```\n\n## Inference Usage\n\nFor easy usage, we provide calibrated thresholds for Llama-2/3 and Mistral models in `models/` folder.\n\n1. Navigate to gpt-fast:\n\n```bash\ncd gpt-fast\n```\n\n2. Download model weights and convert to gpt-fast format (`scripts/prepare.sh`):\n```bash\npython scripts/download.py --repo_id meta-llama/Llama-2-7b-hf --path $SAVE_PATH \u0026\u0026 python scripts/convert_hf_checkpoint.py --checkpoint_dir $SAVE_PATH/meta-llama/Llama-2-7b-hf\n```\n\n3. Run dense inference (`scripts/base_run.sh`):\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python generate.py \\\n    --compile \\ \n    --checkpoint_path $SAVE_PATH/meta-llama/Llama-2-7b-hf/model.pth \\ \n    --interactive\n```\n\n4. Run sparse inference! (`scripts/run.sh`):\n```bash\nCUDA_VISIBLE_DEVICES=0 python generate.py \\\n    --compile \\ \n    --checkpoint_path $SAVE_PATH/meta-llama/Llama-2-7b-hf/model.pth \\ \n    --hist_path ../models/Llama-2-7B/histograms \\ \n    --sparsity 0.5 \\ \n    --interactive\n```\n\nTo benchmark inference speed, remove `--interactive`.\n\nPlease treat the current inference implementation as just a proof of concept! There are a few limitations:\n- Only FP16 is supported, as Triton does not currently support BF16 `atomic_add`.\n- Block-wise greedy sparsities are not currently supported (expect to have this very soon!).\n- Quantized sparse kernels are not currently supported (though, would love a PR!).\n- Speculative decoding is untested\n\n### Accuracy Usage\n\n1. Navigate to TEAL:\n```bash\ncd TEAL\n```\n\n1. Construct histograms for threshold calibration (`scripts/grab_acts.bash`):\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python teal/grab_acts.py \\  \n  --model_name meta-llama/Llama-2-7b-hf \\ \n  --output_path $OUTPUT_PATH\n```\n\n2. Run perplexity test (`scripts/ppl_test.bash`):\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python teal/ppl_test.py \\\n--model_name meta-llama/Llama-2-7b-hf \\\n--teal_path $OUTPUT_PATH \\\n--sparsity 0.5\n```\n\n3. (Optional) Run block-wise greedy optimization (`scripts/greedyopt.bash`):\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python teal/greedyopt.py \\\n  --model_name meta-llama/Llama-2-7b-hf \\\n  --model_type Llama-2-7B \\\n  --teal_path $OUTPUT_PATH \\\n  --target_sparsity 0.9 \\\n  --base_step_size 0.05 \\\n  --last_fraction 0.25\n```\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python teal/ppl_test.py \\\n  --model_name meta-llama/Llama-2-7b-hf \\\n  --teal_path $OUTPUT_PATH \\\n  --sparsity 0.5 \\\n  --greedy_flag\n```\n\n## Citation\n\nIf you find TEAL useful, please consider citing:\n\n```\n@misc{liu2024trainingfreeactivationsparsitylarge,\n      title={Training-Free Activation Sparsity in Large Language Models}, \n      author={James Liu and Pragaash Ponnusamy and Tianle Cai and Han Guo and Yoon Kim and Ben Athiwaratkun},\n      year={2024},\n      eprint={2408.14690},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2408.14690}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffasterdecoding%2Fteal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffasterdecoding%2Fteal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffasterdecoding%2Fteal/lists"}