https://github.com/IntelLabs/SCAP
Statistical Calibrated Activation Pruning
https://github.com/IntelLabs/SCAP
Last synced: 6 months ago
JSON representation
Statistical Calibrated Activation Pruning
- Host: GitHub
- URL: https://github.com/IntelLabs/SCAP
- Owner: IntelLabs
- License: apache-2.0
- Created: 2024-10-18T15:25:38.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-13T05:46:27.000Z (about 1 year ago)
- Last Synced: 2025-05-13T06:31:46.284Z (about 1 year ago)
- Language: Python
- Size: 29.3 KB
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
- Awesome-Activation-Sparsification - <img src="./github-mark.svg" alt="GitHub" width="20" height="20">
README
# Statistical Calibrated Activation Pruning (SCAP)
This repo contains the reference codes for "[Post-Training Statistical Calibration for Higher Activation Sparsity](https://arxiv.org/abs/2412.07174)".
If you find our work useful in your research, please consider citing our paper:
```bibtex
@InProceedings{chua2024scap,
title = {Post-Training Statistical Calibration for Higher Activation Sparsity},
author = {Chua, Vui Seng and Pan, Yujie and Jain, Nilesh},
booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop},
year = {2024},
volume = {262},
series = {Proceedings of Machine Learning Research}
}
```
## Abstract
We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability.
## Setup
Please follow the steps below.
```bash
# recommended python version: 3.10.13
python -m venv ./scap_env
source ./scap_env/bin/activate
# install torch
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cu121
# install dependencies
pip install transformers==4.44.0 datasets==2.21.0 accelerate tqdm rich seaborn matplotlib wheel \
git+https://github.com/EleutherAI/lm-evaluation-harness.git@906ef948dc8dbb4c84e1bb0f2861b1aba30ab533
# install gemv kernel
pip install triton "git+https://github.com/ScalingIntelligence/CATS.git@0bda7708b835f20c59f4dd59d3d32b0c5f2f6376#egg=flash_gemv&subdirectory=flash_gemv"
```
## Reproducing the results
### 1. Run calibration
Get the calibrated thresholds of SCAP for each model and sparsity config.
```bash
bash scripts/01.calibration.bash
```
_You can skip this calibration step, as we have already uploaded the following model configs in the repo._
| Model ID | Config in the bash | Up/gate sparsity | Down sparsity |
| ------------------------- | ------------------------------------------ | -------------------------- | --------------------------- |
| meta-llama/Llama-2-7b-hf | up,zero,0.35,gate,zero,0.35,down,zero,0.55 | 35% without mode centering | 55% without mode centering |
| mistralai/Mistral-7B-v0.1 | up,zero,0.3,gate,zero,0.3,down,zero,0.7 | 30% without mode centering | 70% without mode centering |
| mosaicml/mpt-7b | down,kde,0.5 | / | 50% with _kde peak_ as mode |
| tiiuae/falcon-7b | down,median,0.5 | / | 50% with _median_ as mode |
The resulting `calibrated_thresholds.json` file at `results/scap/` folder shows the mode and threshold for each FFN layer specified in the config.
### 2. Evaluation on zero-shot tasks
Evaluate the zero-shot tasks listed in the paper, i.e., _winogrande, piqa, sciq, hellaswag, boolq, arc_easy, arc_challenge_.
Results are at `results/scap/` folder.
```bash
bash scripts/02.evaluate_zero_shot_tasks.bash
```
The resulting `evaluation_results.json` file contains: (1) evaluation metrics for each task; (2) averaged actual input sparsity for each layer.
### 3. Inference with sparse kernel
We show the actual inference of SCAP optimized models with the sparse GEMV kernel.
```bash
bash scripts/03.inference_demo.bash
```
## Acknowledgement
This work is built atop [CATS](https://github.com/ScalingIntelligence/CATS), which we believe also extends from [DejaVu](https://github.com/FMInference/DejaVu). Credits go to the original authors of these projects.