{"id":13754425,"url":"https://github.com/mit-han-lab/smoothquant","last_synced_at":"2025-04-08T09:08:43.950Z","repository":{"id":64840124,"uuid":"567389569","full_name":"mit-han-lab/smoothquant","owner":"mit-han-lab","description":"[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models","archived":false,"fork":false,"pushed_at":"2024-07-12T03:11:08.000Z","size":6966,"stargazers_count":1374,"open_issues_count":69,"forks_count":167,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-04-01T07:51:20.424Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2211.10438","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mit-han-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-17T17:27:49.000Z","updated_at":"2025-04-01T04:08:48.000Z","dependencies_parsed_at":"2023-02-10T23:15:39.984Z","dependency_job_id":"6249c026-538b-4084-9355-18d24f66c073","html_url":"https://github.com/mit-han-lab/smoothquant","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fsmoothquant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fsmoothquant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fsmoothquant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Fsmoothquant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mit-han-lab","download_url":"https://codeload.github.com/mit-han-lab/smoothquant/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247809962,"owners_count":20999816,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:59.151Z","updated_at":"2025-04-08T09:08:43.915Z","avatar_url":"https://github.com/mit-han-lab.png","language":"Python","funding_links":[],"categories":["其他_NLP自然语言处理","Summary","Python"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models \n[[paper](https://arxiv.org/abs/2211.10438)] [[slides](assets/SmoothQuant.pdf)][[video](https://youtu.be/U0yvqjdMfr0)]\n\n![intuition](figures/intuition.png)\n\n## News\n\n- [2024/05] SmoothQuant enables INT8 model inference on [AMD Instinct MI300X using Composable Kernel](https://rocm.blogs.amd.com/software-tools-optimization/ck-int8-gemm-sq/README.html).\n- [2024/03] We show SmoothQuant can enable W8A8 quantization for Llama-1/2/3, Falcon, Mistral, and Mixtral models with negligible loss. [Results](https://github.com/mit-han-lab/smoothquant?tab=readme-ov-file#perplexity-results-on-llama-123-falcon-mistral-and-mixtral-with-w8a8-quantization).\n- [2024/01] SmoothQuant is integrated into Microsoft's [ONNX Runtime](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant).\n- [2023/11] SmoothQuant is integrated into [Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers).\n- [2023/10] SmoothQuant is integrated into NVIDIA [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/).\n- [2023/03] SmoothQuant is integrated into Intel [Neural-Compressor](https://github.com/intel/neural-compressor).\n\n## Abstract\n\nLarge language models (LLMs) show excellent performance but are compute- and memory-intensive.\nQuantization can reduce memory and accelerate inference.\nHowever, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware.\nWe propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.\nBased on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation.\nSmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT-175B, BLOOM-176B, GLM-130B, and MT-NLG 530B. SmoothQuant\nhas better hardware efficiency than existing techniques.\nWe demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy.\nWe integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework,\nand achieve faster inference speed with half the number of GPUs compared to FP16, enabling the serving of a 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.\n\n## Installation\n\n```bash\nconda create -n smoothquant python=3.8\nconda activate smoothquant\npip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113\npip install transformers==4.36.0 accelerate datasets zstandard\n\npython setup.py install\n```\n\n## Usage\n\n### SmoothQuant INT8 Inference for PyTorch\n\nWe implement SmoothQuant INT8 inference for PyTorch with [CUTLASS](https://github.com/NVIDIA/cutlass) INT8 GEMM kernels, which are wrapped as PyTorch modules in [torch-int](https://github.com/Guangxuan-Xiao/torch-int). Please install [torch-int](https://github.com/Guangxuan-Xiao/torch-int) before running the SmoothQuant PyTorch INT8 inference.\n\nWe implement the quantized OPT model class in [smoothquant/opt.py](smoothquant/opt.py), which uses INT8 linear layers and bundles quantization scales. We provide the already smoothed and quantized OPT model at [https://huggingface.co/mit-han-lab/opt-[MODEL-SIZE]-smoothquant](https://huggingface.co/mit-han-lab/opt-[MODEL-SIZE]-smoothquant), where `[MODEL-SIZE]` can be `125m`, `1.3B`, `2.7B`, `6.7B`, `13B`, `30b`, and `66b`. You can load the INT8 model with the following code:\n\n```python\nfrom smoothquant.opt import Int8OPTForCausalLM\nmodel = Int8OPTForCausalLM.from_pretrained(\"mit-han-lab/opt-30b-smoothquant\")\n```\n\nYou can also check [generate_act_scales.py](examples/generate_act_scales.py) and [export_int8_model.py](examples/export_int8_model.py) to see how we smooth, quantize and export INT8 models.\n\nIn [examples/smoothquant_opt_real_int8_demo.ipynb](examples/smoothquant_opt_real_int8_demo.ipynb), we use OPT-30B model to demonstrate the latency and memory advantages of SmoothQuant. We demonstrate on OPT-30B because it is the largest model we can run both the FP16 and INT8 inference on a single A100 GPU. For larger models requiring multiple GPUs, we recommend using the [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) implementation of SmoothQuant.\n\n### Activation Channel Scales and Calibration\n\nWe provide the activation channel scales for Llama, Mistral, Mixtral, Falcon, OPT, and BLOOM models in [act_scales/](act_scales/). We get those scales with 512 random sentences in the Pile validation set. You can use the OPT demo ([examples/smoothquant_opt_demo.ipynb](examples/smoothquant_opt_demo.ipynb)) and Llama demo ([examples/smoothquant_llama_demo.ipynb](examples/smoothquant_llama_demo.ipynb)) to test smoothing and quantizing those models.\n\nWe also provide the script to get the activation channel scales for your models. Please refer to [examples/generate_act_scales.py](examples/generate_act_scales.py). You can use the following command to get the scales for your models:\n\n```bash\npython examples/generate_act_scales.py \\\n    --model-name \u003cmodel_name_or_path\u003e \\\n    --output-path \u003coutput_act_scales_file_path\u003e \\\n    --num-samples \u003cnum_samples\u003e \\\n    --seq-len \u003csequence_length\u003e \\\n    --dataset-path \u003cpath_to_the_calibration_dataset\u003e\n```\n\n### Demo on OPT-13B with W8A8 Fake Quantization\n\nIn [examples/smoothquant_opt_demo.ipynb](examples/smoothquant_opt_demo.ipynb), we use OPT-13B as an example to demonstrate SmoothQuant can match the accuracy of FP16 and INT8 inference, while the naive baseline cannot. We simulate INT8 inference with FP16 ([smoothquant/fake_quant.py](smoothquant/fake_quant.py)), i.e., fake quantization.\n\n### Perplexity Results on Llama-1/2/3, Falcon, Mistral, and Mixtral with W8A8 Quantization\n\nWe provide an evaluation script to evaluate the language modeling perplexity of OPT, BLoom, Llama, Falcon, Mistral, and Mixtral models with W8A8 simulated quantization. Please refer to [smoothquant/ppl_eval.py](smoothquant/ppl_eval.py). You can use the following command to evaluate the models:\n\n```bash\npython smoothquant/ppl_eval.py \\\n    --model_path \u003cmodel_name_or_path\u003e \\\n    --act_scales_path \u003cact_scales_file_path\u003e \\\n    --smooth \\\n    --alpha \u003calpha\u003e \\\n    --quantize\n```\n\nResults:\n\n| Model        | Method                              | PPL   | Alpha |\n| ------------ | ----------------------------------- | ----- | ----- |\n| Llama-2-7B   | FP16                                | 5.474 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L1)  | 5.515 | 0.85  |\n| Llama-2-13B  | FP16                                | 4.950 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L9)  | 4.929 | 0.85  |\n| Llama-2-70B  | FP16                                | 3.320 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L17) | 3.359 | 0.9   |\n| Llama-3-8B   | FP16                                | 6.138 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L58) | 6.258 | 0.85  |\n| Llama-3-70B  | FP16                                | 2.857 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L66) | 2.982 | 0.85  |\n| Mistral-7B   | FP16                                | 5.253 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L25) | 5.277 | 0.8   |\n| Mixtral-8x7B | FP16                                | 3.842 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L33) | 3.893 | 0.8   |\n| Falcon-7B    | FP16                                | 6.590 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L41) | 6.629 | 0.6   |\n| Falcon-40B   | FP16                                | 5.228 |       |\n|              | [SQ W8A8](examples/ppl_eval.sh#L49) | 5.255 | 0.7   |\n\nFor measured speedup, we recommend using the NVIDIA [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md#int8-smoothquant-w8a8) implementation of SmoothQuant.\n\n## Results\n\n- SmoothQuant migrates **part of** the quantization difficulties from activation to weights, which smooths out the systematic outliers in activation, making both weights and activations **easy to quantize**. \n\n![migrate](figures/migrate.jpg)\n\n- SmoothQuant can achieve W8A8 quantization of LLMs (e.g., OPT-175B) **without degrading performance**.\n\n![accuracy](figures/accuracy.png)\n\n- SmoothQuant can achieve **faster inference** compared to FP16 when integrated into PyTorch, while previous work LLM.int8() does not lead to acceleration (usually slower).\n\n![torch_latency_mem](figures/torch_latency_mem.png)\n\n- We also integrate SmoothQuant into the state-of-the-art serving framework [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), achieving **faster** inference speed using only **half the GPU numbers** compared to FP16 (1 instead of 2 for OPT-66B, 4 instead of 8 for OPT-175B).\n\n![ft_latency_mem](figures/ft_latency_mem.png)\n\n## Citation\n\nIf you find SmoothQuant useful or relevant to your research, please kindly cite our paper:\n\n```bibtex\n@InProceedings{xiao2023smoothquant,\n    title = {{S}mooth{Q}uant: Accurate and Efficient Post-Training Quantization for Large Language Models},\n    author = {Xiao, Guangxuan and Lin, Ji and Seznec, Mickael and Wu, Hao and Demouth, Julien and Han, Song},\n    booktitle = {Proceedings of the 40th International Conference on Machine Learning},\n    year = {2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Fsmoothquant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmit-han-lab%2Fsmoothquant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Fsmoothquant/lists"}