https://github.com/vllm-project/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
https://github.com/vllm-project/llm-compressor
compression quantization sparsity
Last synced: 8 months ago
JSON representation
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
- Host: GitHub
- URL: https://github.com/vllm-project/llm-compressor
- Owner: vllm-project
- License: apache-2.0
- Created: 2024-06-20T20:13:34.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-13T22:54:40.000Z (8 months ago)
- Last Synced: 2025-05-14T00:35:35.735Z (8 months ago)
- Topics: compression, quantization, sparsity
- Language: Python
- Homepage:
- Size: 22.6 MB
- Stars: 1,340
- Watchers: 20
- Forks: 127
- Open Issues: 84
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - vllm-project/llm-compressor - project/llm-compressor 是一个兼容 Hugging Face Transformers 框架的库,专注于通过多种压缩算法优化大语言模型(LLM)的部署效率,特别针对 vLLM 高性能推理框架进行适配。该项目的核心目标是通过量化、剪枝、知识蒸馏等压缩技术,在不显著降低模型性能的前提下,大幅减少模型的内存占用和推理延迟,使其更适合在资源受限的设备或生产环境中部署。其工作原理是基于原始模型权重,通过算法提取关键信息并重构轻量化版本,同时保留模型的核心推理能力。例如,量化技术可将模型参数从浮点数转换为低精度数值,剪枝技术则移除冗余神经元,知识蒸馏则通过教师-学生模型对提升小模型性能。库中提供了与 vLLM 的深度集成接口,支持快速将压缩后的模型部署到推理服务中,结合 vLLM 的批处理和缓存优化技术,进一步提升推理吞吐量。项目特别强调对主流压缩算法的兼容性,用户无需修改原始模型代码即可通过配置参数直接应用压缩,同时支持自定义压缩策略的扩展。此外,该库还提供详细的文档和示例,帮助开发者快速验证不同压缩方案的效果。通过这种方式,llm-compressor 为研究人员和工程师提供了一种高效、灵活的工具,以平衡模型性能与部署成本,尤其适用于需要大规模模型部署但资源有限的场景。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- awesome-repositories - vllm-project/llm-compressor - Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM (Python)
README
#
LLM Compressor
`llmcompressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including:
* Comprehensive set of quantization algorithms for weight-only and activation quantization
* Seamless integration with Hugging Face models and repositories
* `safetensors`-based file format compatible with `vllm`
* Large model support via `accelerate`
**✨ Read the announcement blog [here](https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/)! ✨**
### Supported Formats
* Activation Quantization: W8A8 (int8 and fp8)
* Mixed Precision: W4A16, W8A16
* 2:4 Semi-structured and Unstructured Sparsity
### Supported Algorithms
* Simple PTQ
* GPTQ
* SmoothQuant
* SparseGPT
### When to Use Which Optimization
Please refer to [docs/schemes.md](./docs/schemes.md) for detailed information about available optimization schemes and their use cases.
## Installation
```bash
pip install llmcompressor
```
## Get Started
### End-to-End Examples
Applying quantization with `llmcompressor`:
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md)
* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)
### User Guides
Deep dives into advanced usage of `llmcompressor`:
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)
## Quick Tour
Let's quantize `TinyLlama` with 8 bit weights and activations using the `GPTQ` and `SmoothQuant` algorithms.
Note that the model can be swapped for a local or remote HF-compatible checkpoint and the `recipe` may be changed to target different quantization algorithms or formats.
### Apply Quantization
Quantization is applied by selecting an algorithm and calling the `oneshot` API.
```python
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot
# Select quantization algorithm. In this case, we:
# * apply SmoothQuant to make the activations easier to quantize
# * quantize the weights to int8 with GPTQ (static per channel)
# * quantize the activations to int8 (dynamic per token)
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
]
# Apply quantization using the built in open_platypus dataset.
# * See examples for demos showing how to pass a custom calibration set
oneshot(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
dataset="open_platypus",
recipe=recipe,
output_dir="TinyLlama-1.1B-Chat-v1.0-INT8",
max_seq_length=2048,
num_calibration_samples=512,
)
```
### Inference with vLLM
The checkpoints created by `llmcompressor` can be loaded and run in `vllm`:
Install:
```bash
pip install vllm
```
Run:
```python
from vllm import LLM
model = LLM("TinyLlama-1.1B-Chat-v1.0-INT8")
output = model.generate("My name is")
```
## Questions / Contribution
- If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation.
- We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md).