https://github.com/vllm-project/llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
https://github.com/vllm-project/llm-compressor

compression quantization sparsity

Last synced: 10 months ago
JSON representation

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Host: GitHub
URL: https://github.com/vllm-project/llm-compressor
Owner: vllm-project
License: apache-2.0
Created: 2024-06-20T20:13:34.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-13T22:54:40.000Z (10 months ago)
Last Synced: 2025-05-14T00:35:35.735Z (10 months ago)
Topics: compression, quantization, sparsity
Language: Python
Homepage:
Size: 22.6 MB
Stars: 1,340
Watchers: 20
Forks: 127
Open Issues: 84
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-repositories - vllm-project/llm-compressor - Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM (Python)
StarryDivineSky - vllm-project/llm-compressor - project/llm-compressor 是一个兼容 Hugging Face Transformers 框架的库，专注于通过多种压缩算法优化大语言模型（LLM）的部署效率，特别针对 vLLM 高性能推理框架进行适配。该项目的核心目标是通过量化、剪枝、知识蒸馏等压缩技术，在不显著降低模型性能的前提下，大幅减少模型的内存占用和推理延迟，使其更适合在资源受限的设备或生产环境中部署。其工作原理是基于原始模型权重，通过算法提取关键信息并重构轻量化版本，同时保留模型的核心推理能力。例如，量化技术可将模型参数从浮点数转换为低精度数值，剪枝技术则移除冗余神经元，知识蒸馏则通过教师-学生模型对提升小模型性能。库中提供了与 vLLM 的深度集成接口，支持快速将压缩后的模型部署到推理服务中，结合 vLLM 的批处理和缓存优化技术，进一步提升推理吞吐量。项目特别强调对主流压缩算法的兼容性，用户无需修改原始模型代码即可通过配置参数直接应用压缩，同时支持自定义压缩策略的扩展。此外，该库还提供详细的文档和示例，帮助开发者快速验证不同压缩方案的效果。通过这种方式，llm-compressor 为研究人员和工程师提供了一种高效、灵活的工具，以平衡模型性能与部署成本，尤其适用于需要大规模模型部署但资源有限的场景。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

          #   LLM Compressor

`llmcompressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including:

* Comprehensive set of quantization algorithms for weight-only and activation quantization

* Seamless integration with Hugging Face models and repositories

* `safetensors`-based file format compatible with `vllm`

* Large model support via `accelerate`

**✨ Read the announcement blog [here](https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/)! ✨**



   



### Supported Formats

* Activation Quantization: W8A8 (int8 and fp8)

* Mixed Precision: W4A16, W8A16

* 2:4 Semi-structured and Unstructured Sparsity

### Supported Algorithms

* Simple PTQ

* GPTQ

* SmoothQuant

* SparseGPT

### When to Use Which Optimization

Please refer to [docs/schemes.md](./docs/schemes.md) for detailed information about available optimization schemes and their use cases.

## Installation

```bash

pip install llmcompressor

```

## Get Started

### End-to-End Examples

Applying quantization with `llmcompressor`:

* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)

* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)

* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md)

* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)

* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)

* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)

### User Guides

Deep dives into advanced usage of `llmcompressor`:

* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)

## Quick Tour

Let's quantize `TinyLlama` with 8 bit weights and activations using the `GPTQ` and `SmoothQuant` algorithms.

Note that the model can be swapped for a local or remote HF-compatible checkpoint and the `recipe` may be changed to target different quantization algorithms or formats.

### Apply Quantization

Quantization is applied by selecting an algorithm and calling the `oneshot` API.

```python

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

from llmcompressor.modifiers.quantization import GPTQModifier

from llmcompressor import oneshot

# Select quantization algorithm. In this case, we:

#   * apply SmoothQuant to make the activations easier to quantize

#   * quantize the weights to int8 with GPTQ (static per channel)

#   * quantize the activations to int8 (dynamic per token)

recipe = [

    SmoothQuantModifier(smoothing_strength=0.8),

    GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),

]

# Apply quantization using the built in open_platypus dataset.

#   * See examples for demos showing how to pass a custom calibration set

oneshot(

    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",

    dataset="open_platypus",

    recipe=recipe,

    output_dir="TinyLlama-1.1B-Chat-v1.0-INT8",

    max_seq_length=2048,

    num_calibration_samples=512,

)

```

### Inference with vLLM

The checkpoints created by `llmcompressor` can be loaded and run in `vllm`:

Install:

```bash

pip install vllm

```

Run:

```python

from vllm import LLM

model = LLM("TinyLlama-1.1B-Chat-v1.0-INT8")

output = model.generate("My name is")

```

## Questions / Contribution

- If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation.

- We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vllm-project/llm-compressor

Awesome Lists containing this project

README