https://github.com/onnx/neural-compressor
Model compression for ONNX
https://github.com/onnx/neural-compressor
deep-learning model-compression model-pruning onnx onnxruntime quantization
Last synced: 9 months ago
JSON representation
Model compression for ONNX
- Host: GitHub
- URL: https://github.com/onnx/neural-compressor
- Owner: onnx
- License: apache-2.0
- Created: 2024-04-25T17:01:59.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-12T13:20:49.000Z (over 1 year ago)
- Last Synced: 2024-10-30T02:03:54.769Z (over 1 year ago)
- Topics: deep-learning, model-compression, model-pruning, onnx, onnxruntime, quantization
- Language: Python
- Homepage:
- Size: 2.34 MB
- Stars: 72
- Watchers: 5
- Forks: 9
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: docs/CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
Neural Compressor
===========================
An open-source Python library supporting popular model compression techniques for ONNX
[](https://github.com/onnx/neural-compressor)
[](https://github.com/onnx/neural-compressor/releases)
[](https://github.com/onnx/neural-compressor/blob/master/LICENSE)
---
Neural Compressor aims to provide popular model compression techniques inherited from [Intel Neural Compressor](https://github.com/intel/neural-compressor) yet focused on ONNX model quantization such as SmoothQuant, weight-only quantization through [ONNX Runtime](https://onnxruntime.ai/). In particular, the tool provides the key features, typical examples, and open collaborations as below:
* Support a wide range of Intel hardware such as [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html) and AIPC
* Validate popular LLMs such as [LLama2](./examples/nlp/huggingface_model/text_generation/), [Llama3](./examples/nlp/huggingface_model/text_generation/), [Qwen2](./examples/nlp/huggingface_model/text_generation/) and broad models such as [BERT-base](./examples/nlp/bert/quantization), and [ResNet50](./examples/image_recognition/resnet50/quantization/ptq_static) from popular model hubs such as [Hugging Face](https://huggingface.co/), [ONNX Model Zoo](https://github.com/onnx/models#models), by leveraging automatic [accuracy-driven](./docs/design.md#workflow) quantization strategies
* Collaborate with software platforms such as [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [ONNX](https://github.com/onnx/models#models) and [ONNX Runtime](https://github.com/microsoft/onnxruntime)
## Installation
### Install from source
```Shell
git clone https://github.com/onnx/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
pip install .
```
> **Note**:
> Further installation methods can be found under [Installation Guide](./docs/installation_guide.md).
## Getting Started
Setting up the environment:
```bash
pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx
```
After successfully installing these packages, try your first quantization program.
> Notes: please install from source before the formal pypi release.
### Weight-Only Quantization (LLMs)
Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.
Run the example:
```python
from onnx_neural_compressor.quantization import matmul_nbits_quantizer
algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig()
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
model,
n_bits=4,
block_size=32,
is_symmetric=True,
algo_config=algo_config,
)
quant.process()
best_model = quant.model
```
### Static Quantization
```python
from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader
class DataReader(data_reader.CalibrationDataReader):
def __init__(self):
self.encoded_list = []
# append data into self.encoded_list
self.iter_next = iter(self.encoded_list)
def get_next(self):
return next(self.iter_next, None)
def rewind(self):
self.iter_next = iter(self.encoded_list)
data_reader = DataReader()
qconfig = config.StaticQuantConfig(calibration_data_reader=data_reader)
quantize(model, output_model_path, qconfig)
```
## Documentation
Overview
Architecture
Workflow
Examples
Feature
Quantization
SmoothQuant
Weight-Only Quantization (INT8/INT4)
Layer-Wise Quantization
## Additional Content
* [Contribution Guidelines](./docs/source/CONTRIBUTING.md)
* [Security Policy](SECURITY.md)
## Communication
- [GitHub Issues](https://github.com/onnx/neural-compressor/issues): mainly for bug reports, new feature requests, question asking, etc.
- [Email](mailto:inc.maintainers@intel.com): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.