https://github.com/intel/auto-round

Advanced Quantization Algorithm for LLMs/VLMs.
https://github.com/intel/auto-round
awq gptq int4 neural-compressor quantization rounding
Last synced: 3 months ago
JSON representation
Advanced Quantization Algorithm for LLMs/VLMs.
Host: GitHub
URL: https://github.com/intel/auto-round
Owner: intel
License: apache-2.0
Created: 2024-01-04T02:41:51.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-17T09:31:45.000Z (3 months ago)
Last Synced: 2025-04-17T20:02:03.034Z (3 months ago)
Topics: awq, gptq, int4, neural-compressor, quantization, rounding
Language: Python
Homepage:
Size: 10.4 MB
Stars: 431
Watchers: 12
Forks: 33
Open Issues: 18
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project

StarryDivineSky - intel/auto-round
README

        


AutoRound

===========================

 Advanced Quantization Algorithm for LLMs


[![python](https://img.shields.io/badge/python-3.9%2B-blue)](https://github.com/intel/auto-round)

[![version](https://img.shields.io/badge/release-0.4.6-green)](https://github.com/intel/auto-round)

[![license](https://img.shields.io/badge/license-Apache%202-9C27B0)](https://github.com/intel/auto-round/blob/main/LICENSE)







---



AutoRound is an advanced quantization algorithm for low-bits LLM/VLM inference. It's tailored for a wide range

of models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200

steps,

which competes impressively against recent methods without introducing any additional inference overhead and keeping low

tuning cost. The below

image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more

details and quantized models in several Hugging Face Spaces,

e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)

and [fbaldassarri](https://huggingface.co/fbaldassarri).



![](docs/imgs/autoround_overview.png)



## What's New

* [2024/03] The INT2-mixed R1 model (~200GB) retains 97.9% accuracy. Check

  out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).

* [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.

* [2024/11] We provide experimental support for VLM quantization, please check out

  the [README](./auto_round/mllm/README.md)

## Installation

### Install from pypi

```bash

# GPU

pip install auto-round

# CPU

pip install auto-round[cpu]

# HPU

pip install auto-round-lib

```

  Build from Source

  ```bash

  # GPU

  pip install .

  # CPU

  pip install .[cpu]

  # HPU

  python setup.py install lib

  ```

## Model Quantization

### Basic Usage (Gaudi/CPU/XPU/GPU)

A user guide detailing the full list of supported arguments is provided by calling ```auto-round -h``` on the terminal.

Set the format you want in `format` and

multiple formats exporting has been supported. Please check out [step-by-step-instruction](./docs/step_by_step.md) for

more details about calibration dataset or evaluation.

```bash

auto-round \

    --model facebook/opt-125m \

    --bits 4 \

    --group_size 128 \

    --format "auto_gptq,auto_awq,auto_round" \

    --output_dir ./tmp_autoround

```

We offer two configurations, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed,

respectively. Details are as follows.

  Other Recipes

  ```bash

## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower

auto-round-best \

    --model facebook/opt-125m \

    --bits 4 \

    --group_size 128 \

    --low_gpu_mem_usage 

  ```

  ```bash

## light accuracy, 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2

auto-round-light \

    --model facebook/opt-125m \

    --bits 4 \

    --group_size 128 \

  ```

  

In conclusion, we recommend using **auto-round for INT4 and auto-round-best for INT2**. However, you may adjust the

configuration to suit your specific requirements and available resources.

W4G128 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted on the Nvidia A100 80G using the version

of PyTorch 2.6.0 with enable_torch_compile):

| Model   | Qwen2.5-0.5B-Instruct | Falcon3-3B      | Qwen2.5-7B-Instruct | Meta-Llama-3.1-8B-Instruct | Falcon3-10B     | Qwen2.5-72B-Instruct |

|---------|-----------------------|-----------------|---------------------|----------------------------|-----------------|----------------------|

| 16bits  | 0.4192                | 0.5203          | 0.6470              | 0.6212                     | 0.6151          | 0.7229               |

| Best    | **0.4137**(7m)        | **0.5142**(23m) | 0.6426(58m)         | **0.6116**(65m)            | **0.6092**(81m) | 0.7242(575m)         |

| Default | 0.4129(2m)            | 0.5133(6m)      | 0.6441(13m)         | 0.6106(13m)                | 0.6080(18m)     | **0.7252**(118m)     |

| Light   | 0.4052(2m)            | 0.5108(3m)      | **0.6453**(5m)      | 0.6104(6m)                 | 0.6063(6m)      | 0.7243(37m)          |

  W2G64 results

W2G64 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted on the Nvidia A100 80G using the version of PyTorch 2.6.0 with enable_torch_compile). We recommend using higher precision for the head, tail, and non-expert modules to alleviate the significant accuracy drop.

| Model   | Qwen2.5-0.5B-Instruct | Falcon3-3B      | Qwen2.5-7B-Instruct | Falcon3-10B     | Qwen2.5-72B-Instruct |

  |---------|-----------------------|-----------------|---------------------|-----------------|----------------------|

| 16bits  | 0.4192                | 0.5203          | 0.6470              | 0.6151          | 0.7229               |

| Best    | **0.2989**(6m)        | **0.4267**(24m) | **0.5343**(56m)     | **0.5207**(79m) | **0.6715**(564m)     |

| Default | 0.2878(2m)            | 0.4219(6m)      | 0.5209(13m)         | 0.5133(18m)     | 0.6713(122m)         |

| Light   | 0.2760(2m)            | 0.4063(3m)      | 0.4764(5m)          | 0.4810(7m)      | 0.6581(38m)          |

### API Usage (Gaudi/CPU/XPU/GPU)

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, True

autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)

## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower

# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)

## 2-3X speedup, slight accuracy drop at W4G128

# autoround = AutoRound(model, tokenizer, nsamples=128, iters=50, lr=5e-3, bits=bits, group_size=group_size, sym=sym )

output_dir = "./tmp_autoround"

## format= 'auto_round'(default), 'auto_gptq', 'auto_awq'

autoround.quantize_and_save(output_dir, format='auto_round') 

```

  Detailed Hyperparameters

- `model`: The PyTorch model to be quantized.

- `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.

- `bits (int)`: Number of bits for quantization (default is 4).

- `group_size (int)`: Size of the quantization group (default is 128).

- `sym (bool)`: Whether to use symmetric quantization (default is True).

- `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current

  block for tuning (default is True).

- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).

- `iters (int)`: Number of tuning iterations (default is 200).

- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).

- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).

- `nsamples (int)`: Number of samples for tuning (default is 128).

- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).

- `batch_size (int)`: Batch size for training (default is 8).

- `scale_dtype (str)`: The data type of quantization scale to be used (default is "float16"), different kernels have

  different choices.

- `amp (bool)`: Whether to use automatic mixed precision (default is True).

- `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).

- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).

- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).

- `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is "

  NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. "

  ./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"

- `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits

  or mixed precision.

- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.

### API Usage for VLMs

If you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better results.

  Click to expand

**This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or

adjustments to default hype-parameters

By default, AutoRoundMLLM only quantizes the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To

quantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature

is limited. For more information, please refer to the AutoRoundMLLM [readme](./auto_round/mllm/README.md).

```python

from auto_round import AutoRoundMLLM

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoTokenizer

## load the model

model_name = "Qwen/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(

    model_name, trust_remote_code=True, torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

## quantize the model

bits, group_size, sym = 4, 128, True

autoround = AutoRoundMLLM(model, tokenizer, processor,

                          bits=bits, group_size=group_size, sym=sym)

autoround.quantize()

# save the quantized model, set format='auto_gptq' or 'auto_awq' to use other formats

output_dir = "./tmp_autoround"

autoround.save_quantized(output_dir, format='auto_round', inplace=True)

```

### Export Formats

**AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision

inference. **[2,3,4,8] bits are supported**. However, it has not yet gained widespread community adoption.

**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the

community, **[2,3,4,8] bits are supported**. However, **the

asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small

models.

**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely

adopted within the community, **only 4-bits quantization is supported**.

**GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and

q4_1 (W4G32) is supported in our repo**.

### Quantization Costs

Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note

that data

loading and packing costs have been excluded from the evaluation. **We recommend enabling torch.compile for PyTorch

versions 2.6 and above.**

To optimize GPU memory usage, in addition to activating `low_gpu_mem_usage`, you can set `gradient_accumulate_steps=8`

and a

`batch_size=1`, though this may increase tuning time.

The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA

3.1.

| Torch version/Config W4G128                                                                 | 3B            | 8B             | 14B            | 70B             | 8X7B           |

|---------------------------------------------------------------------------------------------|---------------|----------------|----------------|-----------------|----------------|

| 2.6  with torch compile                                                                     | 7min
10GB | 12min
18GB | 23min
22GB | 120min
42GB | 28min
46GB |

| 2.6  with torch compile 
 low_gpu_mem_usage=True                                        | 12min
6GB | 19min
10GB | 33min
11GB | 140min
25GB | 38min
36GB |

| 2.6  with torch compile 
 low_gpu_mem_usage=True 
 gradient_accumulate_steps=8,bs=1 | 15min
3GB | 25min
6GB  | 45min
7GB  | 187min
19GB | 75min
36GB |

| 2.5  w/o torch compile                                                                      | 8min
10GB | 16min
20GB | 30min
25GB | 140min
49GB | 50min
49GB |

## Model Inference

Please run the quantization code first

### AutoRound format

**CPU**: pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip

install intel-extension-for-transformers,

**HPU**: docker image with Gaudi Software Stack is recommended. More details can be found

in [Gaudi Guide](https://docs.habana.ai/en/latest/).

**CUDA**: no extra operations for sym quantization, for asym quantization, need to install auto-round from source

#### Gaudi/CPU/XPU/CUDA

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

from auto_round import AutoRoundConfig

quantized_model_path = "./tmp_autoround"

model = AutoModelForCausalLM.from_pretrained(quantized_model_path,

                                             device_map="auto", torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)

text = "There is a girl who likes adventure,"

inputs = tokenizer(text, return_tensors="pt").to(model.device)

print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

```

#### Evaluation

  Click to expand

```bash

auto-round --model saved_quantized_model \

    --eval \

    --task lambada_openai \

    --eval_bs 1

```

### AutoGPTQ/AutoAWQ format

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround"

model = AutoModelForCausalLM.from_pretrained(quantized_model_path, torch_dtype="auto",

                                             device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)

text = "There is a girl who likes adventure,"

inputs = tokenizer(text, return_tensors="pt").to(model.device)

print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

```

## Support List

AutoRound supports basically all the major large language models.

  Supported Models List

Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a

different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot

release most of the models ourselves.

Model 
|-------------------------- 
| nvidia/Llama-3.1-Nemotron 
| meta-llama/Llama-3.2-90B- 
| Qwen/QwQ-32B-Preview 
| THUDM/cogvlm2-llama3-chat-19B 
| Qwen/Qwen2-VL-Instruct 
| meta-llama/Llama-3.2-11B-Vision 
| microsoft/Phi-3.5-vision-instruct 
| liuhaotian/llava-v1.5-7b 
| Qwen/Qwen2.5-7B-Instruct 
| Qwen/Qwen2.5-14B-Instruct 
| Qwen/Qwen2.5-32B-Instruct 
| Qwen/Qwen2.5-Coder-32B-Instruct 
| Qwen/Qwen2.5-72B-Instruct 
| meta-llama/Meta-Llama-3.1-70B-Instruct 
| meta-llama/Meta-Llama-3.1-8B-Instruct 
| meta-llama/Meta-Llama-3.1-8B 
| Qwen/Qwen2-7B 
| THUDM/glm-4-9b-chat 
| Qwen/Qwen2-57B-A14B-Instruct 
| 01-ai/Yi-1.5-9B 
| 01-ai/Yi-1.5-9B-Chat 
| Intel/neural-chat-7b-v3-3 
| Intel/neural-chat-7b-v3-1 
| TinyLlama-1.1B-intermediate 
| mistralai/Mistral-7B-v0.1 
| google/gemma-2b 
| tiiuae/falcon-7b 
| sapienzanlp/modello-italia-9b 
| microsoft/phi-2 
| microsoft/Phi-3.5-mini-instruct 
| mistralai/Mistral-7B-Instruct-v0.2 
| mistralai/Mixtral-8x7B-Instruct-v0.1 
| mistralai/Mixtral-8x7B-v0.1 
| meta-llama/Meta-Llama-3-8B-Instruct 
| google/gemma-7b 
| meta-llama/Llama-2-7b-chat-hf 
| baichuan-inc/Baichuan2-7B-Chat 
| 01-ai/Yi-6B-Chat 
| facebook/opt-2.7b 
| bigscience/bloom-3b 
| EleutherAI/gpt-j-6b

| Supported                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | -----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc),  [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc),                                                                                                                                                                                                                                                                                                        | Vision-Instruct  | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc)                                                                                                                                                                                                                                                                                                                    | | [model-opea-int4-sym-autoround-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-inc),[model-opea-int4-sym-autoawq-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc)                                                                                                                                                                                                                                                                                                                      | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/cogvlm2-llama3-chat-19B-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                                                                                                                  | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                       | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc)                                                                                                                                                                                                                                                                                                                    | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc), [model-opea-int4-sym-gptq](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                    | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                                     | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc) [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-7B-Instruct-AutoRound-GPTQ-asym-4bit), [recipe](./docs/Qwen2.5-7B-Instruct-sym.md)                                                                                                                                                                              | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                       | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-32B-Instruct-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                                                                                                                     | | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-Coder-32B-Instruct-AutoRound-GPTQ-4bit)                                                                                                                                                                                                                                                                                                                                                                                                                    | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc), [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-4bit),  [model-kaitchup-autogptq-int2*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-2bit), [recipe](./docs/Qwen2.5-72B-Instruct-sym.md)                                                                  | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc),[model-opea-int4-asym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-asym-inc)                                                                                                                                                                                                                | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc) | | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym)                                                                                                                                                                                                                                                                                                                                                                                                                     | | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc), [model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc)                                                                                                                                                                                                                                                                                                                                                                              | | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc)                                                                                                                                                                                                                                                                                                                                                     | | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc),[model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc)                                                                                                                                                                                                                                                                                                                                                 | | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-4bit-gptq-autoround)                                                                                                                                                                                                                                                                                                                                                                                                                                         | | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-Chat-4bit-gptq-autoround)                                                                                                                                                                                                                                                                                                                                                                                                                                    | | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-3-int4-inc)                                                                                                                                                                                                                                                                                                                                                                                                                                                   | | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-1-int4-inc)                                                                                                                                                                                                                                                                                                                                                                                                                                                   | | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse)                                                                                                                                                                                                                                                                                                                                                                                                  | | [model-autogptq-lmhead-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead), [model-autogptq-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc)                                                                                                                                                                                                                                                                                                                                                           | | [model-autogptq-int4](https://huggingface.co/Intel/gemma-2b-int4-inc)                                                                                                                                                                                                                                                                                                                                                                                                                                                              | | [model-autogptq-int4-G64](https://huggingface.co/Intel/falcon-7b-int4-inc)                                                                                                                                                                                                                                                                                                                                                                                                                                                         | | [model-fbaldassarri-autogptq-int4*](https://huggingface.co/fbaldassarri/modello-italia-9b-autoround-w4g128-cpu)                                                                                                                                                                                                                                                                                                                                                                                                                    | | [model-autoround-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) [model-autogptq-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc)                                                                                                                                                                                                                                                                                                                                                                                     | | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit)                                                                                                                                                                                                                                                                                                                                                                                                                          | | [outdated-recipe](./docs/Mistral-7B-Instruct-v0.2-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | | [outdated-recipe](./docs/Mixtral-8x7B-Instruct-v0.1-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                | | [outdated-recipe](./docs/Mixtral-8x7B-v0.1-asym-acc.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | | [outdated-recipe](./docs/Meta-Llama-3-8B-Instruct-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | | [outdated-recipe](./docs/gemma-7b-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | | [outdated-recipe](./docs/Llama-2-7b-chat-hf-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | | [outdated-recipe](./docs/baichuan2-7b-cha-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | | [outdated-recipe](./docs/bloom-3B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |

 

### VLM Support Matrix

For most VLMs, we typically support the default quantization configuration, which involves quantizing only the language component while excluding the visual component. Besides, we also support quantizing non-text modules of models that follow the Hugging Face standard, i.e., those with a typical processor, though inference may have some issues due to model architecture or kernel limitations.

| Model                               | calibration dataset | quant nontext module |  Quantized Model Link | 

|-------------------------------------|---------------------|----------------------| ----------------------|

| allenai/Molmo                       | pile                | X                    | [Molmo-7B-D-0924-int4-sym](https://huggingface.co/OPEA/Molmo-7B-D-0924-int4-sym-inc), [Molmo-72B-0924-int4-sym-gptq](https://huggingface.co/OPEA/Molmo-72B-0924-int4-sym-gptq-inc), [Molmo-72B-0924-int4-sym](https://huggingface.co/OPEA/Molmo-72B-0924-int4-sym-inc)  |

| deepseek-ai/deepseek-vl2            | pile/llava          | √                    | [deepseek-vl2-int4-sym-gptq](https://huggingface.co/OPEA/deepseek-vl2-int4-sym-gptq-inc) |

| google/gemma-3                      | pile/llava          | √                    | [gemma-3-12b-it-AutoRound-gguf-q4-0](https://huggingface.co/OPEA/gemma-3-12b-it-AutoRound-gguf-q4-0), [gemma-3-27b-it-AutoRound-gguf-q4-0](https://huggingface.co/OPEA/gemma-3-27b-it-AutoRound-gguf-q4-0), [gemma-3-12b-it-int4-AutoRound-cpu](https://huggingface.co/OPEA/gemma-3-12b-it-int4-AutoRound-cpu), [gemma-3-27b-it-int4-AutoRound-cpu](https://huggingface.co/OPEA/gemma-3-27b-it-int4-AutoRound-cpu) |

| HuggingFaceTB/SmolVLM               | pile/llava          | √                    | [SmolVLM-Instruct-int4-sym](https://huggingface.co/OPEA/SmolVLM-Instruct-int4-sym-inc) |

| ibm-granite/granite-vision-3.2      | pile/llava          | -                    |   |

| liuhaotian/Llava-v1.5               | pile/llava          | X                    |[llava-v1.5-7b-int4-sym](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc) |

| meta-llama/Llama-3.2-Vision         | llava               | √                    | [Llama-3.2V-11B-cot-int4-sym](https://huggingface.co/OPEA/Llama-3.2V-11B-cot-int4-sym-inc), [Llama-3.2-11B-Vision-Instruct-qvision-int4-sym](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-qvision-int4-sym-inc), [Llama-3.2-90B-Vision-Instruct-int4-sym](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc), [Llama-3.2-11B-Vision-Instruct-int4-sym](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc) |

| microsoft/Phi3.5-Vision               | pile/llava          | √                    | [Phi-3.5-vision-instruct-int4-sym](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc), [Phi-3.5-vision-instruct-qvision-int4-sym](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-qvision-int4-sym-inc) |

| mistralai/Mistral-Small-3.1         | pile/llava          | X                    | [Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-gptq-sym](https://huggingface.co/OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-gptq-sym), [Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym](https://huggingface.co/OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym) |

| moonshotai/Kimi-VL                  | pile/llava          | √                    |  |

| Qwen/Qwen2-VL                       | pile/llava          | -                    | [Qwen2-VL-7B-Instruct-int4-sym](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc), [Qwen2-VL-72B-Instruct-int4-sym](https://huggingface.co/OPEA/Qwen2-VL-72B-Instruct-int4-sym-inc), [Qwen2-VL-72B-Instruct-int2-sym](https://huggingface.co/OPEA/Qwen2-VL-72B-Instruct-int2-sym-inc) |

| rhymes-ai/Aria                      | pile/llava          | √                    |  |

| THUDM/CogVLM2                       | pile/llava          | √                    | [cogvlm2-llama3-chat-19B-int4-sym](https://huggingface.co/OPEA/cogvlm2-llama3-chat-19B-int4-sym-inc), [cogvlm2-llama3-chat-19B-qvision-int4-sym](https://huggingface.co/OPEA/cogvlm2-llama3-chat-19B-qvision-int4-sym-inc) |

| THUDM/glm-4v                        | pile                | X                    | [glm-4v-9b-int4-sym](https://huggingface.co/OPEA/glm-4v-9b-int4-sym-inc) |

√ means support, - means support to export but cannot infer, X means not support.

## Integration

AutoRound has been integrated into multiple repositories.

[Intel Neural Compressor](https://github.com/intel/neural-compressor)

[ModelCloud/GPTQModel](https://github.com/ModelCloud/GPTQModel)

[pytorch/ao](https://github.com/pytorch/ao)

## Reference

If you find AutoRound useful for your research, please cite our paper:

```bash

@article{cheng2023optimize,

  title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},

  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},

  journal={arXiv preprint arXiv:2309.05516},

  year={2023}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/intel/auto-round

Awesome Lists containing this project

README

Advanced Quantization Algorithm for LLMs