{"id":27545768,"url":"https://github.com/intel/auto-round","last_synced_at":"2026-02-12T08:14:23.050Z","repository":{"id":216769801,"uuid":"738776129","full_name":"intel/auto-round","owner":"intel","description":"Advanced Quantization Algorithm for LLMs/VLMs. ","archived":false,"fork":false,"pushed_at":"2025-04-17T09:31:45.000Z","size":10952,"stargazers_count":431,"open_issues_count":18,"forks_count":33,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-17T20:02:03.034Z","etag":null,"topics":["awq","gptq","int4","neural-compressor","quantization","rounding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/intel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-04T02:41:51.000Z","updated_at":"2025-04-17T17:37:07.000Z","dependencies_parsed_at":"2024-02-23T09:31:01.968Z","dependency_job_id":"6d4aae6a-9664-47ab-999d-de2033167278","html_url":"https://github.com/intel/auto-round","commit_stats":null,"previous_names":["intel/auto-round"],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fauto-round","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fauto-round/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fauto-round/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/intel%2Fauto-round/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/intel","download_url":"https://codeload.github.com/intel/auto-round/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249470090,"owners_count":21277694,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awq","gptq","int4","neural-compressor","quantization","rounding"],"created_at":"2025-04-19T01:02:40.937Z","updated_at":"2026-02-12T08:14:23.043Z","avatar_url":"https://github.com/intel.png","language":"Python","funding_links":[],"categories":["others","A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cdiv align=\"center\"\u003e\n\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/imgs/AutoRound.png\" alt=\"AutoRound Overview\" width=\"20%\"\u003e\n\u003c/p\u003e\n\n\n\u003ch3\u003e Advanced Quantization Algorithm for LLMs\u003c/h3\u003e\n\n[![python](https://img.shields.io/badge/python-3.10%2B-blue)](https://github.com/intel/auto-round)\n[![version](https://img.shields.io/badge/release-0.9.7-green)](https://github.com/intel/auto-round)\n[![license](https://img.shields.io/badge/license-Apache%202-9C27B0)](https://github.com/intel/auto-round/blob/main/LICENSE)\n\u003ca href=\"https://huggingface.co/Intel\"\u003e\n\u003cimg alt=\"Model Checkpoints\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00\"\u003e\n\u003c/a\u003e\n\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;English | [简体中文](README_CN.md)\n\n[User Guide](./docs/step_by_step.md) | [用户指南](./docs/step_by_step_CN.md)\u0026nbsp;\u0026nbsp; \n\n---\n\u003cdiv align=\"left\"\u003e\n\n## 🚀 What is AutoRound?\n\nAutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). \nIt achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging **sign-gradient descent** and providing broad hardware compatibility. \nSee our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](http://arxiv.org/abs/2512.04746) for more details. For usage instructions, please refer to the [User Guide](./docs/step_by_step.md).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/imgs/autoround_overview.png\" alt=\"AutoRound Overview\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n\n## 🆕 What's New\n\n* [2025/12] The **SignRoundV2** paper is available. Turn on  `enable_alg_ext` and use the **AutoScheme** API for mixed-precision quantization to reproduce the results: [*Paper*](http://arxiv.org/abs/2512.04746), [*Notes for evaluating LLaMA models*](./docs/alg_202508.md).\n\n* [2025/11] AutoRound has landed in **LLM-Compressor**: [*Usage*](https://github.com/vllm-project/llm-compressor/tree/main/examples/autoround/README.md), [*vLLM blog*](https://blog.vllm.ai/2025/12/09/intel-autoround-llmc.html), [*RedHat blog*](https://developers.redhat.com/articles/2025/12/09/advancing-low-bit-quantization-llms-autoround-x-llm-compressor), [*X post*](https://x.com/vllm_project/status/1998710451312771532), [*Intel blog*](https://community.intel.com/t5/Blogs/Products-and-Solutions/HPC/Advancing-Low-Bit-Quantization-for-LLMs-AutoRound-x-LLM/post/1729336), [*Linkedin*](https://www.linkedin.com/posts/vllm-project_advancing-lowbit-quantization-for-llms-activity-7404478053768441856-ru8f/?utm_source=share\u0026utm_medium=member_desktop\u0026rcm=ACoAAAapNW8BLnAdCAr57GOwSCJXjf76ZvOEOAg), [*微信*](https://mp.weixin.qq.com/s/l5WA-1_4ipffQN6GOH2Iqg), [*知乎*](https://zhuanlan.zhihu.com/p/1982167638315664412).\n\n* [2025/11] An **enhanced GGUF** quantization algorithm is available via `--enable_alg_ext`: [*Accuracy*](./docs/gguf_alg_ext_acc.md).\n\n* [2025/10] AutoRound has been integrated into **SGLang**: [*Usage*](https://docs.sglang.io/advanced_features/quantization.html#using-auto-round), [*LMSYS Blog*](https://lmsys.org/blog/2025-11-13-AutoRound/), [*X post*](https://x.com/lmsysorg/status/1991977019220148650?s=20), [*Intel blog*](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/AutoRound-Meets-SGLang-Enabling-Quantized-Model-Inference-with/post/1727196), [*Linkedin*](https://www.linkedin.com/feed/update/urn:li:activity:7397742859354857472).\n\n* [2025/10] A **mixed precision** algorithm is available to generate schemes in minutes: [*Usage*](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme),  [*Accuracy*](./docs/auto_scheme_acc.md).\n\n* [2025/09] **MXFP4** and **NVFP4** dtypes is available: [*Accuracy*](./docs/mxnv_acc.md).\n\n* [2025/08] An **improved INT2** algorithm is available via `--enable_alg_ext`: [*Accuracy*](./docs/alg_202508.md)\n\n* [2025/07] **GGUF** format is supported: [*Usage*](./docs/step_by_step.md#gguf-format). \n\n* [2025/05] AutoRound has been integrated into **vLLM**: [*Usage*](https://docs.vllm.ai/en/latest/features/quantization/auto_round/), [*Medium blog*](https://medium.com/@NeuralCompressor/accelerating-vllm-and-sglang-deployment-using-autoround-45fdc0b2683e), [*小红书*](https://www.xiaohongshu.com/explore/69396bc6000000000d03e473?note_flow_source=wechat\u0026xsec_token=CB6G3F_yM99q8XfusvyRlJqm8Db4Es2k0kYIHdIUiSQ9g=).\n\n* [2025/05] AutoRound has been integrated into **Transformers**: [*Blog*](https://huggingface.co/blog/autoround).\n\n* [2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy: [*Model*](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).\n\n\n## ✨ Key Features\n\n\n✅ **Superior Accuracy**\nDelivers strong performance even at 2–3 bits [example models](https://huggingface.co/collections/OPEA/2-3-bits-67a5f0bc6b49d73c01b4753b), with leading results at 4 bits [benchmark](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard).\n\n✅ **Ecosystem Integration**\nSeamlessly works with **Transformers, vLLM, SGLang** and more.\n\n✅ **Multiple Formats Export**\nSupport **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. Details are shown in [export formats](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#supported-export-formats)\n\n✅ **Fast Mixed Bits/Dtypes Scheme Generation**\nAutomatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy [results](./docs/auto_scheme_acc.md) and [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme).\n\n✅ **Optimized Round-to-Nearest Mode**\nUse `--iters 0` for fast quantization with some accuracy drop for 4 bits. Details are shown in [opt_rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#opt-rtn-mode)\n\n✅ **Affordable Quantization Cost**\nQuantize 7B models in about 10 minutes on a single GPU. Details are shown in [quantization costs](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs)\n\n✅ **10+ VLMs Support**\nOut-of-the-box quantization for 10+ vision-language models [example models](https://huggingface.co/collections/OPEA/vlms-autoround-675bc712fdd6a55ebaf11bfa), [support matrix](https://github.com/intel/auto-round/tree/main/auto_round/mllm#support-matrix)\n\n✅ **Multiple Recipes**\nChoose from `auto-round-best`, `auto-round`, and `auto-round-light` to suit your needs. Details are shown in [quantization recipes](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#recipe-recommendation)\n\n✅ Advanced Utilities\nIncludes [multiple gpus quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#devicemulti-gpu-setting-in-quantization), [multiple calibration datasets](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#default-dataset) and support for [10+ runtime backends](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#specify-inference-backend).\n\n✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as **MXFP**, NVFP, W8A8, and more.\n\n\n## Installation\n\n### Install from pypi\n\n```bash\n# CPU(Xeon)/GPU(CUDA)\npip install auto-round\n\n# HPU(Gaudi)\n# install inside the hpu docker container, e.g. vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest  \npip install auto-round-hpu\n\n# XPU(Intel GPU)\npip install torch --index-url https://download.pytorch.org/whl/xpu\npip install auto-round\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eBuild from Source\u003c/summary\u003e\n\n  ```bash\n  # CPU(Xeon)/GPU(CUDA)\n  pip install .\n\n  # HPU(Gaudi)\n  python setup.py install hpu\n  \n  # XPU(Intel GPU)\n  pip install torch --index-url https://download.pytorch.org/whl/xpu\n  pip install .\n  ```\n\n\u003c/details\u003e\n\n## Model Quantization (CPU/Intel GPU/Gaudi/CUDA)\n\n\u003eIf you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results..\n\n### CLI Usage\nThe full list of supported arguments is provided by calling `auto-round -h` on the terminal.\n\n\u003e **ModelScope is supported for model downloads, simply set `AR_USE_MODELSCOPE=1`.**\n\n\n```bash\nauto-round \\\n    --model Qwen/Qwen3-0.6B \\\n    --scheme \"W4A16\" \\\n    --format \"auto_round\" \\\n    --output_dir ./tmp_autoround\n```\n\n\nWe offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively. Details are as follows.\n\u003cdetails\u003e\n  \u003csummary\u003eOther Recipes\u003c/summary\u003e\n\n  ```bash\n# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower\nauto-round-best \\\n    --model Qwen/Qwen3-0.6B \\\n    --scheme \"W4A16\" \\\n    --low_gpu_mem_usage \n  ```\n\n  ```bash\n# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2\nauto-round-light \\\n    --model Qwen/Qwen3-0.6B \\\n    --scheme \"W4A16\" \n\n  ```\n\n  \u003c!-- ```bash\nauto-round-fast \\\n# Fast and low memory, 2-3X speedup, slight accuracy drop at W4G128\n    --model Qwen/Qwen3-0.6B \\\n    --bits 4 \\\n    --group_size 128 \\\n  ``` --\u003e\n\n\u003c/details\u003e\n\nIn conclusion, we recommend using **auto-round for W4A16 and auto-round-best with `enable_alg_ext` for W2A16**. However, you may adjust the\nconfiguration to suit your specific requirements and available resources.\n\n### API Usage\n\n```python\nfrom auto_round import AutoRound\n\n# Load a model (supports FP8/BF16/FP16/FP32)\nmodel_name_or_path = \"Qwen/Qwen3-0.6B\"\n\n# Available schemes: \"W2A16\", \"W3A16\", \"W4A16\", \"W8A16\", \"NVFP4\", \"MXFP4\" (no real kernels), \"GGUF:Q4_K_M\", etc.\nar = AutoRound(model_name_or_path, scheme=\"W4A16\")\n\n# Highest accuracy (4–5× slower).\n# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.\n# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)\n\n# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.\n# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)\n\n# Supported formats: \"auto_round\" (default), \"auto_gptq\", \"auto_awq\", \"llm_compressor\", \"gguf:q4_k_m\", etc.\nar.quantize_and_save(output_dir=\"./qmodel\", format=\"auto_round\")\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eImportant Hyperparameters\u003c/summary\u003e\n\n##### Quantization Scheme \u0026 Configuration\n- **`scheme` (str|dict|AutoScheme)**: The predefined quantization keys, e.g. `W4A16`, `MXFP4`, `NVFP4`, `GGUF:Q4_K_M`. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.\n- **`bits` (int)**: Number of bits for quantization (default is `None`). If not None, it will override the scheme setting.\n- **`group_size` (int)**: Size of the quantization group (default is `None`). If not None, it will override the scheme setting.\n- **`sym` (bool)**: Whether to use symmetric quantization (default is `None`). If not None, it will override the scheme setting.\n- **`layer_config` (dict)**: Configuration for layer_wise scheme (default is `None`), mainly for customized mixed schemes.\n\n##### Algorithm Settings\n- **`enable_alg_ext` (bool)**: [Experimental Feature] Only for `iters\u003e0`. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is `False`.\n\n- **`disable_opt_rtn` (bool|None)**: Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is `None`. If None, it defaults to `False` in most cases to improve accuracy, but may be set to `True` due to known issues.\n\n##### Tuning Process Parameters\n- **`iters` (int)**: Number of tuning iterations (default is `200`). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.\n- **`lr` (float)**: The learning rate for rounding value (default is `None`). When None, it will be set to `1.0/iters` automatically.\n- **`batch_size` (int)**: Batch size for training (default is `8`). 4 is also commonly used.\n- ** `enable_deterministic_algorithms` (bool)**: Whether to enable deterministic algorithms for reproducibility (default is `False`).\n\n##### Calibration Dataset\n- **`dataset` (str|list|tuple|torch.utils.data.DataLoader)**: The dataset for tuning (default is `\"NeelNanda/pile-10k\"`). Supports local JSON files and dataset combinations, e.g. `\"./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test\"`.\n- **`nsamples` (int)**: Number of samples for tuning (default is `128`).\n- **`seqlen` (int)**: Data length of the sequence for tuning (default is `2048`).\n\n##### Device/Speed Configuration\n- **`enable_torch_compile` (bool)**: If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.\n- **`low_gpu_mem_usage` (bool)**: Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is `False`).\n- **`low_cpu_mem_usage` (bool)**: [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default is `True`).\n- **`device_map` (str|dict|int)**: The device to be used for tuning, e.g., `auto`, `cpu`, `cuda`, `0,1,2` (default is `0`). When using `auto`, it will try to use all available GPUs.\n\n\u003c/details\u003e\n\n### Supported Schemes\n\u003cdetails\u003e\n\u003e Gray indicates the absence of a kernel or the presence of only an inefficient/reference kernel. BF16 is mainly for AutoScheme\n\n| Format             | Supported Schemes                                                                                                                                                         |\n|:-------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **auto_round**     | W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32, `MXFP4`, `MXFP8`, `MXFP4_RCEIL`, `MXFP8_RCEIL`, `NVFP4`, `FPW8A16`, `FP8_STATIC`, `BF16`                     |\n| **auto_awq**       | W4A16(Recommended), BF16                                                                                                                                                  |\n| **auto_gptq**      | W4A16(Recommended), W2A16, W3A16, W8A16, W2A16G64, W2A16G32,BF16                                                                                                          |\n| **llm_compressor** | NVFP4(Recommended), `MXFP4`, `MXFP8`, `FPW8A16`, `FP8_STATIC`                                                                                                             |\n| **gguf**           | GGUF:Q4_K_M(Recommended), GGUF:Q2_K_S, GGUF:Q3_K_S, GGUF:Q3_K_M, GGUF:Q3_K_L, GGUF:Q4_K_S, GGUF:Q5_K_S, GGUF:Q5_K_M, GGUF:Q6_K, GGUF:Q4_0, GGUF:Q4_1, GGUF:Q5_0, GGUF:Q5_1,GGUF:Q8_0 |\n| **fake**           | `all schemes (only for research)`                                                                                                                                         |\n\u003c/details\u003e\n\n\n### Adaptive Schemes (Experimental Feature)\nAutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes.\nPlease refer to the [user guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme) for more details on AutoScheme.\n~~~python\nfrom auto_round import AutoRound, AutoScheme\n\nmodel_name = \"Qwen/Qwen3-8B\"\navg_bits = 3.0\nscheme = AutoScheme(avg_bits=avg_bits, options=(\"GGUF:Q2_K_S\", \"GGUF:Q4_K_S\"), ignore_scale_zp_bits=True)\nlayer_config = {\"lm_head\": \"GGUF:Q6_K\"}\n\n# Change iters to 200 for non-GGUF schemes\nar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)\nar.quantize_and_save()\n~~~\n\n\u003cdetails\u003e\n\u003csummary\u003eImportant Hyperparameters of AutoScheme\u003c/summary\u003e\n\n\n##### AutoScheme Hyperparameters\n\n- **`avg_bits` (float)**: Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.  \n- **`options` (str | list[str] | list[QuantizationScheme])**: Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g., `\"W4A16,W2A16\"`), a list of strings (e.g., `[\"W4A16\", \"W2A16\"]`), or a list of `QuantizationScheme` objects.  \n- **`ignore_scale_zp_bits` (bool)**: Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default: `False`).  \n- **`shared_layers` (Iterable[Iterable[str]], optional)**: Only supported in API usage. Defines groups of layers that share quantization settings.  \n- **`batch_size` (int, optional)**: Only supported in API usage. Can be set to `1` to reduce VRAM usage at the expense of longer tuning time.  \n\n\u003c/details\u003e\n\n### API Usage for VLMs\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n**This feature is experimental and may be subject to changes**.\n\nBy default, AutoRound only quantize the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To\nquantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature\nis limited. For more information, please refer to the AutoRound [readme](./auto_round/mllm/README.md).\n\n```python\nfrom auto_round import AutoRound\n\n# Load the model\nmodel_name_or_path = \"Qwen/Qwen2.5-VL-7B-Instruct\"\n# Quantize the model\nar = AutoRound(model_name_or_path, scheme=\"W4A16\")\noutput_dir = \"./qmodel\"\nar.quantize_and_save(output_dir)\n```\n\n\u003c/details\u003e\n\n\n\n## Model Inference\n\n### vLLM (CPU/Intel GPU/CUDA)\n```python\nfrom vllm import LLM, SamplingParams\n\nprompts = [\n    \"Hello, my name is\",\n]\nsampling_params = SamplingParams(temperature=0.6, top_p=0.95)\nmodel_name = \"Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound\"\nllm = LLM(model=model_name)\n\noutputs = llm.generate(prompts, sampling_params)\n\nfor output in outputs:\n    prompt = output.prompt\n    generated_text = output.outputs[0].text\n    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")\n```\n\n\n### SGLang (Intel GPU/CUDA)\n**Please note that support for the MoE models and visual language models is currently limited.**\n\n```python\nimport sglang as sgl\n\nllm = sgl.Engine(model_path=\"Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound\")\nprompts = [\n    \"Hello, my name is\",\n]\nsampling_params = {\"temperature\": 0.6, \"top_p\": 0.95}\n\noutputs = llm.generate(prompts, sampling_params)\nfor prompt, output in zip(prompts, outputs):\n    print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")\n```\n\n\n### Transformers (CPU/Intel GPU/Gaudi/CUDA)\n\nAutoRound supports 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to\ninstall additional libraries when a better backend is found.\n\n**Please avoid manually moving the quantized model to a different device** (e.g., model.to('cpu')) during inference, as\nthis may cause unexpected exceptions.\n\nThe support for Gaudi device is limited.\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\", torch_dtype=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(model_name)\ntext = \"There is a girl who likes adventure,\"\ninputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\nprint(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))\n```\n\n\n## Publications \u0026 Events\n[SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs](https://arxiv.org/abs/2512.04746) (2025.12 paper)\n\n[Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM](https://aclanthology.org/2024.findings-emnlp.662/) (2023.09 paper)\n\n[TEQ: Trainable Equivalent Transformation for Quantization of LLMs](https://arxiv.org/abs/2310.10944) (2023.10 paper)\n\n[Effective Post-Training Quantization for Large Language Models](https://medium.com/intel-analytics-software/effective-post-training-quantization-for-large-language-models-with-enhanced-smoothquant-approach-93e9d104fb98) (2023.04 blog)\n\nCheck out [Full Publication List](./docs/publication_list.md).\n\n## Acknowledgement\nSpecial thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.\n\n\n## 🌟 Support Us\nIf you find AutoRound helpful, please ⭐ star the repo and share it with your community!\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintel%2Fauto-round","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fintel%2Fauto-round","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintel%2Fauto-round/lists"}