{"id":48845490,"url":"https://github.com/dropbox/hqq","last_synced_at":"2026-04-15T05:02:51.224Z","repository":{"id":209205418,"uuid":"715778092","full_name":"dropbox/hqq","owner":"dropbox","description":"Official implementation of Half-Quadratic Quantization (HQQ)","archived":false,"fork":false,"pushed_at":"2026-02-26T12:50:29.000Z","size":496,"stargazers_count":926,"open_issues_count":2,"forks_count":90,"subscribers_count":15,"default_branch":"master","last_synced_at":"2026-04-13T11:02:06.484Z","etag":null,"topics":["llm","machine-learning","quantization"],"latest_commit_sha":null,"homepage":"https://dropbox.github.io/hqq_blog/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dropbox.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-11-07T20:15:00.000Z","updated_at":"2026-04-09T02:49:06.000Z","dependencies_parsed_at":"2025-11-15T10:00:46.064Z","dependency_job_id":null,"html_url":"https://github.com/dropbox/hqq","commit_stats":{"total_commits":297,"total_committers":15,"mean_commits":19.8,"dds":"0.37710437710437705","last_synced_commit":"28ce0273217e2b4473b5228c40554a076085afa8"},"previous_names":["mobiusml/hqq","dropbox/hqq"],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/dropbox/hqq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhqq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhqq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhqq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhqq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dropbox","download_url":"https://codeload.github.com/dropbox/hqq/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhqq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31826907,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"online","status_checked_at":"2026-04-15T02:00:06.175Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","machine-learning","quantization"],"created_at":"2026-04-15T05:02:45.856Z","updated_at":"2026-04-15T05:02:51.175Z","avatar_url":"https://github.com/dropbox.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Half-Quadratic Quantization (HQQ)\nThis repository contains the official implementation of Half-Quadratic Quantization (\u003cb\u003eHQQ\u003c/b\u003e) presented in our articles: \n* HQQ: https://dropbox.github.io/hqq_blog/\n* HQQ+: https://dropbox.github.io/1bit_blog/\n\n### What is HQQ?\n\u003cb\u003eHQQ\u003c/b\u003e is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most 🚀.\n\n\u003cdetails\u003e\n  \u003csummary\u003eFAQ \u003c/summary\u003e\n \u003cb\u003e Why should I use HQQ instead of other quantization methods? \u003c/b\u003e\u003cbr\u003e\n\u003cul\u003e\n\u003cli\u003e HQQ is very fast to quantize models.\u003c/li\u003e\n\u003cli\u003e It supports 8,4,3,2,1 bits.\u003c/li\u003e\n\u003cli\u003e You can use it on any model (LLMs, Vision, etc.).\u003c/li\u003e\n\u003cli\u003e The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA/Triton kernels.\u003c/li\u003e\n\u003cli\u003e HQQ is compatible with peft training.\u003c/li\u003e\n\u003cli\u003e We try to make HQQ fully compatible `torch.compile` for faster inference and training.\u003c/li\u003e\n\u003c/ul\u003e\n  \n  \u003cb\u003eWhat is the quality of the quantized models? \u003c/b\u003e\u003cbr\u003e\n  We have detailed benchmarks on both language and vision models. Please refer to our blog posts: \u003ca href=\"https://dropbox.github.io/hqq_blog/\"\u003eHQQ\u003c/a\u003e, \u003ca href=\"https://dropbox.github.io/1bit_blog/\"\u003eHQQ+\u003c/a\u003e.\u003cbr\u003e \n\n  \u003cb\u003eWhat is the speed of the quantized models?\u003c/b\u003e\u003cbr\u003e\n  4-bit models with `axis=1` can use optimized inference fused kernels. Moreover, we focus on making hqq fully compatible with `torch.compile` which speeds-up both training and inference. For more details, please refer to the backend section below. \u003cbr\u003e\n\n  \u003cb\u003eWhat quantization settings should I use?\u003c/b\u003e\u003cbr\u003e\n  You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend, but this setting is not supported for fast inference. \u003cbr\u003e\n  \n  \u003cb\u003eWhat does the `axis` parameter mean? \u003c/b\u003e\u003cbr\u003e\n  The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.\u003cbr\u003e\n  \n  \u003cb\u003eWhat is the difference between HQQ and HQQ+?\u003c/b\u003e\u003cbr\u003e\n  HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.\u003cbr\u003e\n\n\u003c/details\u003e\n\n### Installation \nFirst, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/\n\nYou can install hqq via  \n```\n#latest stable version\npip install hqq;\n\n#Latest updates - recommended\npip install git+https://github.com/dropbox/hqq.git; \n\n#Disable building the CUDA kernels for the aten backend\nDISABLE_CUDA=1 pip install ...\n```\n\nAlternatively, clone the repo and run ```pip install .``` from this current folder. \n\n### Basic Usage\nTo perform quantization with HQQ, you simply need to replace the linear layers ( ```torch.nn.Linear```) as follows:\n```Python\nfrom hqq.core.quantize import *\n#Quantization settings\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64)\n\n#Replace your linear layer \nhqq_layer = HQQLinear(your_linear_layer, #torch.nn.Linear or None \n                      quant_config=quant_config, #quantization configuration\n                      compute_dtype=torch.float16, #compute dtype\n                      device='cuda', #cuda device\n                      initialize=True, #Use False to quantize later\n                      del_orig=True #if True, delete the original layer\n                      )\n\nW_r = hqq_layer.dequantize() #dequantize()\nW_q = hqq_layer.unpack(dtype=torch.uint8) #unpack\ny   = hqq_layer(x) #forward-pass\n```\n\nThe quantization parameters are set as follows:\n\n- ```nbits``` (int): supports 8, 4, 3, 2, 1 bits.\n- ```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```.\n- ```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of an int type.\n\n### Usage with Models\n#### Transformers 🤗\nFor usage with HF's transformers, see the example below from the \u003ca href=\"https://huggingface.co/docs/transformers/main/en/quantization#hqq\"\u003edocumentation\u003c/a\u003e:\n```Python\nfrom transformers import AutoModelForCausalLM, HqqConfig\n\n# All linear layers will use the same quantization config\nquant_config = HqqConfig(nbits=4, group_size=64)\n\n# Load and quantize\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id, \n    torch_dtype=torch.float16, \n    device_map=\"cuda\", \n    quantization_config=quant_config\n)\n```\nYou can save/load quantized models as regular transformers models via `save_pretrained` / `from_pretrained`.\n\n#### HQQ Lib\nYou can also utilize the HQQ library to quantize transformers models:\n```Python\n#Load the model on CPU\nfrom transformers import AutoModelForCausalLM\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)\n\n#Quantize\nfrom hqq.models.hf.base import AutoHQQHFModel\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64) \nAutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)\n```\nYou can save/load quantized models as follows:\n```Python\nfrom hqq.models.hf.base import AutoHQQHFModel\n\n#Save: Make sure to save the model BEFORE any patching\nAutoHQQHFModel.save_quantized(model, save_dir)\n\n#Save as safetensors (to be load via transformers or vllm)\nAutoHQQHFModel.save_to_safetensors(model, save_dir)\n\n#Load\nmodel = AutoHQQHFModel.from_quantized(save_dir)\n```\n\n❗ Note that models saved via the hqq lib are not compatible with `.from_pretrained()`\n\n### Backends\n#### Native Backends\nThe following native dequantization backends can be used by the `HQQLinear` module:\n```Python\nHQQLinear.set_backend(HQQBackend.PYTORCH)          #Pytorch backend - Default\nHQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)  #Compiled Pytorch\nHQQLinear.set_backend(HQQBackend.ATEN)             #Aten/CUDA backend - only axis=0 supported\n```\n❗ Note that ```HQQBackend.ATEN```  only supports `axis=0`. \n\n#### Optimized Inference\nWe support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows:\n```Python\nfrom hqq.utils.patching import prepare_for_inference\n\n#Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings\n#prepare_for_inference(model)\n\n#Gemlite backend: nbits=4/2/1, compute_dtype=float16, axis=1\nprepare_for_inference(model, backend=\"gemlite\") \n\n#Torchao's tiny_gemm backend (fast for batch-size\u003c4): nbits=4, compute_dtype=bfloat16, axis=1\n#prepare_for_inference(model, backend=\"torchao_int4\") \n```\nNote that these backends only work with `axis=1`. Additional restrictions apply regarding the group-size values depending on the backend. You should expect ~158 tokens/sec with a Llama3-8B 4-bit quantized model on a 4090 RTX.\n\nWhen a quantization config is not supported by the specified inference backend, hqq will fallback to the native backend. \n\n### Custom Quantization Configurations ⚙️\nYou can set up various quantization configurations for different layers by specifying the settings for each layer name:\n#### Transformers 🤗\n```Python\n# Each linear layer with the same tag will use a dedicated quantization config\nq4_config = {'nbits':4, 'group_size':64}\nq3_config = {'nbits':3, 'group_size':32}\n\nquant_config  = HqqConfig(dynamic_config={\n  'self_attn.q_proj':q4_config,\n  'self_attn.k_proj':q4_config,\n  'self_attn.v_proj':q4_config,\n  'self_attn.o_proj':q4_config,\n\n  'mlp.gate_proj':q3_config,\n  'mlp.up_proj'  :q3_config,\n  'mlp.down_proj':q3_config,\n})\n```\n#### HQQ lib\n```Python\nfrom hqq.core.quantize import *\nq4_config    = BaseQuantizeConfig(nbits=4, group_size=64) \nq3_config    = BaseQuantizeConfig(nbits=3, group_size=32)\n\nquant_config = {'self_attn.q_proj':q4_config,\n  'self_attn.k_proj':q4_config,\n  'self_attn.v_proj':q4_config,\n  'self_attn.o_proj':q4_config,\n\n  'mlp.gate_proj':q3_config,\n  'mlp.up_proj'  :q3_config,\n  'mlp.down_proj':q3_config,\n}\n```\n\n### VLLM\nYou can use HQQ in \u003ca href=\"https://github.com/vllm-project/vllm/\"\u003evllm\u003c/a\u003e. Make sure to install \u003ca href=\"https://github.com/dropbox/gemlite/\"\u003eGemLite\u003c/a\u003e before using the backend.\n\n```Python\n#Or you can quantize on-the-fly\nfrom hqq.utils.vllm import set_vllm_onthefly_hqq_quant\nskip_modules = ['lm_head', 'visual', 'vision']\n\n#Select one of the following modes:\n\n#INT/FP format\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='int8_weightonly', skip_modules=skip_modules) #A16W8 - INT8 weight only\nset_vllm_onthefly_hqq_quant(weight_bits=4, group_size=128, quant_mode='int4_weightonly', skip_modules=skip_modules) #A16W4 - HQQ weight only\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='int8_dynamic', skip_modules=skip_modules) #A8W8 - INT8 x INT8 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='fp8_dynamic', skip_modules=skip_modules) #A8W8 - FP8 x FP8 dynamic\n\n#MXFP format\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8 - post_scale=True\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=32, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8- post_scale=False\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_weightonly', skip_modules=skip_modules) #A16W4 - MXFP4 weight-only\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W4 - MXFP8 x MXFP4 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_dynamic', skip_modules=skip_modules) #A4W4 - MXPF4 x MXPF4 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='nvfp4_dynamic', skip_modules=skip_modules) #A4W4 - NVFP4 x NVFP4 dynamic\n\n\nllm = LLM(model=\"meta-llama/Llama-3.2-3B-Instruct\", max_model_len=4096, gpu_memory_utilization=0.80, dtype=torch.float16)\n```\n\n### Peft Training\nPeft training is directly supported in the HuggingFace's \u003ca href=\"https://huggingface.co/docs/peft/v0.12.0/en/developer_guides/quantization#hqq-quantization\"\u003e peft library\u003c/a\u003e. If you still want to use hqq-lib's peft utilities, here's how: \n\n```Python\n#First, quantize/load a quantized HQQ model the\nfrom hqq.core.peft import PeftUtils\n\nbase_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':64, 'dropout':0.05, 'train_dtype':torch.float32}\nlora_params      = {'self_attn.q_proj': base_lora_params,\n                    'self_attn.k_proj': base_lora_params,\n                    'self_attn.v_proj': base_lora_params,\n                    'self_attn.o_proj': base_lora_params,\n                    'mlp.gate_proj'   : None,\n                    'mlp.up_proj'     : None,\n                    'mlp.down_proj'   : None}\n\n\n#Add LoRA to linear/HQQ modules\nPeftUtils.add_lora(model, lora_params)\n\n#Optional: set your backend\nHQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)\n\n#Train ....\n\n#Convert LoRA weights to the same model dtype for faster inference\nmodel.eval()\nPeftUtils.cast_lora_weights(model, dtype=compute_dtype)\n\n#Save LoRA weights\nPeftUtils.save_lora_weights(model, filename)\n\n#Load LoRA weights: automatically calls add_lora \nPeftUtils.load_lora_weights(model, filename)\n```\n\nWe provide a complete example to train a model with HQQ/LoRA that you can find in ```examples/hqq_plus.py```.\n\nIf you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora\n\n### Examples \nWe provide a variety of examples demonstrating model quantization across different backends within the ```examples```  directory.\n\n### Citation 📜\n```\n@misc{badri2023hqq,\ntitle  = {Half-Quadratic Quantization of Large Machine Learning Models},\nurl    = {https://dropbox.github.io/hqq_blog/},\nauthor = {Hicham Badri and Appu Shaji},\nmonth  = {November},\nyear   = {2023}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdropbox%2Fhqq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdropbox%2Fhqq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdropbox%2Fhqq/lists"}