{"id":23451401,"url":"https://github.com/coloquinte/sleekit","last_synced_at":"2026-02-14T02:03:18.420Z","repository":{"id":242273109,"uuid":"792329380","full_name":"Coloquinte/sleekit","owner":"Coloquinte","description":"Bag of Tricks for NN Quantization","archived":false,"fork":false,"pushed_at":"2024-12-09T16:44:46.000Z","size":4250,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-20T23:14:19.319Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Coloquinte.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-26T12:54:00.000Z","updated_at":"2024-12-16T13:38:18.000Z","dependencies_parsed_at":"2024-07-16T17:28:14.934Z","dependency_job_id":"765e5295-7795-4033-92e5-ceaf67662d19","html_url":"https://github.com/Coloquinte/sleekit","commit_stats":null,"previous_names":["coloquinte/sleekit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Coloquinte%2Fsleekit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Coloquinte%2Fsleekit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Coloquinte%2Fsleekit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Coloquinte%2Fsleekit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Coloquinte","download_url":"https://codeload.github.com/Coloquinte/sleekit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231033465,"owners_count":18317982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-24T00:25:54.730Z","updated_at":"2026-02-14T02:03:18.408Z","avatar_url":"https://github.com/Coloquinte.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bag of Tricks for NN Quantization\n\nNeural network quantization is the process that compresses the weights in a neural network to use a smaller number representation.\nThis makes its representation smaller, both on disk and in memory, and can make the computation less expensive for accelerators, typically by using small integer weights for the coefficients.\nAt the same time, it reduces the precision of the computations, so that good algorithm design is necessary to maintain good quality.\n\nThis repository contains tools to research post-training neural networks quantization, with methods to improve over the current state-of-the-art.\nIt is purely for analysis purpose: complete implementations will be made available on other repositories.\nOur main contributions are two simple improvements that are compatible with most quantization methods: an improved scaling method, and making better use of the bias during quantization.\n\n## Quantization method\n\nSleekit uses a very generic quantization method. The steps to quantize a layer are:\n* gathering sample data: we run the network on some data samples to gather statistical informations for each layer;\n* chosing of a codebook: a codebook gives a limited number of values that can be represented, and we round the weights to one of the values in the codebook;\n* scaling the weights: we apply a scaling factor so that the weights are close to the chosen codebook;\n* optimizing the weights: to maintain a good quality for the neural network, we use a specialized algorithm to tweak the weights after rounding.\n\n## Improvements\n\nWe present several generic improvements that can be applied to any quantization method.\nThey will target both the scaling step, to select better scaling factors, and the weight optimization step to reduce the layer error.\n\n### Methodology: layer-per-layer analysis\n\nTo develop our methods we analyze the effect of quantization decisions on a per-layer basis.\nDespite many previous works using network-level metrics, post-training quantization methods minimize the error at the layer level.\nAnalyzing the error at the layer level is therefore the natural approach.\nMoreover, network-level metrics have a tendency to be noisy, can hide small quantization errors or on the contrary be over-sensitive to some layers.\n\nOur baseline for comparison is the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm with 3-bit and 1.5-bit weights.\nWe use GPTQ's given parameters for the heuristic (diagonal ordering and 1% dampening).\nFor the layer weights and metrics, we use layer statistics from a full accuracy run on several smaller networks (OPT-125M, OPT-350M, BLOOM-560M).\nWe compare the error introduced by the quantization with and without our methods.\n\n### Trick 1: better scaling\n\nA good scaling factor minimizes the error introduced by quantization.\nThe typical method is to chose a scaling factor that minimizes the mean squared error on the weights (MSE).\nWe introduce a more precise approach, that optimizes the layer's result directly.\n\nFor weight optimization, we already have access to an accurate measure of the layer's error (the hessian matrix $H$ obtained from input samples).\nOur idea is to reuse it for scaling optimization.\nWe test three different approaches to scaling, and compare the layer error after applying GPTQ:\n* minimizing the mean squared error after rounding to the nearest;\n* using the full hessian matrix to compute the error, which is computationally expensive;\n* using the diagonal of the hessian matrix to compute the error, which has the same computational cost as the MSE;\n* using the full weight optimization to compute the error for each scaling value, which is extremely expensive but is theoretically optimal.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"results/scaling_1.5b.png\" width=45%\u003e\u003cimg src=\"results/scaling_3b.png\" width=45%\u003e\n\u003c/div\u003e\n\nThe usual approach of minimizing the MSE yields results that are far from optimal.\nUsing the full hessian matrix or its diagonal yields similar results that are on average much better than MSE alone.\nResults are far from the theoretical optimum, and even slightly degraded for some layers, leaving room for improvement. \n\n### Trick 2: combining with bias correction\n\n[Bias correction](https://arxiv.org/abs/1810.05723) is a method used to reduce the impact of quantization on a layer.\nNewer quantization methods behave much better, and it is not used much anymore.\nHowever, it is compatible and there is no reason not to use both.\nThe effect of bias correction can even be integrated in the cost function used for weight optimization, using $H=\\frac{1}{n} X^\\intercal X -M^\\intercal M$, where $X$ are the input samples and $M = \\frac{1}{n}1^\\intercal X$ is the average value of the samples for each input.\n\nWe test three different ways to update the bias:\n* applying weight optimization alone (GPTQ) without bias correction;\n* applying bias correction after weight optimization, yielding a slightly smaller layer error;\n* taking the effect of bias correction into account during weight optimization.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"results/correction_1.5b.png\" width=45%\u003e\u003cimg src=\"results/correction_3b.png\" width=45%\u003e\n\u003c/div\u003e\n\nAdding back bias correction greatly improves certain layers, in particular some attention layers in all networks.\nUnsurprisingly, it has more impact with a more agressive quantization and yields better result if taken into account for weight optimization.\n\n### Trick 3: adding local search\n\nThe weight optimization problem is NP-hard, and can only be solved at scale in an approximate manner.\nGPTQ provides a good heuristic for it, however the heuristic of choice to obtain good solutions to similar problems (QUBO) is a simple local search.\nFor this reason, we test the effect of applying a few local search moves after GPTQ, in a best-first manner.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"results/local_search_1.5b.png\" width=45%\u003e\u003cimg src=\"results/local_search_3b.png\" width=45%\u003e\n\u003c/div\u003e\n\nThe effect of just a few local search moves is notable on many layers, and applying them after GPTQ can drastically reduce layer error.\n\n### Minor tricks\n\nOther tricks yield smaller but useful improvements:\n* Using a different ordering for GPTQ. GPTQ makes rounding decisions for the weights in a greedy manner; they obtain a good ordering using the diagonal of the matrix in decreasing order. \nInstead, we multiply this value by the sum of squares of the quantization error; this takes better account of the effect of saturation.\n* Using a different dampening for GPTQ. GPTQ does not behave well on ill-conditioned matrix, and adding a larger penalty term to the matrix paradoxically yields better results.\nThe original paper uses a 1% penalty, but penalties of 3-10% behave better, while removing the penalty altogether degrades results significantly.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"results/ordering_3b.png\" width=40%\u003e\u003cimg src=\"results/dampening_3b.png\" width=40%\u003e\n\u003c/div\u003e\n\n### The many tricks that do not work\n\nThe following approaches did not yield promising results and were abandoned:\n* Improved codebooks: the data is far from being gaussian-distributed, but training a codebook naively on the data is not better than a [NF4 codebook](https://arxiv.org/abs/2305.14314).\nA good codebook training needs to take the hessian (or its diagonal) into account.\n* Entropy coding: it is tempting to combine codebook optimization with entropy coding to reduce storage needs. However, the gain in entropy is not huge compared to an error-optimized codebook, and does not seem worth the effort.\n* GPTQ reordering: clever heuristic orderings for GPTQ based on the hessian matrix do not bring a reduction in layer error, compared to using its diagonal as the original paper does. We tested several variations using the diagonal of the inverse and pivoted Cholesky decompositions.\n* More complex algorithms for weight optimization: it just doesn't scale, but if you want to go in this direction you probably want to use the [MQLib](https://github.com/MQLib/MQLib) as a solver.\n\n### Putting it all together\n\nFinally, we put these algorithms together in Sleekit:\n\n1. the hessian matrix is modified to represent the effect of bias correction;\n2. scaling is performed based on the hessian diagonal;\n3. weight optimization uses our slightly improved ordering and dampening.\n\nThe computational cost of the algorithm is not increased so far compared to GPTQ: this is the \"Sleekit light\" version.\n\nAt the cost of additional computations we add the following for the \"Sleekit heavy\" version:\n\n4. scaling is performed based on a weight optimization computation;\n5. local search is performed during the final weight optimization for 1000 moves.\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"results/compare_1.5b.png\" width=45%\u003e\u003cimg src=\"results/compare_3b.png\" width=45%\u003e\n\u003c/div\u003e\n\nMost of the improvement is due to the better scaling method, but the various methods stack well. Together, they yield to a reduced error on almost all layers.\nOn the other hand, a minority of layers experiences a huge improvement, with error reduced by 80% or more. It is still unclear what the impact is for the neural network as a whole.\n\n### Numerical results\n\nAll results are available in the `results` folder. The geometric mean impact of each trick on the mean squared error against the default GPTQ is shown below.\n\n\u003cdetails\u003e\n\u003csummary\u003eExpand numerical results\u003c/summary\u003e\n\n| Scaling method | 3b      | 2b      | 1.5b    | 1b      |\n| -------------- | ------- | ------- | ------- | ------- |\n| Diagonal       | -20.25% | -16.66% | -15.52% | -7.78%  |\n| Hessian        | -20.50% | -18.41% | -16.36% | -19.48% |\n| Exhaustive     | -29.68% | -29.35% | -26.03% | -30.64% |\n\n| Correction method   | 3b      | 2b      | 1.5b    | 1b      |\n| ------------------- | ------- | ------- | ------- | ------- |\n| After optimization  | -1.72%  | -4.11%  | -5.01%  | -10.78% |\n| During optimization | -4.01%  | -6.72%  | -7.90%  | -13.44% |\n\n| Local search duration | 3b      | 2b      | 1.5b    | 1b      |\n| --------------------- | ------- | ------- | ------- | ------- |\n| 10 moves              | -4.51%  | -6.05%  | -7.07%  | -9.57%  |\n| 100 moves             | -9.42%  | -13.47% | -15.64% | -20.25% |\n\n| Ordering                 | 3b     | 2b     | 1.5b   | 1b     |\n| ------------------------ | ------ | ------ | ------ | ------ |\n| Diagonal * Error         | -0.57% | -0.62% | -0.59% | -0.50% |\n| Diagonal * Squared Error | -1.95% | -1.69% | -1.35% | -1.40% |\n\n| Dampening  | 3b      | 2b      | 1.5b    | 1b      |\n| ---------- | ------- | ------- | ------- | ------- |\n| 0.001      | +2.52%  | +3.50%  | +2.72%  | +3.17%  |\n| 0.003      | +1.29%  | +1.73%  | +1.57%  | +1.63%  |\n| 0.03       | -0.91%  | -1.49%  | -1.54%  | -1.91%  |\n| 0.1        | -0.03%  | -1.91%  | -2.14%  | -3.86%  |\n| 0.3        | +5.42%  | +1.42%  | +0.45%  | -3.67%  |\n| 1.0        | +19.78% | +12.48% | +10.08% | +1.47%  |\n\n| Method                   | 3b      | 2b      | 1.5b    | 1b      |\n| ------------------------ | ------- | ------- | ------- | ------- |\n| Correction only          | -4.01%  | -6.72%  | -7.90%  | -13.44% |\n| Diagonal scaling only    | -20.25% | -16.66% | -15.52% | -7.78%  |\n| Sleekit light            | -25.04% | -23.90% | -22.43% | -20.50% |\n| Sleekit heavy            | -34.86% | -36.49% | -34.33% | -41.94% |\n\n\u003c/details\u003e\n\n## References\n\nThe algorithms in this repository build on the following works:\n* [Bias correction](https://arxiv.org/abs/1810.05723) and [GPTQ](https://arxiv.org/abs/2210.17323) for the approach to weight quantization, as well as similar works such as [AdaRound](https://arxiv.org/abs/2004.10568), [AdaQuant](https://arxiv.org/abs/2006.10518), [OBQ](https://arxiv.org/abs/2208.11580) or [GPTVQ](https://arxiv.org/abs/2402.15319);\n* [Lloyd](https://en.wikipedia.org/wiki/Lloyd%27s_algorithm) and [LBG](https://en.wikipedia.org/wiki/Linde%E2%80%93Buzo%E2%80%93Gray_algorithm) for the choice of quantization grids;\n* The [GPTQ repository](https://github.com/IST-DASLab/gptq) was used for data and testing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoloquinte%2Fsleekit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoloquinte%2Fsleekit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoloquinte%2Fsleekit/lists"}