{"id":31644493,"url":"https://github.com/tao71-ai/autoquantizer","last_synced_at":"2026-05-19T07:32:49.295Z","repository":{"id":306678139,"uuid":"1025531068","full_name":"TAO71-AI/AutoQuantizer","owner":"TAO71-AI","description":"Quantize LLMs automatically.","archived":false,"fork":false,"pushed_at":"2025-09-10T15:58:49.000Z","size":18,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-10T19:51:35.814Z","etag":null,"topics":["cli","gguf","gguf-quantization","llamacpp","llm","llms","python","python3","quantization"],"latest_commit_sha":null,"homepage":"https://tao71.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TAO71-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-24T11:52:41.000Z","updated_at":"2025-09-10T15:58:52.000Z","dependencies_parsed_at":"2025-07-27T02:45:56.829Z","dependency_job_id":"4fcb2b79-8304-44dc-ba19-167c9eb92005","html_url":"https://github.com/TAO71-AI/AutoQuantizer","commit_stats":null,"previous_names":["tao71-ai/autoquantizer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TAO71-AI/AutoQuantizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TAO71-AI%2FAutoQuantizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TAO71-AI%2FAutoQuantizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TAO71-AI%2FAutoQuantizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TAO71-AI%2FAutoQuantizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TAO71-AI","download_url":"https://codeload.github.com/TAO71-AI/AutoQuantizer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TAO71-AI%2FAutoQuantizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278722768,"owners_count":26034461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","gguf","gguf-quantization","llamacpp","llm","llms","python","python3","quantization"],"created_at":"2025-10-07T04:53:24.212Z","updated_at":"2025-10-07T04:53:25.587Z","avatar_url":"https://github.com/TAO71-AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Installation\n## 1. Install llama.cpp\n```bash\ngit clone https://github.com/ggml-org/llama.cpp.git\ncd llama.cpp\npip install -r requirements.txt\ncmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\ncmake --build build --config Release\ncd ..\n```\n\n## 2. Install this script's requirements\n```bash\npip install -r requirements.txt\n```\n\n# Usage\n## LLMs quantization\n```bash\npython quantize_llm.py [ARGUMENTS]\n```\n\n|Argument|Description|Type|Default value|\n|---|---|---|---|\n|--repo=REPOSITORY|Set the model repository. *Required*|str|None|\n|--outtype=QUANT|Set the outtype of the GGUF file. Do not include this quant in the **--quants** argument. *Required*|str|None|\n|--gguf=FILE|Set a current GGUF file to quantize.|str|None|\n|--quants=\"QUANT-1 QUANT-2 ...\"|Set the list of quants to quantize the model with. Separated by spaces, the quant names must be valid. If not set, the model will only be converted to GGUF.|list (str separated by spaces)|\"\"|\n|--output-dir=DIRECTORY|Override the default output directory.|str|\"output\"|\n|--cache-dir=DIRECTORY|Override the default cache directory.|str|\"cache\"|\n|--lcpp-dir=DIRECTORY|Override the default llama.cpp directory.|str|\"llama.cpp\"|\n|--lcpp-pre-gguf=COMMAND|Override the default command to execute when converting to GGUF.|str|\"python\"|\n|--lcpp-gguf=FILE|Override the default script file to execute when converting to GGUF.|str|\"convert\\_hf\\_to\\_gguf.py\"|\n|--lcpp-pre-quant=COMMAND|Override the default command to execute when quantizing.|str|\"\"|\n|--lcpp-quant=FILE|Override the default script file to execute when quantizing.|str|\"build/bin/llama-quantize\"|\n|--model-card-template=TEMPLATE|Override the default model card template.|str|Check the script.|\n|--repo-name-template=TEMPLATE|Override the default repository name template.|str|Check the script.|\n|--repo-public|Make the created repository public.|-|False|\n|--test|Test the script to make sure it works without executing commands.|-|False|\n|--as-dir|Uploads the entire model directory in a single commit, instead of uploading files one by one.|-|False|\n|--force-reinstall|Force the reinstallation of the LLM.|-|False|\n\n### LLM quantization methods\n- **Q2_K_XXS**: Uses Q2\\_K for the embedding and output weights, and mostly Q2\\_K for everything else.\n- **Q2_K_XS**: Uses Q4\\_K for the output weights, Q2\\_K for embedding, and mostly Q2\\_K for everything else.\n- **Q2_K**: Normal Q2\\_K quantization. Most weights are in Q2\\_K. Not recommended for most LLMs due to it's small precision.\n- **Q2_K_L**: Uses Q8\\_0 for the embedding and output weights, and mostly Q2\\_K for everything else. Has more precision, but still it's not recommended because it's mostly Q2\\_K.\n- **Q2_K_XL**: Uses F16 for the embedding and output weights, and mostly Q2\\_K for everything else. Has even more precision, but still not recommended because it's mostly Q2\\_K.\n- **Q3_K_XXS**: Uses Q3\\_K for embedding and output weights, and Q3\\_K\\_S for everything else.\n- **Q3_K_XS**: Uses Q4\\_K for the output weights, Q3\\_K for embedding, and Q3\\_K\\_S for everything else.\n- **Q3_K_S**: Normal Q3\\_K\\_S quantization. Most weights are in Q3\\_K. Not recommended for most use cases due to it's small precision.\n- **Q3_K_M**: Normal Q3\\_K\\_M quantization. There are more weights in other quantizations like Q5\\_K and others, but mostly it is Q3\\_K. Not recommended for most use cases due to it's small precision.\n- **Q3_K_L**: Normal Q3\\_K\\_L quantization. There are even more weights in other quantizations, but mostly it is Q3\\_K. If possible, prefer Q3\\_K\\_XL, but this might have decent results in some use cases. Only recommended if you have a very slow CPU, GPU, or RAM capacity.\n- **Q3_K_XL**: Uses Q8\\_0 for the embedding and output weights, and Q3\\_K\\_L for everything else. This might have decent results in some use cases.\n- **Q3_K_XXL**: Uses F16 for the embedding and output weights, and Q3\\_K\\_L for everything else. Prefer this only if you want more precision for the embedding or output weights. For most models the size of this quant is similar to Q4\\_K\\_S or Q4\\_K\\_M. Prefer Q4\\_K\\_S or Q4\\_K\\_M if the size is similar.\n- **Q4_K_XS**: Uses Q4\\_K for embedding and output weights, and Q4\\_K\\_S for everything else.\n- **Q4_K_S**: Normal Q4\\_K\\_S quantization. Most weights are in Q4\\_K. Gives decent results for most use cases. Slightly lower quality than Q4\\_K\\_M and requires less CPU, GPU, and RAM.\n- **Q4_K_M**: Normal Q4\\_K\\_M quantization. There are more weights in other quantizations like Q5\\_K and others, but mostly it is Q4\\_K. Gives decent results for most use cases. Good quality.\n- **Q4_K_L**: Uses Q8\\_0 for the embedding and output weights, and Q4\\_K\\_M for everything else. More precision than Q4\\_K\\_M.\n- **Q4_K_XL**: Uses F16 for the embedding and output weights, and Q4\\_K\\_M for everything else. Prefer Q5\\_K\\_S or Q5\\_K\\_M if the size is similar.\n- **Q5_K_XXS**: Uses Q4\\_K for output weights, Q5\\_K for embedding, and Q5\\_K\\_S for everything else.\n- **Q5_K_S**: Normal Q5\\_K\\_S quantization. Most weights are in Q5\\_K. High quality and very good results. Very similar to Q5\\_K\\_M but saving a bit more of memory.\n- **Q5_K_M**: Normal Q5\\_K\\_M quantization. There are more weights in other quantizations like Q6\\_K and others, but mostly it is Q5\\_K. High quality and very good results.\n- **Q5_K_L**: Uses Q8\\_0 for the embedding and output weights, and Q5\\_K\\_M for everything else. High quality and very good results.\n- **Q5_K_XL**: Uses F16 for the embedding and output weights, and Q5\\_K\\_M for everything else. Prefer Q6\\_K if the size is similar.\n- **Q6_K_S**: Uses Q4\\_K for output weights, Q6\\_K for embedding, and mostly Q6\\_K for everything else.\n- **Q6_K**: Normal Q6\\_K quantization. Most weights are in Q6\\_K. Very high quality. Results similar to Q8\\_0.\n- **Q6_K_L**: Uses Q8\\_0 for the embedding and output weights, and mostly Q6\\_K for everything else. Very high quality. Results more similar to Q8\\_0 or F16.\n- **Q6_K_XL**: Uses F16 for the embedding and output weights, and mostly Q6\\_K for everything else. Prefer Q6\\_K\\_L.\n- **Q8_K_XS**: Uses Q4\\_K for output weights, Q8\\_0 for embedding, and mostly Q8\\_0 for everything else.\n- **Q8_K_S**: Uses Q6\\_K for output weights, Q8\\_0 for embedding, and mostly Q8\\_0 for everything else.\n- **Q8_0**: Normal Q8\\_0 quantization. Most weights are in Q8\\_0. Quality almost like F16, saving around half the memory required for F16.\n- **Q8_K_XL**: Uses F16 for the embedding and output weights, and mostly Q8\\_0 for everything else. Prefer Q8\\_0.\n- **F16**: Normal F16 quantization. Most weights are in F16. If the model has been trained in BF16 or F32, prefer BF16. Not recommended because Q8\\_0 has almost the same quality. Only use this if the model has been trained in F16 and you really need full precision.\n- **BF16**: Normal BF16 quantization. Most weights are in F16 or BF16. Not recommended because Q8\\_0 has almost the same quality. Only use this if the model has been trained in BF16 or F32 and you really need full precision.\n- **F32**: Normal F32 quantization. Most -if not all- weights are in F32. Not recommended because F16, BF16, and Q8\\_0 has almost the same quality. Most LLMs are more than fine if you use F16, BF16, or Q8\\_0. Only use this if the model has been trained in F32 and you really need full precision.\n\n\u003e [!NOTE]\n\u003e In most cases, Q8\\_0 for embedding and output weights is enough, F16 doesn't make much difference.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftao71-ai%2Fautoquantizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftao71-ai%2Fautoquantizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftao71-ai%2Fautoquantizer/lists"}