{"id":13562914,"url":"https://github.com/pytorch/ao","last_synced_at":"2025-05-12T13:11:31.760Z","repository":{"id":206588398,"uuid":"714073518","full_name":"pytorch/ao","owner":"pytorch","description":"PyTorch native quantization and sparsity for training and inference","archived":false,"fork":false,"pushed_at":"2025-05-07T16:40:48.000Z","size":31987,"stargazers_count":2020,"open_issues_count":351,"forks_count":257,"subscribers_count":43,"default_branch":"main","last_synced_at":"2025-05-07T17:23:59.782Z","etag":null,"topics":["brrr","cuda","dtypes","float8","inference","llama","mx","offloading","optimizer","pytorch","quantization","sparsity","training","transformer"],"latest_commit_sha":null,"homepage":"https://pytorch.org/ao/stable/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pytorch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-11-03T21:27:36.000Z","updated_at":"2025-05-07T16:07:13.000Z","dependencies_parsed_at":"2024-01-17T00:44:20.071Z","dependency_job_id":"3aafdda6-22d8-45a5-96d2-5871ef8bf4ae","html_url":"https://github.com/pytorch/ao","commit_stats":{"total_commits":569,"total_committers":77,"mean_commits":"7.3896103896103895","dds":0.7785588752196837,"last_synced_commit":"d9abbf682c6d0bacbe66933eb568f4ad4c93782d"},"previous_names":["pytorch-labs/ao","pytorch/ao"],"tags_count":46,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fao","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fao/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fao/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pytorch%2Fao/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pytorch","download_url":"https://codeload.github.com/pytorch/ao/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253745170,"owners_count":21957318,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["brrr","cuda","dtypes","float8","inference","llama","mx","offloading","optimizer","pytorch","quantization","sparsity","training","transformer"],"created_at":"2024-08-01T13:01:13.358Z","updated_at":"2025-05-12T13:11:30.853Z","avatar_url":"https://github.com/pytorch.png","language":"Python","funding_links":[],"categories":["Python","Canon","Optimizations and fine-tuning","Repos","A01_文本生成_文本对话","1. Core Frameworks \u0026 Libraries"],"sub_categories":["Memory-Efficient Optimizers","大语言对话模型及数据"],"readme":"# torchao: PyTorch Architecture Optimization\n\n[![](https://dcbadge.vercel.app/api/server/gpumode?style=flat)](https://discord.gg/gpumode)\n\n[Introduction](#introduction) | [Inference](#inference) | [Training](#training)  | [Composability](#composability) | [Custom Kernels](#custom-kernels) | [Alpha Features](#alpha-features) | [Installation](#installation) | [Integrations](#integrations) | [Videos](#videos) | [License](#license) | [Citation](#citation)\n\n## Introduction\n\ntorchao: PyTorch library for custom data types \u0026 optimizations. Quantize and sparsify weights, gradients, optimizers \u0026 activations for inference and training.\n\nFrom the team that brought you the fast series\n* 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)\n* 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)\n* 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)\n\ntorchao works for training too, with [up to 1.5x e2e speedups](https://pytorch.org/blog/training-using-float8-fsdp2/) on large scale (512 GPU / 405B parameter count) pretraining jobs with `torchao.float8`!\n\ntorchao just works with `torch.compile()` and `FSDP2` over most PyTorch models on Huggingface out of the box.\n\n## Inference\n\n### Post Training Quantization\n\nQuantizing and Sparsifying your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model.\n\nThere are 2 methods of post-training quantization, shown in the code snippets below:\n1. Using torchao APIs directly.\n2. Loading a huggingface model with a quantization config.\n\n#### Quantizing for inference with torchao APIs\n```python\nfrom torchao.quantization.quant_api import (\n    quantize_,\n    Int8DynamicActivationInt8WeightConfig,\n    Int4WeightOnlyConfig,\n    Int8WeightOnlyConfig\n)\nquantize_(m, Int4WeightOnlyConfig())\n```\n\nYou can find a more comprehensive usage instructions for quantization [here](torchao/quantization/) and for sparsity [here](/torchao/_models/sam/README.md).\n\n#### Quantizing for inference with huggingface configs\n\nSee [docs](https://huggingface.co/docs/transformers/main/en/quantization/torchao) for more details.\n\nFor inference, we have the option of\n1. Quantize only the weights: works best for memory bound models\n2. Quantize the weights and activations: works best for compute bound models\n2. Quantize the activations and weights and sparsify the weight\n\nFor gpt-fast `Int4WeightOnlyConfig()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.\n\nIf you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so `quantize_(model, Int8WeightOnlyConfig(), device=\"cuda\")` which will send and quantize each layer individually to your GPU.\n\nIf you see slowdowns with any of these techniques or you're unsure which option to use, consider using [autoquant](./torchao/quantization/README.md#autoquantization) which will automatically profile layers and pick the best way to quantize each layer.\n\n```python\nmodel = torchao.autoquant(torch.compile(model, mode='max-autotune'))\n```\n\nWe also provide a developer facing API so you can implement your own quantization algorithms so please use the excellent [HQQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) algorithm as a motivating example.\n\n### Evaluation\n\nYou can also use the EleutherAI [LM evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) to directly evaluate models\nquantized with post training quantization, by following these steps:\n\n1. Quantize your model with a [post training quantization strategy](#post-training-quantization).\n2. Save your model to disk or upload to huggingface hub ([instructions]( https://huggingface.co/docs/transformers/main/en/quantization/torchao?torchao=manual#serialization)).\n3. [Install](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#install) lm-eval.\n4. Run an evaluation. Example:\n\n```bash\nlm_eval --model hf --model_args pretrained=${HF_USER}/${MODEL_ID} --tasks hellaswag --device cuda:0 --batch_size 8\n```\n\nCheck out the lm-eval [usage docs](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) for more details.\n\n### KV Cache Quantization\n\nWe've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference.\n\nIn practice these features alongside int4 weight only quantization allow us to **reduce peak memory by ~55%**, meaning we can Llama3.1-8B inference with a **130k context length with only 18.9 GB of peak memory.** More details can be found [here](torchao/_models/llama/README.md)\n\n## Training\n\n### Quantization Aware Training\n\nPost-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/). For more details, please see the [QAT README](./torchao/quantization/qat/README.md).\n\n```python\nfrom torchao.quantization import (\n    quantize_,\n    Int8DynamicActivationInt4WeightConfig,\n)\nfrom torchao.quantization.qat import (\n    FakeQuantizeConfig,\n    FromIntXQuantizationAwareTrainingConfig,\n    IntXQuantizationAwareTrainingConfig,\n)\n\n# Insert fake quantization\nactivation_config = FakeQuantizeConfig(torch.int8, \"per_token\", is_symmetric=False)\nweight_config = FakeQuantizeConfig(torch.int4, group_size=32)\nquantize_(\n    my_model,\n    IntXQuantizationAwareTrainingConfig(activation_config, weight_config),\n)\n\n# Run training... (not shown)\n\n# Convert fake quantization to actual quantized operations\nquantize_(my_model, FromIntXQuantizationAwareTrainingConfig())\nquantize_(my_model, Int8DynamicActivationInt4WeightConfig(group_size=32))\n```\n\n### Float8\n\n[torchao.float8](torchao/float8) implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.\n\nWith ``torch.compile`` on, current results show throughput speedups of up to **1.5x on up to 512 GPU / 405B parameter count scale** ([details](https://pytorch.org/blog/training-using-float8-fsdp2/))\n\n```python\nfrom torchao.float8 import convert_to_float8_training\nconvert_to_float8_training(m, module_filter_fn=...)\n```\n\nAnd for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).\n\n#### Blog posts about float8 training\n\n* [Supercharging Training using float8 and FSDP2](https://pytorch.org/blog/training-using-float8-fsdp2/)\n* [Efficient Pre-training of Llama 3-like model architectures using torchtitan on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/efficient-pre-training-of-llama-3-like-model-architectures-using-torchtitan-on-amazon-sagemaker/)\n* [Float8 in PyTorch](https://dev-discuss.pytorch.org/t/float8-in-pytorch-1-x/1815)\n\n\n### Sparse Training\n\nWe've added support for semi-structured 2:4 sparsity with **6% end-to-end speedups on ViT-L**. Full blog [here](https://pytorch.org/blog/accelerating-neural-network-training/)\n\nThe code change is a 1 liner with the full example available [here](torchao/sparsity/training/)\n\n```python\nswap_linear_with_semi_sparse_linear(model, {\"seq.0\": SemiSparseLinear})\n```\n\n### Memory-efficient optimizers\n\nADAM takes 2x as much memory as the model params so we can quantize the optimizer state to either 8 or 4 bit effectively reducing the optimizer VRAM requirements by 2x or 4x respectively over an fp16 baseline\n\n```python\nfrom torchao.optim import AdamW8bit, AdamW4bit, AdamWFp8\noptim = AdamW8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions\n```\n\nIn practice, we are a tiny bit slower than expertly written kernels but the implementations for these optimizers were written in a **few hundred lines of PyTorch code** and compiled so please use them or copy-paste them for your quantized optimizers. Benchmarks [here](https://github.com/pytorch/ao/tree/main/torchao/optim)\n\nWe also have support for [single GPU CPU offloading](https://github.com/pytorch/ao/tree/main/torchao/optim#optimizer-cpu-offload) where both the gradients (same size as weights) and the optimizers will be efficiently sent to the CPU. This alone can **reduce your VRAM requirements by 60%**\n\n```python\noptim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)\noptim.load_state_dict(ckpt[\"optim\"])\n```\n\n## Composability\n\n1. `torch.compile`: A key design principle for us is composability as in any new dtype or layout we provide needs to work with our compiler. It shouldn't matter if the kernels are written in pure PyTorch, CUDA, C++, or Triton - things should just work! So we write the dtype, layout, or bit packing logic in pure PyTorch and code-generate efficient kernels.\n3. [FSDP2](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md): Historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization.\n\nThe best example we have combining the composability of lower bit dtype with compile and fsdp is [NF4](torchao/dtypes/nf4tensor.py) which we used to implement the [QLoRA](https://www.youtube.com/watch?v=UvRl4ansfCg) algorithm. So if you're doing research at the intersection of this area we'd love to hear from you.\n\n## Custom Kernels\n\nWe've added support for authoring and releasing [custom ops](./torchao/csrc/) that do not graph break with `torch.compile()` so if you love writing kernels but hate packaging them so they work all operating systems and cuda versions, we'd love to accept contributions for your custom ops. We have a few examples you can follow\n\n1. [fp6](torchao/dtypes/floatx) for 2x faster inference over fp16 with an easy to use API `quantize_(model, FPXWeightOnlyConfig(3, 2))`\n2. [2:4 Sparse Marlin GEMM](https://github.com/pytorch/ao/pull/733) 2x speedups for FP16xINT4 kernels even at batch sizes up to 256\n3. [int4 tinygemm unpacker](https://github.com/pytorch/ao/pull/415) which makes it easier to switch quantized backends for inference\n\nIf you believe there's other CUDA kernels we should be taking a closer look at please leave a comment on [this issue](https://github.com/pytorch/ao/issues/697)\n\n\n## Alpha features\n\nThings we're excited about but need more time to cook in the oven\n\n1. [MX](torchao/prototype/mx_formats) training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.\n2. [Int8 Quantized Training](https://github.com/pytorch/ao/tree/main/torchao/prototype/quantized_training): We're trying out full int8 training. This is easy to use with `quantize_(model, int8_weight_only_quantized_training())`. This work is prototype as the memory benchmarks are not compelling yet.\n3. [IntX](https://github.com/pytorch/ao/tree/main/torchao/dtypes/uintx): We've managed to support all the ints by doing some clever bitpacking in pure PyTorch and then compiling it. This work is prototype as unfortunately without some more investment in either the compiler or low-bit kernels, int4 is more compelling than any smaller dtype\n4. [Bitnet](https://github.com/pytorch/ao/blob/main/torchao/prototype/dtypes/bitnet.py): Mostly this is very cool to people on the team. This is prototype because how useful these kernels are is highly dependent on better hardware and kernel support.\n\n## Installation\n\n`torchao` makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.\n\nStable release from Pypi which will default to CUDA 12.4\n\n```Shell\npip install torchao\n```\n\nStable Release from the PyTorch index\n```Shell\npip install torchao --extra-index-url https://download.pytorch.org/whl/cu124 # full options are cpu/cu118/cu124/cu126\n```\n\nNightly Release\n```Shell\npip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu126/cu128\n```\n\nFor *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration\n\n```Shell\nUSE_CPP=0 pip install -e .\n```\n\n## OSS Integrations\n\nWe're also fortunate to be integrated into some of the leading open-source libraries including\n1. Hugging Face transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)\n2. Hugging Face diffusers best practices with torch.compile and torchao in a standalone repo [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao)\n3. Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference)\n4. [TorchTune](https://github.com/pytorch/torchtune) for our QLoRA and QAT recipes\n5. [torchchat](https://github.com/pytorch/torchchat) for post training quantization\n6. SGLang for LLM serving: [usage](https://github.com/sgl-project/sglang/blob/4f2ee48ed1c66ee0e189daa4120581de324ee814/docs/backend/backend.md?plain=1#L83) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).\n\n## Videos\n* [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1\u0026t=1009)\n* [Low precision dtypes at PyTorch conference](https://youtu.be/xcKwEZ77Cps?si=7BS6cXMGgYtFlnrA)\n* [Slaying OOMs at the Mastering LLM's course](https://www.youtube.com/watch?v=UvRl4ansfCg)\n* [Advanced Quantization at CUDA MODE](https://youtu.be/1u9xUK3G4VM?si=4JcPlw2w8chPXW8J)\n* [Chip Huyen's GPU Optimization Workshop](https://www.youtube.com/live/v_q2JTIqE20?si=mf7HeZ63rS-uYpS6)\n* [Cohere for AI community talk](https://www.youtube.com/watch?v=lVgrE36ZUw0)\n\n\n## License\n\n`torchao` is released under the [BSD 3](https://github.com/pytorch-labs/ao/blob/main/LICENSE) license.\n\n# Citation\n\nIf you find the torchao library useful, please cite it in your work as below.\n\n```bibtex\n@software{torchao,\n  title = {torchao: PyTorch native quantization and sparsity for training and inference},\n  author = {torchao maintainers and contributors},\n  url = {https://github.com/pytorch/torchao},\n  license = {BSD-3-Clause},\n  month = oct,\n  year = {2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpytorch%2Fao","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpytorch%2Fao","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpytorch%2Fao/lists"}