{"id":45644828,"url":"https://github.com/NVIDIA/Model-Optimizer","last_synced_at":"2026-03-09T09:00:53.219Z","repository":{"id":238880695,"uuid":"790916393","full_name":"NVIDIA/Model-Optimizer","owner":"NVIDIA","description":"A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.","archived":false,"fork":false,"pushed_at":"2026-02-28T07:02:37.000Z","size":20138,"stargazers_count":2071,"open_issues_count":167,"forks_count":284,"subscribers_count":26,"default_branch":"main","last_synced_at":"2026-02-28T18:05:06.750Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://nvidia.github.io/Model-Optimizer/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG-Windows.rst","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-04-23T19:00:54.000Z","updated_at":"2026-02-28T07:57:41.000Z","dependencies_parsed_at":"2024-05-21T19:27:44.831Z","dependency_job_id":"504eda53-dc09-4d47-b544-81bc2010170e","html_url":"https://github.com/NVIDIA/Model-Optimizer","commit_stats":{"total_commits":13,"total_committers":4,"mean_commits":3.25,"dds":0.3076923076923077,"last_synced_commit":"f77cae9ffb0f7014c2d8f9d7a94ae7e043d0bbc8"},"previous_names":["nvidia/tensorrt-model-optimizer"],"tags_count":53,"template":false,"template_full_name":null,"purl":"pkg:github/NVIDIA/Model-Optimizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FModel-Optimizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FModel-Optimizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FModel-Optimizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FModel-Optimizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/Model-Optimizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2FModel-Optimizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30208987,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T05:23:27.321Z","status":"ssl_error","status_checked_at":"2026-03-07T05:00:17.256Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-24T04:00:23.157Z","updated_at":"2026-03-09T09:00:53.198Z","avatar_url":"https://github.com/NVIDIA.png","language":"Python","funding_links":[],"categories":["others","7. Training \u0026 Fine-tuning Ecosystem","Tools 🛠️"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n![Banner image](docs/source/assets/model-optimizer-banner.png)\n\n# NVIDIA Model Optimizer\n\n[![Documentation](https://img.shields.io/badge/Documentation-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/Model-Optimizer)\n[![version](https://img.shields.io/pypi/v/nvidia-modelopt?label=Release)](https://pypi.org/project/nvidia-modelopt/)\n[![license](https://img.shields.io/badge/License-Apache%202.0-blue)](./LICENSE)\n\n[Documentation](https://nvidia.github.io/Model-Optimizer) |\n[Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)\n\n\u003c/div\u003e\n\n______________________________________________________________________\n\n**NVIDIA Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.\n\n**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.\n\n**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.\nModel Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.\n\n**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.\n\n## Latest News\n\n- [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)\n- [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.\n- [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)\n- [2025/09/17] [BLOG: An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)\n- [2025/09/11] [BLOG: How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)\n- [2025/08/29] [BLOG: Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)\n- [2025/08/01] [BLOG: Optimizing LLMs for Performance and Accuracy with Post-Training Quantization](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/)\n- [2025/06/24] [BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)\n- [2025/05/14] [NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/)\n- [2025/04/21] [Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)\n- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)\n- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 \u0026 Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)\n- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)\n- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).\n- [2025/01/28] Model Optimizer is now open source!\n\n\u003cdetails close\u003e\n\u003csummary\u003ePrevious News\u003c/summary\u003e\n\n- [2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: [8B](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8), [70B](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8), [405B](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8).\n- [2024/09/10] [Post-Training Quantization of LLMs with NVIDIA NeMo and Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/).\n- [2024/08/28] [Boosting Llama 3.1 405B Performance up to 44% with Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)\n- [2024/08/28] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)\n- [2024/08/15] New features in recent releases: [Cache Diffusion](./examples/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.\n- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)\n- [2024/05/08] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)\n- [2024/03/27] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)\n- [2024/03/18] [GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/)\n- [2024/03/07] [Model Optimizer's 8-bit Post-Training Quantization enables TensorRT to accelerate Stable Diffusion to nearly 2x faster](https://developer.nvidia.com/blog/tensorrt-accelerates-stable-diffusion-nearly-2x-faster-with-8-bit-post-training-quantization/)\n- [2024/02/01] [Speed up inference with Model Optimizer quantization techniques in TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md)\n\n\u003c/details\u003e\n\n## Install\n\nTo install stable release packages for Model Optimizer with `pip` from [PyPI](https://pypi.org/project/nvidia-modelopt/):\n\n```bash\npip install -U nvidia-modelopt[all]\n```\n\nTo install from source in editable mode with all development dependencies or to use the latest features, run:\n\n```bash\n# Clone the Model Optimizer repository\ngit clone git@github.com:NVIDIA/Model-Optimizer.git\ncd Model-Optimizer\n\npip install -e .[dev]\n```\n\nYou can also directly use the [TensorRT-LLM docker images](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)\n(e.g., `nvcr.io/nvidia/tensorrt-llm/release:\u003cversion\u003e`), which have Model Optimizer pre-installed.\nMake sure to upgrade Model Optimizer to the latest version as described above.\nVisit our [installation guide](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html) for\nmore fine-grained control on installed dependencies or for alternative docker images and environment variables to setup.\n\n## Techniques\n\n\u003cdiv align=\"center\"\u003e\n\n| **Technique** | **Description** | **Examples** | **Docs** |\n| :------------: | :------------: | :------------: | :------------: |\n| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \\[[LLMs](./examples/llm_ptq/)\\] \\[[diffusers](./examples/diffusers/)\\] \\[[VLMs](./examples/vlm_ptq/)\\] \\[[onnx](./examples/onnx_ptq/)\\] \\[[windows](./examples/windows/)\\] | \\[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\\] |\n| Quantization Aware Training | Refine accuracy even further with a few training steps! | \\[[NeMo](./examples/llm_qat#nemo-qatqad-simplified-flow-example)\\] \\[[Hugging Face](./examples/llm_qat/)\\] | \\[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\\] |\n| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \\[[PyTorch](./examples/pruning/)\\] | \\[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\\] |\n| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \\[[NeMo](./examples/llm_distill#knowledge-distillation-kd-for-nvidia-nemo-models)\\] \\[[Hugging Face](./examples/llm_distill/)\\] | \\[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\\] |\n| Speculative Decoding | Train draft modules to predict extra tokens during inference! | \\[[Megatron](./examples/speculative_decoding#mlm-example)\\] \\[[Hugging Face](./examples/speculative_decoding/)\\] | \\[[docs](https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html)\\] |\n| Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations | \\[[PyTorch](./examples/llm_sparsity/)\\] | \\[[docs](https://nvidia.github.io/Model-Optimizer/guides/6_sparsity.html)\\] |\n\n\u003c/div\u003e\n\n## Pre-Quantized Checkpoints\n\n- Ready-to-deploy checkpoints \\[[🤗 Hugging Face - Nvidia Model Optimizer Collection](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)\\]\n- Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang)\n- More models coming soon!\n\n## Resources\n\n- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)\n- 📖 [Documentation](https://nvidia.github.io/Model-Optimizer)\n- 🎯 [Benchmarks](./examples/benchmark.md)\n- 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)\n- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)\n- ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)\n\n## Model Support Matrix\n\n| Model Type | Support Matrix |\n|------------|----------------|\n| LLM Quantization | [View Support Matrix](./examples/llm_ptq/README.md#support-matrix) |\n| Diffusers Quantization | [View Support Matrix](./examples/diffusers/README.md#support-matrix) |\n| VLM Quantization | [View Support Matrix](./examples/vlm_ptq/README.md#support-matrix) |\n| ONNX Quantization | [View Support Matrix](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) |\n| Windows Quantization | [View Support Matrix](./examples/windows/README.md#support-matrix) |\n| Quantization Aware Training | [View Support Matrix](./examples/llm_qat/README.md#support-matrix) |\n| Pruning | [View Support Matrix](./examples/pruning/README.md#support-matrix) |\n| Distillation | [View Support Matrix](./examples/llm_distill/README.md#support-matrix) |\n| Speculative Decoding | [View Support Matrix](./examples/speculative_decoding/README.md#support-matrix) |\n\n## Contributing\n\nModel Optimizer is now open source! We welcome any feedback, feature requests and PRs.\nPlease read our [Contributing](./CONTRIBUTING.md) guidelines for details on how to contribute to this project.\n\n### Top Contributors\n\n[![Contributors](https://contrib.rocks/image?repo=NVIDIA/Model-Optimizer)](https://github.com/NVIDIA/Model-Optimizer/graphs/contributors)\n\nHappy optimizing!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2FModel-Optimizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA%2FModel-Optimizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2FModel-Optimizer/lists"}