{"id":23949883,"url":"https://github.com/vllm-project/llm-compressor","last_synced_at":"2025-05-14T14:09:31.091Z","repository":{"id":245882699,"uuid":"817962136","full_name":"vllm-project/llm-compressor","owner":"vllm-project","description":"Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM","archived":false,"fork":false,"pushed_at":"2025-05-13T22:54:40.000Z","size":23720,"stargazers_count":1340,"open_issues_count":84,"forks_count":127,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-05-14T00:35:35.735Z","etag":null,"topics":["compression","quantization","sparsity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vllm-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-06-20T20:13:34.000Z","updated_at":"2025-05-13T15:12:49.000Z","dependencies_parsed_at":"2024-06-24T16:40:57.870Z","dependency_job_id":"9fd99eac-f0e7-4dbc-828c-eb10a2a6d33a","html_url":"https://github.com/vllm-project/llm-compressor","commit_stats":null,"previous_names":["vllm-project/llm-compressor"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fllm-compressor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fllm-compressor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fllm-compressor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fllm-compressor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vllm-project","download_url":"https://codeload.github.com/vllm-project/llm-compressor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254160371,"owners_count":22024568,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","quantization","sparsity"],"created_at":"2025-01-06T11:52:11.992Z","updated_at":"2025-05-14T14:09:26.070Z","avatar_url":"https://github.com/vllm-project.png","language":"Python","funding_links":[],"categories":["📦 Compression","Python","A01_文本生成_文本对话","7. Training \u0026 Fine-tuning Ecosystem"],"sub_categories":["Quantization","大语言对话模型及数据"],"readme":"# \u003cimg width=\"40\" alt=\"tool icon\" src=\"https://github.com/user-attachments/assets/f9b86465-aefa-4625-a09b-54e158efcf96\" /\u003e  LLM Compressor\n`llmcompressor` is an easy-to-use library for optimizing models for deployment with `vllm`, including:\n\n* Comprehensive set of quantization algorithms for weight-only and activation quantization\n* Seamless integration with Hugging Face models and repositories\n* `safetensors`-based file format compatible with `vllm`\n* Large model support via `accelerate`\n\n**✨ Read the announcement blog [here](https://neuralmagic.com/blog/llm-compressor-is-here-faster-inference-with-vllm/)! ✨**\n\n\u003cp align=\"center\"\u003e\n   \u003cimg alt=\"LLM Compressor Flow\" src=\"https://github.com/user-attachments/assets/adf07594-6487-48ae-af62-d9555046d51b\" width=\"80%\" /\u003e\n\u003c/p\u003e\n\n### Supported Formats\n* Activation Quantization: W8A8 (int8 and fp8)\n* Mixed Precision: W4A16, W8A16\n* 2:4 Semi-structured and Unstructured Sparsity\n\n### Supported Algorithms\n* Simple PTQ\n* GPTQ\n* SmoothQuant\n* SparseGPT\n\n### When to Use Which Optimization\n\nPlease refer to [docs/schemes.md](./docs/schemes.md) for detailed information about available optimization schemes and their use cases.\n\n\n## Installation\n\n```bash\npip install llmcompressor\n```\n\n## Get Started\n\n### End-to-End Examples\n\nApplying quantization with `llmcompressor`:\n* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)\n* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)\n* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md)\n* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)\n* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)\n* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)\n\n### User Guides\nDeep dives into advanced usage of `llmcompressor`:\n* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)\n\n\n## Quick Tour\nLet's quantize `TinyLlama` with 8 bit weights and activations using the `GPTQ` and `SmoothQuant` algorithms.\n\nNote that the model can be swapped for a local or remote HF-compatible checkpoint and the `recipe` may be changed to target different quantization algorithms or formats.\n\n### Apply Quantization\nQuantization is applied by selecting an algorithm and calling the `oneshot` API.\n\n```python\nfrom llmcompressor.modifiers.smoothquant import SmoothQuantModifier\nfrom llmcompressor.modifiers.quantization import GPTQModifier\nfrom llmcompressor import oneshot\n\n# Select quantization algorithm. In this case, we:\n#   * apply SmoothQuant to make the activations easier to quantize\n#   * quantize the weights to int8 with GPTQ (static per channel)\n#   * quantize the activations to int8 (dynamic per token)\nrecipe = [\n    SmoothQuantModifier(smoothing_strength=0.8),\n    GPTQModifier(scheme=\"W8A8\", targets=\"Linear\", ignore=[\"lm_head\"]),\n]\n\n# Apply quantization using the built in open_platypus dataset.\n#   * See examples for demos showing how to pass a custom calibration set\noneshot(\n    model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n    dataset=\"open_platypus\",\n    recipe=recipe,\n    output_dir=\"TinyLlama-1.1B-Chat-v1.0-INT8\",\n    max_seq_length=2048,\n    num_calibration_samples=512,\n)\n```\n\n### Inference with vLLM\n\nThe checkpoints created by `llmcompressor` can be loaded and run in `vllm`:\n\nInstall:\n\n```bash\npip install vllm\n```\n\nRun:\n\n```python\nfrom vllm import LLM\nmodel = LLM(\"TinyLlama-1.1B-Chat-v1.0-INT8\")\noutput = model.generate(\"My name is\")\n```\n\n## Questions / Contribution\n\n- If you have any questions or requests open an [issue](https://github.com/vllm-project/llm-compressor/issues) and we will add an example or documentation.\n- We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here](CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fllm-compressor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvllm-project%2Fllm-compressor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fllm-compressor/lists"}