{"id":32595013,"url":"https://github.com/dropbox/gemlite","last_synced_at":"2026-04-15T05:03:00.120Z","repository":{"id":253221253,"uuid":"823578790","full_name":"dropbox/gemlite","owner":"dropbox","description":"Fast low-bit matmul kernels in Triton","archived":false,"fork":false,"pushed_at":"2026-04-02T14:59:06.000Z","size":10614,"stargazers_count":442,"open_issues_count":0,"forks_count":34,"subscribers_count":9,"default_branch":"master","last_synced_at":"2026-04-03T10:05:39.946Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dropbox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-07-03T09:50:13.000Z","updated_at":"2026-04-02T15:37:52.000Z","dependencies_parsed_at":"2024-12-09T16:31:33.220Z","dependency_job_id":"a0ca2514-00ee-4573-ac23-80c60e620188","html_url":"https://github.com/dropbox/gemlite","commit_stats":null,"previous_names":["mobiusml/gemlite","dropbox/gemlite"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/dropbox/gemlite","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fgemlite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fgemlite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fgemlite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fgemlite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dropbox","download_url":"https://codeload.github.com/dropbox/gemlite/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fgemlite/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31826907,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"online","status_checked_at":"2026-04-15T02:00:06.175Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-30T03:02:17.987Z","updated_at":"2026-04-15T05:03:00.101Z","avatar_url":"https://github.com/dropbox.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# GemLite\n\n\u003cdiv align=\"center\" style=\"margin-bottom: 1em;\"\u003e\n\u003ch2\u003eTriton Kernels for Efficient Low-Bit Matrix Multiplication\u003c/h2\u003e\n\n  \u003cimg src=\"images/gemlite%20banner.png\" alt=\"GemLite Logo\" height=\"150\"\u003e\n  \n  [![Twitter][mobius-twitter-badge]][mobius-twitter]\n\n  Made with ❤ by the team at [Mobius Labs](https://www.mobiuslabs.com/) for  'Aana' (ആന : Elephant) suite of multimodal product.  \n  \n\u003c/div\u003e\n\n**GemLite** is a collection of Triton kernels designed for efficient low-bit matrix multiplication, emphasizing simplicity and reusability. It provides a practical solution for achieving significant performance gains, delivering up to **7-8x faster prefill** and **3-6x faster decoding** compared to default Torch AO kernels. For more detailed benchmarks, check the [Performance](#performance) section.\n\nGemLite strikes the perfect balance between **flexibility** and **performance**, allowing users to easily use and modify the codebase to develop high-performance kernels optimized for their specific hardware. We have included multiple versions of the kernels to maximize performance across different matrix shapes.\n\nThe project started with CUDA kernels, but we have switched to \u003ca href=\"https://github.com/triton-lang/triton/\"\u003eTriton\u003c/a\u003e for enhanced flexibility. For the old CUDA version, please refer to \u003ca href=\"https://github.com/dropbox/gemlite/tree/stable_cuda_only\"\u003ethis branch.\u003c/a\u003e\n\n### Result Teaser \n| End-to-end Performance (Llama3 8-bit)              | Matmul Performance (A16W8)               |\n| --------------------------------------------------- | ---------------------------------------- |\n| ![End to End Performance](https://github.com/dropbox/gemlite/blob/master/images/llama3_8bit.svg) | ![Matmul Performance](https://github.com/dropbox/gemlite/blob/master/images/8bit_gs=infeatures_32768x32768_4090RTX.svg) |\n\nExtensive performance results across different bitwidths, batch sizes, and devices are available in the [Performance](#performance) section below.\n\n# Table of Contents\n- [Recent Highlights](#recent-highlights)\n- [Getting Started](#getting-started)\n- [Deep Dive](#deep-dive)\n- [Performance](#performance)\n- [Talks and Resources](#talks-and-resources)\n- [Contributing](#contributing)\n\n# Recent Highlights\n\n- Improved performance with a focus on `sm_120`.\n- GemLite now supports MXFP4/NVFP4 for Blackwell.\n- GemLite now supports vLLM V1 and is `torch.compile` compatible.\n- GemLite now supports `bfloat16`.\n- GemLite is now available in \u003ca href=\"https://github.com/vllm-project/vllm/\"\u003evLLM\u003c/a\u003e via the \u003ca href=\"https://github.com/dropbox/hqq/\"\u003eHQQ\u003c/a\u003e library.\n- GemLite is now integrated with \u003ca href=\"https://github.com/pytorch/ao\"\u003eTorchAO\u003c/a\u003e/\u003ca href=\"https://github.com/sgl-project/sglang\"\u003eSGLang\u003c/a\u003e for 4-bit quantization. Check out the \u003ca href=\"https://pytorch.org/blog/accelerating-llm-inference/\"\u003eblog post\u003c/a\u003e.\n- **Major performance improvements**, especially on the A100 and H100.\n- **Flexible bit packing**: use 8-bit packing for improved batched performance on the A100 and H100 with packed data.\n- **Autotune caching**: save and load the best autotune configs across all kernels with a single line of code.\n- **Helper functions**: make it easier to get started, especially for dynamic quantization.\n- **New GEMV RevSplit-K algorithm**: outperforms GEMM Split-K and GEMV for batch size = 1 with packed data.\n- **Channel-wise scaling**: added support for channel-wise scaling for weights, activations, or both.\n- **Precision support**: includes FP16 × Wn, FP8 × FP8, FP8 × Wn, INT8 × INT8, INT8 × Wn, and MXFPn × MXFPn.\n- **`torch.compile()` support**.\n\n# Getting Started\n\n## Installation\n\n### Latest (Recommended)\n\n```bash\npip install git+https://github.com/dropbox/gemlite/\n```\n\n### Latest Stable Version\n\n```bash\npip install gemlite\n```\n\n## Usage\n\n```python\nimport gemlite\nfrom gemlite import DType, GemLiteLinear\n\ngemlite_linear = GemLiteLinear(\n    W_nbits,  # weight quantization bit width. supported: [8, 4, 2, 1]\n    group_size=group_size,  # any group_size divisible by 32 - enable autotune for group_size \u003c 128 (!)\n    in_features=in_features,  # input size\n    out_features=out_features,  # output size\n    input_dtype=DType.FP16,  # FP16, BF16, FP8, INT8\n    output_dtype=DType.FP16,  # FP16, BF16, FP32, FP8, INT32\n    scaled_activations=False,  # whether the activations are scaled\n)\n\n# Packing: we follow the HQQ format (W_q - zeros) * scales ~= W\n# https://github.com/dropbox/hqq/\ngemlite_linear.pack(W_q, scales, zeros, bias)\n\n# Forward\nout = gemlite_linear(x)\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eSettings\u003c/summary\u003e\n\n```python\n# Set packing width for packed data - recommended to leave this at the default value\ngemlite.set_packing_bitwidth(int)\n\n# Set the accumulation dtype - this is configured automatically.\n# On consumer GPUs, fp16 is used by default.\ngemlite.set_acc_dtype(DType)\n\n# Enable TMA - disabled by default. Only supported for MXFP/NVFP kernels\ngemlite.enable_tma(True)\n\n# Enable/disable native bfp16 atomic addition - recommended to leave this at the default value\ngemlite.set_native_atomic_bfp16(True)\n\n# Enable optimized PTX FP4 packing in the MXFP4/NVFP4 activation quant kernel - requires CUDA 13 ptxas\ngemlite.set_ptx_fp4_pack(True)\n\n# Experimental fast mode for NVFP4, using a static meta scale for activations\ngemlite.set_fast_nvfp4(True)\n\n# Use CUDA graphs for autotuning - this will slow down autotuning\ngemlite.enable_cudagraph_autotune(True)\n\n# Enable activation quantization only from a specified batch size onward.\n# Smaller batch sizes will use weight-only quantization.\ngemlite.enable_activation_scaling(int)\n\n# Enable kernel caching: makes some GEMV kernels faster,\n# but might break with some torch.compile settings\ngemlite.set_kernel_caching(True)\n```\n\n\u003c/details\u003e\n\n### Helper Functions\n\nAdditionally, we offer helper functions that operate as follows:\n\n```python\nfrom gemlite.helper import *\ndevice, dtype = 'cuda:0', torch.float16\n\n# AxWy: x = activation precision in bits, y = weight precision in bits.\n\n# Weight-only\ngemlite_linear = A16W8_INT8(device=device, dtype=dtype).from_linear(layer)\ngemlite_linear = A16W8_FP8(device=device, dtype=dtype).from_linear(layer)\ngemlite_linear = A16W8_HQQ_INT(device=device, dtype=dtype).from_hqqlinear(hqq_layer)\ngemlite_linear = A16W4_HQQ_INT(device=device, dtype=dtype).from_hqqlinear(hqq_layer)\ngemlite_linear = A16W2_HQQ_INT(device=device, dtype=dtype).from_hqqlinear(hqq_layer)\ngemlite_linear = A16W158_INT(device=device, dtype=dtype).from_bitlinear(bitlinear_layer)\n\n# 8-bit activation dynamic quant\ngemlite_linear = A8W8_INT8_dynamic(device=device, dtype=dtype).from_linear(layer)\ngemlite_linear = A8W8_FP8_dynamic(device=device, dtype=dtype).from_linear(layer)\ngemlite_linear = A8W4_HQQ_INT_dynamic(device=device, dtype=dtype).from_hqqlinear(hqq_layer)\ngemlite_linear = A8W158_INT_dynamic(device=device, dtype=dtype).from_bitlinear(bitlinear_layer)\n\n# MXFP weight-only\ngemlite_linear = A16W8_MXFP(device=device, dtype=dtype).from_linear(layer)\ngemlite_linear = A16W4_MXFP(device=device, dtype=dtype).from_linear(layer)\n\n# MXFP/NVFP dynamic quant - if post_scale=True, uses channel-wise activation quantization.\n# Support depends on Triton's ability to support native MXFP/NVFP MMA.\ngemlite_linear = A8W8_MXFP_dynamic(device=device, dtype=dtype, post_scale=False).from_linear(layer)\ngemlite_linear = A8W8_MXFP_dynamic(device=device, dtype=dtype, post_scale=True).from_linear(layer)\ngemlite_linear = A8W4_MXFP_dynamic(device=device, dtype=dtype, post_scale=False).from_linear(layer)\ngemlite_linear = A8W4_MXFP_dynamic(device=device, dtype=dtype, post_scale=True).from_linear(layer)\ngemlite_linear = A4W4_MXFP_dynamic(device=device, dtype=dtype).from_linear(layer)\ngemlite_linear = A4W4_NVFP_dynamic(device=device, dtype=dtype).from_linear(layer)\n```\n\nYou can also patch the whole model, even from CPU, as follows:\n\n```python\nfrom gemlite.helper import *\npatch_model(model, device=device, processor=A8W8_INT8_dynamic())\n```\n\n### Config Caching\n\nTriton autotuning can be time-consuming. To accelerate this process, we provide tools to automatically cache and load the optimal autotuning configurations for all kernels:\n\n```python\nimport gemlite\ngemlite.reset_config()  # resets cached configs for all kernels\ngemlite.cache_config('gemlite_config.json')  # cache\ngemlite.load_config('gemlite_config.json')  # load\n```\n\nEnsure that you use one JSON cache file per GPU model. When the cache is loaded, the kernels will skip autotuning, leading to faster startup times.\n\nYou can warm up specific shapes using the following helper function:\n\n```python\nimport gemlite\n\n# Ignore pre-loaded configs if you want to start from scratch (optional)\n# gemlite.reset_config()\n\n# Set autotune mode: fast or max\n# gemlite.set_autotune(\"max\")\n\n# Autotune with the default batch sizes\nwarmup(A8W8_INT8_dynamic(), shapes=[(4096, 4096), (2048, 4096)])\n\n# You can specify batch sizes too\nwarmup(A8W8_INT8_dynamic(), shapes=[(4096, 4096), (2048, 4096)], batch_sizes=[1, 8, 64, 128])\n\n# If you want to specify the group size for HQQ-style quantization\nwarmup(A16W4_HQQ_INT(), shapes=[(4096, 4096), (2048, 4096)], group_size=64)\n\n# Cache your new config\ngemlite.cache_config('new_config.json')\n```\n\n## vLLM\n\nYou can use GemLite with vLLM via \u003ca href=\"https://github.com/pytorch/ao/\"\u003eTorchAO\u003c/a\u003e or \u003ca href=\"https://github.com/dropbox/hqq/\"\u003eHQQ\u003c/a\u003e as follows:\n\n```python\nfrom hqq.utils.vllm import set_vllm_onthefly_hqq_quant\nskip_modules = ['lm_head', 'visual', 'vision']\n\n# Select one of the following modes:\n\n# INT/FP format\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='int8_weightonly', skip_modules=skip_modules)  # A16W8 - INT8 weight-only\nset_vllm_onthefly_hqq_quant(weight_bits=4, group_size=128, quant_mode='int4_weightonly', skip_modules=skip_modules)  # A16W4 - HQQ weight-only\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='int8_dynamic', skip_modules=skip_modules)  # A8W8 - INT8 x INT8 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='fp8_dynamic', skip_modules=skip_modules)  # A8W8 - FP8 x FP8 dynamic\n\n# MXFP format\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='mxfp8_dynamic', skip_modules=skip_modules)  # A8W8 - MXFP8 x MXFP8 - post_scale=True\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=32, quant_mode='mxfp8_dynamic', skip_modules=skip_modules)  # A8W8 - MXFP8 x MXFP8 - post_scale=False\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_weightonly', skip_modules=skip_modules)  # A16W4 - MXFP4 weight-only\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp8_dynamic', skip_modules=skip_modules)  # A8W4 - MXFP8 x MXFP4 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_dynamic', skip_modules=skip_modules)  # A4W4 - MXFP4 x MXFP4 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='nvfp4_dynamic', skip_modules=skip_modules)  # A4W4 - NVFP4 x NVFP4 dynamic\n\n# Load your vLLM model\nllm = LLM(model=\"meta-llama/Llama-3.1-8B-Instruct\", max_model_len=4096, gpu_memory_utilization=0.80, dtype=torch.float16)\n```\n\n## Deep Dive\n\nWe implement various versions of Triton kernels:\n\n- \u003cb\u003e\u003ca href=\"https://github.com/dropbox/gemlite/blob/master/gemlite/triton_kernels/gemm.py\"\u003eGEMM\u003c/a\u003e\u003c/b\u003e: This GEMM kernel is implemented similarly to \u003ca href=\"https://github.com/fpgaminer/GPTQ-triton\"\u003eGPTQ-triton\u003c/a\u003e. Since it uses tensor cores, activations must be padded with zeros along the batch dimension to at least 16 rows. It supports both float32 and float16 accumulation for fp16 inputs, but only float32 accumulation for bfloat16.\n\n- \u003cb\u003e\u003ca href=\"https://github.com/dropbox/gemlite/blob/master/gemlite/triton_kernels/gemm_splitK.py\"\u003eGEMM Split-K\u003c/a\u003e\u003c/b\u003e: This Split-K GEMM kernel is implemented similarly to \u003ca href=\"https://github.com/foundation-model-stack/foundation-model-stack/blob/triton/triton/kernels/gptq/splitk_dequant_gemm.py\"\u003ethe GPTQ Split-K version\u003c/a\u003e. We build on the GEMM version above and add another grid dimension that splits the K dimension into multiple jobs that calculate partial sums, which are atomically added and then stored. Split-K performs particularly well for batched LLM decoding (batch sizes between 2 and 32).\n\n- \u003cb\u003e\u003ca href=\"https://github.com/dropbox/gemlite/blob/master/gemlite/triton_kernels/gemv.py\"\u003eGEMV\u003c/a\u003e\u003c/b\u003e: This GEMV kernel splits activations into 1D chunks, performs the dot product using `tl.sum`, and accumulates via atomic addition. It is primarily intended for use with small batch sizes (`M == 1`).\n\n- \u003cb\u003e\u003ca href=\"https://github.com/dropbox/gemlite/blob/master/gemlite/triton_kernels/gemv_revsplitK.py\"\u003eGEMV RevSplit-K\u003c/a\u003e\u003c/b\u003e:\n  This algorithm, newly introduced in GemLite, operates in contrast to the GEMM Split-K approach, but within a GEMV context. By doubling the workload per Triton program launched in the GEMV kernel, it reduces the frequency of loading scales/zeros and lowers the number of threads needed. As a result, this method delivers the best performance for batch size = 1 decoding.\n\nAll kernels are flexible, supporting 8-, 4-, 2-, and 1-bit weight precision, as well as float16, bfloat16, and int8/fp8 activations.\n\n## Performance\n\n### End-to-End vLLM benchmarks\n\nMake sure to use CUDA 13 `ptxas` for Blackwell:\n\n```bash\nexport TRITON_PTXAS_BLACKWELL_PATH=/usr/local/cuda-13.0/bin/ptxas\n```\n\n### Prefill (`in=1024`, `out=1`) — Llama-3.1-8B · RTX PRO 6000\n\n| Batch Size | FP16 | GemLite FP8 | RedHat FP8 | GemLite MXFP4 | GemLite NVFP4 | RedHat NVFP4 |\n|:----------:|:----:|:-----------:|:----------:|:-------------:|:-------------:|:------------:|\n| 1   | 15.4 ms  | 10.3 ms | 9.9 ms  | 7.3 ms  | 8.3 ms  | 10.5 ms |\n| 8   | 32.4 ms  | 23.3 ms | 23.6 ms | 20.5 ms | 20.4 ms | 22.2 ms |\n| 16  | 36.9 ms  | 29.8 ms | 29.2 ms | 27.1 ms | 27.7 ms | 28.5 ms |\n| 32  | 56.7 ms  | 48.1 ms | 48.0 ms | 42.6 ms | 43.9 ms | 44.4 ms |\n| 64  | 104.0 ms | 86.6 ms | 93.7 ms | 87.6 ms | 87.1 ms | 75.3 ms |\n| 128 | 198.4 ms | 164.9 ms | 153.5 ms | 151.4 ms | 143.1 ms | 141.2 ms |\n\n### Decode (`in=1`, `out=1024`) — Llama-3.1-8B · RTX PRO 6000\n\n| Batch Size | FP16 | GemLite FP8 | RedHat FP8 | GemLite MXFP4 | GemLite NVFP4 | RedHat NVFP4 |\n|:----------:|:----:|:-----------:|:----------:|:-------------:|:-------------:|:------------:|\n| 1   | 11.75s | 6.75s  | 8.00s  | 4.84s  | 5.94s  | 8.19s  |\n| 8   | 11.92s | 7.41s  | 7.78s  | 5.19s  | 6.32s  | 8.40s  |\n| 16  | 12.44s | 7.89s  | 8.23s  | 5.66s  | 6.77s  | 8.76s  |\n| 32  | 13.83s | 8.74s  | 9.53s  | 6.68s  | 7.71s  | 9.38s  |\n| 64  | 15.69s | 10.41s | 11.08s | 8.96s  | 9.24s  | 10.62s |\n| 128 | 19.32s | 14.71s | 14.65s | 12.39s | 13.34s | 13.81s |\n\n## Talks and Resources\n\nCheck out the talk by lead author \u003ca href=\"https://github.com/mobicham/\"\u003eDr. Hicham Badri\u003c/a\u003e about GemLite at [GPU MODE](https://www.youtube.com/watch?v=7c3c3bCGzKU\u0026t=4838s\u0026ab_channel=GPUMODE). You can also find the slides [here](https://docs.google.com/presentation/d/1R9B6RLOlAblyVVFPk9FtAq6MXR1ufj1NaT0bjjib7Vc/edit#slide=id.g310b85e2148_0_135).\n\nPlease note that GemLite is under active development, and the content discussed in the talk may evolve as the library continues to improve.\n\n## Contributing\n\nContributions are always welcome. Please feel free to raise issues, submit pull requests, or start a discussion.\n\nIf you're looking to integrate GemLite with major inference and AI libraries, we'd love to hear from you!\n\n[mobius-twitter-badge]: https://img.shields.io/twitter/follow/Mobius_Labs?style=social\n[mobius-twitter]: https://twitter.com/Mobius_Labs","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdropbox%2Fgemlite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdropbox%2Fgemlite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdropbox%2Fgemlite/lists"}