{"id":23587684,"url":"https://github.com/IST-DASLab/marlin","last_synced_at":"2025-08-30T04:31:16.235Z","repository":{"id":217704239,"uuid":"744514418","full_name":"IST-DASLab/marlin","owner":"IST-DASLab","description":"FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.","archived":false,"fork":false,"pushed_at":"2024-09-04T13:35:00.000Z","size":725,"stargazers_count":661,"open_issues_count":29,"forks_count":52,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-12-27T05:03:10.652Z","etag":null,"topics":["4bit","kernel","llm","quantization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IST-DASLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-17T13:07:53.000Z","updated_at":"2024-12-23T07:14:33.000Z","dependencies_parsed_at":"2024-04-04T16:41:49.430Z","dependency_job_id":"9b3d999a-9f02-43fe-aba4-13bfcae728b4","html_url":"https://github.com/IST-DASLab/marlin","commit_stats":null,"previous_names":["ist-daslab/marlin"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/IST-DASLab/marlin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IST-DASLab%2Fmarlin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IST-DASLab%2Fmarlin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IST-DASLab%2Fmarlin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IST-DASLab%2Fmarlin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IST-DASLab","download_url":"https://codeload.github.com/IST-DASLab/marlin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IST-DASLab%2Fmarlin/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272805294,"owners_count":24995909,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["4bit","kernel","llm","quantization"],"created_at":"2024-12-27T05:01:40.084Z","updated_at":"2025-08-30T04:31:15.874Z","avatar_url":"https://github.com/IST-DASLab.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/marlin.png\" width=\"250\"/\u003e\n\u003c/div\u003e\n\n# Marlin\n\nThis is Marlin, a **M**ixed **A**uto-**R**egressive **Lin**ear kernel (and the name of one of the planet's fastest fish), an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x)\nspeedups up to batchsizes of 16-32 tokens (in contrast to the 1-2 tokens of prior work with comparable speedup). This makes Marlin well suited for larger-scale\nserving, speculative decoding or advanced multi-inference schemes such as CoT-Majority.\n\n## Techniques:\n\nMost modern GPUs feature FLOP to byte ratios of around 100-200.\nHence, as long as we perform less than 25-50 (tensor core) multiply-accumulates per 4-bit quantized weight, it should (theoretically) be possible to maintain near ideal 4x speedup over FP16 weights.\nThis means that the full performance benefits of weight-only quantization should, in principle, extend to batchsizes 4-8x larger than what is currently achieved by existing kernels.\nHowever, actually realizing this in practice is very challenging, since we essentially need to fully utilize all available GPU resources (global memory, L2 cache, shared memory, tensor cores, vector cores), *simultaneously*.\nMarlin accomplishes this through numerous techniques and optimizations, briefly sketched below:\n\n* We organize computation in such a way that all activations are essentially always fetched from L2 cache and are further reused several times within registers to make sure that repeated loading from shared memory does not become a bottleneck either.\n* We execute global weight loads asynchronously, to all compute operations but also activations loads, with a cache policy that allows immediate eviction in order to not unnecessary pollute the L2 cache with values that are never reused.\n* We perform shared memory loads, whose footprint is quite significant due to relatively large activations, via double buffering to overlap them with computation and global loads.\n* We carefully order dequantization and tensor core instructions to ensure that both GPU pipelines are well saturated and do not bottleneck each other.\n* In general, both quantized weights and group scales are reshuffled offline, into a layout that gives ideal access patterns during execution, allowing for instance directly dequantizing weights into tensor core organization.\n* We have multiple warps in a threadblock compute partial results of the same output tile, in order to achieve higher warp counts, maximizing compute and latency hiding, without increasing the output tile size, which would make good partioning on realistic matrices difficult.\n* All loads use maximum vector length for peak efficiency and we also perform several layout transformations to guarantee that all shared memory reads and writes are conflict-free, in particular for matrix loading instructions, and that global reduction happens at minimal memory overhead.\n* We set up and unroll loops such that the majority of memory offsets are static, minimizing runtime index calculations.\n* We implement a \"striped\" paritioning scheme where the segment of tiles processed by each SM may (partially) span over multiple column \"slices\". This leads to good SM utlization on most matrix shapes, while minimizing required global reduction steps.\n* Global reduction happens directly in the output buffer (temporarily downcasting FP32 accumulators to FP16) which is kept in L2 cache; reduction operations are generally optimized to avoid any unnecessary reads or writes as well.\n* Overall, the kernel's PTX assembly was extensively analyzed in NSight-Compute, and the CUDA code features several more redundant or slightly suboptimal constructions that however compile to faster PTX.\n\n## Benchmarks:\n\nWe first compare the performance of Marlin with other popular 4-bit inference kernels, on a large matrix that can be\nideally partioned on an NVIDIA A10 GPU. This allows all kernels to reach pretty much their best possible performance.\nAll kernels are executed at groupsize 128 (however, we note that scale formats are not 100% identical).\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/peak.png\" width=\"500\"/\u003e\n\u003c/div\u003e\n\nWhile existing kernels achieve relatively close to the optimal 3.87x (note the 0.125 bits storage overhead of the\ngroup scales) speedup at batchsize 1, their performance degrades quickly as the number of inputs is increased. In\ncontrast, Marlin delivers essentially ideal speedups at all batchsizes, enabling the maximum possible 3.87x speedup up\nto batchsizes around 16-32.\n\nDue to its striped partioning scheme, Marlin brings strong performance also on real (smaller) matrices and various GPUs.\nThis is demonstrated by the below results, where we benchmark, at batchsize 16, the overall runtime across all linear\nlayers in Transformer blocks of popular open-source models.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/models.png\" width=\"500\"/\u003e\n\u003c/div\u003e\n\nFinally, we also study what performance can be sustained over longer periods of time, at locked base GPU clock.\nInterestingly, we find that reduced clock speeds significantly harm the relative speedups of prior kernels, but have no\neffect on Marlin's virtually optimal performance (relative to the lower clock setting).\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/sustained.png\" width=\"500\"/\u003e\n\u003c/div\u003e\n\n## Requirements:\n\n* CUDA \u003e= 11.8 (in particular also for the `nvcc` compiler, the version of which should match with torch)\n* NVIDIA GPU with compute capability \u003e= 8.0 (Ampere or Ada, Marlin is not yet optimized for Hopper)\n* `torch\u003e=2.0.0`\n* `numpy`\nFor running quantization script one also needs:\n* `transformers`\n* `datasets`\n* `sentencepiece`\n\n## Usage:\n\nIf all requirements are met, it should be possible to install Marlin by calling\n\n```\npip install .\n```\n\nin the root folder of this repository.\n\nAfterwards, the easiest way to use the  Marlin kernel is via a `marlin.Layer`, a torch-module representing a Marlin\nquantized layer. It allows converting a \"fake-quantized\" (dequantized values stored in FP16) `torch.Linear` layer into\nthe compressed Marlin format via `marlin.Layer.pack(linear, scales)`. Alternatively, the kernel can also be called\ndirectly through `marlin.mul(..)`, provided that weights and scales have already been appropriately preprocessed (see\n`marlin.Layer.pack(...)`). The kernel itself can be found in the self-contained `marlin/marlin_cuda_kernel.cu` file,\nwhich does not contain any dependencies beyond base-CUDA and should thus be easy to integrate into other lower-level\nframeworks.\n\nCorrectness tests can be executed via `python test.py` and benchmarks via `python bench.py`. Please note that in order\nto reproduce our \"sustainable performance\" benchmarks, the GPU clocks need to be locked to their respective base values\nusing:\n\n```\nsudo nvidia-smi --lock-gpu-clocks=BASE_GPU_CLOCK --lock-memory-clocks=BASE_MEM_CLOCK\n```\n\nAdditionally, if ECC is enabled (e.g., on an A10), then the maximum achievable memory bandwidth will be 10-15% lower\nthan in the official spec sheet as every memory requests will contain checksum overheads. This can be disabled via \n\n```\nsudo nvidia-smi -e 0\n```\n\nwhich we do in our A10 benchmarks.\n\n## GPTQ Example:\n\nIn the `gptq` subfolder, we also provide a slightly improved version of the [GPTQ](https://github.com/IST-DASLab/gptq) algorithm, with better group grid clipping and non-uniform calibration sample length, that can produce Marlin-compatible 4-bit versions of Llama2 models.\nAdditionally, there is a script to evaluate such compressed models (using Marlin kernels) in the popular [LLM eval harness](https://github.com/EleutherAI/lm-evaluation-harness).\nThe script below was tested with `lm-eval-harness==0.4.0` and may not work with newer or older versions. \nHere are corresponding sample commands (`marlin`, `transformers` and `datasets` packages must be installed):\n\n```\n% Compress Llama2 model and export model in Marlin format.\npython llama2.py LLAMA2_CHECKPOINT --wbits 4 --save checkpoint.pt\n% Perform perplexity evaluation of uncompressed model.\npython llama2.py LLAMA2_CHECKPOINT\n% Evaluate compressed model (with Marlin kernels) in the eval harness.\npython eval.py --model hf --model_args pretrained=LLAMA2_CHECKPOINT --tasks mmlu \\\n  --marlin_checkpoint checkpoint.marlin.g128\n% Evaluate full precision baseline.\npython eval.py --model hf --model_args pretrained=LLAMA2_CHECKPOINT --tasks mmlu \n```\n\nWe measure the following WikiText and Red-Pajama perplexities, as well as MMLU zero-shot accuracy, for 4-bit (group=128) Marlin models:\n\n| Llama2 | Wiki2 (FP16) | Wiki2 (INT4) | RedPaj (FP16) | RedPaj (INT4) | MMLU (FP16) | MMLU (INT4)  |\n|:---:|:----:|:----:|:----:|:----:|:-----:|:-----:|\n| 7B  | 5.12 | 5.27 | 6.14 | 6.30 | 41.80 | 40.07 |\n| 13B | 4.57 | 4.67 | 5.67 | 5.79 | 52.10 | 51.13 |\n| 70B | 3.12 | 3.21 | 4.74 | 4.81 | 65.43 | 64.81 |\n\nWe note that this GPTQ example is currently intended mostly as a demonstration of how to produce accurate Marlin models and as an end-to-end validation of kernel correctness (rather than to be a flexible compression tool).\n\n## Cite:\n\nIf you found this work useful, please consider citing:\n\n```\n@article{frantar2024marlin,\n  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},\n  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},\n  journal={arXiv preprint arXiv:2408.11743},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIST-DASLab%2Fmarlin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIST-DASLab%2Fmarlin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIST-DASLab%2Fmarlin/lists"}