{"id":45181230,"url":"https://github.com/efeslab/Atom","last_synced_at":"2026-03-05T08:01:29.810Z","repository":{"id":215071534,"uuid":"715445399","full_name":"efeslab/Atom","owner":"efeslab","description":"[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving","archived":false,"fork":false,"pushed_at":"2024-07-02T05:54:54.000Z","size":16042,"stargazers_count":305,"open_issues_count":4,"forks_count":26,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-05-26T05:12:44.342Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/efeslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-07T06:46:09.000Z","updated_at":"2025-05-17T16:33:57.000Z","dependencies_parsed_at":"2024-06-18T16:33:05.789Z","dependency_job_id":"981f3500-0038-4611-aa5e-cc91bf4f51af","html_url":"https://github.com/efeslab/Atom","commit_stats":null,"previous_names":["efeslab/atom"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/efeslab/Atom","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FAtom","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FAtom/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FAtom/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FAtom/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/efeslab","download_url":"https://codeload.github.com/efeslab/Atom/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FAtom/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30115662,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T03:40:26.266Z","status":"ssl_error","status_checked_at":"2026-03-05T03:39:15.902Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-20T10:00:30.841Z","updated_at":"2026-03-05T08:01:29.791Z","avatar_url":"https://github.com/efeslab.png","language":"Cuda","funding_links":[],"categories":["Cuda"],"sub_categories":[],"readme":"# Atom: Low-bit Quantization for Efficient and Accurate LLM Serving\n[[paper](https://arxiv.org/abs/2310.19102)] [[slides](./figures/atom_mlsys_slides.pdf)]  [[poster](./figures/atom_mlsys_poster.pdf)]\n\n![overview](figures/overview_and_ppl.png)\n\nAtom is an accurate low-bit weight-activation quantization algorithm that combines (1) mixed-precision, (2) fine-grained group quantization, (3) dynamic activation quantization, (4) KV-cache quantization, and (5) efficient CUDA kernels co-design. \n\nThis codebase utilizes [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) to evaluate perplexity and zero-shot accuracy. Code segments from [SmoothQuant](https://github.com/mit-han-lab/smoothquant.git), [GPTQ](https://github.com/IST-DASLab/gptq.git), and [SparseGPT](https://github.com/IST-DASLab/sparsegpt.git) are integrated to reproduce results. Our kernels are modified based on previous version of [FlashInfer](https://github.com/flashinfer-ai/flashinfer) and tested by [NVBench](https://github.com/NVIDIA/nvbench/tree/main). Serving framework [Punica](https://github.com/punica-ai/punica) is integrated to evaluate end-to-end throughput and latency. We also use [BitsandBytes](https://github.com/TimDettmers/bitsandbytes) for new data-type evaluations (e.g., FP4). We thank the authors for their great works.\n\nThe current release features:\n* Simulated quantization for accuracy evaluation.\n* Perplexity and zero-shot accuracy evaluation\n* Kernel benchmark \u0026 End-to-end evaluation\n\nTo do:\n- [x] Release code for reproducing results.\n- [x] Release code for end-to-end throughput evaluation.\n- [x] Add FP4 accuracy evaluation for both weight and activation quantization.\n- [x] Add support for Mixtral models.\n- [ ] Optimize kernel for different GPUs.\n- [ ] Full inference workflow in real production scenario.\n\n## Abstract\nThe growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance.\n\nTo maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73× compared to the FP16 and by 2.53× compared to INT8 quantization, while maintaining the same latency target.\n\n## Installation\n1. Run in container. Mount models.\n```\ndocker pull nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04\ndocker run -it --gpus all -v /PATH2MODEL:/model nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 /bin/bash\n```\n2. Clone this repo (Make sure you install Git, and Conda)\n```\ngit clone --recurse-submodules https://github.com/efeslab/Atom\ncd Atom\n```\n3. Prepare environment\n```\ncd model\nconda create -n atom python=3.10\nconda activate atom\npip install -r requirements.txt\n```\n4. Compile kernels benchmarks (Optional): Install gcc-11 and CMake (\u003e= 3.24)\n```\napt install software-properties-common lsb-release\napt-get update\n\ncurl -s https://apt.kitware.com/keys/kitware-archive-latest.asc 2\u003e/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg \u003e/dev/null\napt-add-repository \"deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main\"\napt update\napt install cmake\n\ncd /PATH_TO_ATOM/kernels\nadd-apt-repository -y ppa:ubuntu-toolchain-r/test\napt-get update\napt install -y gcc-11 g++-11\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake -j\n```\n## Usage\n### Accuracy Evaluation\nBefore running this command, please download Llama model from [Hugging Face website](https://huggingface.co/models?sort=trending\u0026search=llama) first.\nWe recommend downloading from [Deca-Llama](https://huggingface.co/linhvu/decapoda-research-llama-7b-hf/tree/main).\n\nWe provide several scripts to reproduce our results in the paper:\n\nTo run our W4A4 perplexity evaluation, please execute\n```\nbash scripts/run_atom_ppl.sh /Path/To/Llama/Model\n```\n\nTo get our W4A4 zero shot accuracy on common sense tasks, please execute\n```\nbash scripts/run_atom_zeroshot_acc.sh /Path/To/Llama/Model\n```\n\nTo run our ablation study on different quantization optimizations, please run\n```\nbash scripts/run_atom_ablation.sh /Path/To/Llama/Model\n```\n\n\nYou can also customize your own quantization setup by modifying the parameters. Check [model/main.py](./model/main.py) to see the description of each parameter.\n```\npython model/main.py /Path/To/Llama/Model wikitext2 \\\n    --wbits 4 --abits 4 --a_sym --w_sym \\\n    --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \\\n    --reorder --act_sort_metric hessian \\\n    --a_clip_ratio 0.9 --w_clip_ratio 0.85 \\\n    --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \\\n    --eval_ppl --eval_common_sense\n```\n### Efficiency Evaluation\nWe evaluate Atom on a RTX4090 GPU. Results below are executed in [cu113](https://hub.docker.com/layers/nvidia/cuda/11.3.1-cudnn8-devel-ubuntu20.04/images/sha256-052b3b515d9653f9c6e358e5b70f8bb9d75c17a8b2039055674dfa7caa970791?context=explore) docker container. Note that current kernels are only optimized for RTX4090.\n\nTo get INT4 GEMM kernel result, please execute:\n```\ncd kernels/build\n./bench_gemm_i4_o16\n```\nCheck column `Elem/s` to see the computation throughput of the kernel (Flop/s).\n![gemm](figures/bench_gemm.png)\n\nOther kernel of Atom can be evaluated similarly, for e.g., `./bench_reorder`. We conduct kernel evaluation on baselines as well. Please check [baselines/README.md](./kernels/baselines/README.md) to reproduce results.\n\nTo reproduce end-to-end throughput and latency evaluation, please check [e2e/README.md](./e2e/README.md).\n## Key Results\n### Perplexity\nWe evaluate Atom's accuracy on serveral model families including Llama, Llama-2, and Mixtral, with data types of INT4 and FP4.\n* WikiText2, PTB and C4 datasets on Llama family:\n![perplexity](figures/atom_ppl.png)\n* WikiText2 perplexity on Llama-2 and Mixtral:\n\n  \u003cimg src=\"figures/atom_ppl_new.png\" style=\"width:75%;\"\u003e\n\n### End-to-end throughput and latency\n* Atom achieves up to 7.7x higher throughput with similar latency than `FP16` with a fixed GPU memory under serving scenario.\n![e2e](figures/atom_e2e_eval.png)\n\n## Reference\nIf you find this project is helpful to your research, please consider to cite our paper:\n```\n@inproceedings{MLSYS2024_5edb57c0,\n author = {Zhao, Yilong and Lin, Chien-Yu and Zhu, Kan and Ye, Zihao and Chen, Lequn and Zheng, Size and Ceze, Luis and Krishnamurthy, Arvind and Chen, Tianqi and Kasikci, Baris},\n booktitle = {Proceedings of Machine Learning and Systems},\n editor = {P. Gibbons and G. Pekhimenko and C. De Sa},\n pages = {196--209},\n title = {Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving},\n url = {https://proceedings.mlsys.org/paper_files/paper/2024/file/5edb57c05c81d04beb716ef1d542fe9e-Paper-Conference.pdf},\n volume = {6},\n year = {2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefeslab%2FAtom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fefeslab%2FAtom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefeslab%2FAtom/lists"}